Bulk Loading for Hadoop

Loading

When BULKLOAD=YES is specified (the default value), SAS creates a “staging file” that contains the data to be loaded into Hive, and then loads the data into Hive by issuing the Hive LOAD DATA query.

The exact sequence of commands varies, depending on the file formats of the target table and the staging file.

If the target table does not exist, the table's file format can be specified using the DBCREATE_TABLE_OPTS= LIBNAME option or data set option. When this option is not used, Hive creates the target table using its default file format. The Hive default file format is specified by the Hive configuration option hive.default.fileformat property.

The staging file’s format can be different from the target table’s file format. Starting in SAS 9.4M8, you can specify the file format of the staging file with the BL_FORMAT= option. For example, you can create the Hive table with a file format of Parquet by specifying DBCREATE_TABLE_OPTS="STORED AS PARQUET" and use BL_FORMAT="ORC" to use ORC when writing the staging file. For more information, see BL_FORMAT= LIBNAME option or data set option. Hive transforms the data from the staging file into the target table’s file format as part of the load operation.

New tables are created in the Hive warehouse. To create tables outside of the Hive warehouse, use the DBCREATE_TABLE_EXTERNAL= and DBCREATE_TABLE_LOCATION= options. These options are available as LIBNAME and data set options.

Here is the general sequence of operations:

  1. If the target table does not exist, SAS submits a CREATE TABLE query to Hive to create the target table. The CREATE TABLE query contains the column definitions, any Hive Table Properties, and any SAS LIBNAME options or data set options that are specified.
  2. SAS creates the staging file that contains the data to be loaded to Hive.
    • When BL_FORMAT=TEXT, SAS writes the data to the HDFS temporary directory. The staging file is created as a UTF-8 delimited text file, using CTRL-A (‘\001’) as a field delimiter and newline (‘\n’) as a record separator.
    • When BL_FORMAT= is not TEXT, the staging file is created locally on the machine where SAS is running and then copied to HDFS.
  3. SAS issues a CREATE TABLE query to create a temporary table in Hive and then issues a LOAD DATA query to load the staging file into the temporary table. This temporary table is referred to as the “staging table.”
  4. SAS issues an INSERT INTO target-table SELECT * FROM query, which copies the data from the staging table to the target table.
  5. SAS issues a DROP TABLE query to drop the temporary staging table.

In some limited cases, SAS can perform the LOAD DATA operation directly into the target table.

When PROC APPEND is used to append to the Hive table, the Hadoop interface places data in a new HDFS file. The interface then issues either the LOAD DATA pattern or the LOAD DATA plus INSERT INTO pattern described earlier.

SAS/ACCESS Interface to Hadoop also supports BULKLOAD=NO. When BULKLOAD=NO, SAS creates an SQL INSERT INTO ... VALUES query to send to Hive. Although this approach is slower than using BULKLOAD=YES, BULKLOAD=NO is useful when HDFS Write access is not available.

Bulk Loading to Cloudera Data Platform Private Cloud - Data Services

To enable bulk loading to Cloudera Data Platform (CDP) Private Cloud - Data Services, you must set the SAS_HADOOP_HDFS_AUTH_TYPE environment variable to KERBEROS. This setting enables you to make the LIBNAME connection to Hive using USER= and PASSWORD= authentication and connect to HDFS using Kerberos.

(Optional) Set the environment variable in the SAS code by entering the following code before the LIBNAME statement:

options set=SAS_HADOOP_HDFS_AUTH_TYPE="KERBEROS";

(Optional) Alternatively, set the environment variable for all SAS sessions in the !SASROOT/bin/sasenv_local file on UNIX SAS installs or as a system environment variable on a Windows SAS installation.

Configure Bulk Loading to ADLS Gen2 for Azure HDInsight

For ADLS Gen2 storage on Azure HDInsight, most customers use bulk loading, which is available by default. To enable access, additional configuration is required. In addition, when you submit code, the AZURETENANTID= system option is required.

Follow these steps to enable access:

  1. Complete the instructions described in "Configuring Hadoop JAR Files and Cluster Configuration Files" in SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS, if this has not already been done.
  2. Open the Azure portal or the Microsoft Azure Storage Explorer. Locate the ADLS Gen2 storage account for the IQ cluster. Copy the primary access key for the storage account.
  3. In the HDInsight Interactive Query (Apache Hive LLAP) cluster, edit the core-site.xml file to add two properties. In the names of both properties, specify the account name. In the second property, paste the primary access key.
    <!—Add these two keys manually -->
    <property>
       <name>fs.azure.account.auth.type.account-name.dfs.core.windows.net</name>
       <value>SharedKey</value>
    </property>
    <property>
       <name>fs.azure.account.key.account-name.dfs.core.windows.net</name>
       <value>primary-access-key</value>
    </property>
  4. In the core-site.xml file, remove or comment these two properties:
    <!--
    <property>
       <name>fs.azure.account.auth.type</name>
       <value>Custom</value>
    </property>
     
    <property>
       <name>fs.azure.account.oauth.provider.type</name>
       <value>com.microsoft.azure.storage.oauth2.TokenServiceBasedAccessTokenProvider</value>
    </property>
    -->%

Examples

This example creates and loads the FLIGHTS98 HiveServer2 table from the SASFLT.FLT98 SAS data set.

libname sasflt 'SAS-library';
libname hdp_air hadoop user=myusr1 pwd=mypwd1 connection-options;

proc sql;
create table hdp_air.flights98
        as select * from sasflt.flt98;
quit;

This example creates and loads the ALLFLIGHTS HiveServer2 table in SEQUENCEFILE format from the SASFLT.ALLFLIGHTS SAS data set.

data hdp_air.allflights (dbcreate_table_opts='stored as sequencefile');
   set sasflt.allflights;
run;

In this example, the SASFLT.FLT98 SAS data set is appended to the ALLFLIGHTS HiveServer2 table.

proc append base=hdp_air.allflights data=sasflt.flt98;run;
Last updated: February 3, 2026