When BULKLOAD=YES is specified (the default value), SAS creates a “staging file” that contains the data to be loaded into Hive, and then loads the data into Hive by issuing the Hive LOAD DATA query.
The exact sequence of commands varies, depending on the file formats of the target table and the staging file.
If the target table does not exist, the table's file format can be specified
using the DBCREATE_TABLE_OPTS= LIBNAME option or data set option. When this option is not used, Hive creates the target table using its default
file format. The Hive default file format is specified by the Hive configuration option
hive.default.fileformat property.
The staging file’s format can be different from the target table’s file format. Starting in SAS 9.4M8, you can specify the file format of the staging file with the BL_FORMAT= option. For example, you can create the Hive table with a file format of Parquet by specifying DBCREATE_TABLE_OPTS="STORED AS PARQUET" and use BL_FORMAT="ORC" to use ORC when writing the staging file. For more information, see BL_FORMAT= LIBNAME option or data set option. Hive transforms the data from the staging file into the target table’s file format as part of the load operation.
New tables are created in the Hive warehouse. To create tables outside of the Hive warehouse, use the DBCREATE_TABLE_EXTERNAL= and DBCREATE_TABLE_LOCATION= options. These options are available as LIBNAME and data set options.
Here is the general sequence of operations:
In some limited cases, SAS can perform the LOAD DATA operation directly into the target table.
When PROC APPEND is used to append to the Hive table, the Hadoop interface places data in a new HDFS file. The interface then issues either the LOAD DATA pattern or the LOAD DATA plus INSERT INTO pattern described earlier.
SAS/ACCESS Interface to Hadoop also supports BULKLOAD=NO. When BULKLOAD=NO, SAS creates an SQL INSERT INTO ... VALUES query to send to Hive. Although this approach is slower than using BULKLOAD=YES, BULKLOAD=NO is useful when HDFS Write access is not available.
To enable bulk loading to Cloudera Data Platform (CDP) Private Cloud - Data Services, you must set the SAS_HADOOP_HDFS_AUTH_TYPE environment variable to KERBEROS. This setting enables you to make the LIBNAME connection to Hive using USER= and PASSWORD= authentication and connect to HDFS using Kerberos.
(Optional) Set the environment variable in the SAS code by entering the following code before the LIBNAME statement:
options set=SAS_HADOOP_HDFS_AUTH_TYPE="KERBEROS";
(Optional) Alternatively, set the environment variable for all SAS sessions in the !SASROOT/bin/sasenv_local file on UNIX SAS installs or as a system environment variable on a Windows SAS installation.
For ADLS Gen2 storage on Azure HDInsight, most customers use bulk loading, which is available by default. To enable access, additional configuration is required. In addition, when you submit code, the AZURETENANTID= system option is required.
Follow these steps to enable access:
<!—Add these two keys manually -->
<property>
<name>fs.azure.account.auth.type.account-name.dfs.core.windows.net</name>
<value>SharedKey</value>
</property>
<property>
<name>fs.azure.account.key.account-name.dfs.core.windows.net</name>
<value>primary-access-key</value>
</property>
<!--
<property>
<name>fs.azure.account.auth.type</name>
<value>Custom</value>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>com.microsoft.azure.storage.oauth2.TokenServiceBasedAccessTokenProvider</value>
</property>
-->%
This example creates and loads the FLIGHTS98 HiveServer2 table from the SASFLT.FLT98 SAS data set.
libname sasflt 'SAS-library';
libname hdp_air hadoop user=myusr1 pwd=mypwd1 connection-options;
proc sql;
create table hdp_air.flights98
as select * from sasflt.flt98;
quit;
This example creates and loads the ALLFLIGHTS HiveServer2 table in SEQUENCEFILE format from the SASFLT.ALLFLIGHTS SAS data set.
data hdp_air.allflights (dbcreate_table_opts='stored as sequencefile');
set sasflt.allflights;
run;
In this example, the SASFLT.FLT98 SAS data set is appended to the ALLFLIGHTS HiveServer2 table.
proc append base=hdp_air.allflights data=sasflt.flt98;run;