Bulk Loading for Impala

Loading

Bulk loading with the Impala engine can be accomplished in the following two ways. For details, see Configuration Details.

Use the WebHDFS interface to Hadoop to push data to HDFS.
Configure a required set of Hadoop JAR files.

Here is how the Impala engine creates table data using the bulk-loading process:

SAS issues two CREATE TABLE statements to the Impala server. One CREATE TABLE statement creates the target Impala table. The other CREATE TABLE statement creates a temporary table.
SAS uses WebHDFS or Java to upload table data to the HDFS temporary directory, as specified in HDFS_TEMPDIR=, where /tmp is the default directory. The resulting file is a UTF-8 delimited text file.
SAS issues a LOAD DATA statement to move the data file from the HDFS /tmp directory into the temporary table.
SAS issues an INSERT INTO statement that copies and transforms the text data from the temporary table into the target Impala table.
SAS deletes the temporary table.

Data Set Options for Bulk Loading

Here are the Impala bulk-load data set options. The BULKLOAD= data set option is required for bulk loading, and all others are optional. For more information, see Data Set Options.

Configuration Details

Java JAR files, HDFS host-specific configuration information, or both are required for bulk loading to HDFS. The host that is required for HDFS when you are using Java or WebHDFS might differ from the Impala host. For example, the host might be another machine on the same cluster. Depending on your configuration, you must specify this information when you are bulk loading data to Impala.

Here is what to specify if you use WebHDFS to upload table data:

SAS_HADOOP_RESTFUL=1 (required).
No SAS Hadoop JAR path is required.
SAS_HADOOP_CONFIG_PATH=<configuration-directory> (contains the hdfs-site.xml file for the specific host). As an alternative, you can use the BL_HOST= data set option to specify the HDFS host name. Using SAS_HADOOP_CONFIG_PATH= is the preferred solution.

Here is what to specify if you use Java to upload table data:

SAS_HADOOP_RESTFUL=0 (optional).
SAS_HADOOP_JAR_PATH=<jar-directory>. For instructions, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
SAS_HADOOP_CONFIG_PATH=<configuration-directory> (contains the hdfs-site.xml file for the specific host). As an alternative, you can use the BL_HOST= data set option to specify the HDFS host name. Using SAS_HADOOP_CONFIG_PATH= is the preferred solution.

For more information about SAS_HADOOP_RESTFUL, SAS_HADOOP_JAR_PATH, and SAS_HADOOP_CONFIG_PATH environment variables, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. Here are ways that you can specify them:

as an operating system environment variable before starting SAS
as a SAS -SET option, such as SAS ... —SET SAS_HADOOP_RESTFUL 1
as an OPTION SET command within SAS, such as option set=SAS_HADOOP_RESTFUL 1;

Example 1: Create an Impala Table from a SAS Data Set

This example shows how you can use a SAS data set, SASFLT.FLT98, to create and load an Impala table, FLIGHTS98.

libname sasflt 'SAS-data-library';
libname mydblib impala host=mysrv1
   db=users user=myusr1 password=mypwd1;

proc sql;
create table mydblib.flights98
   (BULKLOAD=YES
    BL_DATAFILE='/tmp/mytable.dat'
    BL_HOST='192.168.x.x'
    BL_PORT=50070)
as select * from sasflt.flt98;
quit;

Example 2: Append a SAS Data Set to an Existing Impala Table

This example shows how you can append the SAS data set, SASFLT.FLT98, to the existing Impala table, FLIGHTS98. In this example, the HDFS_PRINCIPAL data set option is specified as well. You specify the HDFS_PRINCIPAL= data set option when you configure HDFS to allow Kerberos authentication. Rather than deleting the data file, BL_DELETE_DATAFILE=NO causes the engine to leave it after the load has completed.

proc append base=mydblib.flights98
   (BULKLOAD=yes 
    BL_DATAFILE='/tmp/mytable.dat'
    BL_DELETE_DATAFILE=no
    HDFS_PRINCIPAL='hdfs/hdfs_host.example.com@test.example.com'
    BL_HOST='192.168.x.x'
    BL_PORT=50070)
   data=sasflt.flt98;
run;

Example 3: Create an Impala Table from a SAS Data Set

This example shows how to use a SAS data set, SASFLT.FLT98, to create and load an Impala table, FLIGHTS98, using WebHDFS and configuration files.

option set=SAS_HADOOP_RESTFUL 1;
option set=SAS_HADOOP_CONFIG_PATH "/configs/myhdfshost";
   /* This path should point to the directory that contains */
   /* the hdfs-site.xml file for the cluster that hosts the */
   /* 'mysrv1' Impala host */

libname sasflt 'SAS-data-library';
libname mydblib impala host=mysrv1
   db=users user=myusr1 password=mypwd1;

proc sql;
create table mydblib.flights98
   (BULKLOAD=YES
    BL_DATAFILE='/tmp/mytable.dat'
    /* no BL_HOST or BL_PORT */
as select * from sasflt.flt98;
quit;

Last updated: February 3, 2026