Bulk Loading and Bulk Unloading with the Spark Engine

Overview

Starting in SAS 9.4M9, the Spark engine supports bulk loading and bulk unloading for Databricks in Azure, bulk loading for Databricks in AWS, and bulk loading for Spark in HDFS.

Bulk loading is the fastest way to load data into a database. Bulk unloading is beneficial when retrieving data in excess of 1 Gb from a database.

Bulk loading is not available for Databricks in Google Cloud Platform.

Bulk Loading and Unloading to Databricks in Azure

Authenticating to Databricks in Azure

The Spark engine provides bulk loading and bulk unloading to Databricks through Azure ADLS storage. The Spark engine can bulk load and bulk unload data files between an ADLS2 storage location and SAS.

The bulk operations access ADLS Gen2 using an abfss:// path. Mount points are not supported. You must be able to authenticate to and access ADLS Gen2 from Databricks with the abfss:// path. See Databricks documentation for Azure Data Lake Storage Gen2 for information about setting up direct access to data.

To authenticate to Microsoft Azure, set the following system options:

AZURETENANTID= System Option is required

AZUREAUTHCACHELOC System Option in SAS System Options: Reference is optional

SAS/ACCESS authenticates to ADLS Gen2 storage with a device code. When logging in to Databricks through device authentication, the Microsoft Entra Application ID is the principal that is authorized to access the data instead of a specific user. Any person who has credentials for the Microsoft Entra Application ID can access the data.

To use device-code authentication, supply credentials using the BL_APPLICATIONID=, USER=, and PASSWORD= options.

On a user’s first attempt to access data by using device-code authentication, an error message is printed in the SAS log. The user must authenticate with Microsoft. That process is explained in the documentation for the required AZURETENANTID= system option.

Required Options for Bulk Loading and Unloading in Azure

BULKLOAD=YES is the default when connecting to Databricks in Azure. Bulk unloading is disabled in the Spark engine by default. To enable bulk unloading, you must specify BULKUNLOAD=YES.

You must specify the following options with BULKLOAD=YES or BULKUNLOAD=YES:

BL_ACCOUNTNAME= LIBNAME option or data set option

BL_APPLICATIONID= LIBNAME option or data set option

BL_FILESYSTEM= LIBNAME option or data set option

BULKUNLOAD=YES LIBNAME option or data set option

This option is also available:

BL_FOLDER= LIBNAME option or data set option

Use LIBNAME options or data set options. Do not use a combination of both.

Bulk operations use the Databricks COPY command. When BULKLOAD=YES and related options are specified, a UTF-8 encoded CSV staging file is created on the client and copied to Databricks, where the data is created as a delta table. When BULKUNLOAD=YES is also specified, a UTF-8 encoded CSV file is created in the database and downloaded to SAS.

See Other Options for Bulk Loading and Unloading for information about options that are available to customize and manage the files associated with the bulk load and bulk unload processes.

Bulk Loading to Databricks in AWS

SAS/ACCESS uses the Amazon Simple Storage Service (S3) tool to move from the client to the Databricks database in AWS. Files are copied to an Amazon S3 bucket using S3 utilities. You can find more information about managing data on Databricks on the Amazon Web Services documentation site.

BULKLOAD=YES is the default when accessing Databricks in AWS.

SAS/ACCESS provides the following options for authenticating to S3 for bulk loading. BL_BUCKET= is required. Use LIBNAME options or data set options. Do not use a mix of both.

BL_AWS_CREDENTIALS_FILE= LIBNAME option or data set option is optional

BL_AWS_CONFIG_FILE= LIBNAME option or data set option is optional

BL_AWS_PROFILE_NAME= LIBNAME option or data set option is optional

BL_BUCKET= LIBNAME option or data set option is required

BL_ENC_KEY= LIBNAME option or data set option

BL_IAM_ASSUME_ROLE_ARN= LIBNAME option or data set option

BL_IAM_ASSUME_ROLE_NAME= LIBNAME option or data set option

BL_IAM_ROLE= data set option is optional

BL_KEY= LIBNAME option or data set option

BL_TOKEN= LIBNAME option or data set option

BL_USE_SSL= LIBNAME option or data set option

These options are also available:

BL_CONFIG= LIBNAME option or data set option

BL_REGION= LIBNAME option or data set option

Other Options for Bulk Loading and Unloading

The options in this section are supported for bulk loading in both Databricks cloud host environments, with some exceptions. Options related to bulk unloading options apply only to Databricks in Azure. Use LIBNAME options or data set options to request a bulk operation.

BULKLOAD=NO LIBNAME option or data set option disables bulk loading.

These options modify the format of the CSV staging files created by the bulk load and unload processes:

BL_COMPRESS= LIBNAME option or data set option

BL_DELIMITER= LIBNAME option or data set option

BL_FORMAT= LIBNAME option or data set option

BL_FORMAT_OPT= LIBNAME option or data set option

These options manage the files associated with the bulk load and unload processes:

BL_DEFAULT_DIR= LIBNAME option or data set option

BL_DELETE_DATAFILE= LIBNAME option or data set option

BL_DNSSUFFIX= LIBNAME option or data set option

BL_NUM_DATAFILES= data set option

BL_TIMEOUT= LIBNAME option or data set option

Bulk Loading to Spark Using HDFS

When HDFS access is configured, SAS/ACCESS Interface to Spark makes no differentiation between bulk loading and a standard load process. As a result, the BULKLOAD=YES syntax is supported, but it is not required.

Here is how a text-based table (STORED AS TEXTFILE) is created.

SAS issues a CREATE TABLE Spark command to the Spark server. The command contains all table metadata (column definitions) and the table properties that are specific to SAS that refine Spark metadata to handle maximum string lengths and date/time formats.
SAS uses HDFS to upload table data to the HDFS /tmp directory. The resulting file is a UTF-8-delimited text file that by default uses CTRL-A (‘\001’) as a field delimiter and newline (‘\n’) as a record separator.
SAS issues a LOAD DATA command to move the data file from the HDFS /tmp directory to the appropriate Spark warehouse location. The data file is now part of the Spark table.

Here is how a non-text table is created:

Specify the DBCREATE_TABLE_OPTS= data set option containing a Spark STORED AS clause to the new table reference. Here is an example:
```
data spk.new_hive_table(DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE');
   set sas_table;
   run;
```
SAS issues two CREATE TABLE statements to the Spark server. One CREATE TABLE statement creates the target Spark table. The other CREATE TABLE statement creates a temporary table.
SAS uses HDFS to upload table data to the HDFS /tmp directory. The resulting file is a UTF-8-delimited text file.
SAS issues a LOAD DATA command to move the data file from the HDFS /tmp directory into the temporary table.
SAS issues an INSERT INTO statement that copies and transforms the temp table text data into the target (non-text) Spark table.
SAS deletes the temporary table.

The format of the bulk load staging file can be controlled with the BL_FORMAT= option. The column delimiter can be changed with the BL_DELIMITER= LIBNAME Statement Option option.

Spark considers a table to be a collection of files in a directory that bears the table name. The CREATE TABLE command creates this directory either directly in the Spark warehouse or in a subdirectory, if a nondefault schema is used. The LOAD DATA command moves the data to the correct location.

When PROC APPEND is used to append to the Spark table, the Spark interface places data in a new HDFS file. The interface then issues either the LOAD DATA pattern or the LOAD DATA plus INSERT INTO pattern described earlier.

Examples

This example creates and loads the FLIGHTS98 HiveServer2 table from the SASFLT.FLT98 SAS data set.

libname sasflt 'SAS-library';
libname spk_air spark <connection-options>;

proc sql;
create table spk_air.flights98
        as select * from sasflt.flt98;
quit;

This example creates and loads the ALLFLIGHTS HiveServer2 table in SEQUENCEFILE format from the SASFLT.ALLFLIGHTS SAS data set.

data spk_air.allflights (dbcreate_table_opts='stored as sequencefile');
   set sasflt.allflights;
run;

In this example, the SASFLT.FLT98 SAS data set is appended to the ALLFLIGHTS HiveServer2 table.

proc append base=spk_air.allflights data=sasflt.flt98;
run;

Last updated: February 3, 2026