Starting in SAS 9.4M9, the Spark engine supports bulk loading and bulk unloading for Databricks in Azure, bulk loading for Databricks in AWS, and bulk loading for Spark in HDFS.
Bulk loading is the fastest way to load data into a database. Bulk unloading is beneficial when retrieving data in excess of 1 Gb from a database.
Bulk loading is not available for Databricks in Google Cloud Platform.
The Spark engine provides bulk loading and bulk unloading to Databricks through Azure ADLS storage. The Spark engine can bulk load and bulk unload data files between an ADLS2 storage location and SAS.
The bulk operations access ADLS Gen2 using an abfss:// path. Mount points are not supported. You must be able to authenticate to and access ADLS Gen2 from Databricks with the abfss:// path. See Databricks documentation for Azure Data Lake Storage Gen2 for information about setting up direct access to data.
To authenticate to Microsoft Azure, set the following system options:
AZURETENANTID= System Option is required
AZUREAUTHCACHELOC System Option in SAS System Options: Reference is optional
SAS/ACCESS authenticates to ADLS Gen2 storage with a device code. When logging in to Databricks through device authentication, the Microsoft Entra Application ID is the principal that is authorized to access the data instead of a specific user. Any person who has credentials for the Microsoft Entra Application ID can access the data.
To use device-code authentication, supply credentials using the BL_APPLICATIONID=, USER=, and PASSWORD= options.
On a user’s first attempt to access data by using device-code authentication, an error message is printed in the SAS log. The user must authenticate with Microsoft. That process is explained in the documentation for the required AZURETENANTID= system option.
BULKLOAD=YES is the default when connecting to Databricks in Azure. Bulk unloading is disabled in the Spark engine by default. To enable bulk unloading, you must specify BULKUNLOAD=YES.
You must specify the following options with BULKLOAD=YES or BULKUNLOAD=YES:
BL_ACCOUNTNAME= LIBNAME option or data set option
BL_APPLICATIONID= LIBNAME option or data set option
BL_FILESYSTEM= LIBNAME option or data set option
BULKUNLOAD=YES LIBNAME option or data set option
This option is also available:
BL_FOLDER= LIBNAME option or data set option
Use LIBNAME options or data set options. Do not use a combination of both.
Bulk operations use the Databricks COPY command. When BULKLOAD=YES and related options are specified, a UTF-8 encoded CSV staging file is created on the client and copied to Databricks, where the data is created as a delta table. When BULKUNLOAD=YES is also specified, a UTF-8 encoded CSV file is created in the database and downloaded to SAS.
See Other Options for Bulk Loading and Unloading for information about options that are available to customize and manage the files associated with the bulk load and bulk unload processes.
SAS/ACCESS uses the Amazon Simple Storage Service (S3) tool to move from the client to the Databricks database in AWS. Files are copied to an Amazon S3 bucket using S3 utilities. You can find more information about managing data on Databricks on the Amazon Web Services documentation site.
BULKLOAD=YES is the default when accessing Databricks in AWS.
SAS/ACCESS provides the following options for authenticating to S3 for bulk loading. BL_BUCKET= is required. Use LIBNAME options or data set options. Do not use a mix of both.
BL_AWS_CREDENTIALS_FILE= LIBNAME option or data set option is optional
BL_AWS_CONFIG_FILE= LIBNAME option or data set option is optional
BL_AWS_PROFILE_NAME= LIBNAME option or data set option is optional
BL_BUCKET= LIBNAME option or data set option is required
BL_ENC_KEY= LIBNAME option or data set option
BL_IAM_ASSUME_ROLE_ARN= LIBNAME option or data set option
BL_IAM_ASSUME_ROLE_NAME= LIBNAME option or data set option
BL_IAM_ROLE= data set option is optional
BL_KEY= LIBNAME option or data set option
BL_TOKEN= LIBNAME option or data set option
BL_USE_SSL= LIBNAME option or data set option
These options are also available:
BL_CONFIG= LIBNAME option or data set option
BL_REGION= LIBNAME option or data set option
The options in this section are supported for bulk loading in both Databricks cloud host environments, with some exceptions. Options related to bulk unloading options apply only to Databricks in Azure. Use LIBNAME options or data set options to request a bulk operation.
BULKLOAD=NO LIBNAME option or data set option disables bulk loading.
These options modify the format of the CSV staging files created by the bulk load and unload processes:
BL_COMPRESS= LIBNAME option or data set option
BL_DELIMITER= LIBNAME option or data set option
BL_FORMAT= LIBNAME option or data set option
BL_FORMAT_OPT= LIBNAME option or data set option
These options manage the files associated with the bulk load and unload processes:
BL_DEFAULT_DIR= LIBNAME option or data set option
BL_DELETE_DATAFILE= LIBNAME option or data set option
BL_DNSSUFFIX= LIBNAME option or data set option
BL_NUM_DATAFILES= data set option
BL_TIMEOUT= LIBNAME option or data set option
When HDFS access is configured, SAS/ACCESS Interface to Spark makes no differentiation between bulk loading and a standard load process. As a result, the BULKLOAD=YES syntax is supported, but it is not required.
Here is how a text-based table (STORED AS TEXTFILE) is created.
Here is how a non-text table is created:
data spk.new_hive_table(DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE');
set sas_table;
run;
The format of the bulk load staging file can be controlled with the BL_FORMAT= option. The column delimiter can be changed with the BL_DELIMITER= LIBNAME Statement Option option.
Spark considers a table to be a collection of files in a directory that bears the table name. The CREATE TABLE command creates this directory either directly in the Spark warehouse or in a subdirectory, if a nondefault schema is used. The LOAD DATA command moves the data to the correct location.
When PROC APPEND is used to append to the Spark table, the Spark interface places data in a new HDFS file. The interface then issues either the LOAD DATA pattern or the LOAD DATA plus INSERT INTO pattern described earlier.
This example creates and loads the FLIGHTS98 HiveServer2 table from the SASFLT.FLT98 SAS data set.
libname sasflt 'SAS-library';
libname spk_air spark <connection-options>;
proc sql;
create table spk_air.flights98
as select * from sasflt.flt98;
quit;
This example creates and loads the ALLFLIGHTS HiveServer2 table in SEQUENCEFILE format from the SASFLT.ALLFLIGHTS SAS data set.
data spk_air.allflights (dbcreate_table_opts='stored as sequencefile');
set sasflt.allflights;
run;
In this example, the SASFLT.FLT98 SAS data set is appended to the ALLFLIGHTS HiveServer2 table.
proc append base=spk_air.allflights data=sasflt.flt98;
run;