LIBNAME Statement for the Spark Engine

Overview

This section describes the LIBNAME statement that SAS/ACCESS Interface to Spark supports. For general information about this feature, see LIBNAME Statement for Relational Databases.

IMPORTANT SAS 9.4M9 has new required locations for Hadoop JAR files and third-party JDBC driver files. Using the old location for the JAR files causes an error. The system ignores driver files installed in other locations. For more information about the requirements, see "Configuring SAS/ACCESS Interface to Spark" in SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. These requirements do not apply to SAS Viya 3.5. The documentation at the above link includes a resolution for the following error: Fetch error: java.lang.NoClassDefFoundError: Could not initialize class.

Note: SAS 9.4M9 has a new LIBNAME option, PLATFORM=SPARK | DATABRICKS, that enables you to specify a Databricks or Spark connection with the SERVER=, PORT=, DATABASE=, and HTTPPATH= LIBNAME options. You are no longer required to use the URI= option to specify a Databricks connection. In addition, some Hadoop configuration options can be specified as connection options in the LIBNAME statement instead of in a configuration file. When a configuration file also exists, the LIBNAME settings override the settings in the configuration file. In connection-options, see ALLOW_SELF_SIGNED_CERTS=, HTTPPATH=, MAPREDUCE_JOB_NAME=, MAPREDUCE_JOB_QUEUENAME=, PLATFORM=, SSL_TRUST_STORE=, TEZ_QUEUE_NAME=, TRUST_STORE_PASSWORD=, ZOOKEEPER_NAMESPACE=, ZOOKEEPER_PORT=, and ZOOKEEPER_QUORUM=.

Information about using SAS/ACCESS Interface to Spark with Hadoop distributions that support the Hadoop File System (HDFS) is provided in the LIBNAME Statement for the Hadoop Engine. This includes information about Read and Write security and related HDFS permissions.

Here is the LIBNAME statement syntax for accessing Spark.

LIBNAME libref spark <connection-options> <LIBNAME-options>

For general information about the LIBNAME statement that is not specific to SAS/ACCESS, see LIBNAME Statement in SAS Global Statements: Reference.

Arguments

libref

specifies any SAS name that serves as an alias to associate SAS with a database, schema, server, or group of tables and views.

spark

specifies the SAS/ACCESS engine name for the Spark interface.

connection-options

provides options that are required to connect to Spark SQL.

All connection options are also valid in the CONNECT statement when you use the SQL pass-through facility (SQL procedure) to connect to your DBMS.

ALLOW_SELF_SIGNED_CERTS= NO | YES

enables or disables SSL wire encryption of client requests.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

CLASSPATH="directory-path"

This option behaves differently in SAS 9.4M9 than in SAS 9.4M8 and SAS Viya 3.5.

For SAS 9.4M9, the option specifies additional resources, including JDBC drivers, to add to the JDBC class path. The directory-path must be a subdirectory of the required location. If you specify CLASSPATH=, it is recommended to also specify the DRIVERCLASS= option in order to ensure that you select the correct driver.

On UNIX, this is the required location in SAS 9.4M9:
```
/SASHOME/AccessClients/9.4/DataDrivers/jdbc
```
or, if SASHOME is not set:
```
/usr/local/SAS/AccessClients/9.4/DataDrivers/jdbc
```

On Windows, this is the required location:

SASHOME\AccessClients\9.4\DataDrivers\jdbc

or, if SASHOME is not set:

C:\Progam Files\SASHome\AccessClients\9.4\DataDrivers\jdbc

For SAS 9.4M8 and SAS Viya 3.5, the option specifies the path to the JDBC JAR files. The value that you specify for CLASSPATH= overrides the value that is set for the SAS_ACCESS_CLASSPATH environment variable. You must set a value for the CLASSPATH= connection option or the SAS_ACCESS_CLASSPATH environment variable. The default value is the value that is set for the SAS_ACCESS_CLASSPATH environment variable.

Example:

libname x spark platform=databricks user=token pwd='mytoken' 
classpath='/usr/local/SAS/AccessClients/9.4/DataDrivers/jdbc/alt' 
driverclass='class' 
...more-connection-options...;

DRIVERCLASS=<">class<">

specifies the fully qualified class name of the JDBC driver to use for your connection.

Alias: DRIVER=

Here are some examples that show how the DRIVERCLASS= value interacts with the PLATFORM= option. If the PLATFORM= option is omitted, SAS attempts to determine it from the DRIVERCLASS= value. If both PLATFORM= and DRIVERCLASS= are omitted, the Cloudera driver is used.

Databricks JDBC driver (default when PLATFORM=DATABRICKS)
```
driverclass='com.simba.databricks.jdbc.Driver'
```
Apache Hive JDBC driver (default when PLATFORM=SPARK)
```
driverclass='org.apache.hive.jdbc.HiveDriver'
```
Databricks JDBC driver (available from Databricks):
```
driverclass='com.databricks.client.jdbc.Driver'
```
Cloudera driver for Apache Hive (default when PLATFORM= is not specified)
```
driverclass='com.cloudera.hive.jdbc.HS2Driver'
```

HDFS_DATADIR='path'

when option HDFS_METADIR= is specified, specifies the path to the Spark directory where SAS reads and writes data (for example, ‘/sas/hpa’).

Note: This option is valid only when SAS is configured with access to HDFS.

Note: This option is deprecated in SAS 9.4M9.

HDFS_METADIR='path'

specifies the path to an HDFS directory that contains XML-based table definitions. Through these descriptors, SAS then accesses the data using HDFS instead of Spark. If you want the Spark engine to connect using Spark SQL, do not specify this option.

Note: This option is valid only when SAS is configured with access to HDFS.

Note: This option is deprecated in SAS 9.4M9.

HDFS_TEMPDIR='path'

specifies the path to the HDFS directory where SAS reads and writes temporary data.

Default: HDFS_TEMPDIR='/tmp'

Note: This option is valid only when SAS is configured with access to HDFS.

Note: This option is deprecated in SAS 9.4M9.

HIVE_PRINCIPAL='Hive-principal-name'

specifies the principal name of the Spark server in an environment that uses Kerberos.

Default: none

Note: This argument is not required if hive-site.xml exists in the directory that the SAS_HADOOP_CONFIG_PATH environment variable specifies. Alias: HIVE_KERBEROS_PRINCIPAL=.

HTTPPATH="path"

specifies the HTTP or HTTPS endpoint for building the JDBC URL connection string. This option is ignored when the URL= option is specified.

Default: none

Note: Support for this option starts in SAS 9.4M9.

MAPREDUCE_JOB_NAME="name"

specifies the MapReduce job name.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

MAPREDUCE_JOB_QUEUENAME="name"

specifies the MapReduce queue name.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

PASSWORD=<'>Spark-password<'>

specifies the Spark password that is associated with your user ID. If the password contains spaces or nonalphanumeric characters, you must enclose the password in quotation marks. If you do not want to enter your Spark password in uncoded text on this statement, see PROC PWENCODE in the Base SAS Procedures Guide for a method to encode it.

Alias: PASS=, PWD=, PW=

Note: The USER= and PASSWORD= options remain valid. However, these options are less secure because you are saving authentication information in your programs. Therefore, as a better security practice, it is recommended that you use the AUTHDOMAIN= option instead.

PLATFORM=SPARK|DATABRICKS

specifies the target platform. SPARK specifies that the target platform is a Spark SQL data source. DATABRICKS specifies that the target platform is Databricks. The selected value is used internally to construct an appropriate JDBC URL connection string from SAS connection options.

Set PLATFORM=DATABRICKS when connecting to Databricks.

Default: SPARK

TipSpecifying PLATFORM= and connection options can be easier than writing a JDBC URL for the URL= option.

Note: Support for this option starts in SAS 9.4M9.

SCHEMA=Spark-schema

specifies the Spark schema to use for your database connection.

Alias: DATABASE=, DB=

Default: default

SSL_TRUST_STORE="path"

specifies the path to the SSL trust store.

When specified in the LIBNAME statement, this argument overrides the value stored in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

TEZ_QUEUE_NAME="name"

specifies an Apache Tez queue name in which to execute client requests.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

TRUST_STORE_PASSWORD="password"

specifies the password for the trust store.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

URI='jdbc:driver-name://driver-connection-options’

specifies the JDBC URL to connect to Spark SQL.

See the Databricks documentation for information about how to construct a Databricks JDBC driver connection string.

See the documentation for your JDBC driver for information to construct a connection string for other Spark data sources.

Alias: URL=

Default: none

Note: The URI= option can be used to indicate which driver to use. Typically, the DRIVERCLASS= option takes precedence over the URI= option.

USER=<'>Spark-user-name<'>

specifies the user name for Read (JDBC) and Write (HDFS) operations. Do not use the USER= argument if your Hadoop cluster is secured by Kerberos.

Alias: UID=

Note: USER= must be specified as a connection option and cannot be specified by using the URL= option.

ZOOKEEPER_NAMESPACE=namespace

specifies the ZooKeeper namespace.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

ZOOKEEPER_PORT=port-number

specifies the port number that ZooKeeper listens to for client requests.

Note: Support for this option starts in SAS 9.4M9.

ZOOKEEPER_QUORUM=

specifies the server nodes that are available to run client requests. Use commas to separate node names.

When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.

Note: Support for this option starts in SAS 9.4M9.

LIBNAME options

specify how SAS processes DBMS objects. The following table describes the LIBNAME options for SAS/ACCESS Interface to Spark, with the applicable default values. This table also identifies LIBNAME options that are valid in the CONNECT statement in the SQL procedure. For details, see LIBNAME Options for Relational Databases.

SAS/ACCESS LIBNAME Options for Spark
Option	Default Value	Valid in CONNECT
ACCESS=	none
ANALYZE=	NO
AUTHDOMAIN=	none
BATCH_UPDATE=	YES if BULKLOAD=NO and your JDBC driver supports batch updates NO otherwise
BL_ACCOUNTNAME=	none
BL_APPLICATIONID=	none
BL_AWS_CONFIG_FILE=	~/.aws/config (UNIX) C:\Users\<userid>\.aws\config (Windows)
BL_AWS_CREDENTIALS_FILE=	~/.aws/credentials (UNIX) C:\Users\<userid>\.aws\credentials (Windows)
BL_AWS_PROFILE_NAME=	none
BL_BUCKET=	none
BL_COMPRESS=	NO
BL_CONFIG=	none
BL_DEFAULT_DIR=	temporary file directory that is specified by the UTILLOC= system option
BL_DELETE_DATAFILE=	YES
BL_DELIMITER=	comma
BL_DNSSUFFIX=	dfs.core.windows.net
BL_ENCKEY=	none
BL_FILESYSTEM=	none
BL_FOLDER=	none
BL_FORMAT=	Databricks: CSV Spark in HDFS: Text
BL_FORMAT_OPT=	none
BL_IAM_ASSUME_ROLE_ARN=	none
BL_IAM_ASSUME_ROLE_NAME=	none
	none
BL_KEY=	none
BL_PORT=	8020
BL_REGION=	none
BL_TIMEOUT=	60
BL_TOKEN=	none
BL_USE_SSL=	NO
BULKLOAD=	YES
BULKUNLOAD=	NO
CONNECTION=	SHAREDREAD	●
DBCLIENT_MAX_BYTES=	matches the maximum number of bytes per single character of the SAS session encoding
DBCONINIT=	none	●
DBCONTERM=	none	●
DBCREATE_TABLE_ EXTERNAL=	NO	●
DBCREATE_TABLE_OPTS=	none	●
DBGEN_NAME=	DBMS
DBLIBINIT=	none
DBLIBTERM=	none
DBMAX_TEXT=	1024	●
DBMAX_TEXT_TYPES=	LONG
DBMSTEMP=	NO
DBSASLABEL=	COMPAT
DEFER=	NO	●
DIRECT_SQL=	YES
DRIVER_TRACE=	none
DRIVER_TRACEFILE=	none
DRIVER_TRACEOPTIONS=	The trace log is overwritten and includes no time stamps or thread identification.
IGNORE_BASELINE=	NO	●
INSERTBUFF=	automatically calculated based on row length
LOGIN_TIMEOUT=	30	●
MULTI_DATASRC_OPT=	NONE
POST_STMT_OPTS=	none
PRESERVE_COL_NAMES=	none
PRESERVE_TAB_NAMES=	NO
PROPERTIES=	none	●
QUERY_TIMEOUT=	0	●
READBUFF=	automatically calculated based on row length	●
SCHEMA=	the Spark schema `default`
SCRATCH_DB=	none	●
SPOOL=	YES
SQL_FUNCTIONS=	none	●
SQL_FUNCTIONS_COPY=	none
SUB_CHAR=	none
TRANSCODE_FAIL=	ERROR	●

Spark LIBNAME Statement Examples

The following examples provide a basic introduction to common SAS/ACCESS to Spark configurations.

Databricks in Azure, Using Bulk Operations

Use the following example to establish a connection between SAS and Databricks. Authentication is through a Databricks Personal Access Token. When using a Personal Access Token, specify USER=“token” and specify a personal access token in PASSWORD=. Use of the AUTHDOMAIN= option is preferred to avoid exposing sensitive credentials.

This example includes the LIBNAME option PLATFORM=DATABRICKS. The PLATFORM= LIBNAME option, new in SAS 9.4M9, optimizes the connection string that is generated internally from the options that are specified in the LIBNAME statement for the specified platform. The option also selects the appropriate JDBC driver for the connection from the default drivers.

Starting in SAS 9.4M9, bulk loading and bulk unloading are available for Azure connections when appropriate Azure options are specified. Bulk loading to Databricks is enabled by default so that there is no need to specify BULKLOAD=YES. To enable bulk unloading, you must specify BULKUNLOAD=YES.

Important: When connecting to Azure, you must specify the Azure tenant identifier using the AZURETENANTID= system option in addition to supplying Azure connection options in the LIBNAME statement.

options azuretenantid="tenant-user-id";

libname x spark platform=databricks bulkunload=yes
server='databricks-server' database='database' port=443
user=token pwd='personal-access-token'
httpPath='myHttpPath'
bl_applicationid='azure-application-id'
bl_accountname='adls2-storage-account'
bl_filesystem='adls2-blob-container'
bl_folder='adls2-folder';

For more information about the options for bulk loading and unloading in Azure, see Bulk Loading and Unloading to Databricks in Azure.

Databricks in Amazon Web Services (AWS), Using Bulk Loading

Starting in SAS 9.4M9, bulk loading is available for AWS when appropriate AWS options are specified. AWS connections use S3 storage. Bulk loading to Databricks is enabled by default so that there is no need to specify BULKLOAD=YES. Bulk unloading is currently not available for Databricks in AWS.

libname y spark platform=databricks
  server='databricks-server'
  database=database port=443
  user=token pwd='personal-access-token'
  httpPath='myHttpPath'
  bl_bucket='mybucket'
  bl_aws_config='path-to-config-file'
  bl_aws_credentials_file='path-to-credentials-file'
  bl_delete_datafile=no;

For more information about the options for bulk loading to Databricks in AWS, see Bulk Loading to Databricks in AWS.

Apache Spark, with HDFS Support

Use the following example to connect SAS to Apache Spark on a Hadoop distribution that has access to HDFS. When HDFS is available, SAS/ACCESS Interface to Spark can use HDFS for bulk loading. BULKLOAD=YES is enabled by default. When HDFS is not available, specify BULKLOAD=NO.

Specifying PLATFORM=SPARK is optional, because SPARK is the default PLATFORM= value.

option set=SAS_HADOOP_JAR_PATH="/usr/local/SAS/AccessClients/9.4/DataDrivers/mysrv1_jars";
option set=SAS_HADOOP_CONFIG_PATH="/mysrv1_cfg";
 
libname spkhdfs spark server='spark-server' port=10001 database='default'
  user=myuserid password="myuserpwd";

For more information, see Bulk Loading to Spark Using HDFS.

Last updated: February 3, 2026