This section describes the LIBNAME statement that SAS/ACCESS Interface to Hadoop supports. For general information about this feature, see LIBNAME Statement for Relational Databases.
Here is the LIBNAME statement syntax for accessing Hadoop:
For general information about the LIBNAME statement that is not specific to SAS/ACCESS, see LIBNAME Statement in SAS Global Statements: Reference.
SAS/ACCESS can access Hadoop data through a JDBC connection to a HiveServer2 service. Depending on what release of Hive you have, Hive might not implement Read security. If Hive does not implement Read security, a successful connection from SAS allows Read access to all data that is accessible to the Hive service. SAS/ACCESS can connect to a Hive or HiveServer2 service that is unsecured, user name and password secured, or secured by Kerberos.
SAS/ACCESS creates and appends to Hive tables using the HDFS service. SAS/ACCESS can connect to a Hive or HiveServer2 service that is unsecured, user name and password secured, or secured by Kerberos. Your HDFS connection needs Write access to the HDFS /tmp directory. After data is written to /tmp, a Hive LOAD command is issued on your JDBC connection to associate the data with a Hive table. Therefore, the JDBC Hive session also needs Write access to /tmp.
To optimize big data Reads, SAS/ACCESS creates a temporary table in HDFS /tmp. This requires that the SAS JDBC connection have Write access to /tmp. The temporary table is read using HDFS, so the SAS HDFS connection needs Read access to the temporary table that is written to /tmp. Alternatively, use the HDFS_TEMPDIR option to specify an HDFS directory to use instead of /tmp.
specifies any SAS name that serves as an alias to associate SAS with a database, schema, server, or group of tables and views.
specifies the SAS/ACCESS engine name for the Hadoop interface.
provides connection information and controls how SAS manages timing and concurrence of the connection to Hadoop. All connection options are also valid in the CONNECT statement when you use the SQL pass-through facility (SQL procedure) to connect to your DBMS.
enables or disables SSL wire encryption of client requests.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
This option behaves differently in SAS 9.4M9 than in SAS 9.4M8 and SAS Viya 3.5.
[SAS 9.4M9] specifies additional resources, including JDBC drivers, to add to the JDBC class path. If you specify CLASSPATH=, it is recommended to also specify the DRIVERCLASS= option in order to ensure that you select the correct driver.
/SASHOME/AccessClients/9.4/DataDrivers/jdbc
or, if SASHOME is not set:
/usr/local/SAS/AccessClients/9.4/DataDrivers/jdbc
SASHOME\AccessClients\9.4\DataDrivers\jdbc
or, if SASHOME is not set:
C:\Progam Files\SASHome\AccessClients\9.4\DataDrivers\jdbc
[SAS 9.4M8 and SAS Viya 3.5] specifies the path to the JDBC JAR files. The value that you specify for CLASSPATH= overrides the value that is set for the SAS_ACCESS_CLASSPATH environment variable. You must set a value for the CLASSPATH= connection option or the SAS_ACCESS_CLASSPATH environment variable. The default value is the value that is set for the SAS_ACCESS_CLASSPATH environment variable.
Example:
libname x hadoop user=hive pwd='XXXXXX'
classpath='/usr/local/SAS/AccessClients/9.4/DataDrivers/jdbc/alt'
driverclass='class'
server=cdp71p3hive
database=default;
specifies the fully qualified class name of the JDBC driver to use for your database connection.
Alias: DRIVER=, CLASS=, HIVECLASS=
Examples:
com.cloudera.hive.jdbc.HS2Driver
driverclass="org.apache.hive.jdbc.HiveDriver"
when not in Hive mode, specifies the path to the Hadoop directory where SAS reads and writes data (for example, ‘/sas/hpa’). For details, see Accessing Data Independently from Hive in Base SAS Procedures Guide in Base SAS Procedures Guide.
Alias: HDFS_PERMDIR=
Use this option only when you are not using Hive or HiveServer2.
Note: This option is deprecated in SAS 9.4M9.
specifies the path to an HDFS directory that contains XML-based table definitions, called SASHDMD descriptors. Through these descriptors, SAS then accesses the data using HDFS instead of Hive.
Use this option only when you are not using Hive or HiveServer2.
Note: This option is deprecated in SAS 9.4M9.
specifies the path to the HDFS directory where SAS reads and writes temporary data.
Default:
HDFS_TEMPDIR='/tmp'
Use this option only when you are not using Hive or HiveServer2.
Note: This option is deprecated in SAS 9.4M9.
specifies the principal name of the Hive server in an environment that uses Kerberos.
Default: none
This argument is not required if hive-site.xml exists in the directory that is specified by the SAS_HADOOP_CONFIG_PATH environment variable.
Alias: HIVE_KERBEROS_PRINCIPAL=
specifies the HTTP or HTTPS endpoint for building the JDBC URL connection string. This option is ignored when the URL= option is specified.
Default: none
Note: Support for this option starts in SAS 9.4M9.
specifies the MapReduce job name.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies the MapReduce queue name.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies the Hadoop password that is associated with your user ID. If the password contains spaces or nonalphanumeric characters, you must enclose it in quotation marks. If you do not want to enter your Hadoop password in uncoded text on this statement, see PROC PWENCODE in the Base SAS Procedures Guide for a method to encode it.
Alias: PASS=, PWD=, PW=
specifies the port number to use to connect to the specified Hive service.
Alias: SERVICE=
Default: 10000
specifies the Hive schema.
Alias: DATABASE=, DB=
Default:
default
specifies the Hadoop server name that runs the Hive service. If the server name contains spaces or nonalphanumeric characters, you must enclose it in quotation marks.
Alias: HOST=
specifies the path to the SSL trust store.
When specified in the LIBNAME statement, this argument overrides the value stored in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies an Apache Tez queue name in which to execute client requests.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies the password for the trust store.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies the full Hive JDBC connection string. The user’s input is used exactly as it is specified to connect to Hive, and SAS does not modify the connection string. The JDBC URL options in the connection string are driver-specific. For details about the driver options to include in the string, refer to your driver documentation. See Hadoop LIBNAME Statement Examples for examples of URL values.
Alias: URL=
Default: none
specifies the user name for Read (JDBC) and Write (HDFS) operations. Do not use the USER= argument if your Hadoop cluster is secured by Kerberos.
Alias: UID=
specifies the ZooKeeper namespace.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies the port number that ZooKeeper listens to for client requests.
Note: Support for this option starts in SAS 9.4M9.
specifies the server nodes that are available to run client requests. Separate node names using commas.
When specified in the LIBNAME statement, this argument overrides the value in the hive-site.xml file.
Note: Support for this option starts in SAS 9.4M9.
specifies how SAS processes DBMS objects. The following table describes the LIBNAME options for SAS/ACCESS Interface to Hadoop, with the applicable default values. This table also identifies LIBNAME options that are valid in the CONNECT statement in the SQL procedure. For details, see LIBNAME Options for Relational Databases.
|
Option |
Default Value |
Valid in CONNECT |
|---|---|---|
|
none |
|
|
|
NO |
|
|
|
YES if BULKLOAD=NO and your JDBC driver supports batch updates NO otherwise |
||
|
TEXT |
||
|
8020 |
|
|
|
YES |
|
|
|
SHAREDREAD |
● |
|
|
matches the maximum number of bytes per single character of the SAS session encoding |
||
|
none |
● |
|
|
none |
● |
|
|
NO |
||
|
none |
||
|
none |
||
|
DBMS |
|
|
|
none |
|
|
|
none |
|
|
|
1024 |
● |
|
|
LONG |
||
|
NO |
|
|
|
COMPAT |
|
|
|
NO |
● |
|
|
YES |
|
|
|
none |
|
|
|
none |
|
|
|
The trace log is overwritten and includes no time stamps or thread identification. |
|
|
|
NO |
● |
|
|
automatically calculated based on row length |
||
|
30 |
● |
|
|
NONE |
|
|
|
none |
|
|
|
NO |
|
|
|
NO |
|
|
|
none |
● |
|
|
0 |
● |
|
|
none |
● |
|
|
automatically calculated based on row length |
● |
|
|
Hive schema |
|
|
|
none |
● |
|
|
YES |
|
|
|
none |
● |
|
|
none |
|
|
|
none |
|
|
|
NO |
|
|
|
ERROR |
● |
The following examples provide a basic introduction to common SAS/ACCESS to Hadoop configurations. These examples assume that the administrator has previously run the Hadoop tracer tool and copied the Hadoop jar and configuration files to the specified location. The tracer file gathers Java JAR files from the Hadoop deployment. The path to this directory is specified in the SAS_HADOOP_JAR_PATH environment variable. The XML configuration files are copied to the directory specified by the SAS_HADOOP_CONFIG_PATH environment variable. HDFS access is achieved using Java based APIs. In SAS 9.4M9, JAR files have a required location. For more information, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
A typical HADOOP LIBNAME statement specifies ZooKeeper information that provides for Dynamic Service Discovery. Apache ZooKeeper is a distributed coordination system that allows Hive to run in HiveServer2 high availability mode. The following processes occur in high availability mode:
To enable Dynamic Service Discovery with SAS/ACCESS to Hadoop, specify the ZooKeeper options ZOOKEEPER_QUORUM=, ZOOKEEPER_NAMESPACE=, and ZOOKEEPER_PORT= in the LIBNAME statement. These options can be found in the hive-site.xml file for your Hadoop installation.
option set=SAS_HADOOP_JAR_PATH="/usr/local/SAS/AccessClients/9.4/DataDrivers/mysrv1_jars";
option set=SAS_HADOOP_CONFIG_PATH="/mysrv1_cfg";
libname x hadoop user='user-name' pwd='password'
database='default'
zookeeper_quorum='node1.company.com,node2.company.com,node3.company.com'
zookeeper_namespace='hiveserver2'
zookeeper_port=2181;
Use of the USER= and PWD= options specify that LDAP authentication is used. Use of the AUTHDOMAIN= option is preferred to avoid exposing sensitive credentials.
To connect without using ZooKeeper, specify the SERVER= option. The SERVER= option specifies the HiveServer endpoint. When SERVER= is specified, HiveServer2 high availability will not be used by SAS/ACCESS to Hadoop. This option is appropriate when your Hadoop endpoint is not configured with ZooKeeper.
option set=SAS_HADOOP_JAR_PATH="/usr/local/SAS/AccessClients/9.4/DataDrivers/mysrv1_jars";
option set=SAS_HADOOP_CONFIG_PATH="/mysrv1_cfg";
libname x hadoop user='user-name' pwd='password'
server='hive-server' database='default' port=10000;
Apache Knox is a gateway service that provides secure access to Apache Hadoop clusters. It acts as a reverse proxy, enabling users and applications to interact with Hadoop services (such as HDFS, Hive, and YARN) without directly exposing them to external networks.
This type of connection has the advantage of not requiring any JAR files to be pulled from the Hadoop cluster, although Hadoop configuration files are still required. Here are important things to know about the connection:
options set=SAS_HADOOP_RESTFUL=1;
options set=SAS_HADOOP_CONFIG_PATH="/mysrv1_cfg";
options set=KNOX_GATEWAY_URL="cloudera-URL";
libname x hadoop user=user-name pw='password'
driverclass=com.cloudera.hive.jdbc.HS2Driver
uri="jdbc:hive2://cloudera-server-id:port/database;transportMode=http;ssl=1;httpPath=myHttpPath;
useNativeQuery=1";
Before Kerberos can be used, an administrator must first set up Kerberos for JDBC connections using “kinit”. For instructions, see information about SAS/ACCESS Interface to Hadoop in SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. In this example, the URL= option connects to a Hive server with Kerberos enabled, but without SSL enabled. The Kerberos service principal name is hive/node1.example.com@EXAMPLE.COM, the host name for the data source is node1.example.com, and the server is listening on port 1000 for JDBC connections.
The KrbAuthType= value in the URL= string is specific to the operating environment. The value 3 is for AIX. For Windows and Linux, use 0 or 2. For details about the driver options to include in the URL= string, refer to your driver documentation.
option set=SAS_HADOOP_JAR_PATH="/usr/local/SAS/AccessClients/9.4/DataDrivers/mysrv1_jars";
option set=SAS_HADOOP_CONFIG_PATH="/mysrv1_cfg";
%kerberos_kinit_domain(database);
libname x hadoop
driverclass=com.cloudera.hive.jdbc.HS2Driver
url='jdbc:hive2://node1.example.com:10000/default;AuthMech=1;KrbAuthType=3;KrbRealm=example.com;
KrbHostFQDN=hs2node1.example.com;KrbServiceName=hive'
;