Hadoop is a general-purpose data storage and computing platform that includes SQL-like products, such as Apache Hive. SAS/ACCESS Interface to Hadoop lets you work with your data by using HiveQL, a query language that is based on SQL. HiveQL has its own extensions and limitations.
With SAS/ACCESS Interface to Hadoop, you can read and write data to and from Hadoop as if it were any other relational data source to which SAS can connect. This interface provides fast, efficient access to data stored in Hadoop through HiveQL. You can access Hive tables as if they were native SAS data sets and then analyze them using SAS.
SAS/ACCESS Interface to Hadoop also reads data directly from the Hadoop Distributed File System (HDFS) when possible to improve performance. This differs from the traditional SAS/ACCESS engine behavior, which exclusively uses database SQL to read and write data. For more information, see the READ_METHOD= LIBNAME option.
When using HiveQL, the Hadoop engine requires access to the HiveServer2 service that runs on the Hadoop cluster, often using port 10000. For HDFS operations, such as writing data to Hive tables, the Hadoop engine requires access to the HDFS service that runs on the Hadoop cluster, often using port 8020. If the Hadoop engine cannot access the HDFS service, its full functionality is not available and the READ_METHOD= option is constrained to READ_METHOD=JDBC.
SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. If you have the appropriate license, you might also have access to the SAS Data Connect Accelerator to Hadoop. The data connector or data connect accelerator enables you to load large amounts of data into the CAS server for parallel processing. For more information, see these sections:
For available SAS/ACCESS features, see Hadoop supported features. For more information about Hadoop, see your Hadoop documentation.
Your Hadoop administrator configures the Hadoop cluster that you use. Your administrator has specified the defaults for system parameters such as block size and replication factor that affect the Read and Write performance of your system. Replication factors greater than 1 help prevent data loss yet slow the writing of data. Consult with your administrator about how your particular cluster was configured.
For the configuration steps that are needed before you can connect to a Hadoop server, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
Keep these limitations in mind when working with Hadoop and Hive.
DBCREATE_TABLE_OPTS="COMMENT 'my
table comment'". Although this option accepts any NLS character, the NLS
portion of the comment is not displayed properly later.