Here are the Hive data types that the Hadoop engine supports.
| BIGINT | FLOAT |
| BOOLEAN | INT |
| DECIMAL | SMALLINT |
| DOUBLE | TINYINT |
| BINARY | STRING |
| CHARn | VARCHARn |
| DATE | TIMESTAMP |
| INTERVAL |
| ARRAY | STRUCT |
| MAP |
This table shows the default formats that SAS/ACCESS Interface to Hadoop assigns to SAS variables when using the LIBNAME Statement for the Hadoop Engine.. These default formats are based on Hive column attributes.
|
Hive Data Type |
SAS Data Type |
Default SAS Format |
|---|---|---|
|
ARRAY |
character |
none |
|
BINARY |
character |
$HEX32767 |
|
BOOLEAN |
numeric |
w. (1.) |
|
CHAR(n)1 |
character |
$w. |
|
DATE |
numeric |
DATE9. |
|
DECIMAL(p,s)2 |
numeric |
w.d |
|
DOUBLE |
numeric |
w. |
|
FLOAT |
numeric |
w. |
|
INT |
numeric |
w. (11.) |
|
INTERVAL |
numeric |
DATETIME25.6 |
|
MAP |
character |
none |
|
SMALLINT |
numeric |
w. (6.) |
|
STRING3 |
character |
$32767. |
|
STRUCT |
character |
none |
|
TIMESTAMP |
numeric |
DATETIME25.6 |
|
TINYINT |
numeric |
w. (4.) |
|
VARCHAR(n)1 |
character |
$w. |
| 1 n in Hive data types is equivalent to w. in SAS formats. | ||
| 2 p,s in Hive data types is equivalent to w.d in SAS formats | ||
| 3 The STRING data type might be mapped to the VARCHAR data type by the JDBC client driver. For more information, check your JDBC vendor’s documentation. | ||
This table shows the default Hive data types that SAS/ACCESS assigns to SAS variable formats during output operations when you use the LIBNAME statement.
|
SAS Variable Format |
Hive Data Type |
|---|---|
|
w.d |
DOUBLE |
|
w. |
INT, SMALLINT, TINYINT, BIGINT |
|
$w. |
VARCHAR |
|
datetime formats |
TIMESTAMP |
|
date formats |
DATE |
|
time formats1 |
VARCHAR |
| 1 A column created in Hive using a TIME format is created as a VARCHAR column with a SASFMT format. | |
Below are some potential conversion issues.
ALTER
TABLE weblogs SET TBLPROPERTIES ('SASFMT:webdata'='CHAR(1000)').
To specify a general limit for multiple STRING columns, use the DBMAX_TEXT=
option.Workarounds are based on how you access data.
SAS/ACCESS generates SAS table properties only when creating a new Hive table. Many or perhaps all of your Hive tables are created by other means. For example, your Hadoop administrator might create Hive table definitions by submitting DDL scripts to the Hive CLI. SAS and SAS users can benefit by adding SAS table properties to existing Hive table definitions. In this example, a Hive table has already been specified.
CREATE EXTERNAL TABLE weblogs (extract_date STRING, extract_type INT, webdata STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/web_data'
Based on this table definition, here is how SAS interprets the columns.
libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=2012-02-21 extract_type=1 webdata=http://www.sas.com/industry/oilgas NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 1 extract_date Char 32767 $32767. $32767. extract_date 2 extract_type Num 8 11. 11. extract_type 3 webdata Char 32767 $32767. $32767. webdata
Notice that Hive describes
the extract_date column to SAS as a
32767 length STRING. It also describes the webdata column
as a 32767 length STRING. So SAS/ACCESS enters both of these columns
as character data and uses $32767. to format them. The result is an
overly wide SAS data set with an observation (row) width of 64 kilobytes
that also does not format extract_date to
a SAS DATE.
SAS issues a warning message for this situation, which includes the maximum column length that was in the result set. In the example, the maximum length read for the extract_date STRING column is 10 bytes. The maximum length read for the web data STRING column was 320 bytes.
WARNING:
SAS/ACCESS assigned these columns a length of 32767. If resulting
SAS character variables remain this length, SAS performance is impacted.
See SAS/ACCESS documentation for details. Columns followed by the
maximum length observed were: extract_date:10, webdata:320The example below assumes
that the length of the webdata STRING
in Hive never exceeds 1000 characters. A Hadoop user ID with the appropriate
authority can issue Hive ALTER TABLE statements to add SAS table properties
to the Hive table definition.
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:extract_date'='DATE(9.0)')
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:webdata'='CHAR(1000)')
SAS/ACCESS recognizes the added properties and here is the result.
libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=21FEB2012 extract_type=1 webdata=http://www.sas.com/industry/oilgas NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 1 extract_date Num 8 DATE9. DATE9. extract_date 2 extract_type Num 8 11. 11. extract_type 3 webdata Char 1000 $1000 $1000. webdata
The resulting SAS data
set that is created from the Hive table has a much smaller observation
width, which helps SAS save disk space and reduce CPU consumption.
It also automatically converts and formats extract_date to
SAS standard DATE9. format.
Adding SAS properties to existing Hive tables does not affect table use by software that is not SAS. You can also issue ALTER TABLE and other DDL statements using SAS/ACCESS explicit SQL.. (See SQL Pass-Through Facility Specifics for Hadoop.) Issuing such DDL as an ALTER TABLE statement can be restricted to only the Hadoop administrator.
Some issues currently exist when reading Hadoop data into SAS. (See Issues When Converting Data from Hive to SAS.) For example, Hive STRING columns default to $32767. SAS character format without a specified SASFMT table property or a SAS override option such as DBSASTYPE=.
Here is how you can address specific conversion issues.
To automatically convert
Hive STRING columns that contain ANSI date, timestamp, or time values
to suitable SAS formats, you can use the following HiveQL ALTER TABLE
statements. In the statements are these sample Hive columns: d contains
ANSI date, ts contains ANSI timestamp,
and t contains ANSI time.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')
Instead, you could use these statements to create SAS character variables of optimal length that contain the identical ANSI representation as those that are stored in Hive:
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='CHAR(9)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='CHAR(25)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='CHAR(15)')
You can use the following
statement for other Hive STRING columns where the maximum length is
less than 32767. Here, the string_col column
has a maximum length of 100.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(100)')
However, if you anticipate
that the Hive string_col column might
grow to a maximum length of 200 in the future, you could instead use
this statement to specify the table property.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')
Hive STRING columns longer than 32767 characters are truncated when they are read into SAS. Here is how the warning for this data loss is flagged in the SAS log:
WARNING: Column 'string_col' was truncated 1 times. Observation (row) number 2 was the first observation truncated.
Converting a Hadoop
BIGINT column to a SAS numeric column can cause a loss of precision.
A SAS numeric column can accurately preserve only 15 digits of precision.
However, a BIGINT column can preserve up to 19 significant digits
of precision, plus one character for the possible sign (+/-). You
can address this issue by applying a CHAR(20) table property format.
SAS then automatically reads a Hive BIGINT column into a SAS character
string with $20. format. This format preserves all BIGINT digits in
character format. Here is an example, using the bgint BIGINT
column.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(20)')
Keep this important consideration in mind, however: For Hive tables that SAS/ACCESS creates, you might not need to issue ALTER TABLE statements.
CAUTION
Do not create multiple table properties for a single Hive column. Unpredictable data conversion can result.
For various reasons, it might be impractical or undesirable to issue ALTER TABLE statements to create SAS table properties. In such cases, you can instead use these data set options.
Use DBSASTYPE= in your SAS code to cause data conversion from Hive to SAS that is identical to automatic conversion with table properties. The pairs below are SAS DATA steps with identical behavior. The first of each pair assumes a SASFMT table property, the second one assumes no table property, and DBSASTYPE= is added to achieve the same functionality. (For details, see the DBSASTYPE= data set option.)
Here is the SAS LIBNAME statement for all of these SAS DATA steps.
libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
[---assumes table property 'SASFMT:d'='DATE(9.0)' ---]
data work.local_sample; set hdp.sample_table( keep=d ); run;
[---assumes no table property for column ‘d’---]
data work.local_sample;
set hdp.sample_table( keep=d dbsastype=(d='DATE(9.0)') ); run;
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
[---assumes table property 'SASFMT:ts'='DATETIME(25.6)' ---]
data work.local_sample; set hdp.sample_table( keep=ts ); run;
[---assumes no table property for column ‘ts’---]
data work.local_sample;
set hdp.sample_table( keep=ts dbsastype=(ts='DATETIME(25.6)') ); run;
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')
[---assumes table property 'SASFMT:t'='TIME(15.6)' ---]
data work.local_sample; set hdp.sample_table( keep=t ); run;
[---assumes no table property for column ‘t’---]
data work.local_sample;
set hdp.sample_table( keep=t dbsastype=(t='TIME(15.6)') ); run;
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')
[---assumes table property 'SASFMT:string_col'='CHAR(200)' ---]
data work.local_sample; set hdp.sample_table( keep=string_col ); run;
[---assumes no table property for column ‘string_col’---]
data work.local_sample;
set hdp.sample_table( keep=string_col dbsastype=(string_col='CHAR(200)') );
run;
You can use the DBMAX_TEXT=
option to limit the SAS length of all STRING columns read from Hive.
For example, if you specify DBMAX_TEXT=100, then all SAS character
variables that are created from Hive STRING columns are limited to
width $100. Specifying the DBMAX_TEXT=
option likewise limits the length in SAS of Hive 12 and higher CHAR
and VARCHAR columns.
Neither table properties nor DBSASTYPE= address data conversion issues from Hive to SAS if you use pass-through SQL to read Hive data. For pass-through SQL, you might need to explicitly convert and format each Hive column as you want it to be represented in SAS. For example, suppose you use SAS to create a table with SAS table properties that are generated for all but the BIGINT column. Here is the table that SAS creates.
libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data hdp.passthrough_ex( dbtype=(bgint="BIGINT") );
bgint='1234567890123456789';
format ts datetime25.6; ts=datetime();
format d date9.; d=today();
format t time10.; t=time();
format string_col $20.; string_col='hello';
run;
SAS issues this HiveQL when creating the table.
CREATE TABLE `PASSTHROUGH_EX` (`bgint` BIGINT,`ts` STRING,
`d` STRING,`t` STRING,`string_col` STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\001' STORED AS TEXTFILE TBLPROPERTIES
('SASFMT:ts'='DATETIME(25.6)','SASFMT:d'='DATE(9.0)',
'SASFMT:t'='TIME(10.0)','SASFMT:string_col'='CHAR(20)')
Next, an ALTER TABLE
statement is issued to add a table property for BIGINT
column bgint.
ALTER TABLE passthrough_ex SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(20)')
A LIBNAME-based table that is read to SAS recognizes the table properties.
data work.local; set hdp.passthrough_ex; run;
data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789 ts=25FEB2012:02:00:55.141000 d=25FEB2012 t=2:00:55 string_col=hello
This pass-through SQL step converts and formats each column identically to the LIBNAME-based step that applied the table properties.
proc sql; connect to hadoop(server=mysrv1 user=myusr1 pwd=mypwd1);
create table work.local as select
bgint length 20 format $20. informat $20.,
input(ts, IS8601DT26.) as ts format datetime25.6 informat datetime25.6,
input(d, yymmdd10.) as d format date9. informat date9.,
input(t, IS8601TM15.) as t format time15.6 informat time15.6,
string_col length 20 format $20. informat $20.
from connection to hadoop( select cast(bgint as STRING)
as bgint,ts,d,t,string_col from passthrough_ex );
quit;
data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789 ts=25FEB2012:02:00:55.141000 d=25FEB2012 t=2:00:55.141000 string_col=hello
If SAS detects that a column length for a numeric variable is 32767 and could be less, it writes a message:
WARNING:
These columns could have a length in SAS of 32767. If so, SAS performance
is impacted. See SAS/ACCESS documentation for details. The columns
read from Hive followed by the maximum length observed were: bgint:20,
ts:26, d:9, t:15, string_col:20.Hadoop has a special value called NULL. A Hadoop NULL value means an absence of information and is analogous to a SAS missing value. When SAS/ACCESS reads a Hadoop NULL value, it interprets it as a SAS missing value. For more information about how SAS handles NULL values, see Potential Result Set Differences When Processing Null Data.