Data Types for Hadoop

Supported Hive Data Types

Here are the Hive data types that the Hadoop engine supports.

LIBNAME Statement Data Conversions

This table shows the default formats that SAS/ACCESS Interface to Hadoop assigns to SAS variables when using the LIBNAME Statement for the Hadoop Engine.. These default formats are based on Hive column attributes.

LIBNAME Statement: Default SAS Formats for Hive Data Types

Hive Data Type

SAS Data Type

Default SAS Format

ARRAY

character

none

BINARY

character

$HEX32767

BOOLEAN

numeric

w. (1.)

CHAR(n)1

character

$w.

DATE

numeric

DATE9.

DECIMAL(p,s)2

numeric

w.d

DOUBLE

numeric

w.

FLOAT

numeric

w.

INT

numeric

w. (11.)

INTERVAL

numeric

DATETIME25.6

MAP

character

none

SMALLINT

numeric

w. (6.)

STRING3

character

$32767.

STRUCT

character

none

TIMESTAMP

numeric

DATETIME25.6

TINYINT

numeric

w. (4.)

VARCHAR(n)1

character

$w.

1 n in Hive data types is equivalent to w. in SAS formats.
2 p,s in Hive data types is equivalent to w.d in SAS formats
3 The STRING data type might be mapped to the VARCHAR data type by the JDBC client driver. For more information, check your JDBC vendor’s documentation.

This table shows the default Hive data types that SAS/ACCESS assigns to SAS variable formats during output operations when you use the LIBNAME statement.

LIBNAME Statement: Default Hive Data Types for SAS Variable Formats

SAS Variable Format

Hive Data Type

w.d

DOUBLE

w.

INT, SMALLINT, TINYINT, BIGINT

$w.

VARCHAR

datetime formats

TIMESTAMP

date formats

DATE

time formats1

VARCHAR

1 A column created in Hive using a TIME format is created as a VARCHAR column with a SASFMT format.

Issues When Converting Data from Hive to SAS

Below are some potential conversion issues.

Workarounds are based on how you access data.

Leverage Table Properties for Existing Hive Tables

SAS/ACCESS generates SAS table properties only when creating a new Hive table. Many or perhaps all of your Hive tables are created by other means. For example, your Hadoop administrator might create Hive table definitions by submitting DDL scripts to the Hive CLI. SAS and SAS users can benefit by adding SAS table properties to existing Hive table definitions. In this example, a Hive table has already been specified.

CREATE EXTERNAL TABLE weblogs (extract_date STRING, 
   extract_type INT, webdata STRING) ROW FORMAT DELIMITED FIELDS 
   TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/web_data'

Based on this table definition, here is how SAS interprets the columns.

libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=2012-02-21 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
          Alphabetic List of Variables and Attributes
#  Variable       Type   Len     Format    Informat    Label
1  extract_date   Char   32767   $32767.   $32767.     extract_date
2  extract_type   Num        8   11.       11.         extract_type
3  webdata        Char   32767   $32767.   $32767.     webdata

Notice that Hive describes the extract_date column to SAS as a 32767 length STRING. It also describes the webdata column as a 32767 length STRING. So SAS/ACCESS enters both of these columns as character data and uses $32767. to format them. The result is an overly wide SAS data set with an observation (row) width of 64 kilobytes that also does not format extract_date to a SAS DATE.

SAS issues a warning message for this situation, which includes the maximum column length that was in the result set. In the example, the maximum length read for the extract_date STRING column is 10 bytes. The maximum length read for the web data STRING column was 320 bytes.

WARNING: SAS/ACCESS assigned these columns a length of 32767. If resulting SAS character variables remain this length, SAS performance is impacted. See SAS/ACCESS documentation for details. Columns followed by the maximum length observed were: extract_date:10, webdata:320

The example below assumes that the length of the webdata STRING in Hive never exceeds 1000 characters. A Hadoop user ID with the appropriate authority can issue Hive ALTER TABLE statements to add SAS table properties to the Hive table definition.

ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:extract_date'='DATE(9.0)')
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:webdata'='CHAR(1000)')

SAS/ACCESS recognizes the added properties and here is the result.

libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=21FEB2012 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
          Alphabetic List of Variables and Attributes
#  Variable       Type   Len     Format    Informat    Label
1  extract_date   Num       8    DATE9.    DATE9.      extract_date
2  extract_type   Num       8    11.       11.         extract_type
3  webdata        Char   1000    $1000		 $1000.      webdata

The resulting SAS data set that is created from the Hive table has a much smaller observation width, which helps SAS save disk space and reduce CPU consumption. It also automatically converts and formats extract_date to SAS standard DATE9. format.

Adding SAS properties to existing Hive tables does not affect table use by software that is not SAS. You can also issue ALTER TABLE and other DDL statements using SAS/ACCESS explicit SQL.. (See SQL Pass-Through Facility Specifics for Hadoop.) Issuing such DDL as an ALTER TABLE statement can be restricted to only the Hadoop administrator.

Address Issues When Converting Data from Hive to SAS with Table Properties

Some issues currently exist when reading Hadoop data into SAS. (See Issues When Converting Data from Hive to SAS.) For example, Hive STRING columns default to $32767. SAS character format without a specified SASFMT table property or a SAS override option such as DBSASTYPE=.

Here is how you can address specific conversion issues.

STRING issues

To automatically convert Hive STRING columns that contain ANSI date, timestamp, or time values to suitable SAS formats, you can use the following HiveQL ALTER TABLE statements. In the statements are these sample Hive columns: d contains ANSI date, ts contains ANSI timestamp, and t contains ANSI time.

Note: Keep in mind that you cannot run ALTER TABLE commands in SAS because they are not SAS syntax. You must execute these commands within Hive.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')

Instead, you could use these statements to create SAS character variables of optimal length that contain the identical ANSI representation as those that are stored in Hive:

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='CHAR(9)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='CHAR(25)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='CHAR(15)')

You can use the following statement for other Hive STRING columns where the maximum length is less than 32767. Here, the string_col column has a maximum length of 100.

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(100)')

However, if you anticipate that the Hive string_col column might grow to a maximum length of 200 in the future, you could instead use this statement to specify the table property.

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')

Hive STRING columns longer than 32767 characters are truncated when they are read into SAS. Here is how the warning for this data loss is flagged in the SAS log:

WARNING: Column 'string_col' was truncated 1 times.
Observation (row) number 2 was the first observation truncated.
BIGINT issues

Converting a Hadoop BIGINT column to a SAS numeric column can cause a loss of precision. A SAS numeric column can accurately preserve only 15 digits of precision. However, a BIGINT column can preserve up to 19 significant digits of precision, plus one character for the possible sign (+/-). You can address this issue by applying a CHAR(20) table property format. SAS then automatically reads a Hive BIGINT column into a SAS character string with $20. format. This format preserves all BIGINT digits in character format. Here is an example, using the bgint BIGINT column.

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(20)')

Keep this important consideration in mind, however: For Hive tables that SAS/ACCESS creates, you might not need to issue ALTER TABLE statements.

CAUTION

Do not create multiple table properties for a single Hive column. Unpredictable data conversion can result.

Alternatives to Table Properties for Issues with Data Conversion from Hive to SAS

For various reasons, it might be impractical or undesirable to issue ALTER TABLE statements to create SAS table properties. In such cases, you can instead use these data set options.

DBSASTYPE=

Use DBSASTYPE= in your SAS code to cause data conversion from Hive to SAS that is identical to automatic conversion with table properties. The pairs below are SAS DATA steps with identical behavior. The first of each pair assumes a SASFMT table property, the second one assumes no table property, and DBSASTYPE= is added to achieve the same functionality. (For details, see the DBSASTYPE= data set option.)

Here is the SAS LIBNAME statement for all of these SAS DATA steps.

Note: Remember that you cannot run ALTER TABLE commands in SAS because they are not SAS syntax. You must execute these commands within Hive.
libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
[---assumes table property 'SASFMT:d'='DATE(9.0)' ---]
data work.local_sample; set hdp.sample_table( keep=d ); run;

[---assumes no table property for column ‘d’---]
data work.local_sample;
set hdp.sample_table( keep=d dbsastype=(d='DATE(9.0)') ); run;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
[---assumes table property 'SASFMT:ts'='DATETIME(25.6)' ---]
data work.local_sample; set hdp.sample_table( keep=ts ); run;

[---assumes no table property for column ‘ts’---]
data work.local_sample;
set hdp.sample_table( keep=ts dbsastype=(ts='DATETIME(25.6)') ); run;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')
[---assumes table property 'SASFMT:t'='TIME(15.6)' ---]
data work.local_sample; set hdp.sample_table( keep=t ); run;

[---assumes no table property for column ‘t’---]
data work.local_sample;
set hdp.sample_table( keep=t dbsastype=(t='TIME(15.6)') ); run;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')
[---assumes table property 'SASFMT:string_col'='CHAR(200)' ---]
data work.local_sample; set hdp.sample_table( keep=string_col ); run;

[---assumes no table property for column ‘string_col’---]
data work.local_sample;
set hdp.sample_table( keep=string_col dbsastype=(string_col='CHAR(200)') );
run;
DBMAX_TEXT=[n]

You can use the DBMAX_TEXT= option to limit the SAS length of all STRING columns read from Hive. For example, if you specify DBMAX_TEXT=100, then all SAS character variables that are created from Hive STRING columns are limited to width $100. Specifying the DBMAX_TEXT= option likewise limits the length in SAS of Hive 12 and higher CHAR and VARCHAR columns.

Address Issues When Converting Data from Hive to SAS for Pass-Through SQL

Neither table properties nor DBSASTYPE= address data conversion issues from Hive to SAS if you use pass-through SQL to read Hive data. For pass-through SQL, you might need to explicitly convert and format each Hive column as you want it to be represented in SAS. For example, suppose you use SAS to create a table with SAS table properties that are generated for all but the BIGINT column. Here is the table that SAS creates.

libname hdp hadoop server=mysrv1 user=myusr1 pwd=mypwd1;
data hdp.passthrough_ex( dbtype=(bgint="BIGINT") );
bgint='1234567890123456789';
format ts datetime25.6; ts=datetime();
format d date9.; d=today();
format t time10.; t=time();
format string_col $20.; string_col='hello';
run;

SAS issues this HiveQL when creating the table.

CREATE TABLE `PASSTHROUGH_EX` (`bgint` BIGINT,`ts` STRING,
   `d` STRING,`t` STRING,`string_col` STRING) ROW FORMAT DELIMITED FIELDS 
   TERMINATED BY '\001' STORED AS TEXTFILE TBLPROPERTIES
('SASFMT:ts'='DATETIME(25.6)','SASFMT:d'='DATE(9.0)',
   'SASFMT:t'='TIME(10.0)','SASFMT:string_col'='CHAR(20)')

Next, an ALTER TABLE statement is issued to add a table property for BIGINT column bgint.

ALTER TABLE passthrough_ex SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(20)')

A LIBNAME-based table that is read to SAS recognizes the table properties.

data work.local; set hdp.passthrough_ex; run;

data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789
ts=25FEB2012:02:00:55.141000
d=25FEB2012
t=2:00:55
string_col=hello

This pass-through SQL step converts and formats each column identically to the LIBNAME-based step that applied the table properties.

proc sql; connect to hadoop(server=mysrv1 user=myusr1 pwd=mypwd1);
create table work.local as select
bgint length 20 format $20. informat $20.,
input(ts, IS8601DT26.) as ts format datetime25.6 informat datetime25.6,
input(d, yymmdd10.) as d format date9. informat date9.,
input(t, IS8601TM15.) as t format time15.6 informat time15.6,
string_col length 20 format $20. informat $20.
from connection to hadoop( select cast(bgint as STRING)
   as bgint,ts,d,t,string_col from passthrough_ex );
quit;
data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789
ts=25FEB2012:02:00:55.141000
d=25FEB2012
t=2:00:55.141000
string_col=hello

If SAS detects that a column length for a numeric variable is 32767 and could be less, it writes a message:

WARNING: These columns could have a length in SAS of 32767. If so, SAS performance is impacted. See SAS/ACCESS documentation for details. The columns read from Hive followed by the maximum length observed were: bgint:20, ts:26, d:9, t:15, string_col:20.

Hadoop Null Values

Hadoop has a special value called NULL. A Hadoop NULL value means an absence of information and is analogous to a SAS missing value. When SAS/ACCESS reads a Hadoop NULL value, it interprets it as a SAS missing value. For more information about how SAS handles NULL values, see Potential Result Set Differences When Processing Null Data.

Last updated: February 3, 2026