Limiting the number of rows that the DBMS returns to SAS is an extremely important performance consideration. The less data that the SAS job requests, the faster the job runs.
Wherever possible, specify selection criteria that limit the number of rows that the DBMS returns to SAS. Use the SAS WHERE clause to retrieve a subset of the DBMS data.
If you are interested
in only the first few rows of a table, consider adding the OBS= option.
SAS passes this option to the DBMS to limit the number of rows to
transmit across the network. Using the OBS= option can significantly
improve performance for larger tables. To do this if you are using
SAS Enterprise Guide, select View
Explorer, select the table
that you want from the list of tables, and select the member that
you want to see the contents of the table.
Likewise, select only the DBMS columns that your program needs. Selecting unnecessary columns slows your job.
Just as with a SAS data set you can use the DROP= and KEEP= data set options to prevent retrieving unneeded columns from your DBMS table.
In this example, the KEEP= data set option causes the SAS/ACCESS engine to select only the SALARY and DEPT columns when it reads the Employees table.
libname myhadlib hadoop user=user1 password=mypwd1 server=hadoopsvr;
proc print data=myhadlib.employees (keep=salary dept);
where dept='ACC024';
quit;
The DBMS generates SQL that is similar to this:
SELECT "SALARY", "DEPT" FROM EMPLOYEES
WHERE(DEPT="ACC024")
Without the KEEP option, the DBMS processes SQL that is similar to this code:
SELECT * FROM EMPLOYEES WHERE(DEPT="ACC024")
This results in all columns from the Employees table being read in to SAS.
The DROP= data set option is a parallel option that specifies columns to omit from the output table. Keep in mind that the DROP= and KEEP= data set options are not interchangeable with the DROP and KEEP statements. Use of the DROP and KEEP statements when selecting data from a DBMS can result in retrieval of all columns into SAS. This can seriously impact performance.
For example, this code results in all columns from the Employees table being retrieved into SAS. When creating the output data set, the KEEP statement is applied.
libname myhadlib hadoop user=testid password=testpass server=hadoopsvr;
data temp;
set myhadlib.employees;
keep salary;
run;
Here is how you can use the KEEP= data set option to retrieve only the SALARY column.
data temp;
set myhadlib.employees(keep=salary);
run;