Bulk loading provides high-performance access to external data sources. Multiple Greenplum instances read data in parallel, which enhances performance.
Bulk loading is the fastest way to insert large numbers of rows into Greenplum tables. You can also use bulk loading to execute high-performance SQL queries against external data sources, without first loading those data sources into a Greenplum database. These fast SQL queries let you optimize extraction, transformation, and loading tasks that are common in data warehousing.
Two types of external data sources, external tables and web tables, have different access methods. External tables contain static data that can be scanned multiple times. The data does not change during queries. Web tables provide access to dynamic data sources as if those sources were regular database tables. Web tables cannot be scanned multiple times. The data can change during the course of a query.
You must specify BULKLOAD=YES to use the bulk-load facility.
The following sections show you how to access external tables and web tables using the bulk-load facility.
Use these protocols to access (static) external tables.
To use the gpfdist:// protocol, install and configure the gpfdist (Greenplum file distribution) program on the host that stores the external tables, see Configuring the File Server. The gpfdist utility serves external tables in parallel to the primary Greenplum database segments. The gpfdist:// protocol is advantageous because it ensures that all Greenplum database segments are used during the loading of external tables.
To specify files to gpfdist, use the BL_DATAFILE= data set option. Specify file paths that are relative to the directory from which gpfdist is serving files (the directory where you executed gpfdist).
The gpfdist utility is part of the Greenplum Clients package for the platform where SAS is running. You can also download it from VMware Tanzu Greenplum.
To use the file:// protocol, external tables must reside on a segment host in a location that Greenplum superusers (gpadmin) can access. The segment host name must match the host name, as specified in the gp_configuration system catalog table. In other words, the external tables that you want to load must reside on a host that is part of the set of servers that comprise the database configuration. The file:// protocol is advantageous because it does not require configuration.
Follow these steps to configure the gpfdist file server.
The directory path must be relative to the directory in which you execute gpfdist, and it must exist before gpfdist tries to access it.
export GPLOAD_HOME=directory
C:> gpfdist -d %GPLOAD_HOME% -p 8080 -l %GPLOAD_HOME%\gpfdist.log
$ gpfdist -d $GPLOAD_HOME -p 8080 -l $GPLOAD_HOME/gpfdist.log &
You can run multiple instances of gpfdist on the same host as long each instance has a unique port and directory.
If you do not specify GPLOAD_HOME, the value of the BL_DATAFILE= data set option specifies the directory that contains the external tables to be loaded. If BL_DATAFILE is not specified, the current directory is assumed to contain the external tables.
In Windows, to stop an instance of gpfdist, use the Task Manager or close the Command Window that you used to start that instance of gpfdist.
Follow these steps In UNIX to stop an instance of gpfdist.
$ ps ax | grep gpfdist (Linux)
$ kill 3456
Run this command to test connectivity between an instance of gpfdist and a Greenplum database segment.
$ wget http://gpfdist_hostname:port/file-name
You can use the file:// protocol to identify external files for bulk loading with no additional configuration required. However, using the GPLOAD_HOME environment variable is highly recommended. If you do not specify GPLOAD_HOME, the BL_DATAFILE data set option specifies the source directory. The default source directory is the current directory if you do not specify BL_DATAFILE=. The Greenplum server must have access to the source directory.
Use these data set options to access web tables:
Here are the Greenplum bulk-load data set options. For detailed information about these options, see Data Set Options for Relational Databases.
This first example shows how you can use a SAS data set, SASFLT.FLT98, to create and load a Greenplum table, FLIGHTS98.
libname sasflt 'SAS-data-library';
libname mydblib greenplm server=mysrv1_users
db=users user=myusr1 password=mypwd1;
proc sql;
create table net_air.flights98
(BULKLOAD=YES
BL_DATAFILE='c:\temp\greenplum\data.dat'
BL_USE_PIPE=NO
BL_DELETE_DATAFILE=yes
BL_HOST='192.168.x.x'
BL_PORT=8081)
as select * from sasflt.flt98;
quit;
This next example shows how you can append the SAS data set, SASFLT.FLT98, to the existing Greenplum table ALLFLIGHTS. The BL_USE_PIPE=NO option forces SAS/ACCESS Interface to Greenplum to write data to a flat file, as specified in the BL_DATAFILE= option. Rather than deleting the data file, BL_DELETE_DATAFILE=NO causes the engine to leave it after the load has completed.
proc append base=new_air.flights98
(BULKLOAD=YES
BL_DATAFILE='c:\temp\greenplum\data.dat'
BL_USE_PIPE=NO
BL_DELETE_DATAFILE=no
BL_HOST='192.168.x.x'
BL_PORT=8081)
data=sasflt.flt98;
run;