Bulk Loading for Greenplum

Overview

Bulk loading provides high-performance access to external data sources. Multiple Greenplum instances read data in parallel, which enhances performance.

Bulk loading is the fastest way to insert large numbers of rows into Greenplum tables. You can also use bulk loading to execute high-performance SQL queries against external data sources, without first loading those data sources into a Greenplum database. These fast SQL queries let you optimize extraction, transformation, and loading tasks that are common in data warehousing.

Two types of external data sources, external tables and web tables, have different access methods. External tables contain static data that can be scanned multiple times. The data does not change during queries. Web tables provide access to dynamic data sources as if those sources were regular database tables. Web tables cannot be scanned multiple times. The data can change during the course of a query.

You must specify BULKLOAD=YES to use the bulk-load facility.

The following sections show you how to access external tables and web tables using the bulk-load facility.

Using Protocols to Access External Tables

Use these protocols to access (static) external tables.

gpfdist://

To use the gpfdist:// protocol, install and configure the gpfdist (Greenplum file distribution) program on the host that stores the external tables, see Configuring the File Server. The gpfdist utility serves external tables in parallel to the primary Greenplum database segments. The gpfdist:// protocol is advantageous because it ensures that all Greenplum database segments are used during the loading of external tables.

To specify files to gpfdist, use the BL_DATAFILE= data set option. Specify file paths that are relative to the directory from which gpfdist is serving files (the directory where you executed gpfdist).

The gpfdist utility is part of the Greenplum Clients package for the platform where SAS is running. You can also download it from VMware Tanzu Greenplum.

file://

To use the file:// protocol, external tables must reside on a segment host in a location that Greenplum superusers (gpadmin) can access. The segment host name must match the host name, as specified in the gp_configuration system catalog table. In other words, the external tables that you want to load must reside on a host that is part of the set of servers that comprise the database configuration. The file:// protocol is advantageous because it does not require configuration.

Configuring the File Server

Follow these steps to configure the gpfdist file server.

  1. Download and install gpfdist from the Greenplum Clients package at VMware Tanzu Greenplum.
  2. Specify and load a new environment variable called GPLOAD_HOME.
  3. Set the value of the variable to the directory that contains the external tables that you want to load.

    The directory path must be relative to the directory in which you execute gpfdist, and it must exist before gpfdist tries to access it.

    • For Windows, open My Computer, select the Advanced tab, and click the Environment Variables button.
    • For UNIX, enter this command or add it to your profile:
      export GPLOAD_HOME=directory
  4. Start gpfdist as shown in these examples.
    • For Windows:
      C:> gpfdist -d %GPLOAD_HOME% -p 8080 -l %GPLOAD_HOME%\gpfdist.log
    • For UNIX:
      $ gpfdist -d $GPLOAD_HOME -p 8080 -l $GPLOAD_HOME/gpfdist.log &

You can run multiple instances of gpfdist on the same host as long each instance has a unique port and directory.

If you do not specify GPLOAD_HOME, the value of the BL_DATAFILE= data set option specifies the directory that contains the external tables to be loaded. If BL_DATAFILE is not specified, the current directory is assumed to contain the external tables.

Stopping gpfdist

In Windows, to stop an instance of gpfdist, use the Task Manager or close the Command Window that you used to start that instance of gpfdist.

Follow these steps In UNIX to stop an instance of gpfdist.

  1. Find the process ID:
    $ ps ax | grep gpfdist (Linux)
  2. Kill the process. Here is an example:
    $ kill 3456

Troubleshooting gpfdist

Run this command to test connectivity between an instance of gpfdist and a Greenplum database segment.

$ wget http://gpfdist_hostname:port/file-name

Using the file:// Protocol

You can use the file:// protocol to identify external files for bulk loading with no additional configuration required. However, using the GPLOAD_HOME environment variable is highly recommended. If you do not specify GPLOAD_HOME, the BL_DATAFILE data set option specifies the source directory. The default source directory is the current directory if you do not specify BL_DATAFILE=. The Greenplum server must have access to the source directory.

Accessing Dynamic Data in Web Tables

Use these data set options to access web tables:

Data Set Options for Bulk Loading

Here are the Greenplum bulk-load data set options. For detailed information about these options, see Data Set Options for Relational Databases.

Bulk Loading Examples

This first example shows how you can use a SAS data set, SASFLT.FLT98, to create and load a Greenplum table, FLIGHTS98.

libname sasflt 'SAS-data-library';
libname mydblib greenplm server=mysrv1_users
    db=users user=myusr1 password=mypwd1;

proc sql;
create table net_air.flights98
    (BULKLOAD=YES
     BL_DATAFILE='c:\temp\greenplum\data.dat'
     BL_USE_PIPE=NO
     BL_DELETE_DATAFILE=yes
     BL_HOST='192.168.x.x'
     BL_PORT=8081)
    as select * from sasflt.flt98;
quit;

This next example shows how you can append the SAS data set, SASFLT.FLT98, to the existing Greenplum table ALLFLIGHTS. The BL_USE_PIPE=NO option forces SAS/ACCESS Interface to Greenplum to write data to a flat file, as specified in the BL_DATAFILE= option. Rather than deleting the data file, BL_DELETE_DATAFILE=NO causes the engine to leave it after the load has completed.

proc append base=new_air.flights98
    (BULKLOAD=YES
     BL_DATAFILE='c:\temp\greenplum\data.dat'
     BL_USE_PIPE=NO
     BL_DELETE_DATAFILE=no
     BL_HOST='192.168.x.x'
     BL_PORT=8081)
    data=sasflt.flt98;
run;
Last updated: February 3, 2026