Running a Loading Job

Clearing and Initializing the Graph Store

There are two aspects to clearing the system: flushing the data and clearing the schema definitions in the catalog. Two different commands are available.

CLEAR GRAPH STORE

Available only to superusers.

The CLEAR GRAPH STORE command flushes all the data out of the graph store (database). By default, the system will ask the user to confirm that you really want to discard all the graph data. To force the clear operation and bypass the confirmation question, use the -HARD option, e.g.,

CLEAR GRAPH STORE -HARD

Clearing the graph store does not affect the schema.

  1. Use the -HARD option with extreme caution. There is no undo option. -HARD must be in all capital letters.

  2. CLEAR GRAPH STORE stops all the TigerGraph servers (GPE, GSE, RESTPP, Kafka, and Zookeeper).

  3. Loading jobs and queries are aborted.

DROP ALL

Available only to superusers.

The DROP ALL statement clears the graph store and removes all definitions from the catalog: vertex types, edge types, graph types, jobs, and queries.

Running a Loading Job

Running a loading job executes a previously installed loading job. The job reads lines from an input source, parses each line into data tokens, and applies loading rules and conditions to create new vertex and edge instances to store in the graph data store. The input sources could be defined in the load job or could be provided when running the job. Additionally, loading jobs can also be run by submitted a HTTP request to the REST++ server.

RUN LOADING JOB

RUN LOADING JOB syntax for concurrent loading
RUN LOADING JOB syntax for concurrent loading
RUN LOADING JOB [-noprint] [-dryrun] [-n [i],j] jobname [
USING filevar [="filepath_string"][, filevar [="filepath_string"]]*
[, CONCURRENCY="cnum"][,BATCH_SIZE="bnum"]
]

Note that the keyword LOADING is included. This makes it more clear to users and to GSQL that the job is a loading job and not some other type of job ( such as a SCHEMA_CHANGE JOB).

When a concurrent loading job is submitted, it is assigned a job ID number, which is displayed on the GSQL console. The user can use this job ID to refer to the job, for a status update, to abort the job, or to re-start the job. These operations are described later in this section.

Options

-noprint

By default, the command will print several several lines of status information while the loading is running. If the -noprint option is included, the job will run omit the progress and summary details, but it will still display the job id and the location of the log file.

Example of minimal output when -noprint option is used
Kick off the following job:
JobName: load_videoE, jobid: gsql_demo_m1.1525091090494
Loading log: '/usr/local/tigergraph/logs/restpp/restpp_loader_logs/gsql_demo/gsql_demo_m1.1525091090494.log'

-dryrun

If -dryrun is used, the system will read the data files and process the data as instructed by the job, but will NOT load any data into the graph. This option can be a useful diagnostic tool.

-n [i], j

The -n option limits the loading job to processing only a range of lines of each input data file. The -n flag accepts one or two arguments. For example, -n 50 means read lines 1 to 50. -n 10, 50 means read lines 10 to 50. The special symbol $ is interpreted as "last line", so -n 10,$ means reads from line 10 to the end.

filevar list

The optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a filepath_string , obeying the same format as in the CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.

  1. When a loading job is compiled, it generates one RESTPP endpoint for each filevar and filepath_string . As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (" __GSQL_FILENAME_n__") is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.

  2. If a filepath_string is given, it overrides the filepath_string defined in the loading job. If a particular filevar is not assigned a filepath_string either in the loading job or in the RUN LOADING JOB statement, then an error is reported and the job exits.

CONCURRENCY

The CONCURRENCY parameter sets the maximum number of concurrent requests that the loading job may send to the GPE. The default is 256.

BATCH_SIZE

The BATCH_SIZE parameter sets the number of data lines included in each concurrent request sent to the GPE. The default is 1024.

Running Loading Jobs as REST Requests

Another way to run a loading job is to submit an HTTP request to the POST /ddl/<graph_name> endpoint of the REST++ server. Since the REST++ server has more direct access to the graph processing engine, this can execute more quickly than a RUN LOADING JOB statement in GSQL.

When a CREATE LOADING JOB block is executed, the GSQL system creates one REST endpoint for each file source. Therefore, one REST request can invoke loading for one file source at a time. Running an entire loading job may take more than one REST request.

The Linux curl command is a handy way to make HTTP requests. If the data size is small, it can be included directly in the command line by using the -d flag with a data string:

Curl/REST++ syntax for loading using the POST /ddl endpoint
curl -X POST -d "<data_string>" "http://<server_ip>:9000/ddl/<graph_name>?<parameters>?tag=<job_name>&filename=<filepath><&optional_parameters>"

If the data size is large, it is better to reference the data filename, using the --data-binary flag:

Curl/REST++ syntax for loading using the POST /ddl endpoint
curl -X POST --data-binary @<data_filename> "http://<server_ip>:9000/ddl/<graph_name>?tag=<job_name>&filename=<filename_variable><&optional_parameters>"

<filepath> should be replaced with either a file variable (from a DEFINE FILENAME statement) or a position-based file identifier ("__GSQL_FILENAME_n__") for an explicit filepath_string.

For more information, about sending REST++ requests, see the RESTPP API User Guide .

Example : The code block below shows three equivalent commands for the same loading job. The first uses the gsql command RUN JOB. The second uses the Linux curl command to support a HTTP request, placing the parameter values in the URL's query string. T he third gives the parameter values through the curl command's data payload -d option.

REST++ ddl loading examples
# Case 1: Using GSQL
GSQL -g gsql_demo RUN LOADING JOB load_cf USING FILENAME="../cf/data/cf_data.csv"
# Case 2: Using REST++ Request with data in a file, where file1 is one of the filename variables defined in the loading job
curl -X POST --data-binary @data/cf_data.csv "http://localhost:9000/ddl/gsql_demo?&tag=load_cf&filename=file1&sep=,&eol=\n"
# Case 3: Using REST++ Request with data inline, where file1 is one of the filename variables defined in the loading job
curl -X POST -d
"id2,id1\nid2,id3\nid3,id1\nid3,id4\nid5,id1\nid5,id2\nid5,id4" "http://localhost:9000/ddl/gsql_demo?tag=load_cf&filename=file1&sep=,&eol=\n"

Inspecting and Managing Loading Jobs

Starting with v2.0, there are now commands to checking loading job status, to abort a loading job and to restart a loading job.

Job ID and Status

When a loading job starts, the GSQL server assigns it a job id and displays it for the user to see. The job id format is typically the name of the graph, followed by the machine alias, following by a code number, e.g., gsql_demo_m1.1525091090494

Example of SHOW LOADING STATUS output
Kick off the following job, i.e.
JobName: load_test1, jobid: demo_graph_m1.1523663024967
Loading log: '/home/tigergraph/tigergraph/logs/restpp/restpp_loader_logs/demo_graph/demo_graph_m1.1523663024967.log'
Job "demo_graph_m1.1523663024967" loading status
[RUNNING] m1 ( Finished: 3 / Total: 4 )
[LOADING] /data/output/company.data
[============= ] 20%, 200 kl/s
[LOADED]
+-------------------------------------------------------------------+
| FILENAME | LOADED LINES | AVG SPEED | DURATION|
| /data/output/movie.dat | 100 | 100 l/s | 1.00 s|
|/data/output/person.dat | 100 | 100 l/s | 1.00 s|
| /data/output/roles.dat | 200 | 200 l/s | 1.00 s|
+-------------------------------------------------------------------+
[RUNNING] m2 ( Finished: 1 / Total: 2 )
[LOADING] /data/output/company.data
[========================== ] 60%, 200 kl/s
[LOADED]
+-------------------------------------------------------------------+
| FILENAME | LOADED LINES | AVG SPEED | DURATION|
| /data/output/movie.dat | 100 | 100 l/s | 1.00 s|
+-------------------------------------------------------------------+

By default, an active loading job will display periodic updates of its progress. There are two ways to inhibit these automatic output displays:

  1. Run the loading job with the -noprint option.

  2. After the loading job has started, enter CTRL+C. This will abort the output display process, but the loading job will continue.

SHOW LOADING STATUS

The command SHOW LOADING JOB shows the current status of either a specified loading job or all current jobs:

SHOW LOADING JOB syntax
SHOW LOADING STATUS job_id|ALL

The display format is the same as that displayed during the periodic progress updates of the RUN LOADING JOB command. If you do not know the job id, but you know the job name and possibly the machine, then the ALL option is a handy way to see a list of active job ids.

ABORT LOADING JOB

The command ABORT LOADING JOB aborts either a specified load job or all active loading jobs:

ABORT LOADING JOB syntax
ABORT LOADING JOB job_id|ALL

The output will show a summary of aborted loading jobs.

ABORT LOADING JOB example
gsql -g demo_graph "abort loading job all"
Job "demo_graph_m1.1519111662589" loading status
[ABORT_SUCCESS] m1
[SUMMARY] Finished: 0 / Total: 2
+--------------------------------------------------------------------------------------+
| FILENAME | LOADED LINES | AVG SPEED | DURATION | PERCENTAGE|
| /home/tigergraph/data.csv | 23901701 | 174 kl/s | 136.83 s | 65 %|
|/home/tigergraph/data1.csv | 0 | 0 l/s | 0.00 s | 0 %|
+--------------------------------------------------------------------------------------+
Job "demo_graph_m2.1519111662615" loading status
[ABORT_SUCCESS] m2
[SUMMARY] Finished: 0 / Total: 2
+--------------------------------------------------------------------------------------+
| FILENAME | LOADED LINES | AVG SPEED | DURATION | PERCENTAGE|
| /home/tigergraph/data.csv | 23860559 | 175 kl/s | 136.23 s | 65 %|
|/home/tigergraph/data1.csv | 0 | 0 l/s | 0.00 s | 0 %|
+--------------------------------------------------------------------------------------+

RESUME LOADING JOB

The command RESUME LOADING JOB will restart a previously-run job which ended for some reason before completion.

RESUME LOADING JOB syntax
RESUME LOADING JOB job_id

If the job is finished, this command will do nothing. The RESUME command should pick up where the previous run ended; that is, it should not load the same data twice.

RESUME LOADING JOB example
gsql -g demo_graph "RESUME LOADING JOB demo_graph_m1.1519111662589"
[RESUME_SUCCESS] m1
[MESSAGE] The current job got resummed

Verifying and Debugging a Loading Job

Every loading job creates a log file. When the job starts, it will display the location of the log file. Typically, the file is located at

<TigerGraph.root.dir>/logs/restpp/restpp_loader_logs/<graph_name>/<job_id>.log

This file contains the following information which most users will find useful:

  • A list of all the parameter and option settings for the loading job

  • A copy of the status information that is printed

  • Statistics report on the number of lines successfully read and parsed

The statistics report include how many objects of each type is created, and how many lines are invalid due to different reasons. This report also shows which lines cause the errors. Here is the list of statistics shown in the report. There are two types of statistics. One is file level (the number of lines), and the other is data object level (the number of objects). If an file level error occurs, e.g., a line does not have enough columns, this line of data is skipped for all LOAD statements in this loading job. If an object level error or failed condition occurs, only the corresponding object is not created, i.e., all other objects in the same loading job are still created if no object level error or failed condition for each corresponding object.

File level statistics

Explanation

Valid lines

The number of valid lines in the source file

Reject lines

The number of lines which are rejected by reject_line_rules

Invalid Json format

The number of lines with invalid JSON format

Not enough token

The number of lines with missing column(s)

Oversize token

The number of lines with oversize token(s). Please increase "OutputTokenBufferSize" in the tigergraph/dev/gdk/gsql/config file.

Object level statistics

Explanation

Valid Object

The number of objects which have been loaded successfully

No ID found

The number of objects in which PRIMARY_ID is empty

Invalid Attributes

The number of invalid objects caused by wrong data format for the attribute type

Invalid primary id

The number of invalid objects caused by wrong data format for the PRIMARY_ID type

incorrect fixed binary length

The number of invalid objects caused by the mismatch of the length of the data to the type defined in the schema

Note that failing a WHERE clause is not necessarily a bad result. If the user's intent for the WHERE clause is to select only certain lines, then it is natural for some lines to pass and some lines to fail.

Below is an example.

CREATE VERTEX movie (PRIMARY_ID id UINT, title STRING, country STRING COMPRESS, year UINT)
CREATE DIRECTED EDGE sequel_of (FROM movie, TO movie)
CREATE GRAPH movie_graph(*)
CREATE LOADING JOB load_movie FOR GRAPH movie_graph{
DEFINE FILENAME f
LOAD f TO VERTEX movie VALUES ($0, $1, $2, $3) WHERE to_int($3) < 2000;
}
RUN LOADING JOB load_movie USING f="movie.dat"
movie.dat
0,abc,USA,-1990
1,abc,CHN,1990
2,abc,CHN,1990
3,abc,FRA,2015
4,abc,FRA,2005
5,abc,USA,1990
6,abc,1990

The above loading job and data generate the following report

load_output.log (tail)
--------------------Statistics------------------------------
Valid lines: 6
Reject lines: 0
Invalid Json format: 0
Not enough token: 1 [ERROR] (e.g. 7)
Oversize token: 0
Vertex: movie
Valid Object: 3
No ID found: 0
Invalid Attributes: 1 [ERROR] (e.g. 1:year)
Invalid primary id: 0
Incorrect fixed
binary length: 0
Passed condition lines: 4
Failed condition lines: 2 (e.g. 4,5)

There are a total of 7 data lines. The report shows that

  • Six of the lines are valid data lines

  • One line (Line 7) does not have enough tokens.

Of the 6 valid lines,

  • Three of the 6 valid lines generate valid movie vertices.

  • One line has an invalid attribute (Line 1: year)

  • Two lines (Lines 4 and 5) do not pass the WHERE clause.