Bulk export for YCQL

YSQL
YCQL

Bulk export is available for YugabyteDB's Cassandra-compatible YCQL API. To export data from a YugabyteDB or an Apache Cassandra table, you can use the cassandra-unloader tool.

A typical workflow is to start with creating a source YugabyteDB table and populate it with data, then exporting the data using the cassandra-unloader tool.

Create source table

The following example represents the schema of the destination YugabyteDB table:

CREATE KEYSPACE load;
USE load;

CREATE TABLE users(
  user_id varchar,
  score1 double,
  score2 double,
  points int,
  object_id varchar,
   PRIMARY KEY (user_id));

Generate sample data

You can generate data by executing a Python script, as per the following example:

# sample usage:
#  To generate a 10GB (10240 MB) file.
#  % python gen_csv.py <outfile_name> <outfile_size_MB>
#  % python gen_csv.py file01.csv 10240
#
import numpy as np
import uuid
import csv
import os
import sys
outfile    = sys.argv[1] # output file name
outsize_mb = int(sys.argv[2])
print("Outfile = " + outfile)
print("Outfile Size (MB) = " + str(outsize_mb))
chunksize = 10000
with open(outfile, 'ab') as csvfile:
    while (os.path.getsize(outfile)//1024**2) < outsize_mb:
        data = [[uuid.uuid4() for i in range(chunksize)],
                np.random.random(chunksize)*1000,
                np.random.random(chunksize)*50,
                np.random.randint(1000000, size=(chunksize,)),
                [uuid.uuid4() for i in range(chunksize)]]
        csvfile.writelines(['%s,%.6f,%.6f,%i,%s\n' % row for row in zip(*data)])

The following shows sample rows generated by the script:

head file00.csv

3399bebc-d2cc-40c6-89d4-26102e08ff61,622.491927,40.262305,658257,44d73f8c-1d3c-424e-8fd2-d316c56b8454
4f362eac-f79f-45f6-b6b1-bd5a81f931dc,141.344278,3.024717,694290,7768b010-8411-490a-b523-88cc3ec53cb5
a24a6587-eea4-4907-ac7f-9f99dcac8f82,345.110599,3.869150,510943,5765d1d3-2855-4dbe-9f11-bb3b8631789f
...

To generate five CSV files of approximately 5 GB each, run the following commands:

python ./gen_csv.py file00.csv 5120 &
python ./gen_csv.py file01.csv 5120 &
python ./gen_csv.py file02.csv 5120 &
python ./gen_csv.py file03.csv 5120 &
python ./gen_csv.py file04.csv 5120 &

Load sample data

cassandra-loader is a general-purpose bulk loader for CQL that supports various types of delimited files, particularly CSV files. For details, review the README file of the YugabyteDB cassandra-loader fork. Note that cassandra-loader requires quotes for collection types (for example, "[1,2,3]" rather than [1,2,3] for lists).

You can install cassandra-loader as follows:

wget https://github.com/yugabyte/cassandra-loader/releases/download/<latest-version>/cassandra-loader

chmod a+x cassandra-loader

You can run cassandra-loader and queue up the files for upload one at a time, as follows:

./cassandra-loader \
    -schema "load.users(user_id, score1, score2, points, object_id)" \
    -boolStyle 1_0 \
    -numFutures 1000 \
    -rate 10000 \
    -queryTimeout 65 \
    -numRetries 10 \
    -progressRate 200000 \
    -host <clusterNodeIP> \
    -f file01.csv

For additional options, refer to cassandra-loader options.

Export data

You can install cassandra-unloader as follows:

wget https://github.com/brianmhess/cassandra-loader/releases/download/<latest-version>/cassandra-unloader

chmod a+x cassandra-unloader

You can run cassandra-unloader as follows:

./cassandra-unloader \
   -schema "load.users(user_id, score1, score2, points, object_id)" \
   -boolStyle 1_0 \
   -host <clusterNodeIP> \
   -f outfile.csv

For additional options, refer to cassandra-unloader options.

Best practices

Be sure to always specify the time zone, as it is not added to the default timestamp formats when using the cassandra-loader and cassandra-unloader utilities. Timestamps must be exported and imported in the same format, including the time zone. For example, yyyy-MM-dd HH:mm:ss.SSSZ and yyyy-MM-dd HH:mm:ss.SSSXXX.

It is recommended to use tab character as delimiter on JSONB columns, as the default delimiter, comma ( , ) does not work with these type of columns. For example, -delim $'\t'.