Overview

This version of a COPY FROM statement, allows you to specify the imported file format and currently three types are supported.

With each file type there will be differences in performance and behavior

Here’s a list of supported formats:

  • CSV (comma-separated values): simple columnar text format
  • ORC (optimized row columnar): columnar storage format developed by Apache
  • Parquet (Apache Parquet): columnar data storage used in Apache Hadoop ecosystem
Each query without a specified file format assumes to be importing a CSV file (There is no format detection in place)

Syntax

In order to sepcify the file format using the COPY FROM statement you can use the following syntax:

COPY tablename FROM 'file_path' (FORMAT format_name);
Format name is case insensitive

Example

When copying from the CSV, ORC or Parquet files as the first step, you need to create a destination table:

CREATE TABLE cab_types (id bigint, cab_type text);

Once that is done, you can copy the file content into the table, in the following way:

  • ORC
copy cab_types from 's3://oxla-testdata/cab_types.orc' (format orc);
  • Parquet
copy cab_types from 's3://oxla-testdata/cab_types.parquet' (format parquet);
Copying from parquet files is memory consuming. Files bigger than a few gigabytes might result in out of memory error
  • CSV
copy cab_types from 's3://oxla-testdata/cab_types.csv' (format csv);
  • CSV with Specified Delimiter
copy cab_types from 's3://oxla-testdata/cab_types.csv' (format csv, delimiter ':');

Differences in Behavior

  • Ignored Options

    • HEADER, DELIMITER, NULL options are ignored not affecting the execution of the queries for formats different than CSV.
  • Null Values Handling

    • All ORC files have nullable columns. In order to import a nullable column to an Oxla column, which are described as NOT NULL, the column in the ORC file cannot contain a null value, otherwise the request will be terminated.
    • For Parquet files, inserting a nullable column to a non nullable column is allowed as long as there are no null values in the source column.
  • Column Matching

    • ORC and Parquet files are only being matched based on column index, while CSVs can be matched both with names or indexes. For more information regarding that, please refer to our COPY FROM with HEADER doc