Overview

pairs_to_parquet is a .parquet extention of .pairs file format.

The main purpose of this extension is to leverage the row groups and metadata features of the Parquet format in order to:

  • speed up data selection, filtering and sorting

  • address a minor limitation of the .pairs format, where metadata cannot be easily parsed by generic CSV readers

  • reduce storage space required for the data

  • improve I/O performance

The main problem was to find a tool, which will be able to store additional metadata in .parquet file. Duckdb was a perfect tool to do so, in particularly able to store key-value metadata. Here is, how the process is organised:

Convertation of header(.pairs.gz) to key-value metadata(.parquet)