Overview
pairs_to_parquet is a .parquet extention of .pairs file format.
The main purpose of this extension is to leverage the row groups and metadata features of the Parquet format in order to:
speed up data selection, filtering and sorting
address a minor limitation of the .pairs format, where metadata cannot be easily parsed by generic CSV readers
reduce storage space required for the data
improve I/O performance
The main problem was to find a tool, which will be able to store additional metadata in .parquet file. Duckdb was a perfect tool to do so, in particularly able to store key-value metadata. Here is, how the process is organised:
Contents: