Parquet is the new TSV

bioinformatics

database

Larger-than-memory data access with .parquet files

Author

Fabrício Almeida-Silva

Published

June 5, 2025

Motivation

For programming languages that store data in memory (e.g. R and Python), working with large data files can be a problem. For example, if you need to subset a few rows from a large table stored in a 5 GB TSV/CSV file, you will first need to read the entire table, then subset the rows you want. If your data is larger than your memory capacity, reading the entire data set is not possible. Apache Parquet is a column-oriented file format (similar to TSV/CSV) designed for efficient data storage and retrieval, and it can be used by packages such as arrow to analyze larger-than-memory data sets. Here, I will demonstrate the advantages of storing large data in Parquet files (compared to TSV/CSV files), and benchmark data retrieval using Parquet files and other alternatives.

# Load required packages
library(tidyverse)
library(arrow)
library(duckdb)

Example data

PLAZA is a database for plant comparative genomics data. Among many important features and data resources, PLAZA provides orthologous relationships for plant genes using differnent ‘orthology types’. Here, I will use orthologous genes obtained with the best-hit-and-inparalogs (BHI) type. This is a large (3 GB) CSV file containing the orthologs for all genes in all species in PLAZA Dicots 5.0. For example, if you have a gene of interest in Arabidopsis thaliana, you can use this file to find the corresponding (orthologous) gene(s) in other plants for comparative studies.

Let’s first download the file to a temporary directory.

# Download file
options(timeout = 1e6) # download might take a while

csv_file <- file.path(tempdir(), "plaza_dicots_bhif.csv.gz")
download.file(
    url = "https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/IntegrativeOrthology/integrative_orthology.BHIF.csv.gz",
    destfile = csv_file
)

From CSV to Parquet

To create a Parquet file, we could read the CSV file to the R session and export it as .parquet file using the arrow or nanoparquet packages. This works well if you have access to an HPC or a powerful server. If you don’t have access to a machine with more memory, don’t panic: you can use the duckdb package to create a Parquet file from a CSV file without having to load it first. This can be achieved with the following code:

# Directly convert CSV to Parquet using {duckdb}
con <- dbConnect(duckdb())
parquet_file <- file.path(tempdir(), "plaza_dicots_bhif.parquet")

query <- paste0("
  COPY (SELECT * FROM '", csv_file, "') 
  TO '", parquet_file, "' (FORMAT PARQUET)
")
dbExecute(con, query)
dbDisconnect(con)

In terms of storage, a Parquet file is comparable to a gzipped CSV file.

# Inspect file sizes
data.frame(
    Format = c("CSV", "Parquet"),
    Size = fs::file_size(c(csv_file, parquet_file)),
    row.names = NULL
)

   Format  Size
1     CSV 3.03G
2 Parquet 3.67G

Larger-than-memory data access with {arrow}

Once you have tabular data in a Parquet file, you can use the arrow package to ‘connect’ to the file, perform some data transformation (e.g., filter rows, subset columns, summarize data by groups, etc) using tidyverse verbs, and read only the output of the data transformation. If your familiar with SQL, this is similar to performing SQL queries without loading the data in memory.

To demonstrate how this works, we will extract orthologs (i.e., best-hits-and-inparalogs) of the gene AT2G14610, which encodes a pathogenesis-related protein 1 (PR-1) in Arabidopsis thaliana.

# Connect to the Parquet file
bhi <- arrow::open_dataset(parquet_file)
bhi

FileSystemDataset with 1 Parquet file
4 columns
#query_gene: string
query_species: string
orthologous_gene: string
orthologous_species: string

Once we connect to the Parquet file, we can see that it contains a table with four columns named #query_gene, query_species, orthologous_gene, and orthologous_species. Now, we will filter the table to keep only rows that have ‘AT2G14610’ in column #query_gene, and collect the results to the R session.

# Extract best-hits-and-inparalogs of 'AT2G14610'
pr1_orthologs <- bhi |>
    filter(`#query_gene` == "AT2G14610") |>
    collect()

head(pr1_orthologs)

# A tibble: 6 × 4
  `#query_gene` query_species orthologous_gene               orthologous_species
  <chr>         <chr>         <chr>                          <chr>              
1 AT2G14610     ath           AagrBONN_evm.TU.Sc2ySwM_228.2… aag                
2 AT2G14610     ath           Aa31LG2G16880                  aar                
3 AT2G14610     ath           Aa31LG2G16870                  aar                
4 AT2G14610     ath           Atru.chr5.2212                 acertr             
5 AT2G14610     ath           Actinidia00340                 ach                
6 AT2G14610     ath           AL3G46130                      aly

Brilliant, isn’t it? Using Parquet files, we can seamlessly subset a table that is too large to fit in memory, solving the problem of larger-than-memory data analysis.

Benchmark

Now, you might be asking yourself:

Do I need Parquet files? Why not extract the rows I want from the CSV file using Bash code?

That is indeed possible. In your terminal, you can use Bash code to read the CSV file line by line and search for rows that match our condition (i.e., ‘AT2G14610’ in column #query_gene). The Bash code would look something like this:

zcat file.csv.gz | grep -E '^#query|^AT2G14610' > filtered.csv

Let’s compare the performance of the Bash-based approach with the arrow-based approach. Of note, the .csv.gz file from PLAZA is actually a TSV file, not CSV, but that doesn’t impact performance.

# Wrapper function for the Bash-based approach
filter_bash <- function() {
    outfile <- file.path(tempdir(), "output.tsv")
    args <- c(csv_file, " | grep -E '^#query|^AT2G14610' > ", outfile)
    system2("zcat", args = args)
    
    df <- read_tsv(outfile, show_col_types = FALSE) |> as.data.frame()
    return(df)
}

# Benchmark
bnch <- bench::mark(
    arrow = bhi |>
        filter(`#query_gene` == "AT2G14610") |>
        collect() |>
        as.data.frame(),
    bash = filter_bash()
)

bnch

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 arrow        29.8ms   33.8ms   27.2       58.6KB     4.53
2 bash          58.3s    58.3s    0.0171     1.8MB     0

The benchmark shows that the arrow-based approach (using Parquet files) is much faster than the Bash approach (milliseconds vs one minute!). Hence, developers and maintainers of databases that provide users with large data files should consider providing Parquet files besides traditional CSV/TSV files. In the era of machine learning, AI, and large data, I believe this will make data analysis workflows much faster and more efficient.