When Parquet Columns Get Too Big
This post is about the Parquet file structure, what’s special – although not unique – about it, how Python/Pandas users of the format may want to tweak some parameters, and how Vortexa is helping open-source.
This article is for engineers who use Apache Parquet for data exchange and don’t want nasty surprises.
Actually the above title is wrong. It should really read “When integer overflow bugs occur”, but we’ll come to that. This post is about the Parquet file structure, what’s special – although not unique – about it, how Python/Pandas users of the format out there may want to tweak some parameters, and how Vortexa is helping open-source.
What is Parquet?
Apache Parquet is a columnar file format. Common files we are used to, such as text files, CSV etc. store information one row (or record if you will) at a time. This is easy to conceptualize, and has some advantages, but there are also advantages to pivoting this model and storing data column by column. To explain them, we need to dive into the structure first.
A parquet file is structured thus (with some simplification):
- The file ends with a footer, containing index data for where other data can be found within the file.
- The file is split into row groups, which as you might expect contain groups of rows.
- The row group contains information about each column for a set of rows.
- Each column contains metadata and dictionary lookup data, along with the data for a specific single column, for number of rows, stored contiguously.
- The column data itself is split into pages. Pages are indivisible i.e. must be fully decoded if they are accessed, and can have compression applied.
The above diagram shows how this is nested, with details on subsequent row groups, columns etc. omitted.
So why go to all this trouble?
- Data for a given column is stored together – it is likely many values are similar, and this structure lends itself to much better data compression.
- When reading a file, only the columns which interest us need to be read and decoded
- Columns contain metadata with statistics on their contents – we can skip data which we know doesn’t interest us – known as predicate push-down.
All this means less IO and faster data reads, at the expense of more complex data writes.
Parquet is part of the Hadoop ecosystem, and tools such as Hive and Presto can efficiently read vast quantities of data using it at scale.
The adventure began in production
Among the myriad of moving parts behind the scenes at Vortexa, we have a Python process which produces about 6GB of output in a Parquet file, and a downstream Java process which consumes it.
One day, it just stopped consuming it, but started blowing up, an ArrayList constructor complaining about a negative size parameter. Why?
Well what’s one of the first things an engineer does these days when faced with an obscure error? Google it! I came across this Jira bug raised against the Paquet library itself. A “major” bug outstanding since August 2019. Hmm.
A two-pronged approach
I decided to do two things in parallel:
- Look for a work-around to fix production – quickly!
- Attempt to address the bug itself, contributing back to the Apache Parquet project
The work-around
I suspected 32-bit signed integer overflow. The file contained about 67 million records, a lot less than 2³¹ ~= 2.1 billion, but perhaps something else was wrong.
I fetched “parquet-tools” using brew (I use a Mac), and ran the “meta” command against the file. I can’t share the precise data I saw for our production file, but I saw this – 98% of the data was in the first row group. I could see the size of this row group was over 6GB, well over what I thought (incorrectly) was a 2³¹ bytes limit, given the description in the Jira bug.
The solution was, when writing the file in Python using Pandas, to specify a row group size argument of a million records, thus:
data_frame.to_parquet(“file.parquet”, **{“engine”: “pyarrow”, “row_group_size”: 1000000})
Adding that parameter added perhaps 5% overhead to the file-size, but every inner-structure of the file now fit comfortably within 2GB and the overflow problem went away.
It seems by default Pandas isn’t splitting down the row groups at all, which can harm performance when using Hive, Presto and other technologies. Row groups, especially when combined with intelligent sorting, can effectively shard the data.
The Bug Fix
I downloaded the Apache library source, and forked it. I had a local internal project from which I could recreate the problem with the old “broken” 6GB production file, and I modified it to get the Parquet library from this local forked project instead of Maven.
After poking around in the debugger, tracing back where things failed, I made some discoveries:
- Row group sizes are not directly causing the overflow issue.
- Column sizes within a row group are causing the problem.
- String columns usually compress well and have dictionary lookups applied to them (Parquet does this), but we had one column containing SHA hashes with very high entropy which could not compress. In one row group, this one column was over 4GB in size.
- The Parquet specification does not limit these data structures to 2GB (2³¹ bytes) or even 4GB (2³² bytes) in size. The Python/Pandas output may not be efficient when used with certain tools, but it was not wrong.
Thus the library was at fault for not being able to read this valid, albeit not well structured file coming out of Python/Pandas.
Some more experimentation, and I found that having a large (over 2GB) column did not break the Java library as such, but if we then had a subsequent row group after that data, reading that data would break. Debugging I found a 64-bit signed value being read from the parquet file as a file offset, but it was then being cast to a 32-bit signed value, cropping the value which was over 2³¹ in size.
We effectively had 32-bits (or more) of an unsigned positive number put into a 32-bit signed type, and this made our number negative – it is overflow. This led to ArrayList’s constructor being called with a negative size value and the Exception which unravelled it all.
You can see in this pull request (approved at the time of writing) the changes made to fix and unit-test this overflow issue.
A challenge has been creating an adequate unit-test, as this library cannot create a file which reproduces the issue as we get another 32-bit overflow issue during file writing! Arguably this is less serious, the more serious issue is the library cannot read a file which is technically correct. We also need approximately 3GB of Java heap to reproduce the issue, too heavy for unit tests in many CI pipelines.
Here is the parquet-tools meta output for a file I created to reproduce the problem. Nearly 38 million rows in the first row group, pushing a string column full of random (incompressible) data to approx 2.1GB in size, plus 100,000 remaining rows in a second row group to ensure a reader would base a file offset on the preceding data, triggering the bug:
Here RC = Row Count, and TS = Total Size. It is the row group 1’s “string” column’s total size of 2229232586 bytes which causes the issue, followed by some data in row group 2.
Conclusion
Vortexa makes heavy use of a wide range of technologies. Modern systems are very complex, and if we find a way we can fix any issues, contributing back to the open source community, we will.
We are hiring, come join an amazing team on an amazing mission.
Like our Data Science & Technology content? You might be interested in:
- Using Vortexa’s SDK to investigate the quality of crude imported into Northwest Europe (9 June 2021)
- Integrating Rust into Python (12 April 2021)
- Using Rust to corrode insane Python run-times (30 March 2021)
- Upskill & unlock more of Vortexa’s data using our SDK (5 March 2021)