Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions src/components/fundable/descriptions/ParquetNullOptimizations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#### Overview

Apache Parquet is an open source, column-oriented data file format designed for
efficient data storage and retrieval. Together with Apache Arrow for in-memory data,
it has become for the de facto standard for efficient columnar analytics.

While Parquet and Arrow are most often used together, they have incompatible physical
representations of data with optional values: data where some values can be
missing or "null". While Arrow uses a validity bitmap for each schema field and nesting level,
Parquet condenses that information in a more sophisticated structure called definition
levels (borrowing ideas from Google's Dremel project).

Converting between those two representations is non-trivial and often turns out
a performance bottleneck when reading a Parquet file as in-memory Arrow data.
Even columns that practically do not contain any nulls can still suffer from it if
the data is declared nullable (optional) at the schema level.

We propose to optimize the conversion of null values from Parquet in Arrow C++
for flat (non-nested) data:

1. decoding Parquet definition levels directly into a Arrow validity bitmap, rather than using an
intermediate representation as 16-bit integers;

2. avoiding decoding definition levels entirely when a data page's statistics shows
it cannot contain any nulls (or, conversely, when it cannot contain any non-null values).

This work can optionally be extended so as to apply to schemas with moderate amounts
of nesting.

Depending on the typology of Parquet data, this could make Parquet reading 2x
faster, even more in some cases. If you are ensure whether your workload could
benefit, we can discuss this based on sample Parquet files you provide us.

##### Are you interested in this project? Either entirely or partially, contact us for more information on how to help us fund it
17 changes: 15 additions & 2 deletions src/components/fundable/projectsDetails.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ import EmscriptenForgePackageRequestsMD from "@site/src/components/fundable/desc
import SVE2SupportInXsimdMD from "@site/src/components/fundable/descriptions/SVE2SupportInXsimd.md"
import MatrixOperationsInXtensorMD from "@site/src/components/fundable/descriptions/MatrixOperationsInXtensor.md"
import BinaryViewInArrowCppMD from "@site/src/components/fundable/descriptions/BinaryViewInArrowCpp.md"
import Decimal32InArrowCppMD from"@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
import Float16InArrowCppMD from"@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
import Decimal32InArrowCppMD from "@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
import Float16InArrowCppMD from "@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
import ParquetNullOptimizationsMD from "@site/src/components/fundable/descriptions/ParquetNullOptimizations.md"

export const fundableProjectsDetails = {
jupyterEcosystem: [
Expand Down Expand Up @@ -125,6 +126,18 @@ export const fundableProjectsDetails = {
currentNbOfFunders: 0,
currentFundingPercentage: 0,
repoLink: "https://github.com/apache/arrow"
},
{
category: "Apache Arrow and Parquet",
title: "Parquet C++ reader optimizations",
pageName: "ParquetNullOptimizations",
shortDescription: "Converting Parquet optional values to nullable Arrow data is often a performance bottleneck.",
description: ParquetNullOptimizationsMD,
price: "TBD",
maxNbOfFunders: 1,
currentNbOfFunders: 0,
currentFundingPercentage: 0,
repoLink: "https://github.com/apache/arrow"
}
]

Expand Down
9 changes: 9 additions & 0 deletions src/pages/fundable/ParquetNullOptimizations/GetAQuote.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
import GetAQuotePage from '@site/src/components/fundable/GetAQuotePage';

export default function FundablePage() {
const { siteConfig } = useDocusaurusContext();
return (
<GetAQuotePage/>
);
}
9 changes: 9 additions & 0 deletions src/pages/fundable/ParquetNullOptimizations/index.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
import LargeProjectCardPage from '@site/src/components/fundable/LargeProjectCardPage';

export default function FundablePage() {
const { siteConfig } = useDocusaurusContext();
return (
<LargeProjectCardPage/>
);
}
Loading