Apache DataFusion Comet

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine. Comet is designed to significantly enhance the performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the Spark ecosystem without requiring any code changes.

Comet provides a 2x speedup for TPC-H @ 1TB, resulting in 50% cost savings.

That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have, or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL, DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are required, so the savings come from better utilization of the infrastructure you already run on.

See the Comet Benchmarking Guide for more details.

What Comet Accelerates

Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. It uses Apache Arrow for zero-copy data transfer between the JVM and native code.

Parquet scans: native Parquet reader integrated with Spark's query planner
Apache Iceberg: accelerated Parquet scans when reading Iceberg tables from Spark (see the Iceberg guide)
Shuffle: native columnar shuffle with support for hash and range partitioning
Expressions: hundreds of supported Spark expressions across math, string, datetime, array, map, JSON, hash, and predicate categories
Aggregations: hash aggregate with support for FILTER (WHERE ...) clauses
Joins: hash join, sort-merge join, and broadcast join

For the authoritative lists, see the supported expressions and supported operators pages.

Drop-In Integration

Comet is designed as a drop-in accelerator for Apache Spark, allowing you to integrate Comet into your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the benefits of Comet's acceleration capabilities without disrupting your Spark applications.

Getting Started

Comet supports Apache Spark 3.4 and 3.5, and provides experimental support for Spark 4.0. See the installation guide for the detailed version, Java, and Scala compatibility matrix.

Install Comet by adding the jar for your Spark and Scala version to the Spark classpath and enabling the plugin. A typical configuration looks like:

export COMET_JAR=/path/to/comet-spark-spark3.5_2.12-<version>.jar

$SPARK_HOME/bin/spark-shell \
    --jars $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.explainFallback.enabled=true \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=4g

For full installation instructions, published jar downloads, and configuration reference, see the installation guide and the configuration reference.

Community

Join the DataFusion Slack and Discord channels to connect with other users, ask questions, and share your experiences with Comet.

Contributing

We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in shaping the future of Comet. Check out our contributor guide to get started.

License

Apache DataFusion Comet is licensed under the Apache License 2.0. See the LICENSE.txt file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,843 Commits
.claude/skills		.claude/skills
.github		.github
.mvn/wrapper		.mvn/wrapper
benchmarks		benchmarks
common		common
conf		conf
dev		dev
docs		docs
fuzz-testing		fuzz-testing
kube		kube
native		native
spark-integration		spark-integration
spark		spark
.asf.yaml		.asf.yaml
.dockerignore		.dockerignore
.gitignore		.gitignore
.scalafix.conf		.scalafix.conf
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
rust-toolchain.toml		rust-toolchain.toml
scalafmt.conf		scalafmt.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache DataFusion Comet

What Comet Accelerates

Drop-In Integration

Getting Started

Community

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache DataFusion Comet

What Comet Accelerates

Drop-In Integration

Getting Started

Community

Contributing

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages