Blog

Introducing DataPelago Accelerator for Spark — the next frontier in Spark performance and efficiency

5m read

Apache Spark is the backbone of modern data processing, powering ETL pipelines, analytics, machine learning, and now GenAI workloads. But as datasets grow and infrastructure costs escalate, Spark’s performance and cost-efficiency gaps become more pronounced.

At DataPelago, we set out to solve this challenge, not by replacing Spark, but by accelerating it. We’re excited to introduce DataPelago Accelerator for Spark, the industry’s first plug-and-play accelerator for Apache Spark, built for heterogeneous compute and designed for immediate cost and performance gains.

What is DataPelago Accelerator?

DataPelago Accelerator for Spark, or Accelerator, is a drop-in acceleration layer for Apache Spark, available for both self-managed and managed deployments. Without changing a line of code or moving a byte of data, DataPelago Accelerator delivers:

Up to 10x faster performance
Up to 80% lower compute cost

Powered by DataPelago Nucleus, DataPelago’s Universal Data Processing Engine, the Accelerator enables Spark workloads to run seamlessly across CPUs, GPUs, and other accelerators.

With DataPelago Accelerator for Spark, we are able to complete a few of our heaviest OLAP cube jobs on OSS Spark—something that had been challenging due to data skew and performance bottlenecks. This opens the door for a migration from managed platforms without compromising speed or reliability while reducing our costs by 50%.

Arya KetanDistinguished Engineer, ShareChat

Why accelerate Spark?

Apache Spark is flexible and widely adopted, but it's not inherently optimized for modern hardware. Open-source Spark offers cost control but lags in performance. Managed Spark services improve performance but at a significantly higher cost. Users are often forced to choose between speed and affordability.

DataPelago Accelerator removes this trade-off, delivering high performance and low cost — without compromising openness, flexibility, or compatibility.

Key Capabilities

The Accelerator optimizes Spark across the full execution pipeline — from query planning to code generation to runtime. Core enhancements include:

Intelligent Acceleration Engine

Dynamically detects available hardware (CPU, GPU, FPGA) and generates optimized execution plans.
Executes plans natively using vectorized, columnar processing for maximum throughput.

Broad Spark Coverage

Accelerates a wide range of Spark operators (Scan, Join, Aggregate, Sort, etc.) and functions (string, timestamp, regex, UDFs).
Supports native read/write acceleration, complex data types, and all major table and file formats (Iceberg, Delta, JSON, Parquet, ORC, etc.).

AI-Native Architecture

Treats AI/ML models as first-class citizens in the data pipeline.
Accelerates GenAI workloads for pre-training, fine-tuning, RAG, and Agentic AI.

Real-World Impact

DataPelago Accelerator is already accelerating diverse production workloads:

ETL Pipelines*: 3x faster ETL jobs, 80% cost savings on large Parquet datasets.
OLAP Analytics*: 3x faster cube generation, enabling transition from expensive managed Spark to OSS.
BI Dashboards*: 2x lower latency, 3x higher query throughput, 7x lower costs for interactive dashboards.
GenAI Pipelines: 18x speedup on embedding and knowledge extraction stages in RAG pipelines.

* Acceleration and cost savings achieved with the same servers as before.

Seamless Integration

DataPelago Accelerator is designed for drop-in simplicity:

Plug-and-play deployment with 1-line cluster configuration.
No code changes, rewrites, or migrations needed.
Compatible with all Spark tools — Jupyter, Zeppelin, Airflow, Tableau, etc.
Native integration with Spark UI for query profiling and logs.
Available on GCP Marketplace and deployable on GKE, Google Dataproc, AWS EKS, and on-prem.

Built for the Enterprise

The Accelerator meets the demands of modern data teams:

Security & Governance: Leverages Spark’s existing auth, policies, and compliance controls.
Observability: Streams metrics and logs through existing Spark interfaces.
Analyzer Tool: Built-in AI/ML-powered utility to identify workloads best suited for acceleration.

Availability

DataPelago Accelerator is now generally available for both open source and managed Apache Spark clusters. It supports hybrid and multi-cloud environments, including GPU cloud providers and AI factories.

You can activate the Accelerator today with a single line during cluster startup — and immediately accelerate your Spark workloads while cutting costs.

The Future of Spark is Accelerated

We believe Spark should be fast, efficient, and open — without compromise. DataPelago Accelerator delivers on that vision, giving data teams the performance they need with the control they want.

Available now at datapelago.ai or contact us at info@datapelago.com for a demo.