The AI race is won or lost in data processing. While enterprises invest billions in GenAI initiatives, they hit a fundamental constraint: existing GPU frameworks, though industry-leading, still leave massive performance on the table. This performance gap directly translates to economic bottlenecks that stall AI adoption at scale.
NVIDIA's cuDF has set the industry standard for GPU-accelerated data analytics, enabling organizations to move beyond CPU-only processing for their most demanding workloads. Yet even with cuDF's impressive capabilities, enterprises running large-scale AI and analytics pipelines face escalating infrastructure costs and processing delays that limit what's economically viable.
DataPelago Nucleus is designed to shatter these remaining performance barriers. Our universal data processing engine features an accelerator-centric virtual machine with domain-specific instruction set architecture, purpose-built to unlock the full potential of GPU infrastructure that enterprises have already invested in.
cuDF represents the current pinnacle of GPU acceleration for data analytics—if we can demonstrate substantial improvements over cuDF, it validates the transformational potential available even within existing GPU infrastructure. For enterprises already spending millions on GPU clusters, incremental performance gains at this level translate to massive economic impact.
The operators we benchmarked represent the foundation of virtually every AI and analytics workload: scanning, filtering and projection for data preparation, aggregations for feature engineering, joins for data enrichment, and sorting for ranking systems. In addition, our benchmarks replicate the real-world applications that contain complex expressions, multi-column keys, variable-length strings with varying sizes, etc. Optimizing these core operations creates multiplicative benefits across entire data pipelines.
Our comparison used an AWS P5.48xlarge instance with AMD EPYC 7R13 Processor (192 vCPUs), 2TB DRAM, and 8x NVIDIA H100 GPUs. All measurements were conducted on a single GPU using DataPelago Nucleus v3.0 and cuDF v25.04.00. We evaluated a comprehensive suite of physical operators that form the building blocks of enterprise AI and analytics workloads.
Foundation Operations: Filter and Project Performance
Figure 1 - Project, Filter and Aggregate Throughput
Our first benchmark evaluated complex expressions across five columns with three aggregation functions. These operations form the foundation of every data pipeline—filtering irrelevant data and projecting required columns before subsequent processing.
DataPelago Nucleus is up to 10.5x faster for project operations, up to 10.1x faster for filter operations, and up to 4.3x faster for aggregate operations compared to cuDF.
Foundation operations affect every query in your pipeline. A 3x improvement here multiplies across thousands of daily operations, dramatically reducing both processing time and infrastructure costs for AI data preparation workloads.
Memory-Intensive Operations: Hash Aggregations and Joins
Figure 2: Hash Aggregate and Hash Join operations with varying number of unique keys
Hash table operations represent the core of most analytics and ML feature engineering pipelines. We evaluated three hash-aggregation functions and hash-join operations on four-column tables, varying the number of unique keys to stress-test memory performance.
Nucleus outperforms cuDF by up to 4.5x for hash aggregate operations. Gains increased as the number of unique keys grew.
These operations are central to ML feature engineering and data enrichment workflows. As data complexity increases (more unique keys), Nucleus's architectural advantages become more pronounced, ensuring performance scales with your data growth rather than degrading.
Critical Real-Time Operations: Top-K Performance
Figure 3: Top-K Throughput
The "ORDER BY column LIMIT N" operation—known as Top-K—is ubiquitous in analytics workloads, particularly for real-time ranking and recommendation systems. We sorted four-column tables and extracted the top K rows across various K values.
Nucleus delivers up to 8.2x faster performance for 'Top-K' operations compared to cuDF.
This represents our most dramatic improvement and addresses one of the most common bottlenecks in real-time analytics. For recommendation engines, search ranking, and real-time dashboards, 8x faster Top-K operations can transform what's possible for interactive user experiences.
String Processing: Variable-Length Data Performance
Figure 4: Throughput of various operators on strings
String processing presents unique challenges for GPU acceleration due to variable-length data structures. We benchmarked various physical operators with strings of different lengths to evaluate real-world text processing performance.
For hash join operations, Nucleus achieves 38.6x faster throughput compared to cuDF for smaller strings and 3.9x faster throughput for extra-large strings.
With unstructured text data driving most GenAI applications, string processing optimization directly impacts the economics of large language model data preparation, document processing, and text analytics pipelines.
These technical improvements translate to tangible business value:
These results demonstrate DataPelago Nucleus's foundational advantage: our universal data processing engine doesn't just optimize individual operations—it transforms the entire economics of GPU-accelerated computing. While cuDF represents the current state-of-the-art, Nucleus shows what becomes possible when you architect specifically for the AI era.
All these performance gains are immediately available to Apache Spark users through our DataPelago Accelerator for Spark (DPA-S), requiring zero code changes to existing applications. Organizations can realize these improvements while preserving their entire existing infrastructure and workflow investments.
Even industry-leading GPU acceleration frameworks have substantial performance headroom. DataPelago Nucleus doesn't just incrementally improve GPU workloads—it fundamentally transforms what's economically possible with existing infrastructure.
For enterprises already invested in GPU infrastructure, these results represent an opportunity to dramatically expand what's achievable without additional hardware investments. The question isn't whether your organization can afford to optimize these core operations—it's whether you can afford not to.