Speed at Scale: Benchmarking GPU-Accelerated HeavyDB
Try HeavyIQ Conversational Analytics on 400 million tweets
Download HEAVY.AI Free, a full-featured version available for use at no cost.
GET FREE LICENSEFrom the very beginning, HEAVY.AI was built as a vertically integrated platform designed to allow fully interactive visual exploration of massive multi-billion row datasets, without needing to index, downsample, or pre-aggregate the data. This required the development of a built-in capability to zero-copy render large datasets using the GPU, as well as a web-based visual analytics frontend, HeavyImmerse, that would allow users to effortlessly slice-and-dice data with a few clicks and drags of their mouse.
However none of the blistering speeds the platform is known for would be possible without the performance of its core engine, the GPU-accelerated analytics database HeavyDB. Not only was HeavyDB built from the ground up to leverage the unique architecture of Nvidia GPUs, but we architected it to ensure it could exploit the full parallelism of modern hardware, both GPU and CPU. Here is a non-exhaustive list of key architectural innovations of the system:
- HeavyDB is built around a highly optimized compilation engine that lowers query plans to LLVM IR, from which we can target Nvidia GPUs as well as X86 and ARM CPUs, tailoring optimization passes for each architecture.
- There are a number of strategies that are taken for generating and probing group-by and join hash tables, depending on the hardware and the cardinalities and characteristics of the data involved. For example, on GPU, hash tables can be global or privatized per CUDA block, can be put in global or shared memory, and can use a murmur or perfect hash strategy, all depending on the situation to maximize performance.
- HeavyDB is built to carefully manage the memory across multiple GPUs per server with a separate buffer pool per GPU (and CPU), keeping data close to compute and effectively managing the complexity of a distributed database implementation within a single process for maximum performance.
- As mentioned above, for visualization workloads, we can optionally zero-copy render SQL query results using a subset of the open-source Vega specification, particularly powerful for generating instant visualizations of billions of geospatial records (points, lines, and polygons).
While we often focus on the capabilities of the full vertical stack, over time our core engine has improved significantly both in terms of performance and capabilities. Hence we thought it would be interesting to test the raw performance of the system and compare against leading CPU data warehouses, using the widely used Star Schema Benchmark (SSB) and TPC-H tests. We think the results are impressive (with still further room for optimization), but we’ll let you judge for yourself.
Hardware and Software Setup
We partnered with Vultr, a leading GPU cloud provider, for instances used for testing. We benchmarked our software on a machine with 8X80GB (640GB VRAM total) Nvidia A100 GPUs, costing $12.74/hr, and one of Nvidia’s new ARM-based Grace Hopper GH200 instances, costing $2.99/hr, with a single 96GB Hopper GPU able to move data to and from 480GB of CPU RAM over an ultra-fast 960GB/sec bidirectional bandwidth NVLink interconnect. As we will cover in more detail in a future post, we believe that the new fast NVLink between CPU and GPU in these new Nvidia ARM-based systems (Grace Hopper currently, to be followed by Grace Blackwell) represents a breakthrough in the types of problems and scales of data that can be processed efficiently on GPU.
Both GPU setups are effectively commodity supercomputers; both would easily make the Top500 SuperComputer list a little over 10 years ago. The GH200 system with a single GPU possesses 67 teraFLOPS of FP64 compute bandwidth and nearly 4 petaFLOPS using the new FP8 standard, with up to 144GB of GPU memory at 5TB/sec (our system had 96GB VRAM at 4TB/sec). The 8XA100 system, while based an older (Ampere) architecture, makes up for it in aggregate compute and memory bandwidth by having 8 GPUs, for a total of 156 teraFLOPs of FP64 compute with 640GB VRAM accessible at 16TB/sec. It should be noted that the new Hopper architecture of the GH200 possesses other key architectural advantages (i.e. larger caches) that allows the small GH200 machine to best the much larger 8XA100 machine in some scenarios, as seen below.
To provide a baseline by which to gauge our performance, we also benchmarked the leading cloud data warehouse vendors: Snowflake (Large instance type, costing $16/hr), Databricks (Large Instance Type, costing $28/hr), Google BigQuery (400 Query Slots, costing $24/hr), and Amazon Redshift (32 RPUs, costing $12/hr).
Methodology
Following the trend of other published benchmarks, we ran the tests “warm”, taking the average timing of the 2nd, 3rd, and 4th run of each query. This eliminates the overhead of moving data from storage for the first run, which would essentially make the benchmark storage-bound rather than measuring execution performance. This also dovetails with common usage patterns of our system, where users typically identify data of interest (which might be certain columns of a table for a given time range) and repeatedly query that subset. For certain vendors like Snowflake, we had to disable result set caching, as otherwise queries after the first run returned nearly instantaneously due to result set caching enabled by default in these systems.
Star Schema Benchmark (SSB)
The Star Schema Benchmark (SSB) is a popular SQL benchmark designed to evaluate the performance of database systems when handling complex analytical queries. It models a typical data warehouse environment, specifically focusing on a “star schema” data model where a large fact table is surrounded by smaller dimension tables, commonly used in business intelligence and reporting systems. It should be noted that SSB is derived from the TPC-H benchmark but is simplified to emphasize key aspects of star schema-based workloads.
For our tests we used a Scale Factor (SF) of 1000, meaning 1000GB (1TB) of data, comprising approximately 6 billion records of data. This is a common scale at which to run this benchmark, and illustrative of data scales our customers commonly target with our system.
As can be seen from the results, HeavyDB is significantly faster than all the cloud data warehouses on this workload on both the 8XA100 and 1XGH200 instances, with the largest advantage accruing to the former (8XA100) setup likely due to how well Star Schema Benchmark parallelizes over multiple GPUs due to its simpler nature with less intermediate reductions required.
In addition, the hot data for some queries did not fit entirely in the 96GB VRAM buffer for the GH200 instance, requiring paging over NVLink from CPU. Although this interconnect is significantly faster at 480GB/sec unidirectional (960GB/sec bidirectional) than any PCIe based interconnect (like the 8XA100 has), it is still not as fast as the 4TB/sec HBM3 VRAM buffer, resulting in moderate slowdowns for the queries that had to page from CPU were somewhat slower. That said, it’s a huge leap forward to have such high bandwidth between CPU and GPU, portending an end to the “thin straw” scenario of having to pull large amounts of data over low bandwidth PCIe. For common usage scenarios of our system, such as when there is a large number of users or users are hitting large datasets that cannot fit entirely in VRAM, having such a fast interconnect is a boon, as the entire contents of VRAM can be swapped in a barely noticeable few hundred milliseconds.
Overall, HeavyDB is between 25X and 130X faster than the other data warehouses on the 8XA100 machine, while with the exception of Redshift costing less to run. Notably, nearly half (6 of 13) of the queries ran in under 100ms, and all queries ran under 240ms. This points to the differentiation the HEAVY.AI platform offers for interactive analytics, as even at the 1TB scale, queries of moderate complexity can return well within the latency threshold that can be termed fully interactive (typically defined as under 500ms), even with multiple such queries on a dashboard or multiple users querying the system simultaneously.
TPC-H
The TPC-H Benchmark is an industry-standard SQL benchmark designed to evaluate the performance of decision support systems by simulating complex business-oriented queries. It represents a broad range of ad hoc queries and concurrent data modifications, reflecting real-world applications where databases are used for analytical processing and reporting. Generally the queries are more complex than those in the Star Schema Benchmark, revolving around a snowflake rather than a star schema.
The GH200 instance shined on this benchmark in particular. Despite having less aggregate GPU bandwidth than the SSB, the lower reduction overhead associated with only using a single GPU (instead of 8 on the 8XA100 instance) combined with specific innovations introduced with the Hopper architecture yielded significant performance gains compared to the 8 GPU Ampere instance. Compared to the other CPU data warehouse systems tested, HEAVY.AI on the GH200 instance was between 3.4X and 21.1X faster in absolute terms, and normalized for cost, 13.6X to 169X faster per dollar.
You’ll see a blank result for TPCH21, which contains a form of correlated subquery that we currently cannot rewrite, but future work will address this. In terms of reporting, we subtracted out the TPCH21 times from the other database results for parity. While the other queries are quite performant, we have identified further optimizations that will further improve the performance of HeavyDB across these queries, and plan to implement the associated changes over upcoming releases.
Conclusion
In conclusion, these benchmark results showcase the performance advantages of HeavyDB’s GPU-accelerated architecture, offering speedups up to 130X compared to traditional cloud data warehouses. More significant than the raw numbers is what this means for organizations: the ability to perform complex analytical queries on massive datasets with sub-second latency, enabling truly interactive data exploration and visualization at scale.
Part of the performance advantage of course lies in the significant compute and memory bandwidth advantages of GPUs. It is still mind-boggling to me that you can effectively rent a supercomputer in a box in the cloud for as little as $2.99 an hour (in the case of the GH200 system). We are entering a golden age of abundant and cheap compute, the full impact of which is only beginning to be felt.
On the other hand, without key architectural innovations in HeavyDB, such as an LLVM-based query compilation engine and advanced GPU memory management, much of this theoretical hardware edge would be left untapped. We’ve spent the last 10 years building out and optimizing our system to run analytical SQL queries as fast as possible on GPUs, and even then still see a long list of opportunities for further optimizations, both low-level changes to maximize use of the GPU as well as higher level changes, such as how we reduce results between GPUs, not to mention more mundane improvements such as better query plan rewrites. The pursuit of maximum performance is never finished, and we will continue to follow up with additional benchmarks as we land further optimizations.
We’d love for you to try our platform for yourself, which you can do via a variety of methods, including downloading our Free edition or trying it in our cloud. Until next time, wishing you speedy queries.