By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Todd Mostak
Oct 29, 2024

Speed at Scale: Benchmarking GPU-Accelerated HeavyDB

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

From the very beginning, HEAVY.AI was built as a vertically integrated platform designed to allow fully interactive visual exploration of massive multi-billion row datasets, without needing to index, downsample, or pre-aggregate the data. This required the development of a built-in capability to zero-copy render large datasets using the GPU, as well as a web-based visual analytics frontend, HeavyImmerse, that would allow users to effortlessly slice-and-dice data with a few clicks and drags of their mouse.

However none of the blistering speeds the platform is known for would be possible without the performance of its core engine, the GPU-accelerated analytics database HeavyDB. Not only was HeavyDB built from the ground up to leverage the unique architecture of Nvidia GPUs, but we architected it to ensure it could exploit the full parallelism of modern hardware, both GPU and CPU. Here is a non-exhaustive list of key architectural innovations of the system:

  1. HeavyDB is built around a highly optimized compilation engine that lowers query plans to LLVM IR, from which we can target Nvidia GPUs as well as X86 and ARM CPUs, tailoring optimization passes for each architecture. 
  2. There are a number of strategies that are taken for generating and probing group-by and join hash tables, depending on the hardware and the cardinalities and characteristics of the data involved. For example, on GPU, hash tables can be global or privatized per CUDA block, can be put in global or shared memory, and can use a murmur or perfect hash strategy, all depending on the situation to maximize performance.
  3. HeavyDB is built to carefully manage the memory across multiple GPUs per server with a separate buffer pool per GPU (and CPU), keeping data close to compute and effectively managing the complexity of a distributed database implementation within a single process for maximum performance.
  4. As mentioned above, for visualization workloads, we can optionally zero-copy render SQL query results using a subset of the open-source Vega specification, particularly powerful for generating instant visualizations of billions of geospatial records (points, lines, and polygons).

While we often focus on the capabilities of the full vertical stack, over time our core engine has improved significantly both in terms of performance and capabilities. Hence we thought it would be interesting to test the raw performance of the system and compare against leading CPU data warehouses, using the widely used Star Schema Benchmark (SSB) and TPC-H tests.  We think the results are impressive (with still further room for optimization), but we’ll let you judge for yourself.

Hardware and Software Setup

We partnered with Vultr, a leading GPU cloud provider, for instances used for testing. We benchmarked our software on a machine with 8X80GB (640GB VRAM total) Nvidia A100 GPUs, costing $12.74/hr, and one of Nvidia’s new ARM-based Grace Hopper GH200 instances, costing $2.99/hr, with a single 96GB Hopper GPU able to move data to and from 480GB of CPU RAM over an ultra-fast 960GB/sec bidirectional bandwidth NVLink interconnect. As we will cover in more detail in a future post, we believe that the new fast NVLink between CPU and GPU in these new Nvidia ARM-based systems (Grace Hopper currently, to be followed by Grace Blackwell) represents a breakthrough in the types of problems and scales of data that can be processed efficiently on GPU.

Both GPU setups are effectively commodity supercomputers; both would easily make the Top500 SuperComputer list a little over 10 years ago. The GH200 system with a single GPU possesses 67 teraFLOPS of FP64 compute bandwidth and nearly 4 petaFLOPS using the new FP8 standard, with up to 144GB of GPU memory at  5TB/sec (our system had 96GB VRAM at 4TB/sec). The 8XA100 system, while based an older (Ampere) architecture, makes up for it in aggregate compute and memory bandwidth by having 8 GPUs, for a total of 156 teraFLOPs of FP64 compute with 640GB VRAM accessible at 16TB/sec. It should be noted that the new Hopper architecture of the GH200 possesses other key architectural advantages (i.e. larger caches) that allows the small GH200 machine to best the much larger 8XA100 machine in some scenarios, as seen below.

To provide a baseline by which to gauge our performance, we also benchmarked the leading cloud data warehouse vendors: Snowflake (Large instance type, costing $16/hr), Databricks (Large Instance Type, costing $28/hr), Google BigQuery (400 Query Slots, costing $24/hr), and Amazon Redshift (32 RPUs, costing $12/hr). 

Methodology

Following the trend of other published benchmarks, we ran the tests “warm”, taking the average timing of the 2nd, 3rd, and 4th run of each query. This eliminates the overhead of moving data from storage for the first run, which would essentially make the benchmark storage-bound rather than measuring execution performance. This also dovetails with common usage patterns of our system, where users typically identify data of interest (which might be certain columns of a table for a given time range) and repeatedly query that subset. For certain vendors like Snowflake, we had to disable result set caching, as otherwise queries after the first run returned nearly instantaneously due to result set caching enabled by default in these systems.

Star Schema Benchmark (SSB)

The Star Schema Benchmark (SSB) is a popular SQL benchmark designed to evaluate the performance of database systems when handling complex analytical queries. It models a typical data warehouse environment, specifically focusing on a “star schema” data model where a large fact table is surrounded by smaller dimension tables, commonly used in business intelligence and reporting systems. It should be noted that SSB is derived from the TPC-H benchmark but is simplified to emphasize key aspects of star schema-based workloads.

For our tests we used a Scale Factor (SF) of 1000, meaning 1000GB (1TB) of data, comprising approximately 6 billion records of data. This is a common scale at which to run this benchmark, and illustrative of data scales our customers commonly target with our system.

Queries HeavyDB GH200 HeavyDB 8XA100 Snowflake Large Databricks Large Redshift Serverless 32 RPUs BigQuery 400 Slots
Q001.1 86 30 3026 1903 994 3305
Q001.2 72 29 2129 1769 782 2968
Q001.3 60 29 1752 1881 723 2771
Q002.1 203 82 5252 7652 1994 9999
Q002.2 289 98 4488 7426 1638 8511
Q002.3 190 63 2741 6810 1331 8166
Q003.1 307 112 8648 16449 9726 56843
Q003.2 213 232 4787 7637 2742 15258
Q003.3 190 216 3669 7627 1443 19007
Q003.4 212 223 2624 2821 1411 18997
Q004.1 816 142 9352 12661 8635 22653
Q004.2 715 144 5739 7647 5048 23762
Q004.3 348 155 5662 5586 2075 10926
Total runtime (sec) 3.701 1.555 59.869 87.869 38.542 203.166

As can be seen from the results, HeavyDB is significantly faster than all the cloud data warehouses on this workload on both the 8XA100 and 1XGH200 instances, with the largest advantage accruing to the former (8XA100) setup likely due to how well Star Schema Benchmark parallelizes over multiple GPUs due to its simpler nature with less intermediate reductions required. 

In addition, the hot data for some queries did not fit entirely in the 96GB VRAM buffer for the GH200 instance, requiring paging over NVLink from CPU. Although this interconnect is significantly faster at 480GB/sec unidirectional (960GB/sec bidirectional) than any PCIe based interconnect (like the 8XA100 has), it is still not as fast as the 4TB/sec HBM3 VRAM buffer, resulting in moderate slowdowns for the queries that had to page from CPU were somewhat slower. That said, it’s a huge leap forward to have such high bandwidth between CPU and GPU, portending an end to the “thin straw” scenario of having to pull large amounts of data over low bandwidth PCIe.  For common usage scenarios of our system, such as when there is a large number of users or users are hitting large datasets that cannot fit entirely in VRAM, having such a fast interconnect is a boon, as the entire contents of VRAM can be swapped in a barely noticeable few hundred milliseconds.

Overall, HeavyDB is between 25X and 130X faster than the other data warehouses on the 8XA100 machine, while with the exception of Redshift costing less to run. Notably, nearly half (6 of 13) of the queries ran in under 100ms, and all queries ran under 240ms. This points to the differentiation the HEAVY.AI platform offers for interactive analytics, as even at the 1TB scale, queries of moderate complexity can return well within the latency threshold that can be termed fully interactive (typically defined as under 500ms), even with multiple such queries on a dashboard or multiple users querying the system simultaneously.

TPC-H

The TPC-H Benchmark is an industry-standard SQL benchmark designed to evaluate the performance of decision support systems by simulating complex business-oriented queries. It represents a broad range of ad hoc queries and concurrent data modifications, reflecting real-world applications where databases are used for analytical processing and reporting. Generally the queries are more complex than those in the Star Schema Benchmark, revolving around a snowflake rather than a star schema.

The GH200 instance shined on this benchmark in particular. Despite having less aggregate GPU bandwidth than the SSB, the lower reduction overhead associated with only using a single GPU (instead of 8 on the 8XA100 instance) combined with specific innovations introduced with the Hopper architecture yielded significant performance gains compared to the 8 GPU Ampere instance. Compared to the other CPU data warehouse systems tested, HEAVY.AI on the GH200 instance was between 3.4X and 21.1X faster in absolute terms, and normalized for cost, 13.6X to 169X faster per dollar.

You’ll see a blank result for TPCH21, which contains a form of correlated subquery that we currently cannot rewrite, but future work will address this. In terms of reporting, we subtracted out the TPCH21 times from the other database results for parity. While the other queries are quite performant, we have identified further optimizations that will further improve the performance of HeavyDB across these queries, and plan to implement the associated changes over upcoming releases.

Queries HeavyDB GH200 Snowflake Large Databricks Large Redshift Serverless 32 RPUs BigQuery 400 Slots
TPCH01 43 1656 1514 1081 1941
TPCH02 111 926 877 505 893
TPCH03 36 1834 2836 414 7768
TPCH04 427 1507 2053 358 3670
TPCH05 48 1752 785 624 11192
TPCH06 12 280 559 125 855
TPCH07 60 1218 657 416 1451
TPCH08 97 1820 866 401 11223
TPCH09 379 2712 597 1006 1198
TPCH10 64 2069 3632 725 3119
TPCH11 117 2924 549 463 922
TPCH12 22 860 1475 286 4605
TPCH13 73 3382 1690 1624 2793
TPCH14 30 696 1072 212 739
TPCH15 88 766 673 756 2614
TPCH16 453 2256 689 547 905
TPCH17 307 943 1738 440 9924
TPCH18 291 3198 2217 1307 12916
TPCH19 93 1016 1230 836 652
TPCH20 777 2125 650 654 1524
TPCH21 3541 629 1979 3759
TPCH22 377 910 2412 451 1333
Total runtime (sec) 3.905 34.85 28.771 13.231 82.237

Conclusion

In conclusion, these benchmark results showcase the performance advantages of HeavyDB’s GPU-accelerated architecture, offering speedups up to 130X compared to traditional cloud data warehouses. More significant than the raw numbers is what this means for organizations: the ability to perform complex analytical queries on massive datasets with sub-second latency, enabling truly interactive data exploration and visualization at scale. 

Part of the performance advantage of course lies in the significant compute and memory bandwidth advantages of GPUs. It is still mind-boggling to me that you can effectively rent a supercomputer in a box in the cloud for as little as $2.99 an hour (in the case of the GH200 system). We are entering a golden age of abundant and cheap compute, the full impact of which is only beginning to be felt.

On the other hand, without key architectural innovations in HeavyDB, such as an LLVM-based query compilation engine and advanced GPU memory management, much of this theoretical hardware edge would be left untapped. We’ve spent the last 10 years building out and optimizing our system to run analytical SQL queries as fast as possible on GPUs, and even then still see a long list of opportunities for further optimizations, both low-level changes to maximize use of the GPU as well as higher level changes, such as how we reduce results between GPUs, not to mention more mundane improvements such as better query plan rewrites. The pursuit of maximum performance is never finished, and we will continue to follow up with additional benchmarks as we land further optimizations.

We’d love for you to try our platform for yourself, which you can do via a variety of methods, including downloading our Free edition or trying it in our cloud. Until next time, wishing you speedy queries.

Todd Mostak

Todd is the CTO and Co-founder of HEAVY.AI. Todd built the original prototype of HEAVY.AI after tiring of the inability of conventional tools to allow for interactive exploration of big datasets while conducting his Harvard graduate research on the role of Twitter in the Arab Spring. He then joined MIT as a research fellow focusing on GPU databases before turning the HEAVY.AI project into a startup.