Todd Mostak

Oct 29, 2024

Speed at Scale: Benchmarking GPU-Accelerated HeavyDB

Download HEAVY.AI Free, a full-featured version available for use at no cost.

From the very beginning, HEAVY.AI was built as a vertically integrated platform designed to allow fully interactive visual exploration of massive multi-billion row datasets, without needing to index, downsample, or pre-aggregate the data. This required the development of a built-in capability to zero-copy render large datasets using the GPU, as well as a web-based visual analytics frontend, HeavyImmerse, that would allow users to effortlessly slice-and-dice data with a few clicks and drags of their mouse.

‍

However none of the blistering speeds the platform is known for would be possible without the performance of its core engine, the GPU-accelerated analytics database HeavyDB. Not only was HeavyDB built from the ground up to leverage the unique architecture of Nvidia GPUs, but we architected it to ensure it could exploit the full parallelism of modern hardware, both GPU and CPU. Here is a non-exhaustive list of key architectural innovations of the system:

‍

HeavyDB is built around a highly optimized compilation engine that lowers query plans to LLVM IR, from which we can target Nvidia GPUs as well as X86 and ARM CPUs, tailoring optimization passes for each architecture.
There are a number of strategies that are taken for generating and probing group-by and join hash tables, depending on the hardware and the cardinalities and characteristics of the data involved. For example, on GPU, hash tables can be global or privatized per CUDA block, can be put in global or shared memory, and can use a murmur or perfect hash strategy, all depending on the situation to maximize performance.
HeavyDB is built to carefully manage the memory across multiple GPUs per server with a separate buffer pool per GPU (and CPU), keeping data close to compute and effectively managing the complexity of a distributed database implementation within a single process for maximum performance.
As mentioned above, for visualization workloads, we can optionally zero-copy render SQL query results using a subset of the open-source Vega specification, particularly powerful for generating instant visualizations of billions of geospatial records (points, lines, and polygons).

‍

While we often focus on the capabilities of the full vertical stack, over time our core engine has improved significantly both in terms of performance and capabilities. Hence we thought it would be interesting to test the raw performance of the system and compare against leading CPU data warehouses, using the widely used Star Schema Benchmark (SSB) and TPC-H tests. We think the results are impressive (with still further room for optimization), but we’ll let you judge for yourself.

‍

Hardware and Software Setup‍

We partnered with Vultr, a leading GPU cloud provider, for instances used for testing. We benchmarked our software on a machine with 8X80GB (640GB VRAM total) Nvidia A100 GPUs, costing $12.74/hr, and one of Nvidia’s new ARM-based Grace Hopper GH200 instances, costing $2.99/hr, with a single 96GB Hopper GPU able to move data to and from 480GB of CPU RAM over an ultra-fast 960GB/sec bidirectional bandwidth NVLink interconnect. As we will cover in more detail in a future post, we believe that the new fast NVLink between CPU and GPU in these new Nvidia ARM-based systems (Grace Hopper currently, to be followed by Grace Blackwell) represents a breakthrough in the types of problems and scales of data that can be processed efficiently on GPU.

‍

Both GPU setups are effectively commodity supercomputers; both would easily make the Top500 SuperComputer list a little over 10 years ago. The GH200 system with a single GPU possesses 67 teraFLOPS of FP64 compute bandwidth and nearly 4 petaFLOPS using the new FP8 standard, with up to 144GB of GPU memory at 5TB/sec (our system had 96GB VRAM at 4TB/sec). The 8XA100 system, while based an older (Ampere) architecture, makes up for it in aggregate compute and memory bandwidth by having 8 GPUs, for a total of 156 teraFLOPs of FP64 compute with 640GB VRAM accessible at 16TB/sec. It should be noted that the new Hopper architecture of the GH200 possesses other key architectural advantages (i.e. larger caches) that allows the small GH200 machine to best the much larger 8XA100 machine in some scenarios, as seen below.

‍

To provide a baseline by which to gauge our performance, we also benchmarked the leading cloud data warehouse vendors: Snowflake (Large instance type, costing $16/hr), Databricks (Large Instance Type, costing $28/hr), Google BigQuery (400 Query Slots, costing $24/hr), and Amazon Redshift (32 RPUs, costing $12/hr).

‍

Methodology‍

Following the trend of other published benchmarks, we ran the tests “warm”, taking the average timing of the 2nd, 3rd, and 4th run of each query. This eliminates the overhead of moving data from storage for the first run, which would essentially make the benchmark storage-bound rather than measuring execution performance. This also dovetails with common usage patterns of our system, where users typically identify data of interest (which might be certain columns of a table for a given time range) and repeatedly query that subset. For certain vendors like Snowflake, we had to disable result set caching, as otherwise queries after the first run returned nearly instantaneously due to result set caching enabled by default in these systems.

‍

Star Schema Benchmark (SSB)‍

The Star Schema Benchmark (SSB) is a popular SQL benchmark designed to evaluate the performance of database systems when handling complex analytical queries. It models a typical data warehouse environment, specifically focusing on a “star schema” data model where a large fact table is surrounded by smaller dimension tables, commonly used in business intelligence and reporting systems. It should be noted that SSB is derived from the TPC-H benchmark but is simplified to emphasize key aspects of star schema-based workloads.

‍

For our tests we used a Scale Factor (SF) of 1000, meaning 1000GB (1TB) of data, comprising approximately 6 billion records of data. This is a common scale at which to run this benchmark, and illustrative of data scales our customers commonly target with our system.

‍‍

Queries	HeavyDB GH200	HeavyDB 8XA100	Snowflake Large	Databricks Large	Redshift Serverless 32 RPUs	BigQuery 400 Slots
Q001.1	86	30	3026	1903	994	3305
Q001.2	72	29	2129	1769	782	2968
Q001.3	60	29	1752	1881	723	2771
Q002.1	203	82	5252	7652	1994	9999
Q002.2	289	98	4488	7426	1638	8511
Q002.3	190	63	2741	6810	1331	8166
Q003.1	307	112	8648	16449	9726	56843
Q003.2	213	232	4787	7637	2742	15258
Q003.3	190	216	3669	7627	1443	19007
Q003.4	212	223	2624	2821	1411	18997
Q004.1	816	142	9352	12661	8635	22653
Q004.2	715	144	5739	7647	5048	23762
Q004.3	348	155	5662	5586	2075	10926

Total runtime (sec)	3.701	1.555	59.869	87.869	38.542	203.166

‍

‍

‍‍

As can be seen from the results, HeavyDB is significantly faster than all the cloud data warehouses on this workload on both the 8XA100 and 1XGH200 instances, with the largest advantage accruing to the former (8XA100) setup likely due to how well Star Schema Benchmark parallelizes over multiple GPUs due to its simpler nature with less intermediate reductions required.

‍

In addition, the hot data for some queries did not fit entirely in the 96GB VRAM buffer for the GH200 instance, requiring paging over NVLink from CPU. Although this interconnect is significantly faster at 480GB/sec unidirectional (960GB/sec bidirectional) than any PCIe based interconnect (like the 8XA100 has), it is still not as fast as the 4TB/sec HBM3 VRAM buffer, resulting in moderate slowdowns for the queries that had to page from CPU were somewhat slower. That said, it’s a huge leap forward to have such high bandwidth between CPU and GPU, portending an end to the “thin straw” scenario of having to pull large amounts of data over low bandwidth PCIe. For common usage scenarios of our system, such as when there is a large number of users or users are hitting large datasets that cannot fit entirely in VRAM, having such a fast interconnect is a boon, as the entire contents of VRAM can be swapped in a barely noticeable few hundred milliseconds.

‍

Overall, HeavyDB is between 25X and 130X faster than the other data warehouses on the 8XA100 machine, while with the exception of Redshift costing less to run. Notably, nearly half (6 of 13) of the queries ran in under 100ms, and all queries ran under 240ms. This points to the differentiation the HEAVY.AI platform offers for interactive analytics, as even at the 1TB scale, queries of moderate complexity can return well within the latency threshold that can be termed fully interactive (typically defined as under 500ms), even with multiple such queries on a dashboard or multiple users querying the system simultaneously.

‍

TPC-H‍

The TPC-H Benchmark is an industry-standard SQL benchmark designed to evaluate the performance of decision support systems by simulating complex business-oriented queries. It represents a broad range of ad hoc queries and concurrent data modifications, reflecting real-world applications where databases are used for analytical processing and reporting. Generally the queries are more complex than those in the Star Schema Benchmark, revolving around a snowflake rather than a star schema.

‍

The GH200 instance shined on this benchmark in particular. Despite having less aggregate GPU bandwidth than the SSB, the lower reduction overhead associated with only using a single GPU (instead of 8 on the 8XA100 instance) combined with specific innovations introduced with the Hopper architecture yielded significant performance gains compared to the 8 GPU Ampere instance. Compared to the other CPU data warehouse systems tested, HEAVY.AI on the GH200 instance was between 3.4X and 21.1X faster in absolute terms, and normalized for cost, 13.6X to 169X faster per dollar.

‍

You’ll see a blank result for TPCH21, which contains a form of correlated subquery that we currently cannot rewrite, but future work will address this. In terms of reporting, we subtracted out the TPCH21 times from the other database results for parity. While the other queries are quite performant, we have identified further optimizations that will further improve the performance of HeavyDB across these queries, and plan to implement the associated changes over upcoming releases.

‍

Queries	HeavyDB GH200	Snowflake Large	Databricks Large	Redshift Serverless 32 RPUs	BigQuery 400 Slots
TPCH01	43	1656	1514	1081	1941
TPCH02	111	926	877	505	893
TPCH03	36	1834	2836	414	7768
TPCH04	427	1507	2053	358	3670
TPCH05	48	1752	785	624	11192
TPCH06	12	280	559	125	855
TPCH07	60	1218	657	416	1451
TPCH08	97	1820	866	401	11223
TPCH09	379	2712	597	1006	1198
TPCH10	64	2069	3632	725	3119
TPCH11	117	2924	549	463	922
TPCH12	22	860	1475	286	4605
TPCH13	73	3382	1690	1624	2793
TPCH14	30	696	1072	212	739
TPCH15	88	766	673	756	2614
TPCH16	453	2256	689	547	905
TPCH17	307	943	1738	440	9924
TPCH18	291	3198	2217	1307	12916
TPCH19	93	1016	1230	836	652
TPCH20	777	2125	650	654	1524
TPCH21		3541	629	1979	3759
TPCH22	377	910	2412	451	1333

Total runtime (sec)	3.905	34.85	28.771	13.231	82.237

‍

‍

Conclusion‍

In conclusion, these benchmark results showcase the performance advantages of HeavyDB’s GPU-accelerated architecture, offering speedups up to 130X compared to traditional cloud data warehouses. More significant than the raw numbers is what this means for organizations: the ability to perform complex analytical queries on massive datasets with sub-second latency, enabling truly interactive data exploration and visualization at scale.

‍

Part of the performance advantage of course lies in the significant compute and memory bandwidth advantages of GPUs. It is still mind-boggling to me that you can effectively rent a supercomputer in a box in the cloud for as little as $2.99 an hour (in the case of the GH200 system). We are entering a golden age of abundant and cheap compute, the full impact of which is only beginning to be felt.

‍

On the other hand, without key architectural innovations in HeavyDB, such as an LLVM-based query compilation engine and advanced GPU memory management, much of this theoretical hardware edge would be left untapped. We’ve spent the last 10 years building out and optimizing our system to run analytical SQL queries as fast as possible on GPUs, and even then still see a long list of opportunities for further optimizations, both low-level changes to maximize use of the GPU as well as higher level changes, such as how we reduce results between GPUs, not to mention more mundane improvements such as better query plan rewrites. The pursuit of maximum performance is never finished, and we will continue to follow up with additional benchmarks as we land further optimizations.

‍

We’d love for you to try our platform for yourself, which you can do via a variety of methods, including downloading our Free edition or trying it in our cloud. Until next time, wishing you speedy queries.

Todd Mostak

Filter posts by Category

Featured Posts

12 Data Visualization Color Palettes for Telling Better Stories with Your Data

Put a Hex on it: Introducing new Uber H3 Capabilities

Connect the Dots in Real-Time: Benchmarking Geospatial Join Performance in GPU-Accelerated HeavyDB against CPU databases

Empowering Discovery through Activity-Based Intelligence and AI