Todd Mostak

Apr 16, 2025

Connect the Dots in Real-Time: Benchmarking geospatial join performance in GPU-Accelerated HeavyDB against CPU databases

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

While HEAVY.AI is a general purpose analytics platform, many of our customers’ key use cases are centered around geospatial data and workflows. Part of the reason for this lies in the scalable visualization capabilities of the system; by using GPUs for both querying and their original purpose of rendering data, we allow our users to make scalable and fully interactive maps of billions of points, lines, and polygons, no downsampling or pre-tiling required.

‍

But HEAVY.AI can do far more than just visualize big geospatial datasets; it also has the ability to rapidly query, manipulate, and join such data. HeavyDB, the core SQL engine at the heart of the HEAVY.AI platform, natively supports Open Geospatial Consortium (OGC) datatypes and operators, and can execute spatial operations with unprecedented speed.

‍

Of particular importance for geospatial workflows is spatial join performance, as fast spatial joins allow for ad-hoc and interactive fusion of multiple disparate spatial datasets based on an implicit spatial key. For example, retailers and CPGs might join customer address data to census block groups to get an indication of income and other attributes of their customers, to help improve inventory management or optimize store placement. In other cases, telcos might want to determine the distance between cell towers and customers’ home addresses as well as retail establishments like coffee shops, as distance between handset and tower is the largest determinant of coverage quality.

‍

Given the importance of spatial joins, we set about to test the performance of GPU-accelerated HeavyDB compared to five common databases used for spatial analysis: PostgreSQL, DuckDB, and Snowflake, Redshift and BigQuery. Snowflake, Redshift, and BigQuery natively support the OGC geospatial data types and operators needed, while PostgreSQL and DuckDB require the PostGIS and DuckDB spatial extensions, respectively. Snowflake was tested using a Large instance costing $16/hr, Redshift on a 48 RPU serverless deployment costing $12/hr, and BigQuery using a 400 slot deployment costing $24/hr. Finally, HeavyDB, Postgres, and DuckDB were tested on a Grace Hopper GH200 server running in Vultr Cloud, available for $2.99/hr. As we’ve written elsewhere, the Grace Hopper and Grace Blackwell architectures with fast NVLink connections between CPU and GPU are highly advantageous for GPU-accelerated analytics workloads, as data that doesn’t fit in GPU VRAM can be quickly streamed across the bus from CPU. Particularly relevant for Postgres and DuckDB, the instance has a powerful 72-core Arm Neoverse V2 CPU, and we’ve found DuckDB to run significantly faster on this instance than much costlier X86 instances in past testing.

‍

Our benchmark queries can be bucketed into two categories: point-in-polygon joins and point-to-point distance joins.

‍

Point-in-Polygon Joins

‍

Point-in-polygon joins generally use the ST_CONTAINS or ST_INTERSECTS operators to merge (join) a table containing points (for example, tweets with locations, cell phone records) and a table containing polygon geometries (i.e. census block groups, cell coverage polygons). This generally allows enrichment of the points dataset with metadata from the polygons dataset, in addition to allowing rollup aggregations per polygon (i.e. per census block group).

‍

For example, a common use case for point-in-polygon joins is to take a dataset of interest with a geographic point attribute (i.e. tweets with locations, call data records, traffic tickets, etc) and roll them up to the level of a geographic region like country, state/province, census block group or equivalent, or even building. Note that generally the focal dataset does not already have this region info computed (i.e. it just has the point location and no metadata for geographic region), otherwise it would be easier and more performant to just use the pre-computed region attribute.

‍

As an example, the U.S. Geological Survey (USGS) in conjunction with UC San Diego recently released LiDAR-derived Digital Elevation Model (DEM) and Digital Surface Model (DSM) datasets for the areas impacted by the devastating Palisades and Eaton fires. We wanted to use this data to determine which structures were damaged by the fire, as a potential precursor to building a better fire risk model. To do this, we joined the DEM and DSM datasets, containing over 1B data points each, to Overture building footprints for the area, looking for areas in which the height of man-made structures was close to the base elevation height, suggesting a structure that was burned down. As can be seen in the map below generated from the HeavyImmerse notebook SQL query, the purple structures generally represent areas that were burned over.

‍

Using two point-in-polygon joins to compute the difference between base elevation (DSM) and structure heights (DSM) for structures in the vicinity of the Pacific Palisades fire, to determine which structures were burned over.

‍

With this example use case in mind, let’s proceed to review the benchmarks. To start with, here are the queries that were run in each system. Note that to isolate geospatial join performance as much as possible, simple count aggregates were computed, which also made it easy to ensure the correctness of results in each system by comparing the returned counts. In a real-world scenario, we would generally be rolling up the data to the geographic region we are spatially joining to, similar to the query above joining elevations to building footprints.

‍

Q1. SELECT COUNT(*) FROM us_tweets_100m AS T1 JOIN usa_states AS T2 ON ST_CONTAINS(T2.geom, T1.location);

‍

Q2. SELECT COUNT(*) FROM us_tweets_100m JOIN usa_census_block_groups AS T2 ON ST_CONTAINS(T2.geom, T1.location);

‍

Q3. SELECT COUNT(*) FROM taxi_trips AS T1 JOIN nyc_taxi_zones AS T2 ON ST_CONTAINS(T2.geom, T1.dropoff_point);

‍

Q4. SELECT COUNT(*) FROM taxi_trips AS T1 JOIN nyc_buildings AS T2 ON ST_CONTAINS(T2.geom, T1.dropoff_point);

‍

Below is an overview of the datasets used in testing.

‍

Fact Tables (Point Datasets)

‍

Table Name	# Rows	Description
us_tweets_100m	100,000,000	Sample of 100 million geocoded tweets from 2020 and 2021, primarily from the US.
nyc_taxi_trips	1,095,120,325	NYC taxi pickup and dropoff points, spanning April 2010 to May 2016

‍

Dimension Tables (Polygon Datasets)

‍

Table Name	# Rows/Polygons	Avg Verts/Polygon	Description
usa_states	52	266	Geometries and metadata for each US state and territory, from ESRI
usa_census_block_groups	239,203	218	Geometries and census metadata for each US Census Block Group (CBG) for the 2020 census, from ESRI
nyc_taxi_zones	263	372	Geometries for all NYC taxi zones
nyc_buildings	1,083,261	8	Geometries for building footprints in NYC, from NYC Open Data

‍

And below are the results of the testing. All times were measured as the average of 3 “warm” queries after an initial query was run on cold data, to eliminate disparities between data load times and storage subsystem performance of each database. HeavyDB, Snowflake, Redshift, and BigQuery do not require spatial indexes, while for Postgres and DuckDB indexes were added to the polygon/multipolygon columns in the dimension tables to achieve the denoted performance. When comparing systems, the additional overhead and operational complexity of adding and maintaining indexes should be considered when looking at end-to-end performance and administrative overhead.

‍

	HeavyDB (1XGH200 GPU)	Postgres (GH200 instance, 72 cores)	DuckDB (GH200 instance, 72 cores)	Snowflake Large	Redshift Serverless 48 RPUs	BigQuery 400 Slots
Q1	0.635	206.729	116.526	24.794	5.324	7.878
Q2	4.092	908.059	153.108	539.518	10.419	26.109
Q3	5.818	DNF	979.848	203.484	36.78	100.25
Q4	1.217	DNF	987.314	94.759	19.375	134.735
Total Runtime	11.763	N/A	2,236.796	862.555	71.898	268.972
Instance Cost	$2.99/hr	$2.99/hr	$2.99/hr	$16/hr	$12/hr	$24/hr
HeavyDB Runtime Advantage	--	N/A	190.2X	73.3X	6.1X	22.9X
HeavyDB Runtime Price-Performance Advantage	--	N/A	190.2X	392.2X	24.5X	183.8X
Geometric Mean	2.071	N/A	362.459	130.400	14.100	40.827
HeavyDB Geometric Mean Advantage	--	N/A	175.0X	62.97X	6.8X	19.7X
HeavyDB Geometric Mean Price-Performance Advantage	--	N/A	175.0X	337.0X	27.3X	158.1X

‍

As you can see, HeavyDB, running on a single GPU, is between 6X and 190X faster than the tested CPU data warehouses, with the results for Postgres not counted due to not being able to finish Q3 and Q4. The speed differences, beyond just numbers, point to the fact that fairly large geospatial joins can be run interactively in HeavyDB, allowing for quick ad-hoc analyses to be conducted employing various geospatial joins that would need to be run as a batch workflow in the other systems.

‍

It should also be noted that the costs of the three cloud data warehouses tested: Snowflake, Redshift, and BigQuery, were significantly higher than the cost of the HeavyDB GH200 instance on Vultr Cloud, making the overall price-performance ratios (HeavyDB speedup in the geometric mean of the runtime divided by cost ratio) between 27.3X and 337.0X.

‍

Distance Joins

‍

Another relevant geospatial use case is joining datasets based on distance between geometries. For example, imagine you are a retailer and want to join customer residential addresses to store locations to determine the location of the nearest store. With geospatial SQL, such use cases are as simple as a join using a ST_DISTANCE or ST_DWITHIN predicate, although as we’ll see below, the performance you get varies wildly between databases.

‍

For simplicity, we ran the joins in their native 4326/WGS84 longitude/latitude projections, using a 0.1 distance threshold to approximate 10km. In reality, one would likely either convert the projection to Google Mercator or a local UTM zone, or use casts to GEOGRAPHY to filter the distance between points by a threshold in meters (and not degrees). However, while HeavyDB has a beta feature foron-the-fly accelerated hash joins using GEOGRAPHY-casted types (which has not yet been included in a release), other systems like Redshift and DuckDB do not support this. Therefore, we benchmarked join performance using degree distances in the native 4326 projection for the time being.

‍

Using a geographic distance join to find all cafes that are within 100 meters of a cell tower in San Francisco.

‍

Below are the datasets used in testing:

‍

Fact Tables (Point Datasets)

‍

Table Name	# Rows	Description
us_tweets_100m	100,000,000	Sample of 100 million geocoded tweets from 2020 and 2021, primarily from the US.
world_airports	70,078	Location and metadata for airports worldwide.

‍

Q5. SELECT COUNT(*) FROM world_airports AS T1 JOIN world_airports AS T2 ON ST_DWITHIN(T1.geom, T2.geom, 0.01) AND T1.id > T2.id;

‍

Q6. SELECT COUNT(*) FROM us_tweets_100m AS T1 JOIN world_airports AS T2 ON ST_DWITHIN(T1.location, T2.geom, 0.01);

‍

Note that for Postgres and DuckDB, spatial indexes were added to all point geometry columns to accelerate joins.

‍

	HeavyDB (1XGH200 GPU)	Postgres (GH200 instance, 72 cores)	DuckDB (GH200 instance, 72 cores)	Snowflake Large	Redshift Serverless 48 RPUs
Q5	0.012	0.726	504.626	20.481	4.02
Q6	2.267	373.532	DNF	8.712	DNF
Total	2.279	374.258	N/A	9.193	N/A
Instance Cost	$2.99/hr	$2.99/hr	$2.99/hr	$16/hr	$12/hr
HeavyDB Runtime Advantage	--	164.2X	N/A	4.0X	N/A
HeavyDB Runtime Price-Performance Advantage	--	164.2X	N/A	21.4X	N/A
Geometric Mean	0.165	16.468	N/A	2.047	N/A
HeavyDB Geometric Mean Advantage	--	99.8X	N/A	12.4X	N/A
HeavyDB Geometric Mean Price-Performance Advantage	--	99.8X	N/A	68.4X	N/A

‍

Outside of HeavyDB, only Snowflake could run the second query, joining 100m tweet locations to 70K airport locations, in under ten seconds. Surprisingly, DuckDB was very slow in this test, taking over 8 minutes to even finish the self-join of the airports table, HeavyDB only taking 12 milliseconds and even Postgres taking 726 milliseconds. Interestingly, while Redshift was the fastest system outside of HeavyDB for point-in-polygon joins, it fared quite poorly in this point distance test, with query Q6 not returning.

‍

Again, HeavyDB running on the GH200 instance yields significant price-performance advantages over the other systems, 68.4X using the geometric means of a baseline against the next fastest system, Snowflake.

‍

Conclusion

‍

In conclusion, the above benchmarks demonstrate the substantial advantages of GPU acceleration for geospatial join operations. HeavyDB, running on a cost-effective Grace Hopper GH200 instance, consistently outperformed leading CPU-based databases like PostgreSQL, DuckDB, Snowflake, Redshift, and BigQuery across both point-in-polygon and distance join scenarios, achieving speedups ranging from 7X to over 175X. This raw performance, combined with significantly lower instance costs of HeavyDB running on the GH200 compared to the tested cloud data warehouses, resulted in dramatic price-performance gains, often exceeding 100X. Ultimately, HeavyDB's ability to execute complex spatial joins in seconds, rather than minutes or hours, transforms large-scale geospatial analysis from a batch process into an interactive exploration, empowering users to connect the dots in real-time and derive insights more rapidly and effortlessly than was possible before.

Todd Mostak

Todd is the CEO and Co-founder of HEAVY.AI. Todd built the original prototype of HEAVY.AI after tiring of the inability of conventional tools to allow for interactive exploration of big datasets while conducting his Harvard graduate research on the role of Twitter in the Arab Spring. He then joined MIT as a research fellow focusing on GPU databases before turning the HEAVY.AI project into a startup.