By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Todd Mostak
Oct 10, 2024

Making SQL Smarter: How to Embed AI into your Queries with HeavyIQ

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

We’ve been hard at work over the last year to bring native conversational AI capabilities to the HEAVY.AI platform as part of our HeavyIQ module, primarily focused to date on using locally hosted, custom trained LLMs to allow users to generate queries and visualizations with natural language.


However there are many use cases which call for deeper embedding of Conversational AI into the analytics pipeline. What if vanilla SQL queries could be augmented to handle use cases which require the sort of “fuzzy intelligence” that humans excel at, but are difficult to write explicit logic for? ELT use cases like wrangling poorly formatted dates, sentiment analysis, entity extraction, and even basic fact augmentation are difficult if not impossible to do with standard SQL but constitute sweet spot use cases for LLMs.

To this end, we’ve added the ability to call the HeavyIQ LLM directly from SQL itself with a new LLM_TRANSFORM operator, which can be a huge time saver for use cases like the above. To make this work, we've trained the HeavyIQ Language Model to be able to tersely answer a wide variety of questions, and we even support grammar constraints in the form of allowed outputs and regexes to ensure that the output is readily usable downstream for analysis. On the performance side of things, we ensure we call the LLM in parallel across many threads to yield the best possible throughput, and we only make a single call per unique input to avoid redundant processing (for example, calling LLM_TRANSFORM on a column with a 1,000 unique values over a billion rows will be fast as the LLM only needs to be queried 1,000 times).

The combination of the ease-of-use of SQL, a state-of-the-art Large Language Model purpose-built for embedded analytics use cases (namely being able to answer succinctly and reliably on the broad sets of data enrichment tasks), and the performance optimizations outlined above make the new LLM_TRANSFORM operator a powerful way to clean data and perform ELT tasks. Let’s walk through some examples that showcase the benefits of having the power of LLMs at your (SQL) fingertips.

Dirty Dates

We’ll start with an example that’s sure to bring up some uncomfortable memories for anyone who has spent enough time wrangling messy data. Often if you’re ingesting data from non-standardizes sources such as government parcel data or anything originating from handwritten records, you’ll find dates recorded in all sorts of different and arcane formats. Because the dates are not necessarily recorded in one specific system, you can’t easily use a regex for transformation. This is where the new LLM_TRANSFORM operator can be highly useful. As you can see below, we’re using LLM_TRANSFORM combined with TRY_CAST to not only clean up the dates into a standardized format, but to actually cast the strings to a DATE type that we could either persist or use for downstream analysis.

Note that it was not necessary here, but if we wanted to ensure that the LLM put the dates in the proper format, we could use a regex to enforce a constraint on the output, by adding the regex as an optional third argument: LLM_TRANSFORM(messy_date, 'Transform this date string into a proper ISO formatted date in the form of YYYY-MM-DD', '/\d{4}-\d{2}-\d{2}/'). Note that for the LLM_TRANFORM operator, regexes are always bounded with / on each side like they would be in a sed expression.

Classification

Classifying data into categories is a common task in data analysis and ETL (Extract, Transform, Load) processes. This task can serve several purposes:

  1. Data cleaning: Organizing messy data into consistent categories. For example, standardizing political campaign donations where candidate names may have spelling variations or inconsistent use of middle initials.
  2. Data aggregation: Grouping detailed items into broader categories. An example is categorizing credit card transactions into general spending categories like "groceries" or "entertainment".
  3. Predictive analytics: Using classification to make predictions. For instance, identifying potentially fraudulent credit card transactions based on certain characteristics.

The LLM_TRANSFORM operator works well for all of these types of classification tasks, particularly when coupled with the ability to constraint outputs to a discrete set of items by using the optional third constraint argument with a pipe-delimited set of possible outputs. One example we found useful internally was helping to separate human from bot users when analyzing our internal Slack usage. You can see that not only can LLM_TRANFORM be used to classify inputs, but the output can be filtered or grouped by like any other SQL output, making downstream analytics on the output of the LLM trivial.

We’ve just scratched the surface on compelling ways you might use the LLM for embedded classification of your data, and will share other interesting examples in future posts here.

Sentiment Analysis

A classic NLP task is to analyze the sentiment of text, for example to pinpoint shifts in sentiment towards a brand or a political candidate over time. While an LLM can perform sentiment analysis with some degree of accuracy out of the box, we fine-tuned the HeavyIQ LLM specifically on sentiment analysis tasks with human-annotated “ground truth” data to ensure maximum accuracy for this use case. Let’s see how we can use the LLM_TRANSFORM operator to analyze sentiment of Twitter data.

It’s always good to “vibe check” a model, and you can see here that the HeavyIQ LLM seems to be returning the results we would expect and that would be generated by a human annotator. Now to actually use the sentiment output for analysis, we can do things like roll up this data by week to see trends over time

And if you forget the syntax of the operator or are less comfortable writing SQL, you can even use HeavyIQ itself to write queries like this and automatically visualize them, for example using the following prompt: “Make a map of the location and sentiment of all tweets within 500 meters of the US capitol. Use LLM_TRANSFORM to extract the sentiment.” Here in true Inception-style we are literally using the AI model to write the SQL to call the AI model.

Early feedback from users has pointed to the immense usefulness of bringing in fuzzy intelligence into the SQL pipeline, which can play the role of a “smart regex” to handle tasks that would ordinarily require the introduction of dedicated machine learning models. While we have further work slated to further improve both the expressive power (for example, support for string interpolation syntax to use  multiple column inputs) as well as performance (such as more sophisticated query rewrite capabilities to push down simple filters into subqueries when LLM_TRANSFORM is used in WHERE clauses), we think you will find that the existing capabilities of the new LLM_TRANSFORM operator can already unlock a number of compelling analytic use cases directly from SQL.

If you want to try out for yourself, feel free to download HEAVY.AI Free or spin up a cloud trial. And as always, we’d love to hear your feedback in our Community Forum.

Todd Mostak

Todd is the CTO and Co-founder of HEAVY.AI. Todd built the original prototype of HEAVY.AI after tiring of the inability of conventional tools to allow for interactive exploration of big datasets while conducting his Harvard graduate research on the role of Twitter in the Arab Spring. He then joined MIT as a research fellow focusing on GPU databases before turning the HEAVY.AI project into a startup.