dataquarium

Correlated Subqueries in the Wild with DuckDB, Pt 1

July 18, 2023

Mark Raasveldt recently shared a post about Correlated Subqueries and their performance in DuckDB. As he states in his post,

Subqueries in SQL are a powerful abstraction that allow simple queries to be used as composable building blocks. They allow you to break down complex problems into smaller parts, and subsequently make it easier to write, understand and maintain large and complex queries.

I want to show a couple of real world use cases where correlated subqueries are useful for analyzing business data. In this first post, let’s look at a scenario involving deduplicating data.

Scenario 1: Aggregating data in an OBT at different granularities

I work on an analytics product called Rill that leverages a One Big Table (OBT) architecture for it’s data models. As the name suggests, with OBT we load all of our data to be analyzed into a single table, as opposed to using multiple tables in a star or snowflake schema. The benefits of using OBT are that the model becomes simpler, and common types of data analysis operations like filtering and aggregation become faster and more scalable. However, using an OBT presents challenges as well.

One such challenge is that our data is modeled at a single level of granularity. Let’s say we have some Sales data, with an OBT structure that shows me a list of orders, the order year, the sales amount, and the salesperson:

OrderIdOrderYearSalesAmountSalesperson
12020100Beth
2202150Beth
3202120Beth
4202210Beth
5202180John
6202120John
7202240John

In this table, our lowest granularity is OrderId, with the associated SalesAmount. This works well for doing calculations on the SalesAmount. We can:

However, what if we add a new measure that doesn’t match this level of granularity? For example, lets say that each salesperson has a salary every year. We define the Salary at the Salesperson level:

SalespersonYearSalary
Beth2020100
Beth2021100
Beth2022120
John202180
John202290

When modeling data into an OBT, this Salary data will frequently get duplicated across all records, resulting in the following data:

OrderIdOrderYearSalesAmountSalespersonSalary
12020100Beth100
2202150Beth100
3202120Beth100
4202210Beth120
5202180John80
6202120John80
7202240John90

This duplication of records happens due to the mismatching granularities between SalesAmount (OrderId level) and Salary (Salesperson and OrderYear level). This prevents us from getting some pretty basic answers from our OBT with a total salary metric, like:

In Rill, users define a SQL expression for a metric in a dashboard YAML config file. That expression is then injected into SQL queries for calculating KPIs, trends, and dimension leaderboards. Therefore, we need a single expression for total salary that will properly deduplicate our Salary column by Salesperson when calculating any kind of aggregate over our OBT. In other words, our measure expression needs to work when injected into queries like:

Enter correlated subqueries. Rill is powered by DuckDB, so we can take advantage of it’s correlated subquery performance to solve this problem. With a correlated subquery, we can write a query as our measure expression. We can borrow a trick found in Malloy for doing this deduping. In our query, we first use a list aggregation function to collect all of our distinct Salaries by OrderYear and Salesperson:

list(distinct { key: concat(Salesperson,'~',OrderYear), val: Salary})

Then, we can use unnest to expand that list into a table of values that can be selected from:

select unnest(list(...)) inner;

Finally, we can run our sum aggregation step over that inner table of deduped records:

select sum(inner.val) from (select unnest(list(...)) inner);

That entire query statement can be used as a measure expression, resulting in this measure expression:

(select sum(inner.val) from
  (select unnest(list(distinct { key: concat(Salesperson,'~',OrderYear), val: Salary})) inner)
) as "Total Salary"

If we try this expression in various query contexts with different aggregations and filters, we will see it consistently gives us the correct answer we were looking for:

Inspecting the correct query results in Rill

Thats it for this scenario; in Pt 2, I will show how to write a measure expression that accumulates over a timestamp column using a correlated subquery that can be reused across aggregation contexts.

SQLDuckDB