(Almost) All SQL You Need

sql

tutorial

Author

cstorm125

Published

June 29, 2025

Skill mastery is an endeavor of diminishing returns. You can get from zero to good enough (think first-time chess player to 500 online rating) within 100 hours of deliberate practice; however, it is exponentially more difficult to get from good enough to mastered (think a 500-rated chess player trying to become a 2200-FIDE-rated Candidate Master), requiring over 100x more delibrate practice. SQL is the basic building blocks of data analysis and I posit that most peopleーproduct managers, marketers, salespeople, and other makeshift data analystsーsimply need to be good enough.

This tutorial aims to encapsulate almost all SQL techniques you need to glean actionable insights from your (non-transactional) datasets. We will use the Taobao User Behavior dataset and duckdb to simulate a SQL interface. There will be idiosyncracies according to which flavors of SQL you are usingーPostgres, Presto, Redshift, BigQuery, and so onーbut you should be able to adapt the principles outlined here with a little help of modern coding assistants.

I recommend that you first quickly skim through this post to have a rough idea of what SQL is all about, then move on to complete the exercise for some hands-on shenanigans. While completing it, feel free to refer back to this post and read some sections more in details. The road to becoming a true SQL monkey starts with writing queries.

Code

import duckdb
import pandas as pd
pd.set_option('display.max_rows', 6) 
import numpy as np
import random
from tqdm.auto import tqdm
import os
import timeit


df = pd.read_csv('../../data/sql_almost/UserBehavior.csv',
      header=None)
df.columns = [
  'user_id',
  'item_id',
  'category_id',
  'behavior_type',
  'timestamp']

#sample 5M rows out of 100M to run in reasonable time
df = df.sample(n=5_000_000, random_state=112).reset_index(drop=True)

# # save to parquet
# con = duckdb.connect()
# con.register('transaction_tbl', df)
# output_dir = '../../data/sql_almost/transaction_tbl'
# os.makedirs(output_dir, exist_ok=True)

# con.execute(f"""
# COPY transaction_tbl
# TO '{output_dir}'
# (FORMAT PARQUET, PARTITION_BY (category_id, behavior_type), OVERWRITE_OR_IGNORE);
# """)

0. Intuition

Imagine you are at a library to look for a certain piece of information. The entire library is off limits to the general public. The only way you can access any information is by telling the librarian exactly what you want and let them fetch it for you. The entire library is your database. The librarian is your query engine and what you tell them is your query. In the library, there are a number of shelves (representing tables) that contain any number of books (representing rows in a table) to answer your query. Each book from the same shelf contains the same number of pages (representing columns or fields in a row), each page having a distinct piece of information. Following your instruction, the librarian may walk to different shelves, scour some books and pages, mix-and-match the information, then present the answer to you. Therefore, our task is to give an instruction such that the librarian can give us the most accurate answer using the shortest time and energy possible.

1. Sanity Checks Keep You Sane

We are treating the Taobao User Behavior dataset as our table called transaction_tbl. It is a 5M-row subset based on parquet files partitioned by category_id and behavior_type.

Code

con = duckdb.connect()

#short-hand function to execute query with duckdb easily
def execute_query(query): return con.execute(query).fetchdf()

query = f"""
CREATE OR REPLACE TABLE transaction_tbl AS
SELECT *
FROM read_parquet('../../data/sql_almost/transaction_tbl/*/*/*.parquet', hive_partitioning=true);
"""

#load saved, partitioned table as transaction_tbl
execute_query(query)

The table is a run-of-the-mill user behavior log consisting of

user_id: identifier of user
item_id: identifier of item
timestamp: unix timestamp of when action happened
behavior_type: what type of action it was
category_id: identifier of category the item belongs to

Example: `select *`

The first thing you do with a table you have not seen before is to select a few rows to look at. The texts in variable query is what you would type in your SQL interface.

Code

query = f"""
select * from transaction_tbl limit 100;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id
0	184439	688483	1512105722	buy	1000858
1	860167	3032030	1511876592	pv	1000858
2	29019	1174848	1511952716	pv	1000858
...	...	...	...	...	...
97	109393	336502	1512302682	pv	1003726
98	928768	2826670	1511780591	pv	1003726
99	719622	2297792	1511703912	pv	1003726

100 rows × 5 columns

select * returns all columns from a table, and is almost always followed by limit [NUMBER OF ROWS YOU WANT] in order to not return the entire table but a random subset of it. This is less of a problem in the day and age where most query engines have a built-in safeguard to not display the entire table as a result, otherwise you can crash your system by asking for a terabyte-level output.

Example: `describe`

Another useful command is describe. It will tell you what the columns are, their data types, do they have null (missing) values and so on. Common data types are int/bigint for integers, float for approximate decimals, varchar/string for texts, boolean for true/false and timestamp/date for, well, timestamps and dates. However, they are always ever so slightly different depending on which SQL you are using so better confirm with the documentation such as this one for Presto.

Code

query = f"""
describe transaction_tbl;
"""

execute_query(query)

	column_name	column_type	null	key	default	extra
0	user_id	BIGINT	YES	None	None	None
1	item_id	BIGINT	YES	None	None	None
2	timestamp	BIGINT	YES	None	None	None
3	behavior_type	VARCHAR	YES	None	None	None
4	category_id	BIGINT	YES	None	None	None

Example: `count` and `count(distinct [COLUMN])`

The last check you always want to do is to count how many rows exist in the table:

Code

query = f"""
select count(*) from transaction_tbl;
"""

execute_query(query)

	count_star()
0	5000000

We can also count how many unique values are in any column by adding distinct in front of the column name in count. You can also name your derived columns to know which ones mean what using as.

Code

# count how many unique users
query = f"""
select 
 count(*) as nb_event
 ,count(distinct user_id) as nb_user
from transaction_tbl;
"""

execute_query(query)

	nb_event	nb_user
0	5000000	888335

2. Assorted Platter of Selects

Now that you know how to sanity-check a table, let us move on to selecting what you want. This is often done by filtering the result with conditions using the where clause. Here are some examples you want to familiarize yourself with.

Example: Who are the users that viewed item 2067266?

Link multiple conditions with logical operators and / or.

Code

query = f"""
select 
 distinct user_id
from transaction_tbl
where behavior_type = 'pv'
 and item_id = 2067266;
"""

execute_query(query)

	user_id
0	377356
1	690994
2	485340
...	...
4	469061
5	457064
6	329947

7 rows × 1 columns

Example: Who are the users that viewed item 2067266, ranked by who viewed last to who viewed first?

We use order by [COLUMN] followed by asc for ascending order (default) and desc for descending order.

Code

query = f"""
select 
 user_id
 ,timestamp
from transaction_tbl
where behavior_type = 'pv'
 and item_id = 2067266
order by timestamp desc;
"""

execute_query(query)

	user_id	timestamp
0	241453	1512288234
1	485340	1512160311
2	690994	1512127740
...	...	...
4	469061	1511865189
5	329947	1511850299
6	457064	1511601891

7 rows × 2 columns

Example: How many users did not buy?

<> means not equal.

Code

query = f"""
select 
 count(distinct user_id) 
from transaction_tbl
where behavior_type <> 'buy';
"""

execute_query(query)

	count(DISTINCT user_id)
0	882077

Example: How many events happened before November 26, 2017 (timestamp 1511622000)?

>, <, >=, <= also works with numeric columns.

Code

query = f"""
select 
 count(*)
from transaction_tbl
where timestamp < 1511622000;
"""

execute_query(query)

	count_star()
0	489758

Example: How many events happened between November 25, 2017 (timestamp 1511535600) and November 26, 2017 (timestamp 1511622000)?

You can use between to replace >= and <=.

Code

query = f"""
select 
 count(*)
from transaction_tbl
where timestamp between 1511535600 and 1511622000;
"""

execute_query(query)

	count_star()
0	487676

Example: How many events are either purchase (`buy`) or add-to-cart (`cart`)?

We can use in in place of multiple or

Code

query = f"""
select 
 count(*)
from transaction_tbl
where behavior_type in ('buy', 'cart');
"""

execute_query(query)

	count_star()
0	376999

Check to see if multiple or really gives the same result.

Code

query = f"""
select 
 count(*)
from transaction_tbl
where behavior_type = 'buy' or behavior_type = 'cart';
"""

execute_query(query)

	count_star()
0	376999

Example: How many users interact (`pv`, `cart`, `fav`) with Taobao but never `buy`?

We can use not as a negation for in.

Code

query = f"""
select 
 count(distinct user_id)
from transaction_tbl
where behavior_type not in ('buy');
"""

execute_query(query)

	count(DISTINCT user_id)
0	882077

Example: How many `behavior_type` ends with a `v`?

We use like to match texts that are similar what we want. % is the wild card to say anything can come before/after it. In this simple example, we already know the answer is 2: pv and fav.

Code

query = f"""
select 
 count(distinct behavior_type)
from transaction_tbl
where behavior_type like '%v';
"""

execute_query(query)

	count(DISTINCT behavior_type)
0	2

3. Partitions, Partitions, Partitions

Once you have mastered the basic select patterns, the next thing you must never forget when selecting from a table is to ALWAYS SPECIFY THE PARTITIONS YOU NEED IN THE WHERE CLAUSE. A modern SQL table is stored as multiple compressed files (most commonly parquet) in a nested subfolder structure as seen below. In this case, the partitions (subfolders) are columns category_id and behavior_type.

Fun Fact: Not only do you need to specify the partitions, you need to specify them with the correct data types. I once had a query than ran for 12 hours instead of 5 minutes, simply because I did not check if the partition was in datetime not string. Do not be me.

Code

def print_table_structure(table_path, first_category_id_dir):

    #print root folder
    print(table_path.split(os.sep)[-1])  

    #get category_id partitions
    category_ids = sorted([d for d in os.listdir(table_path) if os.path.isdir(os.path.join(table_path, d)) and d.startswith('category_id=')])

    #print only the first category_id partition
    category_id_path = os.path.join(table_path, first_category_id_dir)
    print(f"  └── {first_category_id_dir}")

    #get behavior_type partitions
    behavior_types = sorted([d for d in os.listdir(category_id_path) if os.path.isdir(os.path.join(category_id_path, d)) and d.startswith('behavior_type=')])

    #print all behavior_type partitions
    for behavior_type_dir in behavior_types:
      behavior_type_path = os.path.join(category_id_path, behavior_type_dir)
      print(f"    └── {behavior_type_dir}")

      #get all parquet files
      files = sorted([f for f in os.listdir(behavior_type_path) if f.endswith('.parquet')])

      #print all files under category_id, behavior_type partitions
      for i, file in enumerate(files):
          if i == len(files) - 1:
              print(f"        └── {file}")
          else:
              print(f"        ├── {file}")

    if len(category_ids) > 1:
        print(f"  {len(category_ids) - 1} more category_id folders...")

print_table_structure('../../data/sql_almost/transaction_tbl','category_id=1462446')

transaction_tbl
  └── category_id=1462446
    └── behavior_type=buy
        ├── data_0.parquet
        ├── data_1.parquet
        ├── data_10.parquet
        ├── data_2.parquet
        ├── data_3.parquet
        ├── data_4.parquet
        ├── data_5.parquet
        ├── data_6.parquet
        ├── data_7.parquet
        ├── data_8.parquet
        └── data_9.parquet
    └── behavior_type=cart
        ├── data_0.parquet
        ├── data_1.parquet
        ├── data_10.parquet
        ├── data_2.parquet
        ├── data_3.parquet
        ├── data_4.parquet
        ├── data_5.parquet
        ├── data_6.parquet
        ├── data_7.parquet
        ├── data_8.parquet
        └── data_9.parquet
    └── behavior_type=fav
        ├── data_0.parquet
        ├── data_1.parquet
        ├── data_2.parquet
        ├── data_3.parquet
        ├── data_4.parquet
        ├── data_5.parquet
        ├── data_6.parquet
        └── data_7.parquet
    └── behavior_type=pv
        ├── data_0.parquet
        ├── data_1.parquet
        ├── data_10.parquet
        ├── data_2.parquet
        ├── data_3.parquet
        ├── data_4.parquet
        ├── data_5.parquet
        ├── data_6.parquet
        ├── data_7.parquet
        ├── data_8.parquet
        └── data_9.parquet
  7796 more category_id folders...

To understand why specifying partitions is crucial, let us try our first query again and see how long it takes to run. We conduct 100 trials of 100 runs each to get mean and standard deviation of query time.

Example: Who are the users that viewed item 2067266? Run WITHOUT partition in `where` clause.

Code

query = """
select distinct user_id from transaction_tbl
where item_id = 2067266;
"""

ts = timeit.repeat(
  lambda: execute_query(query),
  number=100,
  repeat=100
)

print(f'Query time WITHOUT partition: {np.mean(ts):.2f}±{np.std(ts):.2f} seconds')

Query time WITHOUT partition: 0.15±0.08 seconds

Example: Who are the users that viewed item 2067266? Run WITH `category_id` partition in `where` clause.

Code

query = """
select distinct user_id from transaction_tbl
where item_id = 2067266
 and category_id = 4339722;
"""

ts = timeit.repeat(
  lambda: execute_query(query),
  number=100,
  repeat=100
)

print(f'Query time WITH partition: {np.mean(ts):.2f}±{np.std(ts):.2f} seconds')

Query time WITH partition: 0.06±0.06 seconds

What sorcery is this? The query time is roughly halved! This is because when we include partitions in the where clause, we are telling the query engine to only look at the specific parts of the data not the entire table. In real life, not specifying which partitions you need as detailed as possible can spell the difference between waiting for two hours or a few seconds.

4. Column-wise Manipulation

Before we move on, you might have noticed that our dataset is a little bland with only a timestamp and categorical columns (timestamp, user_id, item_id, category_id, behavior_type). In reality, such table as the one we are using often contains price of the item and quantity by which they were purchased. We can add that by manipulating existing columns.

Example: Add `price` column where price is `item_id` modulus 1000 + 50

Here we add the price column we randomly generated to the result. You can manipulate any numeric columns with arithmetic operators.

Code

query = f"""
select 
 *
 ,item_id % 1000 + 50 as price
from transaction_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price
0	184439	688483	1512105722	buy	1000858	533
1	860167	3032030	1511876592	pv	1000858	80
2	29019	1174848	1511952716	pv	1000858	898
...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349
4999998	891831	2064564	1512255716	pv	999980	614
4999999	10430	2006178	1512012994	pv	999980	228

5000000 rows × 6 columns

Example: Add `quantity` column that is a random number between 1 and 10, only for `buy` events. For all other events, leave it as missing values (`null`).

When you buy from a store, you need to specify a quantity of the items; however, for other events (when you view, favorite or add-to-cart), you do not. We must give the query engine an if-else logic depending on the values in each row. In SQL, we do this by using case when [CONDITION] then [VALUE] else [DEFAULT VALUE] end. We use setseed to keep the randomized numbers the same set.

Code

#random is randomizing number between 0 and 1 then floor rounds it down
query = f"""
select setseed(0.112);
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity
0	184439	688483	1512105722	buy	1000858	533	2.0
1	860167	3032030	1511876592	pv	1000858	80	NaN
2	29019	1174848	1511952716	pv	1000858	898	NaN
...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN
4999998	891831	2064564	1512255716	pv	999980	614	NaN
4999999	10430	2006178	1512012994	pv	999980	228	NaN

5000000 rows × 7 columns

We can also have as many when as we want. If we want to explicitly state the conditions for all behavior_type, it will look something like:

Code

query = f"""
select setseed(0.112);

select 
 *
 ,item_id % 1000 + 50 as price
 ,case 
   when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 
   when behavior_type = 'pv' then null
   when behavior_type = 'cart' then null
   when behavior_type = 'fav' then null
   else null 
  end as quantity
from transaction_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity
0	184439	688483	1512105722	buy	1000858	533	4.0
1	860167	3032030	1511876592	pv	1000858	80	NaN
2	29019	1174848	1511952716	pv	1000858	898	NaN
...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN
4999998	891831	2064564	1512255716	pv	999980	614	NaN
4999999	10430	2006178	1512012994	pv	999980	228	NaN

5000000 rows × 7 columns

Example: Calculate `sales` by multiplying `price` and `quantity`

We can do operations among columns of the table. This example necessitates us to perform select twice: first to create price and quantity, then to create sales by multiplying them together. We do this by writing a subquery.

Code

query = f"""
select setseed(0.112);

select 
 whatever_subquery_alias_you_want.*
 ,price * quantity as sales
from 
(select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl) whatever_subquery_alias_you_want;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales
0	184439	688483	1512105722	buy	1000858	533	6.0	3198.0
1	860167	3032030	1511876592	pv	1000858	80	NaN	NaN
2	29019	1174848	1511952716	pv	1000858	898	NaN	NaN
...	...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN	NaN
4999998	891831	2064564	1512255716	pv	999980	614	NaN	NaN
4999999	10430	2006178	1512012994	pv	999980	228	NaN	NaN

5000000 rows × 8 columns

You would notice that the query becomes exponentially less readable with subqueries. This is especially the case when you have multiple nested subqueries and they become unreadable even by you in the next 3 months. An elegant solution is to separate these subqueries with with clauses. This is especially useful if you have a subquery you would like to reuse later in the same query. There is no performance difference between subqueries and with clauses, so pick what is easiest to read for you.

Code

query = f"""
select setseed(0.112);

with t1_tbl as (
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
)

select 
 *
 ,price * quantity as sales
from t1_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales
0	184439	688483	1512105722	buy	1000858	533	6.0	3198.0
1	860167	3032030	1511876592	pv	1000858	80	NaN	NaN
2	29019	1174848	1511952716	pv	1000858	898	NaN	NaN
...	...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN	NaN
4999998	891831	2064564	1512255716	pv	999980	614	NaN	NaN
4999999	10430	2006178	1512012994	pv	999980	228	NaN	NaN

5000000 rows × 8 columns

Example: Fill in missing values in `sales` with zero.

We use coalesce to fill in missing values. Notice that we can have multiple with clauses (consecutive ones puncutated by , and do not need with) depending on how you thin is most readable.

Code

query = f"""
select setseed(0.112);

with t1_tbl as (
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
),

t2_tbl as (
select 
 *
 ,price * quantity as sales
from t1_tbl
)

select
 user_id
 ,item_id
 ,timestamp
 ,behavior_type
 ,category_id
 ,price
 ,quantity
 ,coalesce(sales, 0) as sales
from t2_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales
0	184439	688483	1512105722	buy	1000858	533	2.0	1066.0
1	860167	3032030	1511876592	pv	1000858	80	NaN	0.0
2	29019	1174848	1511952716	pv	1000858	898	NaN	0.0
...	...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN	0.0
4999998	891831	2064564	1512255716	pv	999980	614	NaN	0.0
4999999	10430	2006178	1512012994	pv	999980	228	NaN	0.0

5000000 rows × 8 columns

Example: Convert timestamp from unix timestamp to `yyyy-mm-dd` format.

Datetime conversion, as in any programming script, is a very confusing affair, so I highly recommend you refer to your specific SQL’s documentation such as this one for Spark. But the idea is simply applying some function over your columns. Here we use to_timestamp to convert from int to unix timestamp and then use strftime to convert to a string formatted as yyyy-mm-dd.

Code

query = f"""
select setseed(0.112);

with t1_tbl as (
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
),

t2_tbl as (
select 
 *
 ,price * quantity as sales
from t1_tbl
),

t3_tbl as (
select
 user_id
 ,item_id
 ,timestamp
 ,behavior_type
 ,category_id
 ,price
 ,quantity
 ,coalesce(sales, 0) as sales
from t2_tbl)

select
 *
 ,to_timestamp(timestamp) as event_timestamp
 ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_date
from t3_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales	event_timestamp	event_date
0	184439	688483	1512105722	buy	1000858	533	5.0	2665.0	2017-12-01 14:22:02+09:00	2017-12-01
1	860167	3032030	1511876592	pv	1000858	80	NaN	0.0	2017-11-28 22:43:12+09:00	2017-11-28
2	29019	1174848	1511952716	pv	1000858	898	NaN	0.0	2017-11-29 19:51:56+09:00	2017-11-29
...	...	...	...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN	0.0	2017-11-26 15:24:01+09:00	2017-11-26
4999998	891831	2064564	1512255716	pv	999980	614	NaN	0.0	2017-12-03 08:01:56+09:00	2017-12-03
4999999	10430	2006178	1512012994	pv	999980	228	NaN	0.0	2017-11-30 12:36:34+09:00	2017-11-30

5000000 rows × 10 columns

Example: Make a string column `year_month` that takes only the `yyyy-mm` part from `event_timestamp`.

Most query engines have built-in functions to manipulate texts such as this one for duckdb. However, in this case, since event_timestamp is a timestmap, we need to convert its data type to string before applying the function substring by using cast([COLUMN] as [DATA TYPE]).

Code

query = f"""
select setseed(0.112);

with t1_tbl as (
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
),

t2_tbl as (
select 
 *
 ,price * quantity as sales
from t1_tbl
),

t3_tbl as (
select
 user_id
 ,item_id
 ,timestamp
 ,behavior_type
 ,category_id
 ,price
 ,quantity
 ,coalesce(sales, 0) as sales
from t2_tbl),

t4_tbl as (
select
 *
 ,to_timestamp(timestamp) as event_timestamp
 ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_date
from t3_tbl)

select
 *
 ,substring(cast(event_timestamp as varchar),1,7) as year_month
from t4_tbl;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales	event_timestamp	event_date	year_month
0	184439	688483	1512105722	buy	1000858	533	5.0	2665.0	2017-12-01 14:22:02+09:00	2017-12-01	2017-12
1	860167	3032030	1511876592	pv	1000858	80	NaN	0.0	2017-11-28 22:43:12+09:00	2017-11-28	2017-11
2	29019	1174848	1511952716	pv	1000858	898	NaN	0.0	2017-11-29 19:51:56+09:00	2017-11-29	2017-11
...	...	...	...	...	...	...	...	...	...	...	...
4999997	461462	1642299	1511677441	pv	999980	349	NaN	0.0	2017-11-26 15:24:01+09:00	2017-11-26	2017-11
4999998	891831	2064564	1512255716	pv	999980	614	NaN	0.0	2017-12-03 08:01:56+09:00	2017-12-03	2017-12
4999999	10430	2006178	1512012994	pv	999980	228	NaN	0.0	2017-11-30 12:36:34+09:00	2017-11-30	2017-11

5000000 rows × 11 columns

Example: Save the manipulations done so far as a `view` to be used later.

You’ll notice that even with the with clauses, our query seems substantially more clunky now. We do not want to be re-writing these lines every time we reuse this set of results for other queries. Luckily, there is a solution called view. A view saves the query logic that can be used for other queries later. Unlike actual SQL tables, the data are not stored physically on your database, so the query saved to a view will still run every time it is called.

This query creates the view:

Code

query = f"""
select setseed(0.112);

create or replace view transaction_tbl_x as (

with t1_tbl as (
select 
 *
 ,item_id % 1000 + 50 as price
 ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantity
from transaction_tbl
),

t2_tbl as (
select 
 *
 ,price * quantity as sales
from t1_tbl
),

t3_tbl as (
select
 user_id
 ,item_id
 ,timestamp
 ,behavior_type
 ,category_id
 ,price
 ,quantity
 ,coalesce(sales, 0) as sales
from t2_tbl),

t4_tbl as (
select
 *
 ,to_timestamp(timestamp) as event_timestamp
 ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_date
from t3_tbl)

select
 *
 ,substring(cast(event_timestamp as varchar),1,7) as year_month
from t4_tbl
)
"""

execute_query(query)

	Count

This query reuses it by a simple select *:

Code

query = f"""
select setseed(0.112);

select * from transaction_tbl_x limit 100;
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales	event_timestamp	event_date	year_month
0	184439	688483	1512105722	buy	1000858	533	5.0	2665.0	2017-12-01 14:22:02+09:00	2017-12-01	2017-12
1	860167	3032030	1511876592	pv	1000858	80	NaN	0.0	2017-11-28 22:43:12+09:00	2017-11-28	2017-11
2	29019	1174848	1511952716	pv	1000858	898	NaN	0.0	2017-11-29 19:51:56+09:00	2017-11-29	2017-11
...	...	...	...	...	...	...	...	...	...	...	...
97	109393	336502	1512302682	pv	1003726	552	NaN	0.0	2017-12-03 21:04:42+09:00	2017-12-03	2017-12
98	928768	2826670	1511780591	pv	1003726	720	NaN	0.0	2017-11-27 20:03:11+09:00	2017-11-27	2017-11
99	719622	2297792	1511703912	pv	1003726	842	NaN	0.0	2017-11-26 22:45:12+09:00	2017-11-26	2017-11

100 rows × 11 columns

5. Aggregation aka Group By

When you have millions of rows of data, you do not want to look at them one by one; you want to know how they appear in aggregateーhow many events there are for each type, which items are favorited the most/least, how many items users buy on average, and so on. In fact, you have already learned how to do some of this. count and count(distinct [COLUMN]) are examples of aggregation over the entire table. In this section, we will also learn to use group by to get aggregated values according to the columns we want.

Example: How many events are there for each event type (`behavior_type`)?

You can order by the newly created nb_event column to sort event types in descending order according to how many events they have. As expected, it is views, add-to-carts, favorites, then purchases.

Code

query = f"""
select setseed(0.112);

select 
 behavior_type
 ,count(*) as nb_event
from transaction_tbl_x
group by behavior_type
order by nb_event desc;
"""

execute_query(query)

	behavior_type	nb_event
0	pv	4478549
1	cart	276307
2	fav	144452
3	buy	100692

One neat trick is that you do not have to write the column names in group by or order by and use numbering instead; for instance, 1 means the first column selected, in this case behavior_type.

Code

query = f"""
select setseed(0.112);

select 
 behavior_type
 ,count(*) as nb_event
from transaction_tbl_x
group by 1
order by 2 desc;
"""

execute_query(query)

	behavior_type	nb_event
0	pv	4478549
1	cart	276307
2	fav	144452
3	buy	100692

Example: What are the top 10 items that got favorited by most number of unique customers?

When used in conjunction with where, the where clause comes before group by.

Code

query = f"""
select setseed(0.112);

select 
 item_id
 ,count(distinct user_id) as nb_user
from transaction_tbl_x
where behavior_type = 'fav'
group by 1
order by 2 desc
limit 10;
"""

execute_query(query)

	item_id	nb_user
0	2331370	43
1	3845720	41
2	2279428	40
...	...	...
7	2364679	37
8	3403645	35
9	1783990	33

10 rows × 2 columns

Example: What is the average, standard deviation, min, and max spend per user?

First, we need to sum up all sales for each user then perform the avg, stddev, min and max aggregation over all users. We can also see quantiles using functions like approx_quantile([COLUMN], [QUANTILE]).

Code

query = f"""
select setseed(0.112);

select
 min(spend) as min_spend
 ,approx_quantile(spend, 0.25) as p25
 ,avg(spend) as avg_spend
 ,stddev(spend) as std_spend
 ,approx_quantile(spend, 0.5) as p50
 ,approx_quantile(spend, 0.75) as p75
 ,max(spend) as max_spend
from
(select 
 user_id
 ,sum(sales) as spend
from transaction_tbl_x
group by 1
) a;
"""

execute_query(query)

	min_spend	p25	avg_spend	std_spend	p50	p75	max_spend
0	0.0	0.0	343.119825	1343.99162	0.0	0.0	33513.0

Example: How many percentage of customer purchased at least once?

As you might notice from the last example, most of the customers have spend equals zero. This is a typical distribution in a retail business where most users come to window shop and only a few will make a purchase. We can find out % of those who made at least one purchase by combining avg aggregation and the case when if-else logic.

Code

query = f"""
select setseed(0.112);

select
 avg(case when spend > 0 then 1 else 0 end) as conversion_rate
from
(select 
 user_id
 ,sum(sales) as spend
from transaction_tbl_x
group by 1
) a;
"""

execute_query(query)

	conversion_rate
0	0.102035

Example: Give me only customers who have bought at least one item more expensive than $100.

In the same manner we use where clause to filter select, we can also use having to filter aggregations. This works exactly the same where as how you would do aggregation first then filter it with a subquery. Most modern query engines treat them the same way in terms of performance so pick whichever is more readable to you.

Code

query = f"""
select setseed(0.112);

select 
 user_id
 ,min(price) as min_price
from transaction_tbl_x
where behavior_type = 'buy'
group by 1
having min_price > 100
order by min_price;
"""

execute_query(query)

	user_id	min_price
0	782669	101
1	302287	101
2	773094	101
...	...	...
85577	340274	1049
85578	603498	1049
85579	438867	1049

85580 rows × 2 columns

Example: For each user who have favorited anything, give me a list of items they have favorited ordered by timestamp in ascending order.

Today you might run into preparing a sequence dataset to train LLMs, in which case you want to concatenate a series of values from a column. Most modern SQL handles this such as Spark SQL and Presto. In duckdb, we can use string_agg and keep order from earliest to latest timestamp by using order by.

Code

query = f"""
select setseed(0.112);

select 
 user_id
 ,string_agg(item_id order by timestamp) as item_id_list
from transaction_tbl_x
where behavior_type = 'fav'
group by 1
"""

execute_query(query)

	user_id	item_id_list
0	744534	2149136,88810,4629792
1	517461	4325698,5109079,2036947
2	261004	1289993,4431704
...	...	...
98434	702058	4753515
98435	865408	2643630
98436	391171	5154868

98437 rows × 2 columns

6. Window Function

Aggregation is very powerful, but a major downside is that it collapses the total rows into aggregated values. This means that after using group by, you lose the individual detail of each original row. You can no longer easily ask questions about the sequence of events, like “What happened just before this?” or “What is the third item in this list?”. Window functions solve this by allowing you to perform calculations on a related set of rows (a “window”) without making those rows disappear. They let you calculate things like rankings, running totals, or compare values across rows, all while keeping all your original data rows intact.

Example: For each user, find their second-to-last event.

We might be able to use min/max aggregation with some subqueries to get the last event, but second-to-last event is a bit convoluted to retrieve with group by alone. This is trivial when we use row_number() over (partition by [WINDOW COLUMN] order by [ORDERING COLUMN] asc/desc) to get the ranking numbers (starting with 1), then filter only the rows we need (rnk=2).

Code

query = f"""
select setseed(0.112);

select * from
(select 
 *
 ,row_number() over (partition by user_id order by timestamp desc) as rnk
from transaction_tbl_x) a
where rnk=2
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales	event_timestamp	event_date	year_month	rnk
0	6	730246	1512151025	pv	4756105	296	NaN	0.0	2017-12-02 02:57:05+09:00	2017-12-02	2017-12	2
1	58	274807	1512191821	pv	4458428	857	NaN	0.0	2017-12-02 14:17:01+09:00	2017-12-02	2017-12	2
2	73	2561888	1512224052	pv	753984	938	NaN	0.0	2017-12-02 23:14:12+09:00	2017-12-02	2017-12	2
...	...	...	...	...	...	...	...	...	...	...	...	...
753406	1017789	797757	1512179970	pv	883960	807	NaN	0.0	2017-12-02 10:59:30+09:00	2017-12-02	2017-12	2
753407	1017827	4777442	1512258856	pv	3607361	492	NaN	0.0	2017-12-03 08:54:16+09:00	2017-12-03	2017-12	2
753408	1017858	850224	1512291735	pv	3800818	274	NaN	0.0	2017-12-03 18:02:15+09:00	2017-12-03	2017-12	2

753409 rows × 12 columns

We can confirm that these are one second-to-last event each from users who have a least 2 events.

Code

query = f"""
select setseed(0.112);

select count(*) from
(select 
 user_id
 ,count(*) nb
from transaction_tbl_x
group by 1) a
where nb>=2
"""

execute_query(query)

	count_star()
0	753409

Example: What is the first item in each category that each user view?

We can also have multiple partitions as window.

Code

query = f"""
select setseed(0.112);

select * from
(select 
 user_id
 ,item_id
 ,row_number() over (partition by user_id, category_id order by timestamp asc) as rnk
from transaction_tbl_x
where behavior_type = 'pv') a
where rnk=1;
"""

execute_query(query)

	user_id	item_id	rnk
0	7	516760	1
1	19	3615870	1
2	21	4928897	1
...	...	...	...
3389298	1017999	3251062	1
3389299	1018006	2789562	1
3389300	1018011	3190817	1

3389301 rows × 3 columns

Example: For each user, what was their previous action before a `buy` action?

We can use lag/lead to get value just before or after each row.

Code

pd.set_option('display.max_rows', 10) 

query = f"""
select setseed(0.112);

select * from 
(select
    user_id,
    event_timestamp,
    behavior_type as current_event,
    lag(behavior_type) over (partition by user_id order by event_timestamp asc) as previous_event
from transaction_tbl_x
order by user_id, event_timestamp asc) a
where current_event = 'buy'
limit 10
"""

execute_query(query)

	user_id	event_timestamp	current_event	previous_event
0	2	2017-12-02 20:34:34+09:00	buy	pv
1	41	2017-11-28 15:22:28+09:00	buy	pv
2	45	2017-12-03 11:49:02+09:00	buy	pv
3	50	2017-11-27 09:02:55+09:00	buy	None
4	50	2017-12-02 11:35:55+09:00	buy	pv
5	50	2017-12-03 20:56:57+09:00	buy	buy
6	59	2017-11-27 01:29:32+09:00	buy	pv
7	62	2017-12-03 16:30:30+09:00	buy	pv
8	68	2017-11-27 19:59:34+09:00	buy	pv
9	80	2017-11-25 22:52:00+09:00	buy	pv

Confirm the query works by looking at user_id=50 (no event, pv, buy) and user_id=41 (single pv event).

Code

pd.set_option('display.max_rows', 30) 

query = f"""
select setseed(0.112);

select
 *
from transaction_tbl_x
where user_id in (50, 41)
order by user_id, event_timestamp
"""

execute_query(query)

	user_id	item_id	timestamp	behavior_type	category_id	price	quantity	sales	event_timestamp	event_date	year_month
0	41	2350782	1511590506	fav	3576283	832	NaN	0.0	2017-11-25 15:15:06+09:00	2017-11-25	2017-11
1	41	4810522	1511592979	pv	1464116	572	NaN	0.0	2017-11-25 15:56:19+09:00	2017-11-25	2017-11
2	41	259923	1511651601	pv	4170419	973	NaN	0.0	2017-11-26 08:13:21+09:00	2017-11-26	2017-11
3	41	4786486	1511758455	pv	2572604	536	NaN	0.0	2017-11-27 13:54:15+09:00	2017-11-27	2017-11
4	41	460114	1511850148	buy	4804883	164	5.0	820.0	2017-11-28 15:22:28+09:00	2017-11-28	2017-11
5	41	1876500	1511850397	pv	3158249	550	NaN	0.0	2017-11-28 15:26:37+09:00	2017-11-28	2017-11
6	41	3599288	1512050217	cart	1537669	338	NaN	0.0	2017-11-30 22:56:57+09:00	2017-11-30	2017-11
7	41	2479959	1512050346	pv	3491350	1009	NaN	0.0	2017-11-30 22:59:06+09:00	2017-11-30	2017-11
8	41	3095445	1512072759	fav	4801426	495	NaN	0.0	2017-12-01 05:12:39+09:00	2017-12-01	2017-12
9	41	3937435	1512072770	pv	4801426	485	NaN	0.0	2017-12-01 05:12:50+09:00	2017-12-01	2017-12
10	41	2601044	1512073082	pv	2735466	94	NaN	0.0	2017-12-01 05:18:02+09:00	2017-12-01	2017-12
11	41	3017816	1512073131	pv	2735466	866	NaN	0.0	2017-12-01 05:18:51+09:00	2017-12-01	2017-12
12	41	3976745	1512073512	pv	3002561	795	NaN	0.0	2017-12-01 05:25:12+09:00	2017-12-01	2017-12
13	50	1808993	1511740975	buy	4690421	1043	10.0	10430.0	2017-11-27 09:02:55+09:00	2017-11-27	2017-11
14	50	1353441	1511743131	cart	64179	491	NaN	0.0	2017-11-27 09:38:51+09:00	2017-11-27	2017-11
15	50	1963474	1511743387	pv	3422001	524	NaN	0.0	2017-11-27 09:43:07+09:00	2017-11-27	2017-11
16	50	972172	1511854245	pv	4756105	222	NaN	0.0	2017-11-28 16:30:45+09:00	2017-11-28	2017-11
17	50	407777	1511948105	pv	4756105	827	NaN	0.0	2017-11-29 18:35:05+09:00	2017-11-29	2017-11
18	50	2559047	1512108229	pv	4756105	97	NaN	0.0	2017-12-01 15:03:49+09:00	2017-12-01	2017-12
19	50	3408121	1512110060	pv	1320293	171	NaN	0.0	2017-12-01 15:34:20+09:00	2017-12-01	2017-12
20	50	4225949	1512110662	pv	1879194	999	NaN	0.0	2017-12-01 15:44:22+09:00	2017-12-01	2017-12
21	50	1820775	1512112710	cart	4537973	825	NaN	0.0	2017-12-01 16:18:30+09:00	2017-12-01	2017-12
22	50	3096190	1512179808	pv	472273	240	NaN	0.0	2017-12-02 10:56:48+09:00	2017-12-02	2017-12
23	50	5155205	1512179846	cart	4762182	255	NaN	0.0	2017-12-02 10:57:26+09:00	2017-12-02	2017-12
24	50	518956	1512181946	pv	4835206	1006	NaN	0.0	2017-12-02 11:32:26+09:00	2017-12-02	2017-12
25	50	120958	1512182155	buy	4762182	1008	3.0	3024.0	2017-12-02 11:35:55+09:00	2017-12-02	2017-12
26	50	4619331	1512302217	buy	3884119	381	7.0	2667.0	2017-12-03 20:56:57+09:00	2017-12-03	2017-12
27	50	5045605	1512302237	pv	820727	655	NaN	0.0	2017-12-03 20:57:17+09:00	2017-12-03	2017-12

Example: What is the contribution of each item to their overall category sales?

We can also use aggregations like sum in conjuction with window function.

Code

pd.set_option('display.max_rows', 6) 

query = f"""
select setseed(0.112);

select
 item_id
 ,category_id
 ,item_sales / category_sales as percentage_category_sales
from (
  select
   item_id
   ,category_id
   ,item_sales
   ,sum(item_sales) over (partition by category_id) as category_sales
  from (
   select
    item_id
    ,category_id
    ,sum(sales) as item_sales
    from transaction_tbl_x
    group by 1,2
  ) a
) b
where category_sales > 0;
"""

execute_query(query)

	item_id	category_id	percentage_category_sales
0	644316	75275	0.0
1	4356670	75275	0.0
2	45889	75275	0.0
...	...	...	...
1219659	4607802	5078340	0.0
1219660	4052466	5078340	0.0
1219661	104515	5078340	0.0

1219662 rows × 3 columns

7. Concatenation aka Union

Sometimes you want to concatenate multiple tables with the same set of columns (called schema) together. union all is simple concatenation whereas union will also deduplicate the rows for you, only returning rows that are not perfectly identical to one another.

Example: Concatenate monthly summary of `2017-11` and `2017-12` together.

We do not need to worry about duplicates here so we can simply use union all. See how we can combine count, distinct and case when to get monthly acitve customers and purchasers.

Code

query = f"""
select setseed(0.112);

select * from
(select 
 year_month
 ,count(distinct user_id) as nb_active_customer
 ,count(distinct case when behavior_type='buy' then user_id else null end) as nb_purchaser
 ,sum(sales) as total_sales
from transaction_tbl_x
where year_month = '2017-11'
group by 1)
union all
(select 
 year_month
 ,count(distinct user_id) as nb_active_customer
 ,count(distinct case when behavior_type='buy' then user_id else null end) as nb_purchaser
 ,sum(sales) as total_sales
from transaction_tbl_x
where year_month = '2017-12'
group by 1);
"""

execute_query(query)

	year_month	nb_active_customer	nb_purchaser	total_sales
0	2017-11	759497	58836	192782711.0
1	2017-12	679573	34938	111758189.0

Example: Give me a list of unique users who made at least one purchase or viewed at least 5 unique items.

Concatenate then deduplicate.

Code

query = f"""
select setseed(0.112);

select 
 user_id
from transaction_tbl_x
where behavior_type = 'buy'
group by 1
union
select 
 user_id
from transaction_tbl_x
where behavior_type = 'fav'
group by 1
having count(*)>=5
"""

execute_query(query)

	user_id
0	898251
1	576709
2	837520
...	...
92122	86617
92123	906571
92124	103138

92125 rows × 1 columns

Simple concatenation will give duplicates.

Code

query = f"""
select setseed(0.112);

select 
 user_id
from transaction_tbl_x
where behavior_type = 'buy'
group by 1
union all
select 
 user_id
from transaction_tbl_x
where behavior_type = 'fav'
group by 1
having count(*)>=5
"""

execute_query(query)

	user_id
0	286908
1	910422
2	669209
...	...
92359	251575
92360	598643
92361	906629

92362 rows × 1 columns

8. Joins

join connects data from one table to another based on columns they share. There are many types of joins but 95% of your life will revolve around left join and inner join.

Example: Among users who are active (have at least one event) on 2017-12-03, how many percent were active in 2017-11

left join starts with all rows from the left-side table (the former one) and add rows from the right-side table (the latter one) to it, if and only if the rows meet the conditions given in the on clause. These conditions are usually for values in a column from the left-side table to be equal to, not equal to, or more/less than the ones in a column from the right-side table.

Be sure to give aliases to columns you select and are joining on. Your query engine needs to know exactly from which table the columns came from if they have the same name.

Code

query = f"""
select setseed(0.112);

with november_active_tbl as (
select
 user_id
 ,1 as active_in_november
from transaction_tbl_x
where year_month = '2017-11'
group by 1,2
),

dec3_active_tbl as (
select
 user_id
from transaction_tbl_x
where event_date = '2017-12-03'
group by 1
)

select
 dec3.user_id
 ,active_in_november
from dec3_active_tbl dec3
left join november_active_tbl nov
on dec3.user_id = nov.user_id;
"""

execute_query(query)

	user_id	active_in_november
0	78674	1
1	755902	1
2	829454	1
...	...	...
375424	743851	<NA>
375425	543372	<NA>
375426	610566	<NA>

375427 rows × 2 columns

As you can see, users who were not active in 2017-11 will have null values in their active_in_november column. This is because we need to make sure that all rows from the left-side table (dec3_active_tbl) are there. Lastly, we can find the percentage by a simple aggregation. This is how you calculate percentage of returning users.

Code

query = f"""
select setseed(0.112);

with november_active_tbl as (
select
 user_id
 ,1 as active_in_november
from transaction_tbl_x
where year_month = '2017-11'
group by 1,2
),

dec3_active_tbl as (
select
 user_id
from transaction_tbl_x
where event_date = '2017-12-03'
group by 1
)

select
 avg(coalesce(active_in_november,0)) as percent_active_last_month
from dec3_active_tbl dec3
left join november_active_tbl nov
on dec3.user_id = nov.user_id;
"""

execute_query(query)

	percent_active_last_month
0	0.824562

Example: What is the daily contribution to its total monthly sales, expressed as percentage?

inner join is used when you only want rows where the on conditions are satisfied for both tables. In this case, we know that year_month, the key we are joining on, exists in both daily and monthly sales tables, so we can use inner join without fear of losing information. inner join has the best performance than left join so prioritize it if you can, especially if you are working with huge tables.

One sneaky thing you can do is that technically you can enter a condition only based on one table in the on clause such as daily_sales > 0 below. It will have the same performance as when you do it on where clause.

Code

pd.set_option('display.max_rows', 10) 

query = f"""
select setseed(0.112);

with monthly_sales_tbl as (
select
 year_month
 ,sum(sales) as monthly_sales
from transaction_tbl_x
group by 1
having sum(sales)>0
),

daily_sales_tbl as (
select
 year_month
 ,event_date
 ,sum(sales) as daily_sales
from transaction_tbl_x
group by 1,2
)

select
 daily.year_month
 ,event_date
 ,daily_sales / monthly_sales as percentage_of_monthly_sales
from daily_sales_tbl daily
inner join monthly_sales_tbl monthly
on daily.year_month = monthly.year_month
 and daily_sales > 0
order by event_date;
"""

execute_query(query)

	year_month	event_date	percentage_of_monthly_sales
0	2017-11	2017-11-25	0.149288
1	2017-11	2017-11-26	0.162279
2	2017-11	2017-11-27	0.177516
3	2017-11	2017-11-28	0.165055
4	2017-11	2017-11-29	0.171182
5	2017-11	2017-11-30	0.176354
6	2017-12	2017-12-01	0.284772
7	2017-12	2017-12-02	0.352953
8	2017-12	2017-12-03	0.346453
9	2017-12	2017-12-04	0.016650

9. Tribal Knowledge

We have now gone through SQL techniques to accomplish most tasks as a SQL monkey. I would like to close with some tips and tricks I have learned over the years:

Get your hands dirty. Whether it is the exercise or your own data. Get out there and make it happen!
ALWAYS SPECIFY THE PARTITIONS YOU NEED IN THE WHERE CLAUSE, and use the right data types.
Sanity check for data quality issues, namely duplicates, missing values, improbable values, and data types. Make sure values in columns you care about are distributed in a reasonable manner.
Work with assumptions and experiments. Have a set of clear hypotheses about what you are trying to learn/do, compose the query, run it and record the results. Work incrementally and not all at once; lest you will regret it during the debugging process. Do not just randomly write queries and hope for the best. The rabbit hole is too deep.
Performance is king. Always pick a pattern that results in better performance when possible. For instance, inner join over left join, union all over union, do not use distinct if values are already unique, and so on.
Query code styling must be readable and consistent. In this day and age where a coding assistant can fix your code styling in a few seconds, the best code styling for your query is the one that is most readable to your team. Whether it is tab vs space identation, leading vs trailing commas, with vs subquery vs view, capitalized vs uncapitalized keywords, just pick one style and stick with it.
When in doubt explain. Most query engines will show you how it plans to execute your queries. If you are a beginner, this might not be extremely helpful, but at least you can catch some simple things like if the partitions you specified are being used, is the engine making some unncessary data type conversion, which part of the process takes the moast time and so on. explain will only show you the plan but explain analyze will execute the query then tell you how it went.

Code

query = """
explain analyze
select * from transaction_tbl
where item_id = 2067266
 and category_id = 4339722
"""

print(execute_query(query)['explain_value'][0])

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
 explain analyze select * from transaction_tbl where item_id = 2067266  and category_id = 4339722 
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0016s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│          user_id          │
│          item_id          │
│         timestamp         │
│       behavior_type       │
│        category_id        │
│                           │
│           7 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│           Table:          │
│      transaction_tbl      │
│                           │
│   Type: Sequential Scan   │
│                           │
│        Projections:       │
│          item_id          │
│        category_id        │
│          user_id          │
│         timestamp         │
│       behavior_type       │
│                           │
│          Filters:         │
│      item_id=2067266      │
│    category_id=4339722    │
│                           │
│           7 Rows          │
│          (0.00s)          │
└───────────────────────────┘

0. Intuition

1. Sanity Checks Keep You Sane

Example: select *

Example: describe

Example: count and count(distinct [COLUMN])

2. Assorted Platter of Selects

Example: Who are the users that viewed item 2067266?

Example: Who are the users that viewed item 2067266, ranked by who viewed last to who viewed first?

Example: How many users did not buy?

Example: How many events happened before November 26, 2017 (timestamp 1511622000)?

Example: How many events happened between November 25, 2017 (timestamp 1511535600) and November 26, 2017 (timestamp 1511622000)?

Example: How many events are either purchase (buy) or add-to-cart (cart)?

Example: How many users interact (pv, cart, fav) with Taobao but never buy?

Example: How many behavior_type ends with a v?

3. Partitions, Partitions, Partitions

Example: Who are the users that viewed item 2067266? Run WITHOUT partition in where clause.

Example: Who are the users that viewed item 2067266? Run WITH category_id partition in where clause.

4. Column-wise Manipulation

Example: Add price column where price is item_id modulus 1000 + 50

Example: Add quantity column that is a random number between 1 and 10, only for buy events. For all other events, leave it as missing values (null).

Example: Calculate sales by multiplying price and quantity

Example: Fill in missing values in sales with zero.

Example: Convert timestamp from unix timestamp to yyyy-mm-dd format.

Example: Make a string column year_month that takes only the yyyy-mm part from event_timestamp.

Example: Save the manipulations done so far as a view to be used later.

5. Aggregation aka Group By

Example: How many events are there for each event type (behavior_type)?

Example: What are the top 10 items that got favorited by most number of unique customers?

Example: What is the average, standard deviation, min, and max spend per user?

Example: How many percentage of customer purchased at least once?

Example: Give me only customers who have bought at least one item more expensive than $100.

Example: For each user who have favorited anything, give me a list of items they have favorited ordered by timestamp in ascending order.

6. Window Function

Example: For each user, find their second-to-last event.

Example: What is the first item in each category that each user view?

Example: For each user, what was their previous action before a buy action?

Example: What is the contribution of each item to their overall category sales?

7. Concatenation aka Union

Example: Concatenate monthly summary of 2017-11 and 2017-12 together.

Example: Give me a list of unique users who made at least one purchase or viewed at least 5 unique items.

8. Joins

Example: Among users who are active (have at least one event) on 2017-12-03, how many percent were active in 2017-11

Example: What is the daily contribution to its total monthly sales, expressed as percentage?

9. Tribal Knowledge

Example: `select *`

Example: `describe`

Example: `count` and `count(distinct [COLUMN])`

Example: How many events are either purchase (`buy`) or add-to-cart (`cart`)?

Example: How many users interact (`pv`, `cart`, `fav`) with Taobao but never `buy`?

Example: How many `behavior_type` ends with a `v`?

Example: Who are the users that viewed item 2067266? Run WITHOUT partition in `where` clause.

Example: Who are the users that viewed item 2067266? Run WITH `category_id` partition in `where` clause.

Example: Add `price` column where price is `item_id` modulus 1000 + 50

Example: Add `quantity` column that is a random number between 1 and 10, only for `buy` events. For all other events, leave it as missing values (`null`).

Example: Calculate `sales` by multiplying `price` and `quantity`

Example: Fill in missing values in `sales` with zero.

Example: Convert timestamp from unix timestamp to `yyyy-mm-dd` format.

Example: Make a string column `year_month` that takes only the `yyyy-mm` part from `event_timestamp`.

Example: Save the manipulations done so far as a `view` to be used later.

Example: How many events are there for each event type (`behavior_type`)?

Example: For each user, what was their previous action before a `buy` action?

Example: Concatenate monthly summary of `2017-11` and `2017-12` together.