Skill mastery is an endeavor of diminishing returns. You can get from zero to good enough (think first-time chess player to 500 online rating) within 100 hours of deliberate practice; however, it is exponentially more difficult to get from good enough to mastered (think a 500-rated chess player trying to become a 2200-FIDE-rated Candidate Master), requiring over 100x more delibrate practice. SQL is the basic building blocks of data analysis and I posit that most peopleーproduct managers, marketers, salespeople, and other makeshift data analystsーsimply need to be good enough.
This tutorial aims to encapsulate almost all SQL techniques you need to glean actionable insights from your (non-transactional) datasets. We will use the Taobao User Behavior dataset and duckdb to simulate a SQL interface. There will be idiosyncracies according to which flavors of SQL you are usingーPostgres, Presto, Redshift, BigQuery, and so onーbut you should be able to adapt the principles outlined here with a little help of modern coding assistants.
I recommend that you first quickly skim through this post to have a rough idea of what SQL is all about, then move on to complete the exercise for some hands-on shenanigans. While completing it, feel free to refer back to this post and read some sections more in details. The road to becoming a true SQL monkey starts with writing queries.
featured_image
Code
import duckdbimport pandas as pdpd.set_option('display.max_rows', 6) import numpy as npimport randomfrom tqdm.auto import tqdmimport osimport timeitdf = pd.read_csv('../../data/sql_almost/UserBehavior.csv', header=None)df.columns = ['user_id','item_id','category_id','behavior_type','timestamp']#sample 5M rows out of 100M to run in reasonable timedf = df.sample(n=5_000_000, random_state=112).reset_index(drop=True)# # save to parquet# con = duckdb.connect()# con.register('transaction_tbl', df)# output_dir = '../../data/sql_almost/transaction_tbl'# os.makedirs(output_dir, exist_ok=True)# con.execute(f"""# COPY transaction_tbl# TO '{output_dir}'# (FORMAT PARQUET, PARTITION_BY (category_id, behavior_type), OVERWRITE_OR_IGNORE);# """)
0. Intuition
Imagine you are at a library to look for a certain piece of information. The entire library is off limits to the general public. The only way you can access any information is by telling the librarian exactly what you want and let them fetch it for you. The entire library is your database. The librarian is your query engine and what you tell them is your query. In the library, there are a number of shelves (representing tables) that contain any number of books (representing rows in a table) to answer your query. Each book from the same shelf contains the same number of pages (representing columns or fields in a row), each page having a distinct piece of information. Following your instruction, the librarian may walk to different shelves, scour some books and pages, mix-and-match the information, then present the answer to you. Therefore, our task is to give an instruction such that the librarian can give us the most accurate answer using the shortest time and energy possible.
1. Sanity Checks Keep You Sane
We are treating the Taobao User Behavior dataset as our table called transaction_tbl. It is a 5M-row subset based on parquet files partitioned by category_id and behavior_type.
Code
con = duckdb.connect()#short-hand function to execute query with duckdb easilydef execute_query(query): return con.execute(query).fetchdf()query =f"""CREATE OR REPLACE TABLE transaction_tbl ASSELECT *FROM read_parquet('../../data/sql_almost/transaction_tbl/*/*/*.parquet', hive_partitioning=true);"""#load saved, partitioned table as transaction_tblexecute_query(query)
The table is a run-of-the-mill user behavior log consisting of
user_id: identifier of user
item_id: identifier of item
timestamp: unix timestamp of when action happened
behavior_type: what type of action it was
category_id: identifier of category the item belongs to
Example: select *
The first thing you do with a table you have not seen before is to select a few rows to look at. The texts in variable query is what you would type in your SQL interface.
Code
query =f"""select * from transaction_tbl limit 100;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
0
184439
688483
1512105722
buy
1000858
1
860167
3032030
1511876592
pv
1000858
2
29019
1174848
1511952716
pv
1000858
...
...
...
...
...
...
97
109393
336502
1512302682
pv
1003726
98
928768
2826670
1511780591
pv
1003726
99
719622
2297792
1511703912
pv
1003726
100 rows × 5 columns
select * returns all columns from a table, and is almost always followed by limit [NUMBER OF ROWS YOU WANT] in order to not return the entire table but a random subset of it. This is less of a problem in the day and age where most query engines have a built-in safeguard to not display the entire table as a result, otherwise you can crash your system by asking for a terabyte-level output.
Example: describe
Another useful command is describe. It will tell you what the columns are, their data types, do they have null (missing) values and so on. Common data types are int/bigint for integers, float for approximate decimals, varchar/string for texts, boolean for true/false and timestamp/date for, well, timestamps and dates. However, they are always ever so slightly different depending on which SQL you are using so better confirm with the documentation such as this one for Presto.
The last check you always want to do is to count how many rows exist in the table:
Code
query =f"""select count(*) from transaction_tbl;"""execute_query(query)
count_star()
0
5000000
We can also count how many unique values are in any column by adding distinct in front of the column name in count. You can also name your derived columns to know which ones mean what using as.
Code
# count how many unique usersquery =f"""select count(*) as nb_event ,count(distinct user_id) as nb_userfrom transaction_tbl;"""execute_query(query)
nb_event
nb_user
0
5000000
888335
2. Assorted Platter of Selects
Now that you know how to sanity-check a table, let us move on to selecting what you want. This is often done by filtering the result with conditions using the where clause. Here are some examples you want to familiarize yourself with.
Example: Who are the users that viewed item 2067266?
Link multiple conditions with logical operators and / or.
Example: How many users interact (pv, cart, fav) with Taobao but never buy?
We can use not as a negation for in.
Code
query =f"""select count(distinct user_id)from transaction_tblwhere behavior_type not in ('buy');"""execute_query(query)
count(DISTINCT user_id)
0
882077
Example: How many behavior_type ends with a v?
We use like to match texts that are similar what we want. % is the wild card to say anything can come before/after it. In this simple example, we already know the answer is 2: pv and fav.
Code
query =f"""select count(distinct behavior_type)from transaction_tblwhere behavior_type like '%v';"""execute_query(query)
count(DISTINCT behavior_type)
0
2
3. Partitions, Partitions, Partitions
Once you have mastered the basic select patterns, the next thing you must never forget when selecting from a table is to ALWAYS SPECIFY THE PARTITIONS YOU NEED IN THE WHERE CLAUSE. A modern SQL table is stored as multiple compressed files (most commonly parquet) in a nested subfolder structure as seen below. In this case, the partitions (subfolders) are columns category_id and behavior_type.
Fun Fact: Not only do you need to specify the partitions, you need to specify them with the correct data types. I once had a query than ran for 12 hours instead of 5 minutes, simply because I did not check if the partition was in datetime not string. Do not be me.
Code
def print_table_structure(table_path, first_category_id_dir):#print root folderprint(table_path.split(os.sep)[-1]) #get category_id partitions category_ids =sorted([d for d in os.listdir(table_path) if os.path.isdir(os.path.join(table_path, d)) and d.startswith('category_id=')])#print only the first category_id partition category_id_path = os.path.join(table_path, first_category_id_dir)print(f" └── {first_category_id_dir}")#get behavior_type partitions behavior_types =sorted([d for d in os.listdir(category_id_path) if os.path.isdir(os.path.join(category_id_path, d)) and d.startswith('behavior_type=')])#print all behavior_type partitionsfor behavior_type_dir in behavior_types: behavior_type_path = os.path.join(category_id_path, behavior_type_dir)print(f" └── {behavior_type_dir}")#get all parquet files files =sorted([f for f in os.listdir(behavior_type_path) if f.endswith('.parquet')])#print all files under category_id, behavior_type partitionsfor i, fileinenumerate(files):if i ==len(files) -1:print(f" └── {file}")else:print(f" ├── {file}")iflen(category_ids) >1:print(f" {len(category_ids) -1} more category_id folders...")print_table_structure('../../data/sql_almost/transaction_tbl','category_id=1462446')
To understand why specifying partitions is crucial, let us try our first query again and see how long it takes to run. We conduct 100 trials of 100 runs each to get mean and standard deviation of query time.
Example: Who are the users that viewed item 2067266? Run WITHOUT partition in where clause.
Code
query ="""select distinct user_id from transaction_tblwhere item_id = 2067266;"""ts = timeit.repeat(lambda: execute_query(query), number=100, repeat=100)print(f'Query time WITHOUT partition: {np.mean(ts):.2f}±{np.std(ts):.2f} seconds')
Query time WITHOUT partition: 0.15±0.08 seconds
Example: Who are the users that viewed item 2067266? Run WITH category_id partition in where clause.
Code
query ="""select distinct user_id from transaction_tblwhere item_id = 2067266 and category_id = 4339722;"""ts = timeit.repeat(lambda: execute_query(query), number=100, repeat=100)print(f'Query time WITH partition: {np.mean(ts):.2f}±{np.std(ts):.2f} seconds')
Query time WITH partition: 0.06±0.06 seconds
What sorcery is this? The query time is roughly halved! This is because when we include partitions in the where clause, we are telling the query engine to only look at the specific parts of the data not the entire table. In real life, not specifying which partitions you need as detailed as possible can spell the difference between waiting for two hours or a few seconds.
4. Column-wise Manipulation
Before we move on, you might have noticed that our dataset is a little bland with only a timestamp and categorical columns (timestamp, user_id, item_id, category_id, behavior_type). In reality, such table as the one we are using often contains price of the item and quantity by which they were purchased. We can add that by manipulating existing columns.
Example: Add price column where price is item_id modulus 1000 + 50
Here we add the price column we randomly generated to the result. You can manipulate any numeric columns with arithmetic operators.
Example: Add quantity column that is a random number between 1 and 10, only for buy events. For all other events, leave it as missing values (null).
When you buy from a store, you need to specify a quantity of the items; however, for other events (when you view, favorite or add-to-cart), you do not. We must give the query engine an if-else logic depending on the values in each row. In SQL, we do this by using case when [CONDITION] then [VALUE] else [DEFAULT VALUE] end. We use setseed to keep the randomized numbers the same set.
Code
#random is randomizing number between 0 and 1 then floor rounds it downquery =f"""select setseed(0.112);select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
0
184439
688483
1512105722
buy
1000858
533
2.0
1
860167
3032030
1511876592
pv
1000858
80
NaN
2
29019
1174848
1511952716
pv
1000858
898
NaN
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
4999998
891831
2064564
1512255716
pv
999980
614
NaN
4999999
10430
2006178
1512012994
pv
999980
228
NaN
5000000 rows × 7 columns
We can also have as many when as we want. If we want to explicitly state the conditions for all behavior_type, it will look something like:
Code
query =f"""select setseed(0.112);select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 when behavior_type = 'pv' then null when behavior_type = 'cart' then null when behavior_type = 'fav' then null else null end as quantityfrom transaction_tbl;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
0
184439
688483
1512105722
buy
1000858
533
4.0
1
860167
3032030
1511876592
pv
1000858
80
NaN
2
29019
1174848
1511952716
pv
1000858
898
NaN
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
4999998
891831
2064564
1512255716
pv
999980
614
NaN
4999999
10430
2006178
1512012994
pv
999980
228
NaN
5000000 rows × 7 columns
Example: Calculate sales by multiplying price and quantity
We can do operations among columns of the table. This example necessitates us to perform select twice: first to create price and quantity, then to create sales by multiplying them together. We do this by writing a subquery.
Code
query =f"""select setseed(0.112);select whatever_subquery_alias_you_want.* ,price * quantity as salesfrom (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl) whatever_subquery_alias_you_want;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
0
184439
688483
1512105722
buy
1000858
533
6.0
3198.0
1
860167
3032030
1511876592
pv
1000858
80
NaN
NaN
2
29019
1174848
1511952716
pv
1000858
898
NaN
NaN
...
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
NaN
4999998
891831
2064564
1512255716
pv
999980
614
NaN
NaN
4999999
10430
2006178
1512012994
pv
999980
228
NaN
NaN
5000000 rows × 8 columns
You would notice that the query becomes exponentially less readable with subqueries. This is especially the case when you have multiple nested subqueries and they become unreadable even by you in the next 3 months. An elegant solution is to separate these subqueries with with clauses. This is especially useful if you have a subquery you would like to reuse later in the same query. There is no performance difference between subqueries and with clauses, so pick what is easiest to read for you.
Code
query =f"""select setseed(0.112);with t1_tbl as (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl)select * ,price * quantity as salesfrom t1_tbl;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
0
184439
688483
1512105722
buy
1000858
533
6.0
3198.0
1
860167
3032030
1511876592
pv
1000858
80
NaN
NaN
2
29019
1174848
1511952716
pv
1000858
898
NaN
NaN
...
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
NaN
4999998
891831
2064564
1512255716
pv
999980
614
NaN
NaN
4999999
10430
2006178
1512012994
pv
999980
228
NaN
NaN
5000000 rows × 8 columns
Example: Fill in missing values in sales with zero.
We use coalesce to fill in missing values. Notice that we can have multiple with clauses (consecutive ones puncutated by , and do not need with) depending on how you thin is most readable.
Code
query =f"""select setseed(0.112);with t1_tbl as (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl),t2_tbl as (select * ,price * quantity as salesfrom t1_tbl)select user_id ,item_id ,timestamp ,behavior_type ,category_id ,price ,quantity ,coalesce(sales, 0) as salesfrom t2_tbl;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
0
184439
688483
1512105722
buy
1000858
533
2.0
1066.0
1
860167
3032030
1511876592
pv
1000858
80
NaN
0.0
2
29019
1174848
1511952716
pv
1000858
898
NaN
0.0
...
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
0.0
4999998
891831
2064564
1512255716
pv
999980
614
NaN
0.0
4999999
10430
2006178
1512012994
pv
999980
228
NaN
0.0
5000000 rows × 8 columns
Example: Convert timestamp from unix timestamp to yyyy-mm-dd format.
Datetime conversion, as in any programming script, is a very confusing affair, so I highly recommend you refer to your specific SQL’s documentation such as this one for Spark. But the idea is simply applying some function over your columns. Here we use to_timestamp to convert from int to unix timestamp and then use strftime to convert to a string formatted as yyyy-mm-dd.
Code
query =f"""select setseed(0.112);with t1_tbl as (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl),t2_tbl as (select * ,price * quantity as salesfrom t1_tbl),t3_tbl as (select user_id ,item_id ,timestamp ,behavior_type ,category_id ,price ,quantity ,coalesce(sales, 0) as salesfrom t2_tbl)select * ,to_timestamp(timestamp) as event_timestamp ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_datefrom t3_tbl;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
event_timestamp
event_date
0
184439
688483
1512105722
buy
1000858
533
5.0
2665.0
2017-12-01 14:22:02+09:00
2017-12-01
1
860167
3032030
1511876592
pv
1000858
80
NaN
0.0
2017-11-28 22:43:12+09:00
2017-11-28
2
29019
1174848
1511952716
pv
1000858
898
NaN
0.0
2017-11-29 19:51:56+09:00
2017-11-29
...
...
...
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
0.0
2017-11-26 15:24:01+09:00
2017-11-26
4999998
891831
2064564
1512255716
pv
999980
614
NaN
0.0
2017-12-03 08:01:56+09:00
2017-12-03
4999999
10430
2006178
1512012994
pv
999980
228
NaN
0.0
2017-11-30 12:36:34+09:00
2017-11-30
5000000 rows × 10 columns
Example: Make a string column year_month that takes only the yyyy-mm part from event_timestamp.
Most query engines have built-in functions to manipulate texts such as this one for duckdb. However, in this case, since event_timestamp is a timestmap, we need to convert its data type to string before applying the function substring by using cast([COLUMN] as [DATA TYPE]).
Code
query =f"""select setseed(0.112);with t1_tbl as (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl),t2_tbl as (select * ,price * quantity as salesfrom t1_tbl),t3_tbl as (select user_id ,item_id ,timestamp ,behavior_type ,category_id ,price ,quantity ,coalesce(sales, 0) as salesfrom t2_tbl),t4_tbl as (select * ,to_timestamp(timestamp) as event_timestamp ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_datefrom t3_tbl)select * ,substring(cast(event_timestamp as varchar),1,7) as year_monthfrom t4_tbl;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
event_timestamp
event_date
year_month
0
184439
688483
1512105722
buy
1000858
533
5.0
2665.0
2017-12-01 14:22:02+09:00
2017-12-01
2017-12
1
860167
3032030
1511876592
pv
1000858
80
NaN
0.0
2017-11-28 22:43:12+09:00
2017-11-28
2017-11
2
29019
1174848
1511952716
pv
1000858
898
NaN
0.0
2017-11-29 19:51:56+09:00
2017-11-29
2017-11
...
...
...
...
...
...
...
...
...
...
...
...
4999997
461462
1642299
1511677441
pv
999980
349
NaN
0.0
2017-11-26 15:24:01+09:00
2017-11-26
2017-11
4999998
891831
2064564
1512255716
pv
999980
614
NaN
0.0
2017-12-03 08:01:56+09:00
2017-12-03
2017-12
4999999
10430
2006178
1512012994
pv
999980
228
NaN
0.0
2017-11-30 12:36:34+09:00
2017-11-30
2017-11
5000000 rows × 11 columns
Example: Save the manipulations done so far as a view to be used later.
You’ll notice that even with the with clauses, our query seems substantially more clunky now. We do not want to be re-writing these lines every time we reuse this set of results for other queries. Luckily, there is a solution called view. A view saves the query logic that can be used for other queries later. Unlike actual SQL tables, the data are not stored physically on your database, so the query saved to a view will still run every time it is called.
This query creates the view:
Code
query =f"""select setseed(0.112);create or replace view transaction_tbl_x as (with t1_tbl as (select * ,item_id % 1000 + 50 as price ,case when behavior_type = 'buy' then FLOOR(RANDOM() * 10) + 1 else null end as quantityfrom transaction_tbl),t2_tbl as (select * ,price * quantity as salesfrom t1_tbl),t3_tbl as (select user_id ,item_id ,timestamp ,behavior_type ,category_id ,price ,quantity ,coalesce(sales, 0) as salesfrom t2_tbl),t4_tbl as (select * ,to_timestamp(timestamp) as event_timestamp ,strftime(to_timestamp(timestamp), '%Y-%m-%d') as event_datefrom t3_tbl)select * ,substring(cast(event_timestamp as varchar),1,7) as year_monthfrom t4_tbl)"""execute_query(query)
Count
This query reuses it by a simple select *:
Code
query =f"""select setseed(0.112);select * from transaction_tbl_x limit 100;"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
event_timestamp
event_date
year_month
0
184439
688483
1512105722
buy
1000858
533
5.0
2665.0
2017-12-01 14:22:02+09:00
2017-12-01
2017-12
1
860167
3032030
1511876592
pv
1000858
80
NaN
0.0
2017-11-28 22:43:12+09:00
2017-11-28
2017-11
2
29019
1174848
1511952716
pv
1000858
898
NaN
0.0
2017-11-29 19:51:56+09:00
2017-11-29
2017-11
...
...
...
...
...
...
...
...
...
...
...
...
97
109393
336502
1512302682
pv
1003726
552
NaN
0.0
2017-12-03 21:04:42+09:00
2017-12-03
2017-12
98
928768
2826670
1511780591
pv
1003726
720
NaN
0.0
2017-11-27 20:03:11+09:00
2017-11-27
2017-11
99
719622
2297792
1511703912
pv
1003726
842
NaN
0.0
2017-11-26 22:45:12+09:00
2017-11-26
2017-11
100 rows × 11 columns
5. Aggregation aka Group By
When you have millions of rows of data, you do not want to look at them one by one; you want to know how they appear in aggregateーhow many events there are for each type, which items are favorited the most/least, how many items users buy on average, and so on. In fact, you have already learned how to do some of this. count and count(distinct [COLUMN]) are examples of aggregation over the entire table. In this section, we will also learn to use group by to get aggregated values according to the columns we want.
Example: How many events are there for each event type (behavior_type)?
You can order by the newly created nb_event column to sort event types in descending order according to how many events they have. As expected, it is views, add-to-carts, favorites, then purchases.
Code
query =f"""select setseed(0.112);select behavior_type ,count(*) as nb_eventfrom transaction_tbl_xgroup by behavior_typeorder by nb_event desc;"""execute_query(query)
behavior_type
nb_event
0
pv
4478549
1
cart
276307
2
fav
144452
3
buy
100692
One neat trick is that you do not have to write the column names in group by or order by and use numbering instead; for instance, 1 means the first column selected, in this case behavior_type.
Code
query =f"""select setseed(0.112);select behavior_type ,count(*) as nb_eventfrom transaction_tbl_xgroup by 1order by 2 desc;"""execute_query(query)
behavior_type
nb_event
0
pv
4478549
1
cart
276307
2
fav
144452
3
buy
100692
Example: What are the top 10 items that got favorited by most number of unique customers?
When used in conjunction with where, the where clause comes before group by.
Code
query =f"""select setseed(0.112);select item_id ,count(distinct user_id) as nb_userfrom transaction_tbl_xwhere behavior_type = 'fav'group by 1order by 2 desclimit 10;"""execute_query(query)
item_id
nb_user
0
2331370
43
1
3845720
41
2
2279428
40
...
...
...
7
2364679
37
8
3403645
35
9
1783990
33
10 rows × 2 columns
Example: What is the average, standard deviation, min, and max spend per user?
First, we need to sum up all sales for each user then perform the avg, stddev, min and max aggregation over all users. We can also see quantiles using functions like approx_quantile([COLUMN], [QUANTILE]).
Code
query =f"""select setseed(0.112);select min(spend) as min_spend ,approx_quantile(spend, 0.25) as p25 ,avg(spend) as avg_spend ,stddev(spend) as std_spend ,approx_quantile(spend, 0.5) as p50 ,approx_quantile(spend, 0.75) as p75 ,max(spend) as max_spendfrom(select user_id ,sum(sales) as spendfrom transaction_tbl_xgroup by 1) a;"""execute_query(query)
min_spend
p25
avg_spend
std_spend
p50
p75
max_spend
0
0.0
0.0
343.119825
1343.99162
0.0
0.0
33513.0
Example: How many percentage of customer purchased at least once?
As you might notice from the last example, most of the customers have spend equals zero. This is a typical distribution in a retail business where most users come to window shop and only a few will make a purchase. We can find out % of those who made at least one purchase by combining avg aggregation and the case when if-else logic.
Code
query =f"""select setseed(0.112);select avg(case when spend > 0 then 1 else 0 end) as conversion_ratefrom(select user_id ,sum(sales) as spendfrom transaction_tbl_xgroup by 1) a;"""execute_query(query)
conversion_rate
0
0.102035
Example: Give me only customers who have bought at least one item more expensive than $100.
In the same manner we use where clause to filter select, we can also use having to filter aggregations. This works exactly the same where as how you would do aggregation first then filter it with a subquery. Most modern query engines treat them the same way in terms of performance so pick whichever is more readable to you.
Code
query =f"""select setseed(0.112);select user_id ,min(price) as min_pricefrom transaction_tbl_xwhere behavior_type = 'buy'group by 1having min_price > 100order by min_price;"""execute_query(query)
user_id
min_price
0
782669
101
1
302287
101
2
773094
101
...
...
...
85577
340274
1049
85578
603498
1049
85579
438867
1049
85580 rows × 2 columns
Example: For each user who have favorited anything, give me a list of items they have favorited ordered by timestamp in ascending order.
Today you might run into preparing a sequence dataset to train LLMs, in which case you want to concatenate a series of values from a column. Most modern SQL handles this such as Spark SQL and Presto. In duckdb, we can use string_agg and keep order from earliest to latest timestamp by using order by.
Code
query =f"""select setseed(0.112);select user_id ,string_agg(item_id order by timestamp) as item_id_listfrom transaction_tbl_xwhere behavior_type = 'fav'group by 1"""execute_query(query)
user_id
item_id_list
0
744534
2149136,88810,4629792
1
517461
4325698,5109079,2036947
2
261004
1289993,4431704
...
...
...
98434
702058
4753515
98435
865408
2643630
98436
391171
5154868
98437 rows × 2 columns
6. Window Function
Aggregation is very powerful, but a major downside is that it collapses the total rows into aggregated values. This means that after using group by, you lose the individual detail of each original row. You can no longer easily ask questions about the sequence of events, like “What happened just before this?” or “What is the third item in this list?”. Window functions solve this by allowing you to perform calculations on a related set of rows (a “window”) without making those rows disappear. They let you calculate things like rankings, running totals, or compare values across rows, all while keeping all your original data rows intact.
Example: For each user, find their second-to-last event.
We might be able to use min/max aggregation with some subqueries to get the last event, but second-to-last event is a bit convoluted to retrieve with group by alone. This is trivial when we use row_number() over (partition by [WINDOW COLUMN] order by [ORDERING COLUMN] asc/desc) to get the ranking numbers (starting with 1), then filter only the rows we need (rnk=2).
Code
query =f"""select setseed(0.112);select * from(select * ,row_number() over (partition by user_id order by timestamp desc) as rnkfrom transaction_tbl_x) awhere rnk=2"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
event_timestamp
event_date
year_month
rnk
0
6
730246
1512151025
pv
4756105
296
NaN
0.0
2017-12-02 02:57:05+09:00
2017-12-02
2017-12
2
1
58
274807
1512191821
pv
4458428
857
NaN
0.0
2017-12-02 14:17:01+09:00
2017-12-02
2017-12
2
2
73
2561888
1512224052
pv
753984
938
NaN
0.0
2017-12-02 23:14:12+09:00
2017-12-02
2017-12
2
...
...
...
...
...
...
...
...
...
...
...
...
...
753406
1017789
797757
1512179970
pv
883960
807
NaN
0.0
2017-12-02 10:59:30+09:00
2017-12-02
2017-12
2
753407
1017827
4777442
1512258856
pv
3607361
492
NaN
0.0
2017-12-03 08:54:16+09:00
2017-12-03
2017-12
2
753408
1017858
850224
1512291735
pv
3800818
274
NaN
0.0
2017-12-03 18:02:15+09:00
2017-12-03
2017-12
2
753409 rows × 12 columns
We can confirm that these are one second-to-last event each from users who have a least 2 events.
Example: What is the first item in each category that each user view?
We can also have multiple partitions as window.
Code
query =f"""select setseed(0.112);select * from(select user_id ,item_id ,row_number() over (partition by user_id, category_id order by timestamp asc) as rnkfrom transaction_tbl_xwhere behavior_type = 'pv') awhere rnk=1;"""execute_query(query)
user_id
item_id
rnk
0
7
516760
1
1
19
3615870
1
2
21
4928897
1
...
...
...
...
3389298
1017999
3251062
1
3389299
1018006
2789562
1
3389300
1018011
3190817
1
3389301 rows × 3 columns
Example: For each user, what was their previous action before a buy action?
We can use lag/lead to get value just before or after each row.
Code
pd.set_option('display.max_rows', 10) query =f"""select setseed(0.112);select * from (select user_id, event_timestamp, behavior_type as current_event, lag(behavior_type) over (partition by user_id order by event_timestamp asc) as previous_eventfrom transaction_tbl_xorder by user_id, event_timestamp asc) awhere current_event = 'buy'limit 10"""execute_query(query)
user_id
event_timestamp
current_event
previous_event
0
2
2017-12-02 20:34:34+09:00
buy
pv
1
41
2017-11-28 15:22:28+09:00
buy
pv
2
45
2017-12-03 11:49:02+09:00
buy
pv
3
50
2017-11-27 09:02:55+09:00
buy
None
4
50
2017-12-02 11:35:55+09:00
buy
pv
5
50
2017-12-03 20:56:57+09:00
buy
buy
6
59
2017-11-27 01:29:32+09:00
buy
pv
7
62
2017-12-03 16:30:30+09:00
buy
pv
8
68
2017-11-27 19:59:34+09:00
buy
pv
9
80
2017-11-25 22:52:00+09:00
buy
pv
Confirm the query works by looking at user_id=50 (no event, pv, buy) and user_id=41 (single pv event).
Code
pd.set_option('display.max_rows', 30) query =f"""select setseed(0.112);select *from transaction_tbl_xwhere user_id in (50, 41)order by user_id, event_timestamp"""execute_query(query)
user_id
item_id
timestamp
behavior_type
category_id
price
quantity
sales
event_timestamp
event_date
year_month
0
41
2350782
1511590506
fav
3576283
832
NaN
0.0
2017-11-25 15:15:06+09:00
2017-11-25
2017-11
1
41
4810522
1511592979
pv
1464116
572
NaN
0.0
2017-11-25 15:56:19+09:00
2017-11-25
2017-11
2
41
259923
1511651601
pv
4170419
973
NaN
0.0
2017-11-26 08:13:21+09:00
2017-11-26
2017-11
3
41
4786486
1511758455
pv
2572604
536
NaN
0.0
2017-11-27 13:54:15+09:00
2017-11-27
2017-11
4
41
460114
1511850148
buy
4804883
164
5.0
820.0
2017-11-28 15:22:28+09:00
2017-11-28
2017-11
5
41
1876500
1511850397
pv
3158249
550
NaN
0.0
2017-11-28 15:26:37+09:00
2017-11-28
2017-11
6
41
3599288
1512050217
cart
1537669
338
NaN
0.0
2017-11-30 22:56:57+09:00
2017-11-30
2017-11
7
41
2479959
1512050346
pv
3491350
1009
NaN
0.0
2017-11-30 22:59:06+09:00
2017-11-30
2017-11
8
41
3095445
1512072759
fav
4801426
495
NaN
0.0
2017-12-01 05:12:39+09:00
2017-12-01
2017-12
9
41
3937435
1512072770
pv
4801426
485
NaN
0.0
2017-12-01 05:12:50+09:00
2017-12-01
2017-12
10
41
2601044
1512073082
pv
2735466
94
NaN
0.0
2017-12-01 05:18:02+09:00
2017-12-01
2017-12
11
41
3017816
1512073131
pv
2735466
866
NaN
0.0
2017-12-01 05:18:51+09:00
2017-12-01
2017-12
12
41
3976745
1512073512
pv
3002561
795
NaN
0.0
2017-12-01 05:25:12+09:00
2017-12-01
2017-12
13
50
1808993
1511740975
buy
4690421
1043
10.0
10430.0
2017-11-27 09:02:55+09:00
2017-11-27
2017-11
14
50
1353441
1511743131
cart
64179
491
NaN
0.0
2017-11-27 09:38:51+09:00
2017-11-27
2017-11
15
50
1963474
1511743387
pv
3422001
524
NaN
0.0
2017-11-27 09:43:07+09:00
2017-11-27
2017-11
16
50
972172
1511854245
pv
4756105
222
NaN
0.0
2017-11-28 16:30:45+09:00
2017-11-28
2017-11
17
50
407777
1511948105
pv
4756105
827
NaN
0.0
2017-11-29 18:35:05+09:00
2017-11-29
2017-11
18
50
2559047
1512108229
pv
4756105
97
NaN
0.0
2017-12-01 15:03:49+09:00
2017-12-01
2017-12
19
50
3408121
1512110060
pv
1320293
171
NaN
0.0
2017-12-01 15:34:20+09:00
2017-12-01
2017-12
20
50
4225949
1512110662
pv
1879194
999
NaN
0.0
2017-12-01 15:44:22+09:00
2017-12-01
2017-12
21
50
1820775
1512112710
cart
4537973
825
NaN
0.0
2017-12-01 16:18:30+09:00
2017-12-01
2017-12
22
50
3096190
1512179808
pv
472273
240
NaN
0.0
2017-12-02 10:56:48+09:00
2017-12-02
2017-12
23
50
5155205
1512179846
cart
4762182
255
NaN
0.0
2017-12-02 10:57:26+09:00
2017-12-02
2017-12
24
50
518956
1512181946
pv
4835206
1006
NaN
0.0
2017-12-02 11:32:26+09:00
2017-12-02
2017-12
25
50
120958
1512182155
buy
4762182
1008
3.0
3024.0
2017-12-02 11:35:55+09:00
2017-12-02
2017-12
26
50
4619331
1512302217
buy
3884119
381
7.0
2667.0
2017-12-03 20:56:57+09:00
2017-12-03
2017-12
27
50
5045605
1512302237
pv
820727
655
NaN
0.0
2017-12-03 20:57:17+09:00
2017-12-03
2017-12
Example: What is the contribution of each item to their overall category sales?
We can also use aggregations like sum in conjuction with window function.
Code
pd.set_option('display.max_rows', 6) query =f"""select setseed(0.112);select item_id ,category_id ,item_sales / category_sales as percentage_category_salesfrom ( select item_id ,category_id ,item_sales ,sum(item_sales) over (partition by category_id) as category_sales from ( select item_id ,category_id ,sum(sales) as item_sales from transaction_tbl_x group by 1,2 ) a) bwhere category_sales > 0;"""execute_query(query)
item_id
category_id
percentage_category_sales
0
644316
75275
0.0
1
4356670
75275
0.0
2
45889
75275
0.0
...
...
...
...
1219659
4607802
5078340
0.0
1219660
4052466
5078340
0.0
1219661
104515
5078340
0.0
1219662 rows × 3 columns
7. Concatenation aka Union
Sometimes you want to concatenate multiple tables with the same set of columns (called schema) together. union all is simple concatenation whereas union will also deduplicate the rows for you, only returning rows that are not perfectly identical to one another.
Example: Concatenate monthly summary of 2017-11 and 2017-12 together.
We do not need to worry about duplicates here so we can simply use union all. See how we can combine count, distinct and case when to get monthly acitve customers and purchasers.
Code
query =f"""select setseed(0.112);select * from(select year_month ,count(distinct user_id) as nb_active_customer ,count(distinct case when behavior_type='buy' then user_id else null end) as nb_purchaser ,sum(sales) as total_salesfrom transaction_tbl_xwhere year_month = '2017-11'group by 1)union all(select year_month ,count(distinct user_id) as nb_active_customer ,count(distinct case when behavior_type='buy' then user_id else null end) as nb_purchaser ,sum(sales) as total_salesfrom transaction_tbl_xwhere year_month = '2017-12'group by 1);"""execute_query(query)
year_month
nb_active_customer
nb_purchaser
total_sales
0
2017-11
759497
58836
192782711.0
1
2017-12
679573
34938
111758189.0
Example: Give me a list of unique users who made at least one purchase or viewed at least 5 unique items.
Concatenate then deduplicate.
Code
query =f"""select setseed(0.112);select user_idfrom transaction_tbl_xwhere behavior_type = 'buy'group by 1unionselect user_idfrom transaction_tbl_xwhere behavior_type = 'fav'group by 1having count(*)>=5"""execute_query(query)
user_id
0
898251
1
576709
2
837520
...
...
92122
86617
92123
906571
92124
103138
92125 rows × 1 columns
Simple concatenation will give duplicates.
Code
query =f"""select setseed(0.112);select user_idfrom transaction_tbl_xwhere behavior_type = 'buy'group by 1union allselect user_idfrom transaction_tbl_xwhere behavior_type = 'fav'group by 1having count(*)>=5"""execute_query(query)
user_id
0
286908
1
910422
2
669209
...
...
92359
251575
92360
598643
92361
906629
92362 rows × 1 columns
8. Joins
join connects data from one table to another based on columns they share. There are many types of joins but 95% of your life will revolve around left join and inner join.
Example: Among users who are active (have at least one event) on 2017-12-03, how many percent were active in 2017-11
left join starts with all rows from the left-side table (the former one) and add rows from the right-side table (the latter one) to it, if and only if the rows meet the conditions given in the on clause. These conditions are usually for values in a column from the left-side table to be equal to, not equal to, or more/less than the ones in a column from the right-side table.
Be sure to give aliases to columns you select and are joining on. Your query engine needs to know exactly from which table the columns came from if they have the same name.
Code
query =f"""select setseed(0.112);with november_active_tbl as (select user_id ,1 as active_in_novemberfrom transaction_tbl_xwhere year_month = '2017-11'group by 1,2),dec3_active_tbl as (select user_idfrom transaction_tbl_xwhere event_date = '2017-12-03'group by 1)select dec3.user_id ,active_in_novemberfrom dec3_active_tbl dec3left join november_active_tbl novon dec3.user_id = nov.user_id;"""execute_query(query)
user_id
active_in_november
0
78674
1
1
755902
1
2
829454
1
...
...
...
375424
743851
<NA>
375425
543372
<NA>
375426
610566
<NA>
375427 rows × 2 columns
As you can see, users who were not active in 2017-11 will have null values in their active_in_november column. This is because we need to make sure that all rows from the left-side table (dec3_active_tbl) are there. Lastly, we can find the percentage by a simple aggregation. This is how you calculate percentage of returning users.
Code
query =f"""select setseed(0.112);with november_active_tbl as (select user_id ,1 as active_in_novemberfrom transaction_tbl_xwhere year_month = '2017-11'group by 1,2),dec3_active_tbl as (select user_idfrom transaction_tbl_xwhere event_date = '2017-12-03'group by 1)select avg(coalesce(active_in_november,0)) as percent_active_last_monthfrom dec3_active_tbl dec3left join november_active_tbl novon dec3.user_id = nov.user_id;"""execute_query(query)
percent_active_last_month
0
0.824562
Example: What is the daily contribution to its total monthly sales, expressed as percentage?
inner join is used when you only want rows where the on conditions are satisfied for both tables. In this case, we know that year_month, the key we are joining on, exists in both daily and monthly sales tables, so we can use inner join without fear of losing information. inner join has the best performance than left join so prioritize it if you can, especially if you are working with huge tables.
One sneaky thing you can do is that technically you can enter a condition only based on one table in the on clause such as daily_sales > 0 below. It will have the same performance as when you do it on where clause.
Code
pd.set_option('display.max_rows', 10) query =f"""select setseed(0.112);with monthly_sales_tbl as (select year_month ,sum(sales) as monthly_salesfrom transaction_tbl_xgroup by 1having sum(sales)>0),daily_sales_tbl as (select year_month ,event_date ,sum(sales) as daily_salesfrom transaction_tbl_xgroup by 1,2)select daily.year_month ,event_date ,daily_sales / monthly_sales as percentage_of_monthly_salesfrom daily_sales_tbl dailyinner join monthly_sales_tbl monthlyon daily.year_month = monthly.year_month and daily_sales > 0order by event_date;"""execute_query(query)
year_month
event_date
percentage_of_monthly_sales
0
2017-11
2017-11-25
0.149288
1
2017-11
2017-11-26
0.162279
2
2017-11
2017-11-27
0.177516
3
2017-11
2017-11-28
0.165055
4
2017-11
2017-11-29
0.171182
5
2017-11
2017-11-30
0.176354
6
2017-12
2017-12-01
0.284772
7
2017-12
2017-12-02
0.352953
8
2017-12
2017-12-03
0.346453
9
2017-12
2017-12-04
0.016650
9. Tribal Knowledge
We have now gone through SQL techniques to accomplish most tasks as a SQL monkey. I would like to close with some tips and tricks I have learned over the years:
Get your hands dirty. Whether it is the exercise or your own data. Get out there and make it happen!
ALWAYS SPECIFY THE PARTITIONS YOU NEED IN THE WHERE CLAUSE, and use the right data types.
Sanity check for data quality issues, namely duplicates, missing values, improbable values, and data types. Make sure values in columns you care about are distributed in a reasonable manner.
Work with assumptions and experiments. Have a set of clear hypotheses about what you are trying to learn/do, compose the query, run it and record the results. Work incrementally and not all at once; lest you will regret it during the debugging process. Do not just randomly write queries and hope for the best. The rabbit hole is too deep.
Performance is king. Always pick a pattern that results in better performance when possible. For instance, inner join over left join, union all over union, do not use distinct if values are already unique, and so on.
Query code styling must be readable and consistent. In this day and age where a coding assistant can fix your code styling in a few seconds, the best code styling for your query is the one that is most readable to your team. Whether it is tab vs space identation, leading vs trailing commas, with vs subquery vs view, capitalized vs uncapitalized keywords, just pick one style and stick with it.
When in doubt explain. Most query engines will show you how it plans to execute your queries. If you are a beginner, this might not be extremely helpful, but at least you can catch some simple things like if the partitions you specified are being used, is the engine making some unncessary data type conversion, which part of the process takes the moast time and so on. explain will only show you the plan but explain analyze will execute the query then tell you how it went.
Code
query ="""explain analyzeselect * from transaction_tblwhere item_id = 2067266 and category_id = 4339722"""print(execute_query(query)['explain_value'][0])