library(tidyverse)
sales<-tribble(~ year , ~country , ~product , ~profit ,~Own ,~Time,
2000 , "Finland", "Computer" , 1500 ,"Y" ,"D",
2000 , "Finland" ,"Phone" , 100 ,"Y" ,"D",
2001 , "Finland" ,"Phone" , 10 ,"N" ,"D",
2000 , "India" ,"Calculator" , 75 ,"Y" ,"E",
2000 , "India" ,"Calculator" , 75 ,"N" ,"E",
2000 , "India" ,"Computer" , 1200 ,"N" ,"E",
2000 , "USA" , "Calculator", 75 ,"Y" ,"E",
2000 , "USA" , "Computer" , 1500 ,"N" ,"E",
2001 , "USA" ,"Calculator" , 50 ,"Y" ,"E",
2001 , "USA" ,"Computer" , 1500 ,"Y" ,"E",
2001 , "USA" ,"Computer" , 1200 ,"Y" ,"D",
2001 , "USA" , "TV" , 150 ,"Y" ,"D",
2001 , "USA" , "TV" , 100 ,"Y" ,"D")
SQL Window Functions
Window functions
- A window function performs an aggregate-like operation on a set of query rows.
However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row:
Anatomy of a window function
FUNCTION_NAME() OVER()
ORDER BY
PARTITION BY
ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED
Creating a fake dataset in R
Load in the necessary packages
Set up a database using the datafile
Lets get started
- The row for which function evaluation occurs is called the current row.
- The query rows related to the current row over which function evaluation occurs comprise the window for the current row.
ROW_NUMBER() OVER()
- use it if maybe you have duplicates or intend to create rowids for your data
SELECT ROW_NUMBER() OVER () AS row_id , year, country, product, profit
FROM sales;
row_id | year | country | product | profit |
---|---|---|---|---|
1 | 2000 | Finland | Computer | 1500 |
2 | 2000 | Finland | Phone | 100 |
3 | 2001 | Finland | Phone | 10 |
4 | 2000 | India | Calculator | 75 |
5 | 2000 | India | Calculator | 75 |
6 | 2000 | India | Computer | 1200 |
7 | 2000 | USA | Calculator | 75 |
8 | 2000 | USA | Computer | 1500 |
9 | 2001 | USA | Calculator | 50 |
10 | 2001 | USA | Computer | 1500 |
PARTITION BY
- This splits the table into
partitions
based on a column’s unique values and results aren’t rolled into one column
therefore ROW_NUMBER()
With PARTITION BY
produces the row number of each row within its partition. In this case, rows are numbered per country. By default, partition rows are unordered and row numbering is nondeterministic.
SELECT ROW_NUMBER() OVER(PARTITION BY country) AS row_num1,
year, country, product, profit
FROM sales;
row_num1 | year | country | product | profit |
---|---|---|---|---|
1 | 2000 | Finland | Computer | 1500 |
2 | 2000 | Finland | Phone | 100 |
3 | 2001 | Finland | Phone | 10 |
1 | 2000 | India | Calculator | 75 |
2 | 2000 | India | Calculator | 75 |
3 | 2000 | India | Computer | 1200 |
1 | 2000 | USA | Calculator | 75 |
2 | 2000 | USA | Computer | 1500 |
3 | 2001 | USA | Calculator | 50 |
4 | 2001 | USA | Computer | 1500 |
its also possible to use
ROW_NUMBER()
WithORDER BY
to determine ranks
SELECT ROW_NUMBER() OVER(ORDER BY profit DESC) AS row_num1,
year, country, product, profit
FROM sales;
row_num1 | year | country | product | profit |
---|---|---|---|---|
1 | 2000 | Finland | Computer | 1500 |
2 | 2000 | USA | Computer | 1500 |
3 | 2001 | USA | Computer | 1500 |
4 | 2000 | India | Computer | 1200 |
5 | 2001 | USA | Computer | 1200 |
6 | 2001 | USA | TV | 150 |
7 | 2000 | Finland | Phone | 100 |
8 | 2001 | USA | TV | 100 |
9 | 2000 | India | Calculator | 75 |
10 | 2000 | India | Calculator | 75 |
in the above example , we see that we have ranked our data according to profits
implying that each rowid also signifies the rank of profit
RANK()
- ROW_NUMBER assigns different numbers to profits with the same amount, so it’s not a useful ranking function; if profits are the same , they should have the same rank.
in such a case we use
RANK()
SELECT RANK() OVER(ORDER BY profit DESC) AS RANK,
year, country, product
profit ,FROM sales;
RANK | profit | year | country | product |
---|---|---|---|---|
1 | 1500 | 2000 | Finland | Computer |
1 | 1500 | 2000 | USA | Computer |
1 | 1500 | 2001 | USA | Computer |
4 | 1200 | 2000 | India | Computer |
4 | 1200 | 2001 | USA | Computer |
6 | 150 | 2001 | USA | TV |
7 | 100 | 2000 | Finland | Phone |
7 | 100 | 2001 | USA | TV |
9 | 75 | 2000 | India | Calculator |
9 | 75 | 2000 | India | Calculator |
DENSE_RANK()
we can use DENSE_RANK()
if we need ranks to be ordered or require a further partition
SELECT DENSE_RANK() OVER(PARTITION BY country ORDER BY profit DESC) AS RANK,
year, country, product
profit ,FROM sales;
RANK | profit | year | country | product |
---|---|---|---|---|
1 | 1500 | 2000 | Finland | Computer |
2 | 100 | 2000 | Finland | Phone |
3 | 10 | 2001 | Finland | Phone |
1 | 1200 | 2000 | India | Computer |
2 | 75 | 2000 | India | Calculator |
2 | 75 | 2000 | India | Calculator |
1 | 1500 | 2000 | USA | Computer |
1 | 1500 | 2001 | USA | Computer |
2 | 1200 | 2001 | USA | Computer |
3 | 150 | 2001 | USA | TV |
now we have ranked our profits by country starting with the highest profit in each country
Aggregations
- window functions are similar to
GROUP BY
aggregate function but all rows stay in the output
using the sales information table, these two queries perform aggregate operations that produce a single global sum for all rows taken as a group, and sums grouped per country:
SELECT SUM(profit) AS total_profit
FROM sales;
total_profit |
---|
7535 |
lets us see a GROUP BY
in action
SELECT country, SUM(profit) AS country_profit
FROM sales
GROUP BY country
ORDER BY country;
country | country_profit |
---|---|
Finland | 1610 |
India | 1350 |
USA | 4575 |
On the other hand , window operations do not collapse groups of query rows to a single output row. Instead, they produce a result for each row. Like the preceding queries, the following query uses SUM()
, but this time as a window function:
SELECT
year, country, product, profit,
SUM(profit) OVER() AS total_profit,
SUM(profit) OVER(PARTITION BY country) AS country_profit
FROM sales
ORDER BY country, year, product, profit;
year | country | product | profit | total_profit | country_profit |
---|---|---|---|---|---|
2000 | Finland | Computer | 1500 | 7535 | 1610 |
2000 | Finland | Phone | 100 | 7535 | 1610 |
2001 | Finland | Phone | 10 | 7535 | 1610 |
2000 | India | Calculator | 75 | 7535 | 1350 |
2000 | India | Calculator | 75 | 7535 | 1350 |
2000 | India | Computer | 1200 | 7535 | 1350 |
2000 | USA | Calculator | 75 | 7535 | 4575 |
2000 | USA | Computer | 1500 | 7535 | 4575 |
2001 | USA | Calculator | 50 | 7535 | 4575 |
2001 | USA | Computer | 1200 | 7535 | 4575 |
we have included OVER
clause that specifies how to partition query rows into groups for processing by the window function:
- The first OVER clause is empty, which treats the entire set of query rows as a single partition. The window function thus produces a global sum, but does so for each row.
- The second OVER clause partitions rows by country, producing a sum per partition (per country). The function produces this sum for each partition row.
Window functions are permitted only in the select list and ORDER BY
clause. Query result rows are determined from the FROM
clause, after WHERE, GROUP BY
, and HAVING
processing, and windowing execution occurs before ORDER BY
, LIMIT
, and SELECT DISTINCT
.
WITH CLAUSE
This is used to create a temporary relation such that the output of this temporary relation is available and is used by the query that is associated with the WITH
clause. for instance the following table can be created as a temporary table
FRAMES
Frames allow you to “peek” forwards or backward without first using the relative fetching functions, LAG
and LEAD
, to fetch previous rows’ values into the current row.
ROWS BETWEEN
-
n PRECEDING
:n
rows before current row -
CURRENT ROW
: the current row -
n FOLLOWING
:n
rows after the current row
the following query calculates the number of sales per year and per country(i.e grouped by year and country)
SELECT
year, country, COUNT(*) AS num_sales
FROM sales
GROUP BY year, country;
year | country | num_sales |
---|---|---|
2000 | Finland | 2 |
2000 | India | 3 |
2000 | USA | 2 |
2001 | Finland | 1 |
2001 | USA | 5 |
like soo…
WITH
and a FRAME
- We create a temporary table called
country_sales
WITH Country_sales AS (
SELECT
year, country, COUNT(*) AS num_sales
FROM sales
GROUP BY year, country)
SELECT
year, country, num_sales,
-- Calculate each country's 3s-ales moving total
SUM(num_sales) OVER
PARTITION BY country
(ORDER BY year ASC
ROWS BETWEEN
2 PRECEDING AND CURRENT ROW) AS sales_MA
FROM Country_sales
ORDER BY country ASC, year ASC;
year | country | num_sales | sales_MA |
---|---|---|---|
2000 | Finland | 2 | 2 |
2001 | Finland | 1 | 3 |
2000 | India | 3 | 3 |
2000 | USA | 2 | 2 |
2001 | USA | 5 | 7 |