R and SQL , Best of both
Data Science for Everyone
Loading the databases used in this Exercise
SELECT * FROM staff;
SELECT
statement
- In SQL, you can SELECT data FROM a table using a SELECT statement. For example, the following query SELECTs the name column FROM the people table:
SELECT name FROM people;
- you may want to SELECT all columns FROM a table use
*
rather than typing all the columns
SELECT all columns FROM the
staff
table
SELECT last_name , email,department, salary FROM staff;
- R’s dplyr uses A select statement as well
df |> select(var1,var2,var3)
Filtering data
filtering using WHERE
in SQL
WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
You can build up your WHERE queries by combining multiple conditions with the AND keyword.
SELECT last_name , email,department, salary
FROM staff
WHERE department = 'Computers' ;
- in R we use the
filter
command and almost similar comparison operators as in SQL.
SELECT last_name , email,department, salary
FROM staff
WHERE salary > 149307 ;
Aggregate functions in SQL and R
Find the
MIN()
OF salary
SELECT MIN(salary) FROM staff;
Find the
MIN()
OF salary andaliase
it asmin_salary
SELECT MIN(salary) AS ____ FROM employees;
SELECT __(salary) AS min_salary,
AS max_salary,
__(salary) AS ___
___(______) FROM employees;
filtering text data in SQL and R
- use
LIKE
- the
LIKE
operator can be used in a WHERE clause to search for a pattern in a column. - To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:
The
%
wildcard will match zero, one, or many characters in text
The
_
wildcard will match a single character
SELECT *
FROM staff
WHERE email LIKE 'walle%'
- in R we use the function str_like
Grouping and aggregating in SQL and R
SELECT gender,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
AVG(salary) AS avg_salary
FROM employees
GROUP BY gender;
Grouping , aggregating and Ordering in SQL
SELECT department,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary;
CASE
statements
this contains a
WHEN,THEN
andELSE
statement ,finished withEND
Take the stair CASE
- using
CASE WHEN
to categorise or group data as follows SELECT
….CASE WHEN
conditionTHEN
categoryWHEN
conditionTHEN
category- …………………………………..
ELSE
some_categoryEND AS
variable name
SELECT last_name ,
salary ,CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
WHEN salary IS NULL THEN 'missing'
ELSE '>120000'
END AS salary_range
FROM staff;
SELECT last_name ,
salary ,CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
WHEN salary IS NULL THEN 'missing'
ELSE '>120000'
END AS salary_range,
COUNT(*) AS range_count
FROM staff
GROUP BY salary_range;
Perform an operation across a set of rows that are somehow related to the current row Similar to GROUP BY aggregate functions, but all rows remain in the output.
Uses
- Fetching values from preceding or following rows (e.g. fetching the previous row’s value) Determining reigning champion status
- Calculating growth over time
- Assigning ordinal ranks (1rst, 2nd, etc.) to rows based on their values’ positions in a sorted list
- Running totals, moving averages
Window functions in SQL AND R
- Anatomy of a window function
FUNCTION_NAME() OVER (...)
- ORDER BY
- PARTITION BY
- ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED
SELECT department, job_title, salary,
ROW_NUMBER() OVER () AS Row_N
FROM staff
WHERE department = 'Computers';
Enter ORDER BY
ORDER BY in OVER orders the rows related to the current row Example: Ordering by salary in descending order in ROW_NUMBER ’s OVER clause will assign 1 to the largest salary salary’s rows
SELECT department, job_title, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS Row_N
FROM staff
WHERE department = 'Computers';
ORDER BY inside OVER takes effect before the one outside
Enter PARTITION BY
- PARTITION BY splits the table into partitions based on a column’s unique values
- The results aren’t rolled into one column Operated on separately by the window function ROW_NUMBER will reset for each partition
- LAG will only fetch a row’s previous value if its previous row is in the same partition
SELECT department, job_title, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS Row_N
FROM staff;
The ranking functions
- ROW_NUMBER() always assigns unique numbers, even if two rows’ values are the same
- RANK() assigns the same number to rows with identical values, skipping over the next numbers in such cases
- DENSE_RANK() also assigns the same number to rows with identical values, but doesn’t skip over the next numbers
SELECT
department, job_title, salary,ROW_NUMBER() OVER (ORDER BY salary ASC) AS Row_N,
RANK() OVER (ORDER BY salary ASC) AS Rank_N
FROM staff
WHERE
= 'Computers'; department