R and SQL , Best of both

Data Science for Everyone

Author

Bongani Ncube

Loading the databases used in this Exercise

staff database
Staff dataset from R

SELECT * FROM staff;

`SELECT` statement

SELECT IN SQL
SELECT IN R

In SQL, you can SELECT data FROM a table using a SELECT statement. For example, the following query SELECTs the name column FROM the people table: SELECT name FROM people;
you may want to SELECT all columns FROM a table use * rather than typing all the columns

SELECT all columns FROM the staff table

SELECT last_name , email,department, salary FROM staff;

R’s dplyr uses A select statement as well

df |> select(var1,var2,var3)

Filtering data

filtering IN SQL
filtering IN R

filtering using WHERE in SQL

WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT last_name , email,department, salary 
FROM staff
WHERE department = 'Computers' ;

in R we use the filter command and almost similar comparison operators as in SQL.

Using greater than /less than IN SQL
Using greater than /less than IN R

SELECT last_name , email,department, salary 
FROM staff
WHERE salary > 149307 ;

Aggregate functions in SQL and R

Finding Minimum IN SQL
Finding Minimum IN R

Find the MIN() OF salary

SELECT MIN(salary) FROM staff;

Find the MIN() OF salary and aliase it as min_salary

SELECT MIN(salary) AS ____ FROM employees;

Find the MIN(),MAX(),AVG(),STDEV() OF salary in SQL
Find the MIN(),MAX(),AVG(),STDEV() OF salary in R

SELECT __(salary) AS min_salary,
       __(salary) AS max_salary,
       ___(______) AS ___
FROM employees;

filtering text data in SQL and R

use LIKE
the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

Filter using LIKE in SQL
Filter using LIKE in R

SELECT * 
FROM staff 
WHERE email LIKE 'walle%'

in R we use the function str_like

Grouping and aggregating in SQL and R

Grouping to Find the MIN(),MAX(),AVG(),STDEV() OF salary in SQL for each gender
Grouping to Find the MIN(),MAX(),AVG(),STDEV() OF salary in R for each gender

SELECT gender,
      MIN(salary) AS min_salary,
      MAX(salary) AS max_salary,
      AVG(salary) AS avg_salary
FROM employees
GROUP BY gender;

Grouping , aggregating and Ordering in SQL

in SQL
in R

SELECT department,
      MIN(salary) AS min_salary,
      MAX(salary) AS max_salary,
      AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary;

`CASE` statements

this contains a WHEN,THEN and ELSE statement ,finished with END

Take the stair `CASE`

using CASE WHEN to categorise or group data as follows
SELECT….
CASE WHEN condition THEN category
WHEN condition THEN category
…………………………………..
ELSE some_category
END AS variable name

Case statement in SQL
Case When in R

SELECT last_name ,
           salary ,
          CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
               WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
               WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
               WHEN salary IS NULL THEN 'missing'
               ELSE '>120000'
              END AS salary_range
FROM staff;

Case statement and GROUP BY in SQL
Case When AND GROUP BY in R

SELECT last_name ,
           salary ,
          CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
               WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
               WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
               WHEN salary IS NULL THEN 'missing'
               ELSE '>120000'
              END AS salary_range,
              COUNT(*) AS range_count
FROM staff
GROUP BY salary_range;

Window functions

Perform an operation across a set of rows that are somehow related to the current row Similar to GROUP BY aggregate functions, but all rows remain in the output.

Uses

Fetching values from preceding or following rows (e.g. fetching the previous row’s value) Determining reigning champion status
Calculating growth over time
Assigning ordinal ranks (1rst, 2nd, etc.) to rows based on their values’ positions in a sorted list
Running totals, moving averages

Window functions in SQL AND R

Anatomy of a window function

FUNCTION_NAME() OVER (...)

ORDER BY
PARTITION BY
ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED

ROW_NUMBER in SQL
ROW_NUMBER in R

SELECT department, job_title, salary,
ROW_NUMBER() OVER () AS Row_N
FROM staff
WHERE department = 'Computers';

Enter ORDER BY

ORDER BY in OVER orders the rows related to the current row Example: Ordering by salary in descending order in ROW_NUMBER ’s OVER clause will assign 1 to the largest salary salary’s rows

Order by and ROW_NUMBER in SQL
Order by and ROW_NUMBER in R

SELECT department, job_title, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS Row_N
FROM staff
WHERE department = 'Computers';

ORDER BY inside OVER takes effect before the one outside

Enter PARTITION BY

PARTITION BY splits the table into partitions based on a column’s unique values
The results aren’t rolled into one column Operated on separately by the window function ROW_NUMBER will reset for each partition
LAG will only fetch a row’s previous value if its previous row is in the same partition

Partition in SQL
Partition in R

SELECT department, job_title, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS Row_N
FROM staff;

The ranking functions

ROW_NUMBER() always assigns unique numbers, even if two rows’ values are the same
RANK() assigns the same number to rows with identical values, skipping over the next numbers in such cases
DENSE_RANK() also assigns the same number to rows with identical values, but doesn’t skip over the next numbers

Ranking in SQL
Ranking in R

SELECT
department, job_title, salary,
ROW_NUMBER() OVER (ORDER BY salary ASC) AS Row_N,
RANK() OVER (ORDER BY salary ASC) AS Rank_N
FROM staff
WHERE
department = 'Computers';