R and SQL , Best of both

Data Science for Everyone

Author

Bongani Ncube

Loading the databases used in this Exercise

SELECT * FROM staff;

SELECT statement

  • In SQL, you can SELECT data FROM a table using a SELECT statement. For example, the following query SELECTs the name column FROM the people table: SELECT name FROM people;
  • you may want to SELECT all columns FROM a table use * rather than typing all the columns

SELECT all columns FROM the staff table

SELECT last_name , email,department, salary FROM staff;
  • R’s dplyr uses A select statement as well

df |> select(var1,var2,var3)

Filtering data

filtering using WHERE in SQL

  • WHERE is a filtering clause

  • In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT last_name , email,department, salary 
FROM staff
WHERE department = 'Computers' ;
  • in R we use the filter command and almost similar comparison operators as in SQL.
SELECT last_name , email,department, salary 
FROM staff
WHERE salary > 149307 ;

Aggregate functions in SQL and R

Find the MIN() OF salary

SELECT MIN(salary) FROM staff;

Find the MIN() OF salary and aliase it as min_salary

SELECT MIN(salary) AS ____ FROM employees;
SELECT __(salary) AS min_salary,
       __(salary) AS max_salary,
       ___(______) AS ___
FROM employees;

filtering text data in SQL and R

  • use LIKE
  • the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
  • To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

SELECT * 
FROM staff 
WHERE email LIKE 'walle%'
  • in R we use the function str_like

Grouping and aggregating in SQL and R

SELECT gender,
      MIN(salary) AS min_salary,
      MAX(salary) AS max_salary,
      AVG(salary) AS avg_salary
FROM employees
GROUP BY gender;

Grouping , aggregating and Ordering in SQL

SELECT department,
      MIN(salary) AS min_salary,
      MAX(salary) AS max_salary,
      AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary;

CASE statements

this contains a WHEN,THEN and ELSE statement ,finished with END

Take the stair CASE

  • using CASE WHEN to categorise or group data as follows
  • SELECT….
  • CASE WHEN condition THEN category
  • WHEN condition THEN category
  • …………………………………..
  • ELSE some_category
  • END AS variable name
SELECT last_name ,
           salary ,
          CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
               WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
               WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
               WHEN salary IS NULL THEN 'missing'
               ELSE '>120000'
              END AS salary_range
FROM staff;
SELECT last_name ,
           salary ,
          CASE WHEN salary BETWEEN 0 AND 50000 THEN '0-50000'
               WHEN salary BETWEEN 50001 AND 100000 THEN '50001-100000'
               WHEN salary BETWEEN 100001 AND 120000 THEN '100001-120000'
               WHEN salary IS NULL THEN 'missing'
               ELSE '>120000'
              END AS salary_range,
              COUNT(*) AS range_count
FROM staff
GROUP BY salary_range;
Window functions

Perform an operation across a set of rows that are somehow related to the current row Similar to GROUP BY aggregate functions, but all rows remain in the output.

Uses

  • Fetching values from preceding or following rows (e.g. fetching the previous row’s value) Determining reigning champion status
  • Calculating growth over time
  • Assigning ordinal ranks (1rst, 2nd, etc.) to rows based on their values’ positions in a sorted list
  • Running totals, moving averages

Window functions in SQL AND R

  • Anatomy of a window function
FUNCTION_NAME() OVER (...)
  • ORDER BY
  • PARTITION BY
  • ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED
SELECT department, job_title, salary,
ROW_NUMBER() OVER () AS Row_N
FROM staff
WHERE department = 'Computers';

Enter ORDER BY

ORDER BY in OVER orders the rows related to the current row Example: Ordering by salary in descending order in ROW_NUMBER ’s OVER clause will assign 1 to the largest salary salary’s rows

SELECT department, job_title, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS Row_N
FROM staff
WHERE department = 'Computers';

ORDER BY inside OVER takes effect before the one outside

Enter PARTITION BY

  • PARTITION BY splits the table into partitions based on a column’s unique values
  • The results aren’t rolled into one column Operated on separately by the window function ROW_NUMBER will reset for each partition
  • LAG will only fetch a row’s previous value if its previous row is in the same partition
SELECT department, job_title, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS Row_N
FROM staff;

The ranking functions

  • ROW_NUMBER() always assigns unique numbers, even if two rows’ values are the same
  • RANK() assigns the same number to rows with identical values, skipping over the next numbers in such cases
  • DENSE_RANK() also assigns the same number to rows with identical values, but doesn’t skip over the next numbers
SELECT
department, job_title, salary,
ROW_NUMBER() OVER (ORDER BY salary ASC) AS Row_N,
RANK() OVER (ORDER BY salary ASC) AS Rank_N
FROM staff
WHERE
department = 'Computers';