An SQL case study

databases
sql training
data manipulation
Author

Bongani Ncube

Published

4 February 2024

Load in required packages

Read in data

sales<-readr::read_csv("sales.csv")

What variables do we have

names(sales)
[1] "Order ID"         "Product"          "Quantity Ordered" "Price Each"      
[5] "Order Date"       "Purchase Address"
Objectives only for

look at the data types

sapply(sales[1,],class)
        Order ID          Product Quantity Ordered       Price Each 
     "character"      "character"      "character"      "character" 
      Order Date Purchase Address 
     "character"      "character" 

look at missing data

colSums(is.na(sales))
        Order ID          Product Quantity Ordered       Price Each 
             545              545              545              545 
      Order Date Purchase Address 
             545              545 

whats wrong with the data

Note

try to find out where the missing values are coming from , will check the row numbers with missing values.

(to_run<-sales |>
  mutate(id=row_number()) |> 
  relocate(id) |> 
  filter(is.na(`Price Each`)))
# A tibble: 545 × 7
      id `Order ID` Product `Quantity Ordered` `Price Each` `Order Date`
   <int> <chr>      <chr>   <chr>              <chr>        <chr>       
 1   265 <NA>       <NA>    <NA>               <NA>         <NA>        
 2   649 <NA>       <NA>    <NA>               <NA>         <NA>        
 3   681 <NA>       <NA>    <NA>               <NA>         <NA>        
 4  1386 <NA>       <NA>    <NA>               <NA>         <NA>        
 5  1496 <NA>       <NA>    <NA>               <NA>         <NA>        
 6  2130 <NA>       <NA>    <NA>               <NA>         <NA>        
 7  2221 <NA>       <NA>    <NA>               <NA>         <NA>        
 8  3045 <NA>       <NA>    <NA>               <NA>         <NA>        
 9  3243 <NA>       <NA>    <NA>               <NA>         <NA>        
10  3374 <NA>       <NA>    <NA>               <NA>         <NA>        
# ℹ 535 more rows
# ℹ 1 more variable: `Purchase Address` <chr>

now that we have seen that the columns are missing on common rows ,we can now remove all these!

sales<-sales |> 
  filter(!is.na(`Order ID`))

Glimpse at the data

glimpse(sales)
Rows: 186,305
Columns: 6
$ `Order ID`         <chr> "295665", "295666", "295667", "295668", "295669", "…
$ Product            <chr> "Macbook Pro Laptop", "LG Washing Machine", "USB-C …
$ `Quantity Ordered` <chr> "1", "1", "1", "1", "1", "1", "1", "2", "1", "4", "…
$ `Price Each`       <chr> "1700", "600.0", "11.95", "149.99", "11.95", "3.84"…
$ `Order Date`       <chr> "12/30/19 00:01", "12/29/19 07:03", "12/12/19 18:21…
$ `Purchase Address` <chr> "136 Church St, New York City, NY 10001", "562 2nd …

why are price each and order quantity variables characters?

Note

why is price each a character?…from the output of glimpse what can we see?

Looks like there has been some typo in the data as all the column observations contain the column name so let us drop these rows and rename certain columns

sales<-sales |> 
filter(!grepl("Price",`Price Each`,perl = T)) |> #drop unwanted rows
  rename(quantity=`Quantity Ordered`) |> 
  rename(address = `Purchase Address`) |> 
  rename(Price = `Price Each`)

do we have any duplicates?

dim(sales)
[1] 185950      6
Note

we have 186305 rows in our data

sales |> 
  summarise(unique_rows=n_distinct(`Order ID`))
# A tibble: 1 × 1
  unique_rows
        <int>
1      178437
Note

but we have 178438 unique rows … this is because one person could order different products

sales |> 
  janitor::tabyl(quantity) |> 
  as_tibble() |> 
  mutate(percent=round(percent,4)) 
# A tibble: 9 × 3
  quantity      n percent
  <chr>     <int>   <dbl>
1 1        168552  0.906 
2 2         13324  0.0717
3 3          2920  0.0157
4 4           806  0.0043
5 5           236  0.0013
6 6            80  0.0004
7 7            24  0.0001
8 8             5  0     
9 9             3  0     
  • well,people have a tendency of ordering what`s enough for them so its not surprising that most people would order one thing

look at class again

sapply(sales[1,],class)
   Order ID     Product    quantity       Price  Order Date     address 
"character" "character" "character" "character" "character" "character" 

Extracting more information from the data

Note

let’s begin off by using R to achieve this and then later on use SQL for this!

lets look at address column

sales |> 
  relocate(address)
# A tibble: 185,950 × 6
   address                        `Order ID` Product quantity Price `Order Date`
   <chr>                          <chr>      <chr>   <chr>    <chr> <chr>       
 1 136 Church St, New York City,… 295665     Macboo… 1        1700  12/30/19 00…
 2 562 2nd St, New York City, NY… 295666     LG Was… 1        600.0 12/29/19 07…
 3 277 Main St, New York City, N… 295667     USB-C … 1        11.95 12/12/19 18…
 4 410 6th St, San Francisco, CA… 295668     27in F… 1        149.… 12/22/19 15…
 5 43 Hill St, Atlanta, GA 30301  295669     USB-C … 1        11.95 12/18/19 12…
 6 200 Jefferson St, New York Ci… 295670     AA Bat… 1        3.84  12/31/19 22…
 7 928 12th St, Portland, OR 970… 295671     USB-C … 1        11.95 12/16/19 15…
 8 813 Hickory St, Dallas, TX 75… 295672     USB-C … 2        11.95 12/13/19 09…
 9 718 Wilson St, Dallas, TX 750… 295673     Bose S… 1        99.99 12/15/19 23…
10 77 7th St, Dallas, TX 75001    295674     AAA Ba… 4        2.99  12/28/19 11…
# ℹ 185,940 more rows
Note

one should note that this column contains useful information such as the street , city and zip . we need to extract the city and create a column for this!

  • we can use the function separate() to achieve this!
(sales_df<-sales |> 
  separate(address,into=c("address1","city","code"),",") |> 
  mutate_at(c(3,4),as.numeric) |> 
  mutate_if(is.character,as.factor) |> 
  relocate(c("address1","city","code")))
# A tibble: 185,950 × 8
   address1         city   code  `Order ID` Product quantity  Price `Order Date`
   <fct>            <fct>  <fct> <fct>      <fct>      <dbl>  <dbl> <fct>       
 1 136 Church St    " New… " NY… 295665     Macboo…        1 1.7 e3 12/30/19 00…
 2 562 2nd St       " New… " NY… 295666     LG Was…        1 6   e2 12/29/19 07…
 3 277 Main St      " New… " NY… 295667     USB-C …        1 1.20e1 12/12/19 18…
 4 410 6th St       " San… " CA… 295668     27in F…        1 1.50e2 12/22/19 15…
 5 43 Hill St       " Atl… " GA… 295669     USB-C …        1 1.20e1 12/18/19 12…
 6 200 Jefferson St " New… " NY… 295670     AA Bat…        1 3.84e0 12/31/19 22…
 7 928 12th St      " Por… " OR… 295671     USB-C …        1 1.20e1 12/16/19 15…
 8 813 Hickory St   " Dal… " TX… 295672     USB-C …        2 1.20e1 12/13/19 09…
 9 718 Wilson St    " Dal… " TX… 295673     Bose S…        1 1.00e2 12/15/19 23…
10 77 7th St        " Dal… " TX… 295674     AAA Ba…        4 2.99e0 12/28/19 11…
# ℹ 185,940 more rows
Note

nice , now we have the variable city standing on its own!

  • now I can store this as a database! for comparisons sake

Storing as database

sales_df<-sales_df |> 
  mutate(Date=mdy_hm(`Order Date`),
         Date1=as.character(Date))

sold<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(sold,sales_df)