working with strings in SQL

Author

Bongani Ncube

Published

4 February 2024

introduction

most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power

Load in the necessary packages

library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)

create a fake dataset in R using `tribble()`

original_table<-tribble(~country,~points,
                        "US", 5,
                        "U.S.",6,
                        "US (United States)",7,
                        "us",8,
                        "  US",9,
                        "in Zimbabwe",10,
                        "Zimba.bwe",12,
                        "Great Zimbabwe",10,
                        "Zimbabwean",7,
                        "ZIMBABWE",5,
                        "zimbabwe country",9)
original_table

# A tibble: 11 × 2
   country              points
   <chr>                 <dbl>
 1 "US"                      5
 2 "U.S."                    6
 3 "US (United States)"      7
 4 "us"                      8
 5 "  US"                    9
 6 "in Zimbabwe"            10
 7 "Zimba.bwe"              12
 8 "Great Zimbabwe"         10
 9 "Zimbabwean"              7
10 "ZIMBABWE"                5
11 "zimbabwe country"        9

Set up a database using the datafile

window<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(window,original_table)

common issues

‘apple’ $\neq$ ‘Apple’
’’ $\neq$ NULL
’ apple’ $\neq$ ‘apple’

SELECT *
FROM original_table;

Displaying records 1 - 10
country	points
US	5
U.S.	6
US (United States)	7
us	8
US	9
in Zimbabwe	10
Zimba.bwe	12
Great Zimbabwe	10
Zimbabwean	7
ZIMBABWE	5

the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:

typing errors
capitalised names and non capitalised names .
same name with different characters typed
spaces and dots

`REPLACE()`

The REPLACE() function is a versatile function that allows you to replace or remove characters from a string. The syntax is as follows:

REPLACE(fieldName, 'searchFor', 'replaceWith')

SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;

Displaying records 1 - 10
country	country_cleaned	points
US	US	5
U.S.	US	6
US (United States)	US (United States)	7
us	us	8
US	US	9
in Zimbabwe	in Zimbabwe	10
Zimba.bwe	Zimbabwe	12
Great Zimbabwe	Great Zimbabwe	10
Zimbabwean	Zimbabwean	7
ZIMBABWE	ZIMBABWE	5

great way to start , we can clearly compare the two columns and note that we have changed U.S. to US and Zimba.bwe to Zimbabwe

if we use this result with a SUM and GROUP BY it kinda reduces our burden though not that much.

SELECT REPLACE(country,'.','') AS country_cleaned,
       SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;

Displaying records 1 - 10
country_cleaned	points
US	9
Great Zimbabwe	10
US	11
US (United States)	7
ZIMBABWE	5
Zimbabwe	12
Zimbabwean	7
in Zimbabwe	10
us	8
zimbabwe country	9

great ,but still we have a long way to go. whats going on with US? ,are the strings the same?.. no.

`TRIM()`

as we have noted above ,the other US has blank spaces inbetween, TRIM allows us to remove unnecessary blank spaces. Let us use TRIM and REPLACE together with aggregate functions

SELECT country , 
       REPLACE(TRIM(country),'.','') AS country_cleaned,
       SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;

9 records
country	country_cleaned	total_points
Great Zimbabwe	Great Zimbabwe	10
US	US	20
US (United States)	US (United States)	7
ZIMBABWE	ZIMBABWE	5
Zimba.bwe	Zimbabwe	12
Zimbabwean	Zimbabwean	7
in Zimbabwe	in Zimbabwe	10
us	us	8
zimbabwe country	zimbabwe country	9

its now like a step in the right direction , the only problem now is ,us and US(United States)

`UPPER() and LOWER()`

These change strings to capital or small letters , Let us use UPPER(), TRIM and REPLACE together with aggregate functions .

SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;

Displaying records 1 - 10
country_cleaned	points
US	5
U.S.	6
US (UNITED STATES)	7
US	8
US	9
IN ZIMBABWE	10
ZIMBA.BWE	12
GREAT ZIMBABWE	10
ZIMBABWEAN	7
ZIMBABWE	5

we nolonger have us character ,do we?… naah

SELECT country , 
       REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
       SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;

7 records
country	country_cleaned	total_points
Great Zimbabwe	GREAT ZIMBABWE	10
in Zimbabwe	IN ZIMBABWE	10
US	US	28
US (United States)	US (UNITED STATES)	7
Zimba.bwe	ZIMBABWE	17
zimbabwe country	ZIMBABWE COUNTRY	9
Zimbabwean	ZIMBABWEAN	7

By using the above QUERY ,Zimba.bwe and Zimbabwe now turned to ZIMBABWE while US,us, US and U.S. changed to US but our problem is not yet solved because US(United States) has not been included.

`SUBSTR(string,val1,val2)` ,`RIGHT` and `LEFT`

since we are using SQLite in R we tend to use SUBSTR() and not LEFT or RIGHT

SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;

Displaying records 1 - 10
country_cleaned	points
US	5
U.	6
US	7
us	8
	9
in	10
Zi	12
Gr	10
Zi	7
ZI	5

Let us use SUBSTR(), UPPER(), TRIM() and REPLACE() together with aggregate functions

SELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
        SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;

4 records
country_cleaned	points
GR	10
IN	10
US	35
ZI	33

We awe almost safe now since we have managed to combine some strings into US only but lets see if we can wrap it all together and solve our problem

LIKE

the LIKE clause can be used in a WHERE CLAUSE to search for a pattern in a column.

the % wildcard will match zero,one or many characters in text.

let us wrap all this up in a CASE statement.

SELECT CASE 
    WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US' 
    THEN 'United States'
    WHEN LOWER(country) LIKE '%zimb%'
    THEN 'Zimbabwe' END AS country_cleaned,
        SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;

2 records
country_cleaned	total_points
United States	35
Zimbabwe	53

Great

introduction

Load in the necessary packages

create a fake dataset in R using tribble()

Set up a database using the datafile

common issues

REPLACE()

TRIM()

UPPER() and LOWER()