working with strings in SQL

databases
sql training
data manipulation
Author

Bongani Ncube

Published

4 February 2024

introduction

most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power

Load in the necessary packages

create a fake dataset in R using tribble()

original_table<-tribble(~country,~points,
                        "US", 5,
                        "U.S.",6,
                        "US (United States)",7,
                        "us",8,
                        "  US",9,
                        "in Zimbabwe",10,
                        "Zimba.bwe",12,
                        "Great Zimbabwe",10,
                        "Zimbabwean",7,
                        "ZIMBABWE",5,
                        "zimbabwe country",9)
original_table
# A tibble: 11 × 2
   country              points
   <chr>                 <dbl>
 1 "US"                      5
 2 "U.S."                    6
 3 "US (United States)"      7
 4 "us"                      8
 5 "  US"                    9
 6 "in Zimbabwe"            10
 7 "Zimba.bwe"              12
 8 "Great Zimbabwe"         10
 9 "Zimbabwean"              7
10 "ZIMBABWE"                5
11 "zimbabwe country"        9

Set up a database using the datafile

window<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(window,original_table)

common issues

SELECT *
FROM original_table;
Displaying records 1 - 10
country points
US 5
U.S. 6
US (United States) 7
us 8
US 9
in Zimbabwe 10
Zimba.bwe 12
Great Zimbabwe 10
Zimbabwean 7
ZIMBABWE 5

the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:

REPLACE()

  • The REPLACE() function is a versatile function that allows you to replace or remove characters from a string. The syntax is as follows:

REPLACE(fieldName, 'searchFor', 'replaceWith')

SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;
Displaying records 1 - 10
country country_cleaned points
US US 5
U.S. US 6
US (United States) US (United States) 7
us us 8
US US 9
in Zimbabwe in Zimbabwe 10
Zimba.bwe Zimbabwe 12
Great Zimbabwe Great Zimbabwe 10
Zimbabwean Zimbabwean 7
ZIMBABWE ZIMBABWE 5

great way to start , we can clearly compare the two columns and note that we have changed U.S. to US and Zimba.bwe to Zimbabwe

  • if we use this result with a SUM and GROUP BY it kinda reduces our burden though not that much.
SELECT REPLACE(country,'.','') AS country_cleaned,
       SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
Displaying records 1 - 10
country_cleaned points
US 9
Great Zimbabwe 10
US 11
US (United States) 7
ZIMBABWE 5
Zimbabwe 12
Zimbabwean 7
in Zimbabwe 10
us 8
zimbabwe country 9

great ,but still we have a long way to go. whats going on with US? ,are the strings the same?.. no.

TRIM()

  • as we have noted above ,the other US has blank spaces inbetween, TRIM allows us to remove unnecessary blank spaces. Let us use TRIM and REPLACE together with aggregate functions
SELECT country , 
       REPLACE(TRIM(country),'.','') AS country_cleaned,
       SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
9 records
country country_cleaned total_points
Great Zimbabwe Great Zimbabwe 10
US US 20
US (United States) US (United States) 7
ZIMBABWE ZIMBABWE 5
Zimba.bwe Zimbabwe 12
Zimbabwean Zimbabwean 7
in Zimbabwe in Zimbabwe 10
us us 8
zimbabwe country zimbabwe country 9

its now like a step in the right direction , the only problem now is ,us and US(United States)

UPPER() and LOWER()

  • These change strings to capital or small letters , Let us use UPPER(), TRIM and REPLACE together with aggregate functions .
SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;
Displaying records 1 - 10
country_cleaned points
US 5
U.S. 6
US (UNITED STATES) 7
US 8
US 9
IN ZIMBABWE 10
ZIMBA.BWE 12
GREAT ZIMBABWE 10
ZIMBABWEAN 7
ZIMBABWE 5

we nolonger have us character ,do we?… naah

SELECT country , 
       REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
       SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
7 records
country country_cleaned total_points
Great Zimbabwe GREAT ZIMBABWE 10
in Zimbabwe IN ZIMBABWE 10
US US 28
US (United States) US (UNITED STATES) 7
Zimba.bwe ZIMBABWE 17
zimbabwe country ZIMBABWE COUNTRY 9
Zimbabwean ZIMBABWEAN 7

By using the above QUERY ,Zimba.bwe and Zimbabwe now turned to ZIMBABWE while US,us, US and U.S. changed to US but our problem is not yet solved because US(United States) has not been included.

SUBSTR(string,val1,val2) ,RIGHT and LEFT

since we are using SQLite in R we tend to use SUBSTR() and not LEFT or RIGHT

SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;
Displaying records 1 - 10
country_cleaned points
US 5
U. 6
US 7
us 8
9
in 10
Zi 12
Gr 10
Zi 7
ZI 5
  • Let us use SUBSTR(), UPPER(), TRIM() and REPLACE() together with aggregate functions
SELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
        SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
4 records
country_cleaned points
GR 10
IN 10
US 35
ZI 33

We awe almost safe now since we have managed to combine some strings into US only but lets see if we can wrap it all together and solve our problem

LIKE

  • the LIKE clause can be used in a WHERE CLAUSE to search for a pattern in a column.

the % wildcard will match zero,one or many characters in text.

SELECT CASE 
    WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US' 
    THEN 'United States'
    WHEN LOWER(country) LIKE '%zimb%'
    THEN 'Zimbabwe' END AS country_cleaned,
        SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
2 records
country_cleaned total_points
United States 35
Zimbabwe 53

Great