working with strings in SQL
introduction
most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power
Load in the necessary packages
create a fake dataset in R using tribble()
original_table<-tribble(~country,~points,
"US", 5,
"U.S.",6,
"US (United States)",7,
"us",8,
" US",9,
"in Zimbabwe",10,
"Zimba.bwe",12,
"Great Zimbabwe",10,
"Zimbabwean",7,
"ZIMBABWE",5,
"zimbabwe country",9)
original_table# A tibble: 11 × 2
country points
<chr> <dbl>
1 "US" 5
2 "U.S." 6
3 "US (United States)" 7
4 "us" 8
5 " US" 9
6 "in Zimbabwe" 10
7 "Zimba.bwe" 12
8 "Great Zimbabwe" 10
9 "Zimbabwean" 7
10 "ZIMBABWE" 5
11 "zimbabwe country" 9
Set up a database using the datafile
common issues
SELECT *
FROM original_table;| country | points |
|---|---|
| US | 5 |
| U.S. | 6 |
| US (United States) | 7 |
| us | 8 |
| US | 9 |
| in Zimbabwe | 10 |
| Zimba.bwe | 12 |
| Great Zimbabwe | 10 |
| Zimbabwean | 7 |
| ZIMBABWE | 5 |
the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:
REPLACE()
- The
REPLACE()function is a versatile function that allows you to replace or remove characters from a string. The syntax is as follows:
REPLACE(fieldName, 'searchFor', 'replaceWith')
SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;| country | country_cleaned | points |
|---|---|---|
| US | US | 5 |
| U.S. | US | 6 |
| US (United States) | US (United States) | 7 |
| us | us | 8 |
| US | US | 9 |
| in Zimbabwe | in Zimbabwe | 10 |
| Zimba.bwe | Zimbabwe | 12 |
| Great Zimbabwe | Great Zimbabwe | 10 |
| Zimbabwean | Zimbabwean | 7 |
| ZIMBABWE | ZIMBABWE | 5 |
great way to start , we can clearly compare the two columns and note that we have changed
U.S.toUSandZimba.bwe to Zimbabwe
- if we use this result with a
SUMandGROUP BYit kinda reduces our burden though not that much.
SELECT REPLACE(country,'.','') AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | points |
|---|---|
| US | 9 |
| Great Zimbabwe | 10 |
| US | 11 |
| US (United States) | 7 |
| ZIMBABWE | 5 |
| Zimbabwe | 12 |
| Zimbabwean | 7 |
| in Zimbabwe | 10 |
| us | 8 |
| zimbabwe country | 9 |
great ,but still we have a long way to go. whats going on with
US? ,are the strings the same?.. no.
TRIM()
- as we have noted above ,the other
UShas blank spaces inbetween,TRIMallows us to remove unnecessary blank spaces. Let us useTRIMandREPLACEtogether with aggregate functions
SELECT country ,
REPLACE(TRIM(country),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country | country_cleaned | total_points |
|---|---|---|
| Great Zimbabwe | Great Zimbabwe | 10 |
| US | US | 20 |
| US (United States) | US (United States) | 7 |
| ZIMBABWE | ZIMBABWE | 5 |
| Zimba.bwe | Zimbabwe | 12 |
| Zimbabwean | Zimbabwean | 7 |
| in Zimbabwe | in Zimbabwe | 10 |
| us | us | 8 |
| zimbabwe country | zimbabwe country | 9 |
its now like a step in the right direction , the only problem now is ,
usandUS(United States)
UPPER() and LOWER()
- These change strings to capital or small letters , Let us use
UPPER(),TRIMandREPLACEtogether with aggregate functions .
SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;| country_cleaned | points |
|---|---|
| US | 5 |
| U.S. | 6 |
| US (UNITED STATES) | 7 |
| US | 8 |
| US | 9 |
| IN ZIMBABWE | 10 |
| ZIMBA.BWE | 12 |
| GREAT ZIMBABWE | 10 |
| ZIMBABWEAN | 7 |
| ZIMBABWE | 5 |
we nolonger have
uscharacter ,do we?… naah
SELECT country ,
REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country | country_cleaned | total_points |
|---|---|---|
| Great Zimbabwe | GREAT ZIMBABWE | 10 |
| in Zimbabwe | IN ZIMBABWE | 10 |
| US | US | 28 |
| US (United States) | US (UNITED STATES) | 7 |
| Zimba.bwe | ZIMBABWE | 17 |
| zimbabwe country | ZIMBABWE COUNTRY | 9 |
| Zimbabwean | ZIMBABWEAN | 7 |
By using the above QUERY ,
Zimba.bweandZimbabwenow turned toZIMBABWEwhileUS,us, US and U.S.changed toUSbut our problem is not yet solved becauseUS(United States)has not been included.
SUBSTR(string,val1,val2) ,RIGHT and LEFT
since we are using
SQLitein R we tend to useSUBSTR()and notLEFT or RIGHT
SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;| country_cleaned | points |
|---|---|
| US | 5 |
| U. | 6 |
| US | 7 |
| us | 8 |
| 9 | |
| in | 10 |
| Zi | 12 |
| Gr | 10 |
| Zi | 7 |
| ZI | 5 |
- Let us use
SUBSTR(),UPPER(),TRIM()andREPLACE()together with aggregate functions
SELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | points |
|---|---|
| GR | 10 |
| IN | 10 |
| US | 35 |
| ZI | 33 |
We awe almost safe now since we have managed to combine some strings into
USonly but lets see if we can wrap it all together and solve our problem
LIKE
- the
LIKEclause can be used in aWHERECLAUSE to search for a pattern in a column.
the
%wildcard will match zero,one or many characters in text.
SELECT CASE
WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US'
THEN 'United States'
WHEN LOWER(country) LIKE '%zimb%'
THEN 'Zimbabwe' END AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | total_points |
|---|---|
| United States | 35 |
| Zimbabwe | 53 |
Great