working with strings in SQL
introduction
most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power
Load in the necessary packages
create a fake dataset in R using tribble()
original_table<-tribble(~country,~points,
"US", 5,
"U.S.",6,
"US (United States)",7,
"us",8,
" US",9,
"in Zimbabwe",10,
"Zimba.bwe",12,
"Great Zimbabwe",10,
"Zimbabwean",7,
"ZIMBABWE",5,
"zimbabwe country",9)
original_table
# A tibble: 11 × 2
country points
<chr> <dbl>
1 "US" 5
2 "U.S." 6
3 "US (United States)" 7
4 "us" 8
5 " US" 9
6 "in Zimbabwe" 10
7 "Zimba.bwe" 12
8 "Great Zimbabwe" 10
9 "Zimbabwean" 7
10 "ZIMBABWE" 5
11 "zimbabwe country" 9
Set up a database using the datafile
common issues
SELECT *
FROM original_table;
country | points |
---|---|
US | 5 |
U.S. | 6 |
US (United States) | 7 |
us | 8 |
US | 9 |
in Zimbabwe | 10 |
Zimba.bwe | 12 |
Great Zimbabwe | 10 |
Zimbabwean | 7 |
ZIMBABWE | 5 |
the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:
REPLACE()
- The
REPLACE()
function is a versatile function that allows you to replace or remove characters from a string. The syntax is as follows:
REPLACE(fieldName, 'searchFor', 'replaceWith')
SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;
country | country_cleaned | points |
---|---|---|
US | US | 5 |
U.S. | US | 6 |
US (United States) | US (United States) | 7 |
us | us | 8 |
US | US | 9 |
in Zimbabwe | in Zimbabwe | 10 |
Zimba.bwe | Zimbabwe | 12 |
Great Zimbabwe | Great Zimbabwe | 10 |
Zimbabwean | Zimbabwean | 7 |
ZIMBABWE | ZIMBABWE | 5 |
great way to start , we can clearly compare the two columns and note that we have changed
U.S.
toUS
andZimba.bwe to Zimbabwe
- if we use this result with a
SUM
andGROUP BY
it kinda reduces our burden though not that much.
SELECT REPLACE(country,'.','') AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | points |
---|---|
US | 9 |
Great Zimbabwe | 10 |
US | 11 |
US (United States) | 7 |
ZIMBABWE | 5 |
Zimbabwe | 12 |
Zimbabwean | 7 |
in Zimbabwe | 10 |
us | 8 |
zimbabwe country | 9 |
great ,but still we have a long way to go. whats going on with
US
? ,are the strings the same?.. no.
TRIM()
- as we have noted above ,the other
US
has blank spaces inbetween,TRIM
allows us to remove unnecessary blank spaces. Let us useTRIM
andREPLACE
together with aggregate functions
SELECT country ,
REPLACE(TRIM(country),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country | country_cleaned | total_points |
---|---|---|
Great Zimbabwe | Great Zimbabwe | 10 |
US | US | 20 |
US (United States) | US (United States) | 7 |
ZIMBABWE | ZIMBABWE | 5 |
Zimba.bwe | Zimbabwe | 12 |
Zimbabwean | Zimbabwean | 7 |
in Zimbabwe | in Zimbabwe | 10 |
us | us | 8 |
zimbabwe country | zimbabwe country | 9 |
its now like a step in the right direction , the only problem now is ,
us
andUS(United States)
UPPER() and LOWER()
- These change strings to capital or small letters , Let us use
UPPER()
,TRIM
andREPLACE
together with aggregate functions .
SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;
country_cleaned | points |
---|---|
US | 5 |
U.S. | 6 |
US (UNITED STATES) | 7 |
US | 8 |
US | 9 |
IN ZIMBABWE | 10 |
ZIMBA.BWE | 12 |
GREAT ZIMBABWE | 10 |
ZIMBABWEAN | 7 |
ZIMBABWE | 5 |
we nolonger have
us
character ,do we?… naah
SELECT country ,
REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country | country_cleaned | total_points |
---|---|---|
Great Zimbabwe | GREAT ZIMBABWE | 10 |
in Zimbabwe | IN ZIMBABWE | 10 |
US | US | 28 |
US (United States) | US (UNITED STATES) | 7 |
Zimba.bwe | ZIMBABWE | 17 |
zimbabwe country | ZIMBABWE COUNTRY | 9 |
Zimbabwean | ZIMBABWEAN | 7 |
By using the above QUERY ,
Zimba.bwe
andZimbabwe
now turned toZIMBABWE
whileUS,us, US and U.S.
changed toUS
but our problem is not yet solved becauseUS(United States)
has not been included.
SUBSTR(string,val1,val2)
,RIGHT
and LEFT
since we are using
SQLite
in R we tend to useSUBSTR()
and notLEFT or RIGHT
SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;
country_cleaned | points |
---|---|
US | 5 |
U. | 6 |
US | 7 |
us | 8 |
9 | |
in | 10 |
Zi | 12 |
Gr | 10 |
Zi | 7 |
ZI | 5 |
- Let us use
SUBSTR()
,UPPER()
,TRIM()
andREPLACE()
together with aggregate functions
SELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | points |
---|---|
GR | 10 |
IN | 10 |
US | 35 |
ZI | 33 |
We awe almost safe now since we have managed to combine some strings into
US
only but lets see if we can wrap it all together and solve our problem
LIKE
- the
LIKE
clause can be used in aWHERE
CLAUSE to search for a pattern in a column.
the
%
wildcard will match zero,one or many characters in text.
SELECT CASE
WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US'
THEN 'United States'
WHEN LOWER(country) LIKE '%zimb%'
THEN 'Zimbabwe' END AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | total_points |
---|---|
United States | 35 |
Zimbabwe | 53 |
Great