most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)tribble()original_table<-tribble(~country,~points,
"US", 5,
"U.S.",6,
"US (United States)",7,
"us",8,
" US",9,
"in Zimbabwe",10,
"Zimba.bwe",12,
"Great Zimbabwe",10,
"Zimbabwean",7,
"ZIMBABWE",5,
"zimbabwe country",9)
original_tablewindow<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(window,original_table)SELECT *
FROM original_table;| country | points |
|---|---|
| US | 5 |
| U.S. | 6 |
| US (United States) | 7 |
| us | 8 |
| US | 9 |
| in Zimbabwe | 10 |
| Zimba.bwe | 12 |
| Great Zimbabwe | 10 |
| Zimbabwean | 7 |
| ZIMBABWE | 5 |
the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:
REPLACE()REPLACE() function is a versatile function that
allows you to replace or remove characters from a string. The syntax is
as follows:
REPLACE(fieldName, 'searchFor', 'replaceWith')
SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;| country | country_cleaned | points |
|---|---|---|
| US | US | 5 |
| U.S. | US | 6 |
| US (United States) | US (United States) | 7 |
| us | us | 8 |
| US | US | 9 |
| in Zimbabwe | in Zimbabwe | 10 |
| Zimba.bwe | Zimbabwe | 12 |
| Great Zimbabwe | Great Zimbabwe | 10 |
| Zimbabwean | Zimbabwean | 7 |
| ZIMBABWE | ZIMBABWE | 5 |
great way to start , we can clearly compare the two columns and note that we have changed
U.S.toUSandZimba.bwe to Zimbabwe
SUM and
GROUP BY it kinda reduces our burden though not that
much.SELECT REPLACE(country,'.','') AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | points |
|---|---|
| US | 9 |
| Great Zimbabwe | 10 |
| US | 11 |
| US (United States) | 7 |
| ZIMBABWE | 5 |
| Zimbabwe | 12 |
| Zimbabwean | 7 |
| in Zimbabwe | 10 |
| us | 8 |
| zimbabwe country | 9 |
great ,but still we have a long way to go. whats going on with
US? ,are the strings the same?.. no.
TRIM()US has blank spaces
inbetween, TRIM allows us to remove unnecessary blank
spaces. Let us use TRIM and REPLACE together
with aggregate functionsSELECT country ,
REPLACE(TRIM(country),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country | country_cleaned | total_points |
|---|---|---|
| Great Zimbabwe | Great Zimbabwe | 10 |
| US | US | 20 |
| US (United States) | US (United States) | 7 |
| ZIMBABWE | ZIMBABWE | 5 |
| Zimba.bwe | Zimbabwe | 12 |
| Zimbabwean | Zimbabwean | 7 |
| in Zimbabwe | in Zimbabwe | 10 |
| us | us | 8 |
| zimbabwe country | zimbabwe country | 9 |
its now like a step in the right direction , the only problem now is ,
usandUS(United States)
UPPER() and LOWER()UPPER(), TRIM and REPLACE
together with aggregate functions .SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;| country_cleaned | points |
|---|---|
| US | 5 |
| U.S. | 6 |
| US (UNITED STATES) | 7 |
| US | 8 |
| US | 9 |
| IN ZIMBABWE | 10 |
| ZIMBA.BWE | 12 |
| GREAT ZIMBABWE | 10 |
| ZIMBABWEAN | 7 |
| ZIMBABWE | 5 |
we nolonger have
uscharacter ,do we?… naah
SELECT country ,
REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country | country_cleaned | total_points |
|---|---|---|
| Great Zimbabwe | GREAT ZIMBABWE | 10 |
| in Zimbabwe | IN ZIMBABWE | 10 |
| US | US | 28 |
| US (United States) | US (UNITED STATES) | 7 |
| Zimba.bwe | ZIMBABWE | 17 |
| zimbabwe country | ZIMBABWE COUNTRY | 9 |
| Zimbabwean | ZIMBABWEAN | 7 |
By using the above QUERY ,
Zimba.bweandZimbabwenow turned toZIMBABWEwhileUS,us, US and U.S.changed toUSbut our problem is not yet solved becauseUS(United States)has not been included.
SUBSTR(string,val1,val2) ,RIGHT and
LEFTsince we are using
SQLitein R we tend to useSUBSTR()and notLEFT or RIGHT
SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;| country_cleaned | points |
|---|---|
| US | 5 |
| U. | 6 |
| US | 7 |
| us | 8 |
| 9 | |
| in | 10 |
| Zi | 12 |
| Gr | 10 |
| Zi | 7 |
| ZI | 5 |
SUBSTR(), UPPER(),
TRIM() and REPLACE() together with aggregate
functionsSELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | points |
|---|---|
| GR | 10 |
| IN | 10 |
| US | 35 |
| ZI | 33 |
We awe almost safe now since we have managed to combine some strings into
USonly but lets see if we can wrap it all together and solve our problem
LIKE clause can be used in a WHERE
CLAUSE to search for a pattern in a column.the
%wildcard will match zero,one or many characters in text.
CASE statement.SELECT CASE
WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US'
THEN 'United States'
WHEN LOWER(country) LIKE '%zimb%'
THEN 'Zimbabwe' END AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;| country_cleaned | total_points |
|---|---|
| United States | 35 |
| Zimbabwe | 53 |
Great