most data cleaning processes involve working with structured and unstructured character/string datatypes . the ability to manipulate string data can be a super power
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
tribble()
<-tribble(~country,~points,
original_table"US", 5,
"U.S.",6,
"US (United States)",7,
"us",8,
" US",9,
"in Zimbabwe",10,
"Zimba.bwe",12,
"Great Zimbabwe",10,
"Zimbabwean",7,
"ZIMBABWE",5,
"zimbabwe country",9)
original_table
country <chr> | points <dbl> | |||
---|---|---|---|---|
US | 5 | |||
U.S. | 6 | |||
US (United States) | 7 | |||
us | 8 | |||
US | 9 | |||
in Zimbabwe | 10 | |||
Zimba.bwe | 12 | |||
Great Zimbabwe | 10 | |||
Zimbabwean | 7 | |||
ZIMBABWE | 5 |
<-dbConnect(RSQLite::SQLite(), ":memory:")
windowcopy_to(window,original_table)
SELECT *
FROM original_table;
country | points |
---|---|
US | 5 |
U.S. | 6 |
US (United States) | 7 |
us | 8 |
US | 9 |
in Zimbabwe | 10 |
Zimba.bwe | 12 |
Great Zimbabwe | 10 |
Zimbabwean | 7 |
ZIMBABWE | 5 |
the main goal here would be to consolidate the above character strings since the following can be noted in our dataset:
REPLACE()
REPLACE()
function is a versatile function that
allows you to replace or remove characters from a string. The syntax is
as follows:
REPLACE(fieldName, 'searchFor', 'replaceWith')
SELECT country , REPLACE(country,'.','') AS country_cleaned,points
FROM original_table;
country | country_cleaned | points |
---|---|---|
US | US | 5 |
U.S. | US | 6 |
US (United States) | US (United States) | 7 |
us | us | 8 |
US | US | 9 |
in Zimbabwe | in Zimbabwe | 10 |
Zimba.bwe | Zimbabwe | 12 |
Great Zimbabwe | Great Zimbabwe | 10 |
Zimbabwean | Zimbabwean | 7 |
ZIMBABWE | ZIMBABWE | 5 |
great way to start , we can clearly compare the two columns and note that we have changed
U.S.
toUS
andZimba.bwe to Zimbabwe
SUM
and
GROUP BY
it kinda reduces our burden though not that
much.SELECT REPLACE(country,'.','') AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | points |
---|---|
US | 9 |
Great Zimbabwe | 10 |
US | 11 |
US (United States) | 7 |
ZIMBABWE | 5 |
Zimbabwe | 12 |
Zimbabwean | 7 |
in Zimbabwe | 10 |
us | 8 |
zimbabwe country | 9 |
great ,but still we have a long way to go. whats going on with
US
? ,are the strings the same?.. no.
TRIM()
US
has blank spaces
inbetween, TRIM
allows us to remove unnecessary blank
spaces. Let us use TRIM
and REPLACE
together
with aggregate functionsSELECT country ,
REPLACE(TRIM(country),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country | country_cleaned | total_points |
---|---|---|
Great Zimbabwe | Great Zimbabwe | 10 |
US | US | 20 |
US (United States) | US (United States) | 7 |
ZIMBABWE | ZIMBABWE | 5 |
Zimba.bwe | Zimbabwe | 12 |
Zimbabwean | Zimbabwean | 7 |
in Zimbabwe | in Zimbabwe | 10 |
us | us | 8 |
zimbabwe country | zimbabwe country | 9 |
its now like a step in the right direction , the only problem now is ,
us
andUS(United States)
UPPER() and LOWER()
UPPER()
, TRIM
and REPLACE
together with aggregate functions .SELECT UPPER(country) AS country_cleaned ,points
FROM original_table;
country_cleaned | points |
---|---|
US | 5 |
U.S. | 6 |
US (UNITED STATES) | 7 |
US | 8 |
US | 9 |
IN ZIMBABWE | 10 |
ZIMBA.BWE | 12 |
GREAT ZIMBABWE | 10 |
ZIMBABWEAN | 7 |
ZIMBABWE | 5 |
we nolonger have
us
character ,do we?… naah
SELECT country ,
REPLACE(TRIM(UPPER(country)),'.','') AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country | country_cleaned | total_points |
---|---|---|
Great Zimbabwe | GREAT ZIMBABWE | 10 |
in Zimbabwe | IN ZIMBABWE | 10 |
US | US | 28 |
US (United States) | US (UNITED STATES) | 7 |
Zimba.bwe | ZIMBABWE | 17 |
zimbabwe country | ZIMBABWE COUNTRY | 9 |
Zimbabwean | ZIMBABWEAN | 7 |
By using the above QUERY ,
Zimba.bwe
andZimbabwe
now turned toZIMBABWE
whileUS,us, US and U.S.
changed toUS
but our problem is not yet solved becauseUS(United States)
has not been included.
SUBSTR(string,val1,val2)
,RIGHT
and
LEFT
since we are using
SQLite
in R we tend to useSUBSTR()
and notLEFT or RIGHT
SELECT SUBSTR(country,1,2) AS country_cleaned,points
FROM original_table;
country_cleaned | points |
---|---|
US | 5 |
U. | 6 |
US | 7 |
us | 8 |
9 | |
in | 10 |
Zi | 12 |
Gr | 10 |
Zi | 7 |
ZI | 5 |
SUBSTR()
, UPPER()
,
TRIM()
and REPLACE()
together with aggregate
functionsSELECT SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) AS country_cleaned,
SUM(points) AS points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | points |
---|---|
GR | 10 |
IN | 10 |
US | 35 |
ZI | 33 |
We awe almost safe now since we have managed to combine some strings into
US
only but lets see if we can wrap it all together and solve our problem
LIKE
clause can be used in a WHERE
CLAUSE to search for a pattern in a column.the
%
wildcard will match zero,one or many characters in text.
CASE
statement.SELECT CASE
WHEN SUBSTR(TRIM(REPLACE(UPPER(country),'.','')),1,2) = 'US'
THEN 'United States'
WHEN LOWER(country) LIKE '%zimb%'
THEN 'Zimbabwe' END AS country_cleaned,
SUM(points) AS total_points
FROM original_table
GROUP BY country_cleaned;
country_cleaned | total_points |
---|---|
United States | 35 |
Zimbabwe | 53 |
Great