This notebook answers some of the most common “How To..” questions
that pop-up during data cleaning.
I’ve also cleaned this notebook in Python. You may check out the Python
version here
on Kaggle.
I hope both versions of this work helps beginners to understand corresponding functions in Python and R.
The Audible Dataset used in this notebook can be found here.
#read data
audible_data = readr::read_csv(paste0(working_dir, "/data/audible_dataset/audible_uncleaned.csv"))## # A tibble: 4 × 8
## name author narrator time releasedate language stars price
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Geronimo Stilton #11 &… Writt… Narrate… 2 hr… 04-08-08 English 5 ou… 468.…
## 2 The Burning Maze Writt… Narrate… 13 h… 01-05-18 English 4.5 … 820.…
## 3 The Deep End Writt… Narrate… 2 hr… 06-11-20 English 4.5 … 410.…
## 4 Daughter of the Deep Writt… Narrate… 11 h… 05-10-21 English 4.5 … 615.…
From the sample data above and information below, we can list out
some inconsistencies in the dataframe.
1. author- “Writtenby:” can be removed; multiple author
names can be placed in different columns.
2. narrator- “Narratedby:” can be removed.
3. time- is of type str; can be converted
to numeric or datetime type. We’ll see what should be done here.
4. releasedate- is of type str; should be
of type date.
5. language- all values should be in uniform case;
first letter capital, rest smallcase.
6. stars- can be split into i) number of stars and ii)
total number of ratings columns.
7. price- is of type str; should be
float.
## Rows: 87,489
## Columns: 8
## $ name <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The Dee…
## $ author <chr> "Writtenby:GeronimoStilton", "Writtenby:RickRiordan", "Wri…
## $ narrator <chr> "Narratedby:BillLobely", "Narratedby:RobbieDaymond", "Narr…
## $ time <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 min…
## $ releasedate <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-10"…
## $ language <chr> "English", "English", "English", "English", "English", "En…
## $ stars <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rating…
## $ price <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.00"…
## name author narrator time
## Length:87489 Length:87489 Length:87489 Length:87489
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## releasedate language stars price
## Length:87489 Length:87489 Length:87489 Length:87489
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
#total count of each unique book
book_count <- audible_data %>%
group_by(name) %>%
summarize(total_count = n()) %>%
arrange(-total_count)
#books with 10 or more observations.
book_count %>% filter(total_count >= 10)## # A tibble: 18 × 2
## name total_count
## <chr> <int>
## 1 The Art of War 20
## 2 Sterling Biographies 19
## 3 Sterling Point Books 16
## 4 The Odyssey 16
## 5 Hamlet 15
## 6 A Christmas Carol 14
## 7 Pride and Prejudice 14
## 8 The Prophet 14
## 9 As a Man Thinketh 13
## 10 The Iliad 13
## 11 The Science of Getting Rich 13
## 12 Abraham Lincoln 12
## 13 The Picture of Dorian Gray 12
## 14 Meditations 11
## 15 The Prince 11
## 16 The Raven 11
## 17 The Richest Man in Babylon 11
## 18 Unstoppable 10
## [1] 0
So, for example, ‘ChandlerBing’ becomes ‘Chandler Bing’.
NOTE:
1. Some rows contains muliple author names. e.g. “ChandlerBing,
RossGeller, PhoebeBuffay”.
2. If str_replace() is used below, space
will be added to only the first author name, ie. “Chandler Bing,
RossGeller, PhoebeBuffay”.
3. Applying str_replace_all() adds space
to all author names, i.e. “Chandler Bing, Ross Geller, Phoebe
Buffay”.
#Add space between the first, middle and last names of Authors.
#e.g. JaneAustin becomes Jane Austin
audible_data$author <- stringr::str_replace_all(audible_data$author,
pattern = "([a-z])([A-Z])",
replacement = "\\1 \\2")
head(audible_data[,'author'], 4)## # A tibble: 4 × 1
## author
## <chr>
## 1 Geronimo Stilton
## 2 Rick Riordan
## 3 Jeff Kinney
## 4 Rick Riordan
str_count() below finds the number of
occurances of comma ‘,’ in each row of author column. Then maximum
number of occurances is found and 1 is added.
So, if there are three commmas in any given row, it implies that row has
4 names in the author column.
And 4 columns will be created to store each author name.
#maximum number of authors for any book.
#helps to determine max number of columns needed to keep indiviudal authors
max_columns <- max(str_count(audible_data$author, ',')) + 1#split the author column.
audible_data <- audible_data %>% separate(col = author,
into = paste0('author', seq_len(max_columns)),
sep = ',',
fill = 'right')## # A tibble: 3 × 11
## name author1 author2 author3 author4 narrator time releasedate language
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 The Innoc… Mark T… <NA> <NA> <NA> Narrate… 19 h… 30-12-16 English
## 2 A Sentime… Lauren… <NA> <NA> <NA> Narrate… 4 hr… 23-02-11 English
## 3 Havana Mark K… <NA> <NA> <NA> Narrate… 6 hr… 07-03-17 English
## # ℹ 2 more variables: stars <chr>, price <chr>
Here, I wanted to see if it’s at all useful to split the author
column or not.
As we can see from the NA value counts below, the newly created columns
author2, author3, author4 are mostly
vacant.
Majority of the books have only 1 author and about 14,000 books have 2
authors.
(Total number of unique books in the data- 82767)
#find total number of NA in colnames containing 'author'
#(author1, author2, author3, author4)
rbind(colSums(is.na(audible_data %>% select(contains('author')))))## author1 author2 author3 author4
## [1,] 0 73762 85135 86713
#make a copy of time column to understand all kind of formats in which data is present
time_column = audible_data$timeAll digits have been removed to see what all phrases are present.
#replace all numbers with blanks
time_column = str_replace_all(time_column,
pattern = '[0-9]', replacement = '')
#keep only unique patterns
unique(time_column)## [1] " hrs and mins" " hrs" " hrs and min"
## [4] " hr and mins" " mins" " hr"
## [7] " hr and min" " min" "Less than minute"
Assumption: All rows with
Less than xx minutes will be replaced with
xx minutes.#find rows with "less than x minute" values in audiobook duration
less_than_duration = grep(pattern = 'less than', audible_data$time, ignore.case = T)
#check all unique values with 'less than' pattern
audible_data[less_than_duration,'time'] %>% distinct()## # A tibble: 1 × 1
## time
## <chr>
## 1 Less than 1 minute
The objective is to:
1. Create 2 columns hour_component and
min_component to store the hour and minute
duration of the audiobook.
2. Examples:
- ‘17 hrs and 5 mins’ will split into 17 in
hour column and 5 in minute column.
- ‘24 mins’ will be stored as 0 in hour column
and 24 in minute column.
- ‘less than 1 minute’ will be stored as 0 in
hour and 1 in minute column.
Regular Expression used:
1. “^-” indicates start matching the pattern from
beginning of the string.
2. “([0-9]+)”- is the capture group. Means look for one
or more digits.
3. + means one or more occurances of digits.
4. “([0-9]+) hr” captures digits before the word hr or
hrs.
5. “([0-9]+) min” captures digits before
min/mins/minutes words.
#extract number of hours
audible_data$hour_component <- stringr::str_extract(audible_data$time,
pattern = '^([0-9]+) hr')
#remove the 'hr' part
audible_data$hour_component <- stringr::str_replace(audible_data$hour_component,
pattern = ' hr', replacement = '')
#extract number of minutes
audible_data$min_component <- stringr::str_extract(audible_data$time, pattern = '([0-9]+) min')
#remove the 'min' part
audible_data$min_component <- stringr::str_replace(audible_data$min_component,
pattern = ' min', replacement = '')## # A tibble: 5 × 3
## time hour_component min_component
## <chr> <chr> <chr>
## 1 2 hrs and 20 mins 2 20
## 2 13 hrs and 8 mins 13 8
## 3 2 hrs and 3 mins 2 3
## 4 11 hrs and 16 mins 11 16
## 5 10 hrs 10 <NA>
## Rows: 87,489
## Columns: 13
## $ name <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1 <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language <chr> "English", "English", "English", "English", "English", …
## $ stars <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ price <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <chr> "2", "13", "2", "11", "10", "10", "2", "12", "10", "13"…
## $ min_component <chr> "20", "8", "3", "16", NA, "35", "23", "32", "56", "22",…
Now, after digits extraction, values like ‘24
mins’ will keep 24 in minute column and NA in hour
column.
Similarly, a value like ‘11 hrs’ will keep NA
in minutes column.
Below, these NA values have been replaced with 0.
Now, we can change the datatype of hour and minutes from strings to integers.
#convert the hour and min column to int type
audible_data = audible_data %>%
mutate(hour_component = as.integer(hour_component),
min_component = as.integer(min_component))Dates are written in dd-mm-yy format.
They are in character format and will be converted to date type.
But before that, we need to make sure that each of the components are
consistent. That is:
- dd doesn’t exceed 31.
- mm doesn’t exceed 12.
- In the datacard
on Audible dataset page, its mentioned that some books are yet to be
released, which means we can have year value greater than 2023.
## # A tibble: 4 × 1
## releasedate
## <chr>
## 1 04-08-08
## 2 01-05-18
## 3 06-11-20
## 4 05-10-21
Regular expressions used:
1. “^([0-9]+)”- check for the digit from beginning of
string (^) and before hypher -.
2. “-([0-9]+)-” check for the digits between the two
hyphens.
3. “-([0-9]+)$”- capture the digits after hyphen sign
and end of the string.
#day component- dd
#check if any value in day part > 31.
any(
as.integer(
unique(
str_extract(string = audible_data$releasedate, pattern = '^([0-9]+)')
)
) > 31)## [1] FALSE
#check for inconsistencies in month of release date
#Check to see whether middle values (months) contain any number > 12
#all unique months.
#extract all months; E.g. extracted format is: "-05-" for May
unique_months <- unique(str_extract(string = audible_data$releasedate, pattern = '-([0-9]+)-'))
#remove both hyphens
unique_months <- str_replace_all(string = unique_months, pattern = '-', repl = '')
#convert to integer and check if any value exceeds 12 (12 months)
any(as.integer(unique_months) > 12)## [1] FALSE
In the Py version of this notebook here
on Kaggle, I’ve added a note in this part.
If date string is in format: 04-08-23, then separator
used in “format” parameter should be
%d-%m-%Y and not
%d/%m/%Y.
But no such caution was needed here. Only thing to note in case of R
is to make sure the right function is used.
Since the date in our df is of form: dd-mm-yy, hence
dmy() has been used.
If in the date, month was mentioned first, followed by day, then year,
then mdy() would have been the right
function.
## # A tibble: 3 × 2
## name releasedate
## <chr> <chr>
## 1 Geronimo Stilton #11 & #12 04-08-08
## 2 The Burning Maze 01-05-18
## 3 The Deep End 06-11-20
#convert to date type
audible_data$release_date <- lubridate::dmy(audible_data$releasedate)
#view datatype of all cols now
glimpse(audible_data)## Rows: 87,489
## Columns: 14
## $ name <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1 <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language <chr> "English", "English", "English", "English", "English", …
## $ stars <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ price <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <int> 2, 13, 2, 11, 10, 10, 2, 12, 10, 13, 8, 5, 6, 12, 11, 2…
## $ min_component <int> 20, 8, 3, 16, 0, 35, 23, 32, 56, 22, 48, 23, 1, 58, 55,…
## $ release_date <date> 2008-08-04, 2018-05-01, 2020-11-06, 2021-10-05, 2010-0…
## [1] "English" "Hindi" "spanish" "german"
## [5] "french" "catalan" "swedish" "italian"
## [9] "danish" "finnish" "dutch" "hebrew"
## [13] "russian" "polish" "galician" "afrikaans"
## [17] "icelandic" "romanian" "japanese" "tamil"
## [21] "portuguese" "urdu" "hungarian" "czech"
## [25] "bulgarian" "mandarin_chinese" "basque" "korean"
## [29] "arabic" "greek" "turkish" "ukrainian"
## [33] "slovene" "norwegian" "telugu" "lithuanian"
#capitalize the language names
audible_data$language <- stringr::str_to_title(audible_data$language)The objective is to create 2 columns out of the stars column:
- stars_out_of_5- will contain numeric
values like 2, 4.5 etc.
- total_ratings- total number of reviews
received on the particular audiobook.
total_ratings column, “ratings”
will be removed and values will be converted to float type.## # A tibble: 3 × 2
## name stars
## <chr> <chr>
## 1 Geronimo Stilton #11 & #12 5 out of 5 stars34 ratings
## 2 The Burning Maze 4.5 out of 5 stars41 ratings
## 3 The Deep End 4.5 out of 5 stars38 ratings
#split the stars column after the word 'stars'
#rows with 'Not yet rated' return NA in 'total_ratings' column after separation
audible_data <- audible_data %>%
separate(col = stars,
into = c('stars_out_of_5', 'total_ratings'),
sep = 'stars',
#do not remove the original column
remove = F,
#row containing 'Not yet rated' returns NA.
#NA should be filled starting from right-hand-side column
fill = 'right')#some unique rows in the data
audible_data[sample(nrow(audible_data),40),c('stars', 'stars_out_of_5', 'total_ratings')] %>% distinct()## # A tibble: 6 × 3
## stars stars_out_of_5 total_ratings
## <chr> <chr> <chr>
## 1 Not rated yet "Not rated yet" <NA>
## 2 4.5 out of 5 stars2 ratings "4.5 out of 5 " 2 ratings
## 3 5 out of 5 stars1 rating "5 out of 5 " 1 rating
## 4 4.5 out of 5 stars11 ratings "4.5 out of 5 " 11 ratings
## 5 3 out of 5 stars1 rating "3 out of 5 " 1 rating
## 6 4 out of 5 stars1 rating "4 out of 5 " 1 rating
Out of 87,489 rows in the df, 72,417 contain “Not yet rated” for books that have not been rated yet!
There is no point in removing all these rows and perform an analysis. But I want to showcase some more queries that arise while cleaning a dataframe. For e.g. How do you extract a floating point number occuring before a text? and so on..
The objective is to make these columns numeric by extracting the numeric part from the ratings e.g. 4.5 out of 5 becomes 4.5 & 106 ratings becomes 106.
When this is done, the “Not rated yet” rows will get converted to NA.
I wanted to keep a copy of the data cleaned so far.
So, the subsequent cleaning steps have been performed on a copy of this
dataframe.
stars_out_of_5 column to retain only the digit part.total_ratings column to retain only the numeric part.total_ratings column. (Bcz this column has to be converted
to numeric.)#remove 'out of 5' from 'stars_out_of_5' column
audible_data$stars_out_of_5 <- stringr::str_replace(audible_data$stars_out_of_5,
pattern = ' out of 5', replacement = '')
#remove 'ratings' from 'total_ratings' column
audible_data$total_ratings <- stringr::str_replace(audible_data$total_ratings,
pattern = ' ratings', replacement = '')
#remove commas from values like 1,500 from total_ratings
audible_data$total_ratings <- str_replace(audible_data$total_ratings,
pattern = ',', replacement = '')#finally change the datatype of both columns
audible_data <- audible_data %>%
mutate(stars_out_of_5 = as.numeric(stars_out_of_5),
total_ratings = as.integer(total_ratings))## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `stars_out_of_5 = as.numeric(stars_out_of_5)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
## Rows: 87,489
## Columns: 16
## $ name <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1 <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language <chr> "English", "English", "English", "English", "English", …
## $ stars <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ stars_out_of_5 <dbl> 5.0, 4.5, 4.5, 4.5, 4.5, 5.0, 5.0, 5.0, 5.0, 5.0, 4.5, …
## $ total_ratings <int> 34, 41, 38, 12, 181, 72, 11, 50, 5, 58, 130, 6, 7, 41, …
## $ price <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <int> 2, 13, 2, 11, 10, 10, 2, 12, 10, 13, 8, 5, 6, 12, 11, 2…
## $ min_component <int> 20, 8, 3, 16, 0, 35, 23, 32, 56, 22, 48, 23, 1, 58, 55,…
## $ release_date <date> 2008-08-04, 2018-05-01, 2020-11-06, 2021-10-05, 2010-0…
#convert price to float type
#replace 'free' with 0
audible_data$price = str_replace(audible_data$price, regex(pattern = 'free',ignore_case = TRUE),
replacement = '')
#remove ',' from the values
audible_data$price = str_replace(audible_data$price, pattern = ',', replacement = '')
#convert to float
audible_data$price = as.numeric(audible_data$price)## tibble [87,489 × 16] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:87489] "Geronimo Stilton #11 & #12" "The Burning Maze" "The Deep End" "Daughter of the Deep" ...
## $ author1 : chr [1:87489] "Geronimo Stilton" "Rick Riordan" "Jeff Kinney" "Rick Riordan" ...
## $ author2 : chr [1:87489] NA NA NA NA ...
## $ author3 : chr [1:87489] NA NA NA NA ...
## $ author4 : chr [1:87489] NA NA NA NA ...
## $ narrator : chr [1:87489] "Bill Lobely" "Robbie Daymond" "Dan Russell" "Soneela Nankani" ...
## $ time : chr [1:87489] "2 hrs and 20 mins" "13 hrs and 8 mins" "2 hrs and 3 mins" "11 hrs and 16 mins" ...
## $ releasedate : chr [1:87489] "04-08-08" "01-05-18" "06-11-20" "05-10-21" ...
## $ language : chr [1:87489] "English" "English" "English" "English" ...
## $ stars : chr [1:87489] "5 out of 5 stars34 ratings" "4.5 out of 5 stars41 ratings" "4.5 out of 5 stars38 ratings" "4.5 out of 5 stars12 ratings" ...
## $ stars_out_of_5: num [1:87489] 5 4.5 4.5 4.5 4.5 5 5 5 5 5 ...
## $ total_ratings : int [1:87489] 34 41 38 12 181 72 11 50 5 58 ...
## $ price : num [1:87489] 468 820 410 615 820 ...
## $ hour_component: int [1:87489] 2 13 2 11 10 10 2 12 10 13 ...
## $ min_component : int [1:87489] 20 8 3 16 0 35 23 32 56 22 ...
## $ release_date : Date[1:87489], format: "2008-08-04" "2018-05-01" ...
## [1] "name" "author1" "author2" "author3"
## [5] "author4" "narrator" "time" "releasedate"
## [9] "language" "stars" "stars_out_of_5" "total_ratings"
## [13] "price" "hour_component" "min_component" "release_date"
And finally, I’ve created a subset of the dataframe with all relevant columns.
#select relevant columns
final_data <- audible_data %>%
select(name, contains('author'), narrator, time, hour_component, min_component,
release_date, language, stars_out_of_5, total_ratings, price)
#view a sample of the final data
final_data[sample(nrow(final_data),5),]## # A tibble: 5 × 14
## name author1 author2 author3 author4 narrator time hour_component
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 Empowerment Tap… Margar… <NA> <NA> <NA> Margare… 1 hr… 1
## 2 Flawless Scott … Greg C… <NA> <NA> Don Hag… 10 h… 10
## 3 History for 8th… V.Suvo… <NA> <NA> <NA> Ilya Gl… 5 hr… 5
## 4 Tales from the … Brian … <NA> <NA> <NA> Lee Gor… 6 hr… 6
## 5 Zwietracht Tanja … <NA> <NA> <NA> Rebecca… 4 hr… 4
## # ℹ 6 more variables: min_component <int>, release_date <date>, language <chr>,
## # stars_out_of_5 <dbl>, total_ratings <int>, price <dbl>
I used this step in initial phase of cleaning this dataframe, but
then dropped it from the final draft.
But since this notebook contains “How To..” steps, I’ve included it
below just for the sake of future reference.
stars_out_of_5 column- example-
4.5 out of 5.
#example
ratings <- data.frame(rating1 = c('4.5 out of 5', '3 out of 5', '.5 out of 5'))
ratings$rating2 <- ratings$rating1Regular expression used:
## rating1 rating2
## 1 4.5 out of 5 4.5
## 2 3 out of 5 3
## 3 .5 out of 5 .5
So, this is it!
Ofc, there are many more ways of carrying out the cleaning
process.
I hope this will be helpful for the readers in some way.
Thanks for reading :)