Introduction

This notebook answers some of the most common “How To..” questions that pop-up during data cleaning.
I’ve also cleaned this notebook in Python. You may check out the Python version here on Kaggle.

I hope both versions of this work helps beginners to understand corresponding functions in Python and R.

The Audible Dataset used in this notebook can be found here.

library(tidyverse) # metapackage of all tidyverse packages
library(lubridate) # to work on dates

#the working directory
working_dir <- getwd()

#read data
audible_data = readr::read_csv(paste0(working_dir, "/data/audible_dataset/audible_uncleaned.csv"))

#preview of the data
head(audible_data, 4)

## # A tibble: 4 × 8
##   name                    author narrator time  releasedate language stars price
##   <chr>                   <chr>  <chr>    <chr> <chr>       <chr>    <chr> <chr>
## 1 Geronimo Stilton #11 &… Writt… Narrate… 2 hr… 04-08-08    English  5 ou… 468.…
## 2 The Burning Maze        Writt… Narrate… 13 h… 01-05-18    English  4.5 … 820.…
## 3 The Deep End            Writt… Narrate… 2 hr… 06-11-20    English  4.5 … 410.…
## 4 Daughter of the Deep    Writt… Narrate… 11 h… 05-10-21    English  4.5 … 615.…

From the sample data above and information below, we can list out some inconsistencies in the dataframe.
1. author- “Writtenby:” can be removed; multiple author names can be placed in different columns.
2. narrator- “Narratedby:” can be removed.
3. time- is of type str; can be converted to numeric or datetime type. We’ll see what should be done here.
4. releasedate- is of type str; should be of type date.
5. language- all values should be in uniform case; first letter capital, rest smallcase.
6. stars- can be split into i) number of stars and ii) total number of ratings columns.
7. price- is of type str; should be float.

#information about the data
glimpse(audible_data)

## Rows: 87,489
## Columns: 8
## $ name        <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The Dee…
## $ author      <chr> "Writtenby:GeronimoStilton", "Writtenby:RickRiordan", "Wri…
## $ narrator    <chr> "Narratedby:BillLobely", "Narratedby:RobbieDaymond", "Narr…
## $ time        <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 min…
## $ releasedate <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-10"…
## $ language    <chr> "English", "English", "English", "English", "English", "En…
## $ stars       <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rating…
## $ price       <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.00"…

#stats
summary(audible_data)

##      name              author            narrator             time          
##  Length:87489       Length:87489       Length:87489       Length:87489      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  releasedate          language            stars              price          
##  Length:87489       Length:87489       Length:87489       Length:87489      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

Data Cleaning Steps

1. How to check total count of each category?

#total count of each unique book
book_count <- audible_data %>%
                group_by(name) %>%
                summarize(total_count = n()) %>%
                arrange(-total_count)

#books with 10 or more observations.
book_count %>% filter(total_count >= 10)

## # A tibble: 18 × 2
##    name                        total_count
##    <chr>                             <int>
##  1 The Art of War                       20
##  2 Sterling Biographies                 19
##  3 Sterling Point Books                 16
##  4 The Odyssey                          16
##  5 Hamlet                               15
##  6 A Christmas Carol                    14
##  7 Pride and Prejudice                  14
##  8 The Prophet                          14
##  9 As a Man Thinketh                    13
## 10 The Iliad                            13
## 11 The Science of Getting Rich          13
## 12 Abraham Lincoln                      12
## 13 The Picture of Dorian Gray           12
## 14 Meditations                          11
## 15 The Prince                           11
## 16 The Raven                            11
## 17 The Richest Man in Babylon           11
## 18 Unstoppable                          10

2. How to check number of duplicate rows?

#total number of duplicate rows
sum(duplicated(audible_data))

## [1] 0

3. How to remove a string pattern?

Changes mades to the author column:

Removed “Writtenby:”.
Add space between first, middle(if any) and last names of the authors.
Since some books have multiple authors, this column is split into multiple columns with 1 author name in each.

The 3rd step is optional.
I’ve added it here just to showcase how to split a text column into multiple columns, and assign column names automatically using prefixes like author1, author2 etc.

#Author column
#remove the phrase "Writtenby:"
audible_data$author <- stringr::str_replace(audible_data$author, 
                                            pattern = 'Writtenby:', replacement = '')
#after removing "Written By", author column looks like:
head(audible_data[,'author'], 4)

## # A tibble: 4 × 1
##   author         
##   <chr>          
## 1 GeronimoStilton
## 2 RickRiordan    
## 3 JeffKinney     
## 4 RickRiordan

4. How to add space between 2 or more word blocks? (here first, middle, last name)

Regular Expression used:

“([a-z])([A-Z])”- means match a lowercase letter followed by an uppercase letter. So, group 1 is [a-z] and group 2 is [A-Z].
“\\1 \\2”- means add space wherever uppercase letter(group 2) is followed by lowercase letter group(group 1).

So, for example, ‘ChandlerBing’ becomes ‘Chandler Bing’.

NOTE:
1. Some rows contains muliple author names. e.g. “ChandlerBing, RossGeller, PhoebeBuffay”.
2. If str_replace() is used below, space will be added to only the first author name, ie. “Chandler Bing, RossGeller, PhoebeBuffay”.
3. Applying str_replace_all() adds space to all author names, i.e. “Chandler Bing, Ross Geller, Phoebe Buffay”.

#Add space between the first, middle and last names of Authors.
#e.g. JaneAustin becomes Jane Austin
audible_data$author <- stringr::str_replace_all(audible_data$author, 
                                            pattern = "([a-z])([A-Z])", 
                                            replacement = "\\1 \\2")
head(audible_data[,'author'], 4)

## # A tibble: 4 × 1
##   author          
##   <chr>           
## 1 Geronimo Stilton
## 2 Rick Riordan    
## 3 Jeff Kinney     
## 4 Rick Riordan

5. How to split a value into multiple columns and assign column name with prefix/suffix?

str_count() below finds the number of occurances of comma ‘,’ in each row of author column. Then maximum number of occurances is found and 1 is added.
So, if there are three commmas in any given row, it implies that row has 4 names in the author column.
And 4 columns will be created to store each author name.

#maximum number of authors for any book.
#helps to determine max number of columns needed to keep indiviudal authors
max_columns <- max(str_count(audible_data$author, ',')) + 1

#split the author column.
audible_data <- audible_data %>% separate(col = author,
                         into = paste0('author', seq_len(max_columns)),
                         sep = ',',
                         fill = 'right')

tail(audible_data,3)

## # A tibble: 3 × 11
##   name       author1 author2 author3 author4 narrator time  releasedate language
##   <chr>      <chr>   <chr>   <chr>   <chr>   <chr>    <chr> <chr>       <chr>   
## 1 The Innoc… Mark T… <NA>    <NA>    <NA>    Narrate… 19 h… 30-12-16    English 
## 2 A Sentime… Lauren… <NA>    <NA>    <NA>    Narrate… 4 hr… 23-02-11    English 
## 3 Havana     Mark K… <NA>    <NA>    <NA>    Narrate… 6 hr… 07-03-17    English 
## # ℹ 2 more variables: stars <chr>, price <chr>

6. How to find total number of NA in each column?

Here, I wanted to see if it’s at all useful to split the author column or not.
As we can see from the NA value counts below, the newly created columns author2, author3, author4 are mostly vacant.
Majority of the books have only 1 author and about 14,000 books have 2 authors.
(Total number of unique books in the data- 82767)

#find total number of NA in colnames containing 'author' 
#(author1, author2, author3, author4)
rbind(colSums(is.na(audible_data %>% select(contains('author')))))

##      author1 author2 author3 author4
## [1,]       0   73762   85135   86713

Changes made to narrator column:

Removed “Narratedby:”.
Add space between first, middle and last names of the narrator.
Here, I haven’t split this column for multiple narrators.

#remove "narrated by" from the beginning
audible_data$narrator <- stringr::str_replace(audible_data$narrator, 
                                             pattern = "Narratedby:",
                                             replacement = "")

#add space between first and last names
audible_data$narrator <- stringr::str_replace_all(audible_data$narrator,
                                                 pattern = "([a-z])([A-Z])",
                                                 replacement = "\\1 \\2")

Time column

This column contains duration of audiobooks in text format.
Some rows contain ‘xx hrs and yy mins’ while others contain ‘y mins’, ‘less than 1 minute’ etc.
In order to see all unique formats present, a copy of the time column has been created.

#make a copy of time column to understand all kind of formats in which data is present
time_column = audible_data$time

All digits have been removed to see what all phrases are present.

#replace all numbers with blanks
time_column = str_replace_all(time_column,
                              pattern = '[0-9]', replacement = '')
#keep only unique patterns
unique(time_column)

## [1] " hrs and  mins"    " hrs"              " hrs and  min"    
## [4] " hr and  mins"     " mins"             " hr"              
## [7] " hr and  min"      " min"              "Less than  minute"

Assumption: All rows with Less than xx minutes will be replaced with xx minutes.
Below, I’ve filtered out all row indices which have this ‘less than..’ values to see all distinct durations.
Since only 1 unique value present, less than 1 minute, it can be simply replaced with ‘1’.
If other values were present like less than 5 minutes, less than 10 mins etc., then we would have used regular expressions to extract and store the digits.

#find rows with "less than x minute" values in audiobook duration
less_than_duration = grep(pattern = 'less than', audible_data$time, ignore.case = T)
#check all unique values with 'less than' pattern
audible_data[less_than_duration,'time'] %>% distinct()

## # A tibble: 1 × 1
##   time              
##   <chr>             
## 1 Less than 1 minute

The objective is to:
1. Create 2 columns hour_component and min_component to store the hour and minute duration of the audiobook.
2. Examples:
- ‘17 hrs and 5 mins’ will split into 17 in hour column and 5 in minute column.
- ‘24 mins’ will be stored as 0 in hour column and 24 in minute column.
- ‘less than 1 minute’ will be stored as 0 in hour and 1 in minute column.

#create 2 columns- 1 for hour component and the other for minute component.
audible_data$hour_component <- 0
audible_data$min_component <- 0

7. How to extract digits before a specifc word?

Regular Expression used:
1. “^-” indicates start matching the pattern from beginning of the string.
2. “([0-9]+)”- is the capture group. Means look for one or more digits.
3. + means one or more occurances of digits.
4. “([0-9]+) hr” captures digits before the word hr or hrs.
5. “([0-9]+) min” captures digits before min/mins/minutes words.

#extract number of hours
audible_data$hour_component <- stringr::str_extract(audible_data$time, 
                                                    pattern = '^([0-9]+) hr')
#remove the 'hr' part
audible_data$hour_component <- stringr::str_replace(audible_data$hour_component,
                                                   pattern = ' hr', replacement = '')
#extract number of minutes
audible_data$min_component <- stringr::str_extract(audible_data$time, pattern = '([0-9]+) min')
#remove the 'min' part
audible_data$min_component <- stringr::str_replace(audible_data$min_component,
                                                   pattern = ' min', replacement = '')

#preview of extracted time
head(audible_data %>% select(time, hour_component, min_component), 5)

## # A tibble: 5 × 3
##   time               hour_component min_component
##   <chr>              <chr>          <chr>        
## 1 2 hrs and 20 mins  2              20           
## 2 13 hrs and 8 mins  13             8            
## 3 2 hrs and 3 mins   2              3            
## 4 11 hrs and 16 mins 11             16           
## 5 10 hrs             10             <NA>

#checking the datatype of all cols now
glimpse(audible_data)

## Rows: 87,489
## Columns: 13
## $ name           <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1        <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator       <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time           <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate    <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language       <chr> "English", "English", "English", "English", "English", …
## $ stars          <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ price          <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <chr> "2", "13", "2", "11", "10", "10", "2", "12", "10", "13"…
## $ min_component  <chr> "20", "8", "3", "16", NA, "35", "23", "32", "56", "22",…

8. How to convert NA values to 0?

Now, after digits extraction, values like ‘24 mins’ will keep 24 in minute column and NA in hour column.
Similarly, a value like ‘11 hrs’ will keep NA in minutes column.
Below, these NA values have been replaced with 0.

audible_data$hour_component <- audible_data$hour_component %>% replace_na('0')
audible_data$min_component <- audible_data$min_component %>% replace_na('0')

9. How to change the datatype of multiple columns?

Now, we can change the datatype of hour and minutes from strings to integers.

#convert the hour and min column to int type
audible_data = audible_data %>%
                mutate(hour_component = as.integer(hour_component),
                      min_component = as.integer(min_component))

Date column

Dates are written in dd-mm-yy format.
They are in character format and will be converted to date type.
But before that, we need to make sure that each of the components are consistent. That is:
- dd doesn’t exceed 31.
- mm doesn’t exceed 12.
- In the datacard on Audible dataset page, its mentioned that some books are yet to be released, which means we can have year value greater than 2023.

head(audible_data %>% select(releasedate), 4)

## # A tibble: 4 × 1
##   releasedate
##   <chr>      
## 1 04-08-08   
## 2 01-05-18   
## 3 06-11-20   
## 4 05-10-21

10. How to check for inconsistencies in date column stored as a string?

Regular expressions used:
1. “^([0-9]+)”- check for the digit from beginning of string (^) and before hypher -.
2. “-([0-9]+)-” check for the digits between the two hyphens.
3. “-([0-9]+)$”- capture the digits after hyphen sign and end of the string.

#day component- dd
#check if any value in day part > 31.
any(
    as.integer(
        unique(
            str_extract(string = audible_data$releasedate, pattern = '^([0-9]+)')
            )
    ) > 31)

## [1] FALSE

#check for inconsistencies in month of release date
#Check to see whether middle values (months) contain any number > 12
#all unique months. 

#extract all months; E.g. extracted format is: "-05-" for May
unique_months <- unique(str_extract(string = audible_data$releasedate, pattern = '-([0-9]+)-'))
#remove both hyphens
unique_months <- str_replace_all(string = unique_months, pattern = '-', repl = '')
#convert to integer and check if any value exceeds 12 (12 months)
any(as.integer(unique_months) > 12)

## [1] FALSE

11. How to convert date stored as string to ‘date’ type?

In the Py version of this notebook here on Kaggle, I’ve added a note in this part.
If date string is in format: 04-08-23, then separator used in “format” parameter should be %d-%m-%Y and not %d/%m/%Y.

But no such caution was needed here. Only thing to note in case of R is to make sure the right function is used.
Since the date in our df is of form: dd-mm-yy, hence dmy() has been used.
If in the date, month was mentioned first, followed by day, then year, then mdy() would have been the right function.

head(audible_data %>% select(name, releasedate), 3)

## # A tibble: 3 × 2
##   name                       releasedate
##   <chr>                      <chr>      
## 1 Geronimo Stilton #11 & #12 04-08-08   
## 2 The Burning Maze           01-05-18   
## 3 The Deep End               06-11-20

#convert to date type
audible_data$release_date <- lubridate::dmy(audible_data$releasedate)
#view datatype of all cols now
glimpse(audible_data)

## Rows: 87,489
## Columns: 14
## $ name           <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1        <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator       <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time           <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate    <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language       <chr> "English", "English", "English", "English", "English", …
## $ stars          <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ price          <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <int> 2, 13, 2, 11, 10, 10, 2, 12, 10, 13, 8, 5, 6, 12, 11, 2…
## $ min_component  <int> 20, 8, 3, 16, 0, 35, 23, 32, 56, 22, 48, 23, 1, 58, 55,…
## $ release_date   <date> 2008-08-04, 2018-05-01, 2020-11-06, 2021-10-05, 2010-0…

12. How to convert text to one uniform case?

#all unique languages in the dataset
unique(audible_data$language)

##  [1] "English"          "Hindi"            "spanish"          "german"          
##  [5] "french"           "catalan"          "swedish"          "italian"         
##  [9] "danish"           "finnish"          "dutch"            "hebrew"          
## [13] "russian"          "polish"           "galician"         "afrikaans"       
## [17] "icelandic"        "romanian"         "japanese"         "tamil"           
## [21] "portuguese"       "urdu"             "hungarian"        "czech"           
## [25] "bulgarian"        "mandarin_chinese" "basque"           "korean"          
## [29] "arabic"           "greek"            "turkish"          "ukrainian"       
## [33] "slovene"          "norwegian"        "telugu"           "lithuanian"

#capitalize the language names
audible_data$language <- stringr::str_to_title(audible_data$language)

Stars column

The objective is to create 2 columns out of the stars column:
- stars_out_of_5- will contain numeric values like 2, 4.5 etc.
- total_ratings- total number of reviews received on the particular audiobook.

Example: "4.5 out of 5 stars7 ratings"
This column will be split after the word ‘stars’.
Then ’ out of 5’ will be removed and remaining number will be converted to float type.
In the total_ratings column, “ratings” will be removed and values will be converted to float type.

head(audible_data %>% select(name, stars), 3)

## # A tibble: 3 × 2
##   name                       stars                       
##   <chr>                      <chr>                       
## 1 Geronimo Stilton #11 & #12 5 out of 5 stars34 ratings  
## 2 The Burning Maze           4.5 out of 5 stars41 ratings
## 3 The Deep End               4.5 out of 5 stars38 ratings

#split the stars column after the word 'stars'
#rows with 'Not yet rated' return NA in 'total_ratings' column after separation
audible_data <- audible_data %>% 
                    separate(col = stars,
                            into = c('stars_out_of_5', 'total_ratings'),
                            sep = 'stars',
                            #do not remove the original column
                            remove = F,
                            #row containing 'Not yet rated' returns NA.
                            #NA should be filled starting from right-hand-side column
                            fill = 'right')

#some unique rows in the data
audible_data[sample(nrow(audible_data),40),c('stars', 'stars_out_of_5', 'total_ratings')] %>% distinct()

## # A tibble: 6 × 3
##   stars                        stars_out_of_5  total_ratings
##   <chr>                        <chr>           <chr>        
## 1 Not rated yet                "Not rated yet" <NA>         
## 2 4.5 out of 5 stars2 ratings  "4.5 out of 5 " 2 ratings    
## 3 5 out of 5 stars1 rating     "5 out of 5 "   1 rating     
## 4 4.5 out of 5 stars11 ratings "4.5 out of 5 " 11 ratings   
## 5 3 out of 5 stars1 rating     "3 out of 5 "   1 rating     
## 6 4 out of 5 stars1 rating     "4 out of 5 "   1 rating

Out of 87,489 rows in the df, 72,417 contain “Not yet rated” for books that have not been rated yet!

There is no point in removing all these rows and perform an analysis. But I want to showcase some more queries that arise while cleaning a dataframe. For e.g. How do you extract a floating point number occuring before a text? and so on..

The objective is to make these columns numeric by extracting the numeric part from the ratings e.g. 4.5 out of 5 becomes 4.5 & 106 ratings becomes 106.

When this is done, the “Not rated yet” rows will get converted to NA. I wanted to keep a copy of the data cleaned so far.
So, the subsequent cleaning steps have been performed on a copy of this dataframe.

#make a copy of dataframe cleaned so far.
audible_data_copy <- audible_data

Steps taken below:

Remove ‘out of 5’ from stars_out_of_5 column to retain only the digit part.
Remove ‘ratings’ from total_ratings column to retain only the numeric part.
Remove commas from values like 1,500 in total_ratings column. (Bcz this column has to be converted to numeric.)

#remove 'out of 5' from 'stars_out_of_5' column
audible_data$stars_out_of_5 <- stringr::str_replace(audible_data$stars_out_of_5, 
                                                    pattern = ' out of 5', replacement = '')
#remove 'ratings' from 'total_ratings' column
audible_data$total_ratings <- stringr::str_replace(audible_data$total_ratings,
                                                  pattern = ' ratings', replacement = '')
#remove commas from values like 1,500 from total_ratings
audible_data$total_ratings <- str_replace(audible_data$total_ratings, 
                                         pattern = ',', replacement = '')

#finally change the datatype of both columns
audible_data <- audible_data %>%
                    mutate(stars_out_of_5 = as.numeric(stars_out_of_5),
                          total_ratings = as.integer(total_ratings))

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `stars_out_of_5 = as.numeric(stars_out_of_5)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

glimpse(audible_data)

## Rows: 87,489
## Columns: 16
## $ name           <chr> "Geronimo Stilton #11 & #12", "The Burning Maze", "The …
## $ author1        <chr> "Geronimo Stilton", "Rick Riordan", "Jeff Kinney", "Ric…
## $ author2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author3        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author4        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ narrator       <chr> "Bill Lobely", "Robbie Daymond", "Dan Russell", "Soneel…
## $ time           <chr> "2 hrs and 20 mins", "13 hrs and 8 mins", "2 hrs and 3 …
## $ releasedate    <chr> "04-08-08", "01-05-18", "06-11-20", "05-10-21", "13-01-…
## $ language       <chr> "English", "English", "English", "English", "English", …
## $ stars          <chr> "5 out of 5 stars34 ratings", "4.5 out of 5 stars41 rat…
## $ stars_out_of_5 <dbl> 5.0, 4.5, 4.5, 4.5, 4.5, 5.0, 5.0, 5.0, 5.0, 5.0, 4.5, …
## $ total_ratings  <int> 34, 41, 38, 12, 181, 72, 11, 50, 5, 58, 130, 6, 7, 41, …
## $ price          <chr> "468.00", "820.00", "410.00", "615.00", "820.00", "656.…
## $ hour_component <int> 2, 13, 2, 11, 10, 10, 2, 12, 10, 13, 8, 5, 6, 12, 11, 2…
## $ min_component  <int> 20, 8, 3, 16, 0, 35, 23, 32, 56, 22, 48, 23, 1, 58, 55,…
## $ release_date   <date> 2008-08-04, 2018-05-01, 2020-11-06, 2021-10-05, 2010-0…

Price column

Steps taken:

Replace ‘Free’ with 0.
Remove commas from prices. Currently, this column is of type str; it has to be converted to numeric.
Convert to float.

#convert price to float type

#replace 'free' with 0
audible_data$price = str_replace(audible_data$price, regex(pattern = 'free',ignore_case = TRUE),
                                replacement = '')
#remove ',' from the values
audible_data$price = str_replace(audible_data$price, pattern = ',', replacement = '')
#convert to float
audible_data$price = as.numeric(audible_data$price)

str(audible_data)

## tibble [87,489 × 16] (S3: tbl_df/tbl/data.frame)
##  $ name          : chr [1:87489] "Geronimo Stilton #11 & #12" "The Burning Maze" "The Deep End" "Daughter of the Deep" ...
##  $ author1       : chr [1:87489] "Geronimo Stilton" "Rick Riordan" "Jeff Kinney" "Rick Riordan" ...
##  $ author2       : chr [1:87489] NA NA NA NA ...
##  $ author3       : chr [1:87489] NA NA NA NA ...
##  $ author4       : chr [1:87489] NA NA NA NA ...
##  $ narrator      : chr [1:87489] "Bill Lobely" "Robbie Daymond" "Dan Russell" "Soneela Nankani" ...
##  $ time          : chr [1:87489] "2 hrs and 20 mins" "13 hrs and 8 mins" "2 hrs and 3 mins" "11 hrs and 16 mins" ...
##  $ releasedate   : chr [1:87489] "04-08-08" "01-05-18" "06-11-20" "05-10-21" ...
##  $ language      : chr [1:87489] "English" "English" "English" "English" ...
##  $ stars         : chr [1:87489] "5 out of 5 stars34 ratings" "4.5 out of 5 stars41 ratings" "4.5 out of 5 stars38 ratings" "4.5 out of 5 stars12 ratings" ...
##  $ stars_out_of_5: num [1:87489] 5 4.5 4.5 4.5 4.5 5 5 5 5 5 ...
##  $ total_ratings : int [1:87489] 34 41 38 12 181 72 11 50 5 58 ...
##  $ price         : num [1:87489] 468 820 410 615 820 ...
##  $ hour_component: int [1:87489] 2 13 2 11 10 10 2 12 10 13 ...
##  $ min_component : int [1:87489] 20 8 3 16 0 35 23 32 56 22 ...
##  $ release_date  : Date[1:87489], format: "2008-08-04" "2018-05-01" ...

colnames(audible_data)

##  [1] "name"           "author1"        "author2"        "author3"       
##  [5] "author4"        "narrator"       "time"           "releasedate"   
##  [9] "language"       "stars"          "stars_out_of_5" "total_ratings" 
## [13] "price"          "hour_component" "min_component"  "release_date"

And finally, I’ve created a subset of the dataframe with all relevant columns.

#select relevant columns
final_data <- audible_data %>%
                select(name, contains('author'), narrator, time, hour_component, min_component,
                      release_date, language, stars_out_of_5, total_ratings, price)
#view a sample of the final data
final_data[sample(nrow(final_data),5),]

## # A tibble: 5 × 14
##   name             author1 author2 author3 author4 narrator time  hour_component
##   <chr>            <chr>   <chr>   <chr>   <chr>   <chr>    <chr>          <int>
## 1 Empowerment Tap… Margar… <NA>    <NA>    <NA>    Margare… 1 hr…              1
## 2 Flawless         Scott … Greg C… <NA>    <NA>    Don Hag… 10 h…             10
## 3 History for 8th… V.Suvo… <NA>    <NA>    <NA>    Ilya Gl… 5 hr…              5
## 4 Tales from the … Brian … <NA>    <NA>    <NA>    Lee Gor… 6 hr…              6
## 5 Zwietracht       Tanja … <NA>    <NA>    <NA>    Rebecca… 4 hr…              4
## # ℹ 6 more variables: min_component <int>, release_date <date>, language <chr>,
## #   stars_out_of_5 <dbl>, total_ratings <int>, price <dbl>

14. How to extract decimal numbers from a text?

I used this step in initial phase of cleaning this dataframe, but then dropped it from the final draft.
But since this notebook contains “How To..” steps, I’ve included it below just for the sake of future reference.

stars_out_of_5 column- example- 4.5 out of 5.
- What I’ve done above- replace ‘out of 5’ with ’’, then convert 4.5 from string to float.
- Another way- extract the floating-point number from this text, i.e. 4.5 and then convert it to float.

#example
ratings <- data.frame(rating1 = c('4.5 out of 5', '3 out of 5', '.5 out of 5'))
ratings$rating2 <- ratings$rating1

Regular expression used:

[0-9]*- means 0 or more occurance of a digit.
“.?”- means presence of a decimal point is optional.
So, “[0-9]*.?” means 0 or more occurances of digits before decimal point.
This pattern helps to capture values like .5 where there are no digits before decimal.
“.?[0-9]+” ensures that there must be one or more digits after the decimal point.

ratings$rating2 <- str_extract(ratings$rating1, regex(pattern = '([0-9]*\\.?[0-9]+)'))
ratings

##        rating1 rating2
## 1 4.5 out of 5     4.5
## 2   3 out of 5       3
## 3  .5 out of 5      .5

So, this is it!
Ofc, there are many more ways of carrying out the cleaning process.
I hope this will be helpful for the readers in some way.
Thanks for reading :)

Data Cleaning (R) - Audible Dataset

Shefali C.

2023-10-04