Data Wrangling Exercise 2: Dealing with missing values

Notes:

Knitted RMarkdown HTML document of this project is available on Rpubs Here.
My Springboard GitHub repository for all projects is available here.

0: Load the data in RStudio

Save the data set as a CSV file called titanic_original.csv and load it in RStudio into a data frame.

library(readxl)
titanic <- read_excel("titanic3.xls")

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in M1306 / R1306C13: '328'

write.csv(titanic,"titanic_original.csv")
head(titanic)

## # A tibble: 6 x 14
##   pclass survived name  sex      age sibsp parch ticket  fare cabin
##    <dbl>    <dbl> <chr> <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1      1        1 Alle~ fema~ 29         0     0 24160  211.  B5   
## 2      1        1 Alli~ male   0.917     1     2 113781 152.  C22 ~
## 3      1        0 Alli~ fema~  2         1     2 113781 152.  C22 ~
## 4      1        0 Alli~ male  30         1     2 113781 152.  C22 ~
## 5      1        0 Alli~ fema~ 25         1     2 113781 152.  C22 ~
## 6      1        1 Ande~ male  48         0     0 19952   26.6 E12  
## # ... with 4 more variables: embarked <chr>, boat <chr>, body <dbl>,
## #   home.dest <chr>

1: Port of embarkation

The embarked column has some missing values, which are known to correspond to passengers who actually embarked at Southampton. Find the missing values and replace them with S. (Caution: Sometimes a missing value might be read into R as a blank or empty string.)

library(tidyr)
titanic <- titanic %>% replace_na(list(embarked = "S"))

2: Age

You’ll notice that a lot of the values in the Age column are missing. While there are many ways to fill these missing values, using the mean or median of the rest of the values is quite common in such cases.

Calculate the mean of the Age column and use that value to populate the missing values

Think about other ways you could have populated the missing values in the age column. Why would you pick any of those over the mean (or not)?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

summarise(titanic, Average = mean(age, na.rm = TRUE)) # 29.88

## # A tibble: 1 x 1
##   Average
##     <dbl>
## 1    29.9

titanic <- titanic %>% replace_na(list(age = "30"))

3: Lifeboat

You’re interested in looking at the distribution of passengers in different lifeboats, but as we know, many passengers did not make it to a boat :-( This means that there are a lot of missing values in the boat column. Fill these empty slots with a dummy value e.g. the string ‘None’ or ‘NA’

titanic <- titanic %>% replace_na(list(boat = 'None'))

4: Cabin

You notice that many passengers don’t have a cabin number associated with them.

Does it make sense to fill missing cabin numbers with a value?

What does a missing value here mean?

You have a hunch that the fact that the cabin number is missing might be a useful indicator of survival. Create a new column has_cabin_number which has 1 if there is a cabin number, and 0 otherwise.

titanic <- titanic %>% mutate(has_cabin_number = ifelse(is.na(cabin), 0, 1)) 
View(titanic)

5: Submit the project on Github

Include your code, the original data as a CSV file titanic_original.csv, and the cleaned up data as a CSV file called titanic_clean.csv.

write.csv(titanic,"titanic_clean.csv")