library("plyr")
library("dplyr")
library("tidyr")
library("readxl")
library("knitr")

Exercise

Using R, you’ll be handling missing values in this data set (titanic3), and creating a new data set.

Steps

1) Load the data in RStudio:

Save the data set as a CSV file called titanic_original.csv and load it in RStudio into a data frame.

titanic_original <- read_excel("C:/Users/xena0/Downloads/titanic3.xls")
## Warning in read_fun(path = path, sheet_i = sheet, limits = limits, shim =
## shim, : Coercing text to numeric in M1306 / R1306C13: '328'
View(titanic_original)

2) Port of embarkation

The embarked column has some missing values, which are known to correspond to passengers who actually embarked at Southampton. Find the missing values and replace them with S. (Caution: Sometimes a missing value might be read into R as a blank or empty string.)

titanic_clean <- titanic_original %>% 
  mutate(embarked = ifelse(grepl(" ", embarked, ignore.case = TRUE), "S", embarked)) %>%
  replace_na(list(embarked = "S"))

3) Age

You’ll notice that a lot of the values in the Age column are missing. While there are many ways to fill these missing values, using the mean or median of the rest of the values is quite common in such cases.

Calculate the mean of the Age column and use that value to populate the missing values

Think about other ways you could have populated the missing values in the age column. Why would you pick any of those over the mean (or not)?

Answer: When you are using such a large data set, I would think using the mean would be the best way to predict the age of passangers with missing age information. If the passenger list was smaller, you may want to look at other options.

titanic_clean %>% 
  summarise(avg_age = mean(age, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_age
##     <dbl>
## 1    29.9
titanic_clean <- titanic_clean %>% 
  replace_na(list(age = 30))

4) Lifeboat

You’re interested in looking at the distribution of passengers in different lifeboats, but as we know, many passengers did not make it to a boat :-( This means that there are a lot of missing values in the boat column. Fill these empty slots with a dummy value e.g. the string ‘None’ or ‘NA’

titanic_clean <- titanic_clean %>% 
  replace_na(list(boat = "NONE"))

5) Cabin

You notice that many passengers don’t have a cabin number associated with them.

Does it make sense to fill missing cabin numbers with a value?

Answer: Yes, so that it is easier to pull out the data.

What does a missing value here mean?

Answer: It could mean a lot of things. It could just be a mistake. It could indicate that the passanger is lower class and didn’t have a cabin. It could indicate a stow away that was found after the ship embarked. It could corolate with survival… Some of these answers may be found analizing this data against the know data points.

You have a hunch that the fact that the cabin number is missing might be a useful indicator of survival. Create a new column has_cabin_number which has 1 if there is a cabin number, and 0 otherwise.

titanic_clean <- titanic_clean %>% 
  mutate(has_cabin_number = ifelse(grepl("NA", cabin), 0, 1))

6) Submit the project on Github

Include your code, the original data as a CSV file titanic_original.csv, and the cleaned up data as a CSV file called titanic_clean.csv.

kable(titanic_clean[1:5, ], caption = "titanic clean")
titanic clean
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest has_cabin_number
1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NA St Louis, MO 1
1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NA Montreal, PQ / Chesterville, ON 1
1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NONE NA Montreal, PQ / Chesterville, ON 1
1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NONE 135 Montreal, PQ / Chesterville, ON 1
1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NONE NA Montreal, PQ / Chesterville, ON 1