Notes:
Knitted RMarkdown HTML document of this project is available on Rpubs Here.
My Springboard GitHub repository for all projects is available here.
Save the data set as a CSV file called titanic_original.csv and load it in RStudio into a data frame.
library(readxl)
titanic <- read_excel("titanic3.xls")
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in M1306 / R1306C13: '328'
write.csv(titanic,"titanic_original.csv")
head(titanic)
## # A tibble: 6 x 14
## pclass survived name sex age sibsp parch ticket fare cabin
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 1 Alle~ fema~ 29 0 0 24160 211. B5
## 2 1 1 Alli~ male 0.917 1 2 113781 152. C22 ~
## 3 1 0 Alli~ fema~ 2 1 2 113781 152. C22 ~
## 4 1 0 Alli~ male 30 1 2 113781 152. C22 ~
## 5 1 0 Alli~ fema~ 25 1 2 113781 152. C22 ~
## 6 1 1 Ande~ male 48 0 0 19952 26.6 E12
## # ... with 4 more variables: embarked <chr>, boat <chr>, body <dbl>,
## # home.dest <chr>
The embarked column has some missing values, which are known to correspond to passengers who actually embarked at Southampton. Find the missing values and replace them with S. (Caution: Sometimes a missing value might be read into R as a blank or empty string.)
library(tidyr)
titanic <- titanic %>% replace_na(list(embarked = "S"))
You’ll notice that a lot of the values in the Age column are missing. While there are many ways to fill these missing values, using the mean or median of the rest of the values is quite common in such cases.
Calculate the mean of the Age column and use that value to populate the missing values
Think about other ways you could have populated the missing values in the age column. Why would you pick any of those over the mean (or not)?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summarise(titanic, Average = mean(age, na.rm = TRUE)) # 29.88
## # A tibble: 1 x 1
## Average
## <dbl>
## 1 29.9
titanic <- titanic %>% replace_na(list(age = "30"))
You’re interested in looking at the distribution of passengers in different lifeboats, but as we know, many passengers did not make it to a boat :-( This means that there are a lot of missing values in the boat column. Fill these empty slots with a dummy value e.g. the string ‘None’ or ‘NA’
titanic <- titanic %>% replace_na(list(boat = 'None'))
You notice that many passengers don’t have a cabin number associated with them.
Does it make sense to fill missing cabin numbers with a value?
What does a missing value here mean?
You have a hunch that the fact that the cabin number is missing might be a useful indicator of survival. Create a new column has_cabin_number which has 1 if there is a cabin number, and 0 otherwise.
titanic <- titanic %>% mutate(has_cabin_number = ifelse(is.na(cabin), 0, 1))
View(titanic)
5: Submit the project on Github
Include your code, the original data as a CSV file titanic_original.csv, and the cleaned up data as a CSV file called titanic_clean.csv.
write.csv(titanic,"titanic_clean.csv")
..