Overview

This is an exercise for the Springboard Data Foundations course. In each step we’ll look at the data from the Titanic and fill in missing data.

Load the data
Port of embarcation
Age
Lifeboat
Cabin
Save the data

Load the data

We’re going to load the original Excel spreadsheet, titanic3.xls. We’ll save an as-is copy as titanic_original.csv.

titanic_orig <- read.csv("titanic3_original.csv")
titanic <- titanic_orig

Port of embarcation

The embarked column has some missing values, which are known to correspond to passengers who actually embarked at Southampton. Find the missing values and replace them with S.

FWIW, some rows to fix: 169, 284

print(titanic$embarked[168:170])

## [1] S   C
## Levels:  C Q S

for(p in 1:length(titanic$embarked)) {
  if(titanic$embarked[p] == "") { titanic$embarked[p] <- "S" }
}
print(titanic$embarked[168:170])

## [1] S S C
## Levels:  C Q S

Age

Calculate the mean of the Age column and use that value to populate the missing values.

FWIW, some rows to fix: 16, 41

print(titanic$age[15:17])

## [1] 80 NA 24

age.m <- mean(titanic$age, na.rm=TRUE)
for(p in 1:length(titanic$age)) {
  if(is.na(titanic$age[p])) { titanic$age[p] <- age.m }
}
print(titanic$age[15:17])

## [1] 80.00000 29.88113 24.00000

Lifeboats

Fill empty slots with “None”.

FWIW, some rows to fix: 3, 4, 5

titanic$boat <- as.character(titanic$boat)
print(titanic$boat[1:6])

## [1] "2"  "11" ""   ""   ""   "3"

for(p in 1:length(titanic$boat)) {
  if(titanic$boat[p] == "") { titanic$boat[p] <- "None" }
}
print(titanic$boat[1:6])

## [1] "2"    "11"   "None" "None" "None" "3"

Cabin

Create a new column has_cabin_number which has 1 if there is a cabin number, and 0 otherwise.

FYI, rows without cabins: 10, 14

titanic$cabin <- as.character(titanic$cabin)
print(titanic$cabin[1:6])

## [1] "B5"      "C22 C26" "C22 C26" "C22 C26" "C22 C26" "E12"

has_cabin <- c()
for(p in 1:length(titanic$cabin)) {
  has_cabin <- append(has_cabin, if_else(titanic$cabin[p] == "", 0, 1))
}
titanic <- bind_cols(titanic, has_cabin_number = has_cabin)
print(titanic$has_cabin_number[1:6])

## [1] 1 1 1 1 1 1

Save data

And save it as titanic_clean.csv.

write.csv(titanic, "titanic_clean.csv")

wrangling2.Rmd

mikec964

August 12, 2017