This is an exercise for the Springboard Data Foundations course. In each step we’ll look at the data from the Titanic and fill in missing data.
We’re going to load the original Excel spreadsheet, titanic3.xls. We’ll save an as-is copy as titanic_original.csv.
titanic_orig <- read.csv("titanic3_original.csv")
titanic <- titanic_orig
The embarked column has some missing values, which are known to correspond to passengers who actually embarked at Southampton. Find the missing values and replace them with S.
FWIW, some rows to fix: 169, 284
print(titanic$embarked[168:170])
## [1] S C
## Levels: C Q S
for(p in 1:length(titanic$embarked)) {
if(titanic$embarked[p] == "") { titanic$embarked[p] <- "S" }
}
print(titanic$embarked[168:170])
## [1] S S C
## Levels: C Q S
Calculate the mean of the Age column and use that value to populate the missing values.
FWIW, some rows to fix: 16, 41
print(titanic$age[15:17])
## [1] 80 NA 24
age.m <- mean(titanic$age, na.rm=TRUE)
for(p in 1:length(titanic$age)) {
if(is.na(titanic$age[p])) { titanic$age[p] <- age.m }
}
print(titanic$age[15:17])
## [1] 80.00000 29.88113 24.00000
Fill empty slots with “None”.
FWIW, some rows to fix: 3, 4, 5
titanic$boat <- as.character(titanic$boat)
print(titanic$boat[1:6])
## [1] "2" "11" "" "" "" "3"
for(p in 1:length(titanic$boat)) {
if(titanic$boat[p] == "") { titanic$boat[p] <- "None" }
}
print(titanic$boat[1:6])
## [1] "2" "11" "None" "None" "None" "3"
Create a new column has_cabin_number which has 1 if there is a cabin number, and 0 otherwise.
FYI, rows without cabins: 10, 14
titanic$cabin <- as.character(titanic$cabin)
print(titanic$cabin[1:6])
## [1] "B5" "C22 C26" "C22 C26" "C22 C26" "C22 C26" "E12"
has_cabin <- c()
for(p in 1:length(titanic$cabin)) {
has_cabin <- append(has_cabin, if_else(titanic$cabin[p] == "", 0, 1))
}
titanic <- bind_cols(titanic, has_cabin_number = has_cabin)
print(titanic$has_cabin_number[1:6])
## [1] 1 1 1 1 1 1
And save it as titanic_clean.csv.
write.csv(titanic, "titanic_clean.csv")