1 Executive Summary
2 Loading and Exploring Data
- 2.1 Download Files
3 Exploratory Analysis
- 3.1 Response variable (Survived)
- 3.2 Prediction Variables (numeric)
4 Missing Data, label encoding, factorizing Variables
- 4.1 Missing Data
- 4.2 Label Encoding
5 Visualization of important variables
- 5.1 Passenger class
6 Visualization
- 6.1 Correlation
  - 6.1.1 Fare
  - 6.1.2 Price
- 6.2 Finding importance of variable with quick random forest
7 Feature engineering
- 7.1 Title
- 7.2 Cabin Letter
8 Preparing Data for modeling
- 8.1 Preprocesing predictor variables
- 8.2 Combining
9 Prepare test set for predicting
10 Modeling

1 Executive Summary

I use Titanic data set on kaggle to create prediction models. It use different attributes to predict if the passenger survived or not.
The link to the data set: https://www.kaggle.com/competitions/titanic/

2 Loading and Exploring Data

2.1 Download Files

train_path <- "train.csv"
test_path <- "test.csv"

train <- read.csv(train_path)
test <- read.csv(test_path)

The training set has 891 passengers and 12 columns.

dim(train)

## [1] 891  12

dim(test)

## [1] 418  11

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

test_labels <- test$PassengerId
train <- train[, -1]

dim(train)

## [1] 891  11

3 Exploratory Analysis

3.1 Response variable (Survived)

table(train$Survived)

## 
##   0   1 
## 549 342

ggplot(train, aes(x = as.factor(Survived))) + geom_bar(stat = "count")

There are more people not survived than survived

3.2 Prediction Variables (numeric)

3.2.1 Correlation with Survived or not

I will find the correlations of the numeric variables to Survived

numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]

corTab <- cor(train_num, use = "pairwise.complete.obs")
corSorted <- as.matrix(sort(abs(corTab[, "Survived"]), decreasing = TRUE))
corNames <- row.names(corSorted)

corSorted <- corTab[corNames, corNames]
corrplot.mixed(corSorted, tl.col = "black", tl.pos = "lt", tl.cex = 0.7,
    cl.cex = 0.7, number.cex = 0.7)

Pclass (passenger class) has the highest absolute correlation with the Survived variable.

3.2.2 Pclass

Passenger class

1   Upper class
2   Middle class
3   Lower class

table(train$Pclass)

## 
##   1   2   3 
## 216 184 491

We can assume that 1 means first class while 3 mean third class. Which means that the lower the class (third class is the lowest), the lower the survival rate. It is reasonable as the third class deck located at the lower part of the Titanic and thus harder to survive.

ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")

3.2.3 Fare

The money paid for the ticket

ggplot(train, aes(x = Fare)) + geom_density()

It can be seen that the price is very right skewed. I will keep in mind for later pre-processing.

cor(train$Pclass, train$Fare, use = "pairwise.complete.obs")

## [1] -0.5494996

The fare and passenger class is negatively correlated. (Third class paid the lowest and first class paid the highest). This confirmed that Pclass is the passenger class.

4 Missing Data, label encoding, factorizing Variables

4.1 Missing Data

First, find all variable with missing variables

NAcol <- which(colSums(is.na(train) | train == "") > 0)
sort(colSums(sapply(train[NAcol], function(x) is.na(x) | x ==
    "")), decreasing = TRUE)

##    Cabin      Age Embarked 
##      687      177        2

4.1.1 Cabin

summary(train$Cabin)

##    Length     Class      Mode 
##       891 character character

table(train$Cabin)

## 
##                             A10             A14             A16             A19 
##             687               1               1               1               1 
##             A20             A23             A24             A26             A31 
##               1               1               1               1               1 
##             A32             A34             A36              A5              A6 
##               1               1               1               1               1 
##              A7            B101            B102             B18             B19 
##               1               1               1               2               1 
##             B20             B22             B28              B3             B30 
##               2               2               2               1               1 
##             B35             B37             B38             B39              B4 
##               2               1               1               1               1 
##             B41             B42             B49              B5             B50 
##               1               1               2               2               1 
##     B51 B53 B55 B57 B59 B63 B66         B58 B60             B69             B71 
##               2               2               2               1               1 
##             B73             B77             B78             B79             B80 
##               1               2               1               1               1 
##         B82 B84             B86             B94         B96 B98            C101 
##               1               1               1               4               1 
##            C103            C104            C106            C110            C111 
##               1               1               1               1               1 
##            C118            C123            C124            C125            C126 
##               1               2               2               2               2 
##            C128            C148              C2         C22 C26     C23 C25 C27 
##               1               1               2               3               4 
##             C30             C32             C45             C46             C47 
##               1               1               1               1               1 
##             C49             C50             C52             C54         C62 C64 
##               1               1               2               1               1 
##             C65             C68              C7             C70             C78 
##               2               2               1               1               2 
##             C82             C83             C85             C86             C87 
##               1               2               1               1               1 
##             C90             C91             C92             C93             C95 
##               1               1               2               2               1 
##             C99               D         D10 D12             D11             D15 
##               1               3               1               1               1 
##             D17             D19             D20             D21             D26 
##               2               1               2               1               2 
##             D28             D30             D33             D35             D36 
##               1               1               2               2               2 
##             D37             D45             D46             D47             D48 
##               1               1               1               1               1 
##             D49             D50             D56              D6              D7 
##               1               1               1               1               1 
##              D9             E10            E101             E12            E121 
##               1               1               3               1               2 
##             E17             E24             E25             E31             E33 
##               1               2               2               1               2 
##             E34             E36             E38             E40             E44 
##               1               1               1               1               2 
##             E46             E49             E50             E58             E63 
##               1               1               1               1               1 
##             E67             E68             E77              E8           F E69 
##               2               1               1               2               1 
##           F G63           F G73              F2             F33             F38 
##               1               2               3               3               1 
##              F4              G6               T 
##               2               4               1

Try to separate by the letter of the cabin

train_Cab <- train[train$Cabin != "", ]
table(str_extract(train_Cab$Cabin, "^[A-Z]"))

## 
##  A  B  C  D  E  F  G  T 
## 15 47 59 33 32 13  4  1

train_Cab$CabLet <- str_extract(train_Cab$Cabin, "^[A-Z]")
temp <- train[train$Cabin == "", ]
temp$CabLet <- "None"
train_Cab <- rbind(train_Cab, temp)
train_Cab %>%
    group_by(CabLet) %>%
    summarise(survive_rate = mean(Survived), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num)

## # A tibble: 9 × 3
##   CabLet survive_rate prop  
##   <chr>         <dbl> <chr> 
## 1 A             0.467 1.68% 
## 2 B             0.745 5.27% 
## 3 C             0.593 6.62% 
## 4 D             0.758 3.70% 
## 5 E             0.75  3.59% 
## 6 F             0.615 1.46% 
## 7 G             0.5   0.45% 
## 8 None          0.300 77.10%
## 9 T             0     0.11%

Cabin B, D, E has the highest survive rate of about 75%. Although T has 0 survival rate, it only has one sample. It is not that representable.
However, the passenger with Cabin variable has a significantly higher survive rate than passenger with no Cabin.

sur_rate_Cab <- mean(train$Survived[train$Cabin != ""])
sur_rate_nCab <- mean(train$Survived[train$Cabin == ""])
sur_rate <- mean(train$Survived)

t(data.frame(survive_rate = sur_rate, sur_rate_with_cabin = sur_rate_Cab,
    sur_rate_with_no_cabin = sur_rate_nCab))

##                             [,1]
## survive_rate           0.3838384
## sur_rate_with_cabin    0.6666667
## sur_rate_with_no_cabin 0.2998544

I remove the Cabin variable and replace with Cabin letter, while the ones with no Cabin will be replaced by “None”.

train$CabLet <- str_extract(train$Cabin, "^[A-Z]")
train$CabLet[is.na(train$CabLet)] <- "None"
train$Cabin <- NULL
train$CabLet <- as.factor(train$CabLet)

temp <- train %>%
    group_by(CabLet) %>%
    summarise(sur_rate = mean(Survived), num = n()) %>%
    mutate(prop = num/sum(num)) %>%
    select(!num)
ggplot(temp, aes(x = CabLet, y = sur_rate, width = prop)) + geom_bar(stat = "summary",
    fun = "mean") + labs(x = "Cabin Number", y = "Survive Rate")

4.1.2 Age

It is the age of the passenger.

ggplot(train, aes(x = Age, )) + geom_histogram(binwidth = 3)

## Warning: Removed 177 rows containing non-finite values (stat_bin).

We can see that general decreasing trend in the survival rate with respect to the increase of age.

age_summ <- train[!is.na(train$Age), ] %>%
    mutate(age_group = cut(Age, seq(0, 100, by = 10))) %>%
    group_by(age_group) %>%
    summarise(survive_rate = mean(Survived), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num)

age_summ

## # A tibble: 8 × 3
##   age_group survive_rate prop 
##   <fct>            <dbl> <chr>
## 1 (0,10]           0.594 9.0% 
## 2 (10,20]          0.383 16.1%
## 3 (20,30]          0.365 32.2%
## 4 (30,40]          0.445 21.7%
## 5 (40,50]          0.384 12.0%
## 6 (50,60]          0.405 5.9% 
## 7 (60,70]          0.235 2.4% 
## 8 (70,80]          0.2   0.7%

ggplot(age_summ, aes(x = age_group, y = survive_rate)) + geom_bar(stat = "identity") +
    labs(x = "age group", y = "survival rate")

Find age and passenger class relation

kable(train[!is.na(train$Age), ] %>%
    group_by(Pclass) %>%
    summarise(avgAge = mean(Age), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num))

Pclass	avgAge	prop
1	38.23344	26.1%
2	29.87763	24.2%
3	25.14062	49.7%

I will impute by the mean of the passenger class

train$Age <- ave(train$Age, train$Pclass, FUN = function(x) ifelse(is.na(x),
    mean(x, na.rm = TRUE), x))

4.1.3 Embarked

It is the port that the passengers are embarked from:

C   Cherbourg
Q   Queenstown
S   Southampton

table(train$Embarked)

## 
##       C   Q   S 
##   2 168  77 644

kable(train[train$Embarked != "", ] %>%
    group_by(Embarked) %>%
    summarise(survival_rate = mean(Survived), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num))

Embarked	survival_rate	prop
C	0.5535714	19%
Q	0.3896104	9%
S	0.3369565	72%

Look at the 2 rows of missing embarked

kable(train[train$Embarked == "", ])

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	CabLet
62	1	1	Icard, Miss. Amelie	female	38	0	0	113572	80		B
830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62	0	0	113572	80		B

I will impute by replacing with mode

train$Embarked[c(62, 830)] <- names(sort(-table(train$Embarked)))[1]
train$Embarked <- as.factor(train$Embarked)

4.2 Label Encoding

Find all character variables

names(which(sapply(train, is.character)))

## [1] "Name"   "Sex"    "Ticket"

4.2.1 Name

names(which(table(train$name) > 1))

## NULL

# unique(str_extract(train$Name, ', [^.]+\\.'))

There is no repeat in the name of the passenger. Hence, I will delete the variable and create a gender variable

title <- str_extract(train$Name, ", [^.]+\\.")
train$title <- sapply(title, function(x) substring(x, 3, nchar(x)))

# # Titles # Male title M_title <- c('Mr.', 'Master.',
# 'Don.', 'Sir.', 'Jonkheer.') # Neutral / unknown title
# N_title <- c('Rev.', 'Dr.', 'Major.', 'Col.', 'Capt.') #
# Female title F_title <- c('Mrs.', 'Miss.', 'Mme.', 'Ms.',
# 'Lady.', 'Mlle.') train$gender[title %in% M_title] <- 'M'
# train$gender[title %in% N_title | is.na(title)] <- 'N'
# train$gender[title %in% F_title] <- 'F'

title_summ <- train %>%
    group_by(title) %>%
    summarise(survival_rate = mean(Survived), num = n()) %>%
    mutate(prop = num/sum(num)) %>%
    select(!num)
ggplot(title_summ, aes(x = title, y = survival_rate, width = prop)) +
    geom_bar(stat = "identity")

4.2.2 Sex

table(train$Sex)

## 
## female   male 
##    314    577

Change to factor

train$Sex <- as.factor(train$Sex)

4.2.3 Fare and Ticket

Fare is the price for one or more ticket

tickets <- table(train$Ticket)
train$count <- as.numeric(sapply(train$Ticket, FUN = function(x) tickets[x]))
train$price <- train$Fare/train$count

ggplot(train, aes(x = Fare)) + geom_histogram(bins = 10)

ggplot(train, aes(x = price)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sum(is.na(train))

## [1] 0

cor(train$price, train$Pclass)

## [1] -0.6555588

train %>%
    group_by(Pclass) %>%
    summarise(mean = mean(price))

## # A tibble: 3 × 2
##   Pclass  mean
##    <int> <dbl>
## 1      1 43.7 
## 2      2 13.3 
## 3      3  8.09

Since there is almost no repeat in ticket, it doesn’t contribute much to the result. I will remove it.

train <- select(train, -Ticket)

5 Visualization of important variables

numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]
train_fac <- train[, -numVar]

corTab <- cor(train_num, use = "pairwise.complete.obs")
corSort <- as.matrix(sort(corTab[, "Survived"], decreasing = TRUE))

corNames <- row.names(corSort)

corMat <- corTab[corNames, corNames]

corrplot.mixed(corMat, tl.pos = "lt")

5.1 Passenger class

ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")

6 Visualization

6.1 Correlation

numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]
corTab <- cor(train_num, use = "pairwise.complete.obs")
corSort <- as.matrix(sort(corTab[, "Survived"], decreasing = TRUE))

corNames <- row.names(corSort)

corMat <- corTab[corNames, corNames]

corrplot.mixed(corMat, tl.pos = "lt")

6.1.1 Fare

fare_summ <- train %>%
    mutate(fare_group = cut(Fare, 10)) %>%
    group_by(fare_group) %>%
    summarise(sur_rate = mean(Survived), n = n()) %>%
    mutate(prop = n/sum(n))
ggplot(fare_summ, aes(x = fare_group, y = sur_rate)) + geom_bar(stat = "summary",
    fun = "mean") + labs(x = "fare group", y = "survival rate") +
    scale_fill_discrete(drop = FALSE) + scale_x_discrete(drop = FALSE)

6.1.2 Price

price_summ <- train %>%
    mutate(price_group = cut(price, 10)) %>%
    group_by(price_group) %>%
    summarise(sur_rate = mean(Survived), n = n()) %>%
    mutate(prop = n/sum(n))
ggplot(price_summ, aes(x = price_group, y = sur_rate)) + geom_bar(stat = "summary",
    fun = "mean") + labs(x = "price group", y = "survival rate") +
    scale_fill_discrete(drop = FALSE) + scale_x_discrete(drop = FALSE)

6.2 Finding importance of variable with quick random forest

set.seed(72022)
quick_rf <- randomForest(x = train[, -1], y = train$Survived,
    ntree = 100, importance = TRUE)

## Warning in randomForest.default(x = train[, -1], y = train$Survived, ntree =
## 100, : The response has five or fewer unique values. Are you sure you want to do
## regression?

imp_rf <- importance(quick_rf)
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[,
    1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]

ggplot(imp_df, aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) +
    geom_bar(stat = "identity") + labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
    coord_flip() + theme(legend.position = "none")

6.2.1 Sex

ggplot(train, aes(x = Sex, y = Survived)) + geom_bar(stat = "summary",
    fun = "mean")

6.2.2 Pclass

ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")

6.2.3 Age

age_summ <- train[!is.na(train$Age), ] %>%
    mutate(age_group = cut(Age, seq(0, 100, by = 10))) %>%
    group_by(age_group) %>%
    summarise(survive_rate = mean(Survived), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num)

ggplot(age_summ, aes(x = age_group, y = survive_rate)) + geom_bar(stat = "identity") +
    labs(x = "age group", y = "survival rate")

7 Feature engineering

7.1 Title

I will remove title as it is highly correlated to sex and passenger class. It is also the lowest in importance in the random forest.

train <- select(train, !c(title, Name, count, price))

7.2 Cabin Letter

ggplot(train, aes(x = reorder(CabLet, Survived, FUN = mean, decreasing = T),
    y = Survived)) + geom_bar(stat = "summary", fun = "mean") +
    labs(x = "Cabin Letter", y = "Survival Rate")

8 Preparing Data for modeling

8.1 Preprocesing predictor variables

train_preScale <- train
numVarNames <- names(which(sapply(train[, -1], is.numeric)))
train_num <- train[, numVarNames]
train_fac <- train[, !(names(train) %in% c(numVarNames, "Survived"))]

8.1.1 Skewness

log_names <- c()
for (i in 1:ncol(train_num)) {
    if (abs(skew(train_num[, i])) > 0.8) {
        train_num[, i] <- log(train_num[, i] + 1)
        log_names <- c(log_names, i)
    }
}
log_names <- names(train_num)[log_names]

8.1.2 Normalizing and feature scaling

train_num <- as.data.frame(scale(train_num))

8.1.3 one hot encoding

Change predictors to numeric

train_fac <- as.data.frame(model.matrix(~. - 1, train_fac))
dim(train_fac)

## [1] 891  12

8.2 Combining

train <- cbind(data.frame(Survived = as.factor(train$Survived)),
    train_num, train_fac)
str(train)

## 'data.frame':    891 obs. of  18 variables:
##  $ Survived  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass    : num  0.827 -1.565 0.827 -1.565 0.827 ...
##  $ Age       : num  -0.552 0.659 -0.249 0.432 0.432 ...
##  $ SibSp     : num  0.889 0.889 -0.609 0.889 -0.609 ...
##  $ Parch     : num  -0.529 -0.529 -0.529 -0.529 -0.529 ...
##  $ Fare      : num  -0.879 1.36 -0.798 1.061 -0.784 ...
##  $ Sexfemale : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Sexmale   : num  1 0 0 0 1 1 1 1 0 0 ...
##  $ EmbarkedQ : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ EmbarkedS : num  1 0 1 1 1 0 1 1 1 0 ...
##  $ CabLetB   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetC   : num  0 1 0 1 0 0 0 0 0 0 ...
##  $ CabLetD   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetE   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ CabLetF   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetNone: num  1 0 1 0 1 1 0 1 1 1 ...
##  $ CabLetT   : num  0 0 0 0 0 0 0 0 0 0 ...

log_names

## [1] "SibSp" "Parch" "Fare"

9 Prepare test set for predicting

test$CabLet <- str_extract(test$Cabin, "^[A-Z]")
test$CabLet[is.na(test$CabLet)] <- "None"

agePclassTab <- train_preScale %>%
    group_by(Pclass) %>%
    summarise(avgAge = round(mean(Age), digits = 2), num = n()) %>%
    mutate(prop = percent(num/sum(num))) %>%
    select(!num)

test$Age[is.na(test$Age)] <- agePclassTab[test$Pclass[is.na(test$Age)],
    2][[1]]

test$Embarked[is.na(test$Embarked)] <- names(sort(-table(train$Embarked)))[1]

test <- select(test, !c(Name, Ticket, Cabin))
test$Sex <- as.factor(test$Sex)

test$Fare[is.na(test$Fare)] <- mean(train_preScale$Fare)

str(test)

## 'data.frame':    418 obs. of  9 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
##  $ CabLet     : chr  "None" "None" "None" "None" ...

test_num <- test[, numVarNames]
test_fac <- test[, !(names(test) %in% numVarNames)]

test_num[, log_names] <- log(test_num[, log_names] + 1)
test_num <- as.data.frame(scale(test_num))

test_fac <- as.data.frame(model.matrix(~. - 1, test_fac))
dim(test_fac)

## [1] 418  12

test <- cbind(test_num, test_fac)
test$CabLetT <- 0
dim(test)

## [1] 418  18

dim(train)

## [1] 891  18

str(test)

## 'data.frame':    418 obs. of  18 variables:
##  $ Pclass     : num  0.872 0.872 -0.315 0.872 0.872 ...
##  $ Age        : num  0.385 1.358 2.526 -0.199 -0.588 ...
##  $ SibSp      : num  -0.633 1.037 -0.633 -0.633 1.037 ...
##  $ Parch      : num  -0.497 -0.497 -0.497 -0.497 1.139 ...
##  $ Fare       : num  -0.868 -0.97 -0.67 -0.774 -0.445 ...
##  $ PassengerId: num  892 893 894 895 896 897 898 899 900 901 ...
##  $ Sexfemale  : num  0 1 0 0 1 0 1 0 1 0 ...
##  $ Sexmale    : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ EmbarkedQ  : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ EmbarkedS  : num  0 1 0 1 1 1 0 1 0 1 ...
##  $ CabLetB    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetD    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetF    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetG    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetNone : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CabLetT    : num  0 0 0 0 0 0 0 0 0 0 ...

str(train)

## 'data.frame':    891 obs. of  18 variables:
##  $ Survived  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass    : num  0.827 -1.565 0.827 -1.565 0.827 ...
##  $ Age       : num  -0.552 0.659 -0.249 0.432 0.432 ...
##  $ SibSp     : num  0.889 0.889 -0.609 0.889 -0.609 ...
##  $ Parch     : num  -0.529 -0.529 -0.529 -0.529 -0.529 ...
##  $ Fare      : num  -0.879 1.36 -0.798 1.061 -0.784 ...
##  $ Sexfemale : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Sexmale   : num  1 0 0 0 1 1 1 1 0 0 ...
##  $ EmbarkedQ : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ EmbarkedS : num  1 0 1 1 1 0 1 1 1 0 ...
##  $ CabLetB   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetC   : num  0 1 0 1 0 0 0 0 0 0 ...
##  $ CabLetD   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetE   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ CabLetF   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CabLetNone: num  1 0 1 0 1 1 0 1 1 1 ...
##  $ CabLetT   : num  0 0 0 0 0 0 0 0 0 0 ...

names(train)[which(!(names(train) %in% names(test)))]

## [1] "Survived"

10 Modeling

10.1 Random Forest

set.seed(1234)
mod_rf <- randomForest(Survived ~ ., train, ntree = 500, importance = T)
mod_rf

## 
## Call:
##  randomForest(formula = Survived ~ ., data = train, ntree = 500,      importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 17.96%
## Confusion matrix:
##     0   1 class.error
## 0 498  51  0.09289617
## 1 109 233  0.31871345

imp_rf <- importance(mod_rf)
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[,
    1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]

ggplot(imp_df, aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) +
    geom_bar(stat = "identity") + labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
    coord_flip() + theme(legend.position = "none")

pred <- predict(mod_rf, newdata = test, type = "class")
head(pred)

## 1 2 3 4 5 6 
## 0 0 0 0 0 0 
## Levels: 0 1

result <- data.frame(PassengerId = test$PassengerId, Survived = as.numeric(pred) -
    1)
# write.csv(result, 'result.csv', row.names = FALSE)

10.2 Logistic regression

mod_log <- glm(Survived ~ ., family = binomial(link = "logit"),
    data = train)
pred <- predict(mod_log, newdata = test, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

pred_bi <- ifelse(pred > 0.5, 1, 0)
result <- data.frame(PassengerId = test$PassengerId, Survived = pred_bi)
# write.csv(result, 'result_log.csv', row.names = FALSE)

10.3 decision tree

dtmod <- rpart(Survived ~ ., data = train, method = "class")

pred_dt <- predict(dtmod, newdata = test, type = "class")
result <- data.frame(PassengerId = test$PassengerId, Survived = pred_dt)
# write.csv(result, 'result_dt.csv', row.names = FALSE)

Titanic

Jamin Wong

2022-07-28