I use Titanic data set on kaggle to create prediction models. It use
different attributes to predict if the passenger survived or not.
The link to the data set: https://www.kaggle.com/competitions/titanic/
train_path <- "train.csv"
test_path <- "test.csv"
train <- read.csv(train_path)
test <- read.csv(test_path)
The training set has 891 passengers and 12 columns.
dim(train)
## [1] 891 12
dim(test)
## [1] 418 11
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
test_labels <- test$PassengerId
train <- train[, -1]
dim(train)
## [1] 891 11
table(train$Survived)
##
## 0 1
## 549 342
ggplot(train, aes(x = as.factor(Survived))) + geom_bar(stat = "count")
There are more people not survived than survived
I will find the correlations of the numeric variables to Survived
numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]
corTab <- cor(train_num, use = "pairwise.complete.obs")
corSorted <- as.matrix(sort(abs(corTab[, "Survived"]), decreasing = TRUE))
corNames <- row.names(corSorted)
corSorted <- corTab[corNames, corNames]
corrplot.mixed(corSorted, tl.col = "black", tl.pos = "lt", tl.cex = 0.7,
cl.cex = 0.7, number.cex = 0.7)
Pclass (passenger class) has the highest absolute correlation with the Survived variable.
Passenger class
1 Upper class
2 Middle class
3 Lower class
table(train$Pclass)
##
## 1 2 3
## 216 184 491
We can assume that 1 means first class while 3 mean third class. Which means that the lower the class (third class is the lowest), the lower the survival rate. It is reasonable as the third class deck located at the lower part of the Titanic and thus harder to survive.
ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")
The money paid for the ticket
ggplot(train, aes(x = Fare)) + geom_density()
It can be seen that the price is very right skewed. I will keep in mind for later pre-processing.
cor(train$Pclass, train$Fare, use = "pairwise.complete.obs")
## [1] -0.5494996
The fare and passenger class is negatively correlated. (Third class paid the lowest and first class paid the highest). This confirmed that Pclass is the passenger class.
First, find all variable with missing variables
NAcol <- which(colSums(is.na(train) | train == "") > 0)
sort(colSums(sapply(train[NAcol], function(x) is.na(x) | x ==
"")), decreasing = TRUE)
## Cabin Age Embarked
## 687 177 2
summary(train$Cabin)
## Length Class Mode
## 891 character character
table(train$Cabin)
##
## A10 A14 A16 A19
## 687 1 1 1 1
## A20 A23 A24 A26 A31
## 1 1 1 1 1
## A32 A34 A36 A5 A6
## 1 1 1 1 1
## A7 B101 B102 B18 B19
## 1 1 1 2 1
## B20 B22 B28 B3 B30
## 2 2 2 1 1
## B35 B37 B38 B39 B4
## 2 1 1 1 1
## B41 B42 B49 B5 B50
## 1 1 2 2 1
## B51 B53 B55 B57 B59 B63 B66 B58 B60 B69 B71
## 2 2 2 1 1
## B73 B77 B78 B79 B80
## 1 2 1 1 1
## B82 B84 B86 B94 B96 B98 C101
## 1 1 1 4 1
## C103 C104 C106 C110 C111
## 1 1 1 1 1
## C118 C123 C124 C125 C126
## 1 2 2 2 2
## C128 C148 C2 C22 C26 C23 C25 C27
## 1 1 2 3 4
## C30 C32 C45 C46 C47
## 1 1 1 1 1
## C49 C50 C52 C54 C62 C64
## 1 1 2 1 1
## C65 C68 C7 C70 C78
## 2 2 1 1 2
## C82 C83 C85 C86 C87
## 1 2 1 1 1
## C90 C91 C92 C93 C95
## 1 1 2 2 1
## C99 D D10 D12 D11 D15
## 1 3 1 1 1
## D17 D19 D20 D21 D26
## 2 1 2 1 2
## D28 D30 D33 D35 D36
## 1 1 2 2 2
## D37 D45 D46 D47 D48
## 1 1 1 1 1
## D49 D50 D56 D6 D7
## 1 1 1 1 1
## D9 E10 E101 E12 E121
## 1 1 3 1 2
## E17 E24 E25 E31 E33
## 1 2 2 1 2
## E34 E36 E38 E40 E44
## 1 1 1 1 2
## E46 E49 E50 E58 E63
## 1 1 1 1 1
## E67 E68 E77 E8 F E69
## 2 1 1 2 1
## F G63 F G73 F2 F33 F38
## 1 2 3 3 1
## F4 G6 T
## 2 4 1
Try to separate by the letter of the cabin
train_Cab <- train[train$Cabin != "", ]
table(str_extract(train_Cab$Cabin, "^[A-Z]"))
##
## A B C D E F G T
## 15 47 59 33 32 13 4 1
train_Cab$CabLet <- str_extract(train_Cab$Cabin, "^[A-Z]")
temp <- train[train$Cabin == "", ]
temp$CabLet <- "None"
train_Cab <- rbind(train_Cab, temp)
train_Cab %>%
group_by(CabLet) %>%
summarise(survive_rate = mean(Survived), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num)
## # A tibble: 9 × 3
## CabLet survive_rate prop
## <chr> <dbl> <chr>
## 1 A 0.467 1.68%
## 2 B 0.745 5.27%
## 3 C 0.593 6.62%
## 4 D 0.758 3.70%
## 5 E 0.75 3.59%
## 6 F 0.615 1.46%
## 7 G 0.5 0.45%
## 8 None 0.300 77.10%
## 9 T 0 0.11%
Cabin B, D, E has the highest survive rate of about 75%. Although T
has 0 survival rate, it only has one sample. It is not that
representable.
However, the passenger with Cabin variable has a significantly higher
survive rate than passenger with no Cabin.
sur_rate_Cab <- mean(train$Survived[train$Cabin != ""])
sur_rate_nCab <- mean(train$Survived[train$Cabin == ""])
sur_rate <- mean(train$Survived)
t(data.frame(survive_rate = sur_rate, sur_rate_with_cabin = sur_rate_Cab,
sur_rate_with_no_cabin = sur_rate_nCab))
## [,1]
## survive_rate 0.3838384
## sur_rate_with_cabin 0.6666667
## sur_rate_with_no_cabin 0.2998544
I remove the Cabin variable and replace with Cabin letter, while the ones with no Cabin will be replaced by “None”.
train$CabLet <- str_extract(train$Cabin, "^[A-Z]")
train$CabLet[is.na(train$CabLet)] <- "None"
train$Cabin <- NULL
train$CabLet <- as.factor(train$CabLet)
temp <- train %>%
group_by(CabLet) %>%
summarise(sur_rate = mean(Survived), num = n()) %>%
mutate(prop = num/sum(num)) %>%
select(!num)
ggplot(temp, aes(x = CabLet, y = sur_rate, width = prop)) + geom_bar(stat = "summary",
fun = "mean") + labs(x = "Cabin Number", y = "Survive Rate")
It is the age of the passenger.
ggplot(train, aes(x = Age, )) + geom_histogram(binwidth = 3)
## Warning: Removed 177 rows containing non-finite values (stat_bin).
We can see that general decreasing trend in the survival rate with respect to the increase of age.
age_summ <- train[!is.na(train$Age), ] %>%
mutate(age_group = cut(Age, seq(0, 100, by = 10))) %>%
group_by(age_group) %>%
summarise(survive_rate = mean(Survived), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num)
age_summ
## # A tibble: 8 × 3
## age_group survive_rate prop
## <fct> <dbl> <chr>
## 1 (0,10] 0.594 9.0%
## 2 (10,20] 0.383 16.1%
## 3 (20,30] 0.365 32.2%
## 4 (30,40] 0.445 21.7%
## 5 (40,50] 0.384 12.0%
## 6 (50,60] 0.405 5.9%
## 7 (60,70] 0.235 2.4%
## 8 (70,80] 0.2 0.7%
ggplot(age_summ, aes(x = age_group, y = survive_rate)) + geom_bar(stat = "identity") +
labs(x = "age group", y = "survival rate")
Find age and passenger class relation
kable(train[!is.na(train$Age), ] %>%
group_by(Pclass) %>%
summarise(avgAge = mean(Age), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num))
| Pclass | avgAge | prop |
|---|---|---|
| 1 | 38.23344 | 26.1% |
| 2 | 29.87763 | 24.2% |
| 3 | 25.14062 | 49.7% |
I will impute by the mean of the passenger class
train$Age <- ave(train$Age, train$Pclass, FUN = function(x) ifelse(is.na(x),
mean(x, na.rm = TRUE), x))
It is the port that the passengers are embarked from:
C Cherbourg
Q Queenstown
S Southampton
table(train$Embarked)
##
## C Q S
## 2 168 77 644
kable(train[train$Embarked != "", ] %>%
group_by(Embarked) %>%
summarise(survival_rate = mean(Survived), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num))
| Embarked | survival_rate | prop |
|---|---|---|
| C | 0.5535714 | 19% |
| Q | 0.3896104 | 9% |
| S | 0.3369565 | 72% |
Look at the 2 rows of missing embarked
kable(train[train$Embarked == "", ])
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | CabLet | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 62 | 1 | 1 | Icard, Miss. Amelie | female | 38 | 0 | 0 | 113572 | 80 | B | |
| 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62 | 0 | 0 | 113572 | 80 | B |
I will impute by replacing with mode
train$Embarked[c(62, 830)] <- names(sort(-table(train$Embarked)))[1]
train$Embarked <- as.factor(train$Embarked)
Find all character variables
names(which(sapply(train, is.character)))
## [1] "Name" "Sex" "Ticket"
names(which(table(train$name) > 1))
## NULL
# unique(str_extract(train$Name, ', [^.]+\\.'))
There is no repeat in the name of the passenger. Hence, I will delete the variable and create a gender variable
title <- str_extract(train$Name, ", [^.]+\\.")
train$title <- sapply(title, function(x) substring(x, 3, nchar(x)))
# # Titles # Male title M_title <- c('Mr.', 'Master.',
# 'Don.', 'Sir.', 'Jonkheer.') # Neutral / unknown title
# N_title <- c('Rev.', 'Dr.', 'Major.', 'Col.', 'Capt.') #
# Female title F_title <- c('Mrs.', 'Miss.', 'Mme.', 'Ms.',
# 'Lady.', 'Mlle.') train$gender[title %in% M_title] <- 'M'
# train$gender[title %in% N_title | is.na(title)] <- 'N'
# train$gender[title %in% F_title] <- 'F'
title_summ <- train %>%
group_by(title) %>%
summarise(survival_rate = mean(Survived), num = n()) %>%
mutate(prop = num/sum(num)) %>%
select(!num)
ggplot(title_summ, aes(x = title, y = survival_rate, width = prop)) +
geom_bar(stat = "identity")
table(train$Sex)
##
## female male
## 314 577
Change to factor
train$Sex <- as.factor(train$Sex)
Fare is the price for one or more ticket
tickets <- table(train$Ticket)
train$count <- as.numeric(sapply(train$Ticket, FUN = function(x) tickets[x]))
train$price <- train$Fare/train$count
ggplot(train, aes(x = Fare)) + geom_histogram(bins = 10)
ggplot(train, aes(x = price)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sum(is.na(train))
## [1] 0
cor(train$price, train$Pclass)
## [1] -0.6555588
train %>%
group_by(Pclass) %>%
summarise(mean = mean(price))
## # A tibble: 3 × 2
## Pclass mean
## <int> <dbl>
## 1 1 43.7
## 2 2 13.3
## 3 3 8.09
Since there is almost no repeat in ticket, it doesn’t contribute much to the result. I will remove it.
train <- select(train, -Ticket)
numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]
train_fac <- train[, -numVar]
corTab <- cor(train_num, use = "pairwise.complete.obs")
corSort <- as.matrix(sort(corTab[, "Survived"], decreasing = TRUE))
corNames <- row.names(corSort)
corMat <- corTab[corNames, corNames]
corrplot.mixed(corMat, tl.pos = "lt")
ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")
numVar <- which(sapply(train, is.numeric))
train_num <- train[, numVar]
corTab <- cor(train_num, use = "pairwise.complete.obs")
corSort <- as.matrix(sort(corTab[, "Survived"], decreasing = TRUE))
corNames <- row.names(corSort)
corMat <- corTab[corNames, corNames]
corrplot.mixed(corMat, tl.pos = "lt")
fare_summ <- train %>%
mutate(fare_group = cut(Fare, 10)) %>%
group_by(fare_group) %>%
summarise(sur_rate = mean(Survived), n = n()) %>%
mutate(prop = n/sum(n))
ggplot(fare_summ, aes(x = fare_group, y = sur_rate)) + geom_bar(stat = "summary",
fun = "mean") + labs(x = "fare group", y = "survival rate") +
scale_fill_discrete(drop = FALSE) + scale_x_discrete(drop = FALSE)
price_summ <- train %>%
mutate(price_group = cut(price, 10)) %>%
group_by(price_group) %>%
summarise(sur_rate = mean(Survived), n = n()) %>%
mutate(prop = n/sum(n))
ggplot(price_summ, aes(x = price_group, y = sur_rate)) + geom_bar(stat = "summary",
fun = "mean") + labs(x = "price group", y = "survival rate") +
scale_fill_discrete(drop = FALSE) + scale_x_discrete(drop = FALSE)
set.seed(72022)
quick_rf <- randomForest(x = train[, -1], y = train$Survived,
ntree = 100, importance = TRUE)
## Warning in randomForest.default(x = train[, -1], y = train$Survived, ntree =
## 100, : The response has five or fewer unique values. Are you sure you want to do
## regression?
imp_rf <- importance(quick_rf)
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[,
1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]
ggplot(imp_df, aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) +
geom_bar(stat = "identity") + labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
coord_flip() + theme(legend.position = "none")
ggplot(train, aes(x = Sex, y = Survived)) + geom_bar(stat = "summary",
fun = "mean")
ggplot(train, aes(x = Pclass)) + geom_bar(stat = "count")
age_summ <- train[!is.na(train$Age), ] %>%
mutate(age_group = cut(Age, seq(0, 100, by = 10))) %>%
group_by(age_group) %>%
summarise(survive_rate = mean(Survived), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num)
ggplot(age_summ, aes(x = age_group, y = survive_rate)) + geom_bar(stat = "identity") +
labs(x = "age group", y = "survival rate")
I will remove title as it is highly correlated to sex and passenger class. It is also the lowest in importance in the random forest.
train <- select(train, !c(title, Name, count, price))
ggplot(train, aes(x = reorder(CabLet, Survived, FUN = mean, decreasing = T),
y = Survived)) + geom_bar(stat = "summary", fun = "mean") +
labs(x = "Cabin Letter", y = "Survival Rate")
train_preScale <- train
numVarNames <- names(which(sapply(train[, -1], is.numeric)))
train_num <- train[, numVarNames]
train_fac <- train[, !(names(train) %in% c(numVarNames, "Survived"))]
log_names <- c()
for (i in 1:ncol(train_num)) {
if (abs(skew(train_num[, i])) > 0.8) {
train_num[, i] <- log(train_num[, i] + 1)
log_names <- c(log_names, i)
}
}
log_names <- names(train_num)[log_names]
train_num <- as.data.frame(scale(train_num))
Change predictors to numeric
train_fac <- as.data.frame(model.matrix(~. - 1, train_fac))
dim(train_fac)
## [1] 891 12
train <- cbind(data.frame(Survived = as.factor(train$Survived)),
train_num, train_fac)
str(train)
## 'data.frame': 891 obs. of 18 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : num 0.827 -1.565 0.827 -1.565 0.827 ...
## $ Age : num -0.552 0.659 -0.249 0.432 0.432 ...
## $ SibSp : num 0.889 0.889 -0.609 0.889 -0.609 ...
## $ Parch : num -0.529 -0.529 -0.529 -0.529 -0.529 ...
## $ Fare : num -0.879 1.36 -0.798 1.061 -0.784 ...
## $ Sexfemale : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Sexmale : num 1 0 0 0 1 1 1 1 0 0 ...
## $ EmbarkedQ : num 0 0 0 0 0 1 0 0 0 0 ...
## $ EmbarkedS : num 1 0 1 1 1 0 1 1 1 0 ...
## $ CabLetB : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetC : num 0 1 0 1 0 0 0 0 0 0 ...
## $ CabLetD : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetE : num 0 0 0 0 0 0 1 0 0 0 ...
## $ CabLetF : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetNone: num 1 0 1 0 1 1 0 1 1 1 ...
## $ CabLetT : num 0 0 0 0 0 0 0 0 0 0 ...
log_names
## [1] "SibSp" "Parch" "Fare"
test$CabLet <- str_extract(test$Cabin, "^[A-Z]")
test$CabLet[is.na(test$CabLet)] <- "None"
agePclassTab <- train_preScale %>%
group_by(Pclass) %>%
summarise(avgAge = round(mean(Age), digits = 2), num = n()) %>%
mutate(prop = percent(num/sum(num))) %>%
select(!num)
test$Age[is.na(test$Age)] <- agePclassTab[test$Pclass[is.na(test$Age)],
2][[1]]
test$Embarked[is.na(test$Embarked)] <- names(sort(-table(train$Embarked)))[1]
test <- select(test, !c(Name, Ticket, Cabin))
test$Sex <- as.factor(test$Sex)
test$Fare[is.na(test$Fare)] <- mean(train_preScale$Fare)
str(test)
## 'data.frame': 418 obs. of 9 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
## $ CabLet : chr "None" "None" "None" "None" ...
test_num <- test[, numVarNames]
test_fac <- test[, !(names(test) %in% numVarNames)]
test_num[, log_names] <- log(test_num[, log_names] + 1)
test_num <- as.data.frame(scale(test_num))
test_fac <- as.data.frame(model.matrix(~. - 1, test_fac))
dim(test_fac)
## [1] 418 12
test <- cbind(test_num, test_fac)
test$CabLetT <- 0
dim(test)
## [1] 418 18
dim(train)
## [1] 891 18
str(test)
## 'data.frame': 418 obs. of 18 variables:
## $ Pclass : num 0.872 0.872 -0.315 0.872 0.872 ...
## $ Age : num 0.385 1.358 2.526 -0.199 -0.588 ...
## $ SibSp : num -0.633 1.037 -0.633 -0.633 1.037 ...
## $ Parch : num -0.497 -0.497 -0.497 -0.497 1.139 ...
## $ Fare : num -0.868 -0.97 -0.67 -0.774 -0.445 ...
## $ PassengerId: num 892 893 894 895 896 897 898 899 900 901 ...
## $ Sexfemale : num 0 1 0 0 1 0 1 0 1 0 ...
## $ Sexmale : num 1 0 1 1 0 1 0 1 0 1 ...
## $ EmbarkedQ : num 1 0 1 0 0 0 1 0 0 0 ...
## $ EmbarkedS : num 0 1 0 1 1 1 0 1 0 1 ...
## $ CabLetB : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetD : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetF : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetNone : num 1 1 1 1 1 1 1 1 1 1 ...
## $ CabLetT : num 0 0 0 0 0 0 0 0 0 0 ...
str(train)
## 'data.frame': 891 obs. of 18 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : num 0.827 -1.565 0.827 -1.565 0.827 ...
## $ Age : num -0.552 0.659 -0.249 0.432 0.432 ...
## $ SibSp : num 0.889 0.889 -0.609 0.889 -0.609 ...
## $ Parch : num -0.529 -0.529 -0.529 -0.529 -0.529 ...
## $ Fare : num -0.879 1.36 -0.798 1.061 -0.784 ...
## $ Sexfemale : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Sexmale : num 1 0 0 0 1 1 1 1 0 0 ...
## $ EmbarkedQ : num 0 0 0 0 0 1 0 0 0 0 ...
## $ EmbarkedS : num 1 0 1 1 1 0 1 1 1 0 ...
## $ CabLetB : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetC : num 0 1 0 1 0 0 0 0 0 0 ...
## $ CabLetD : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetE : num 0 0 0 0 0 0 1 0 0 0 ...
## $ CabLetF : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CabLetNone: num 1 0 1 0 1 1 0 1 1 1 ...
## $ CabLetT : num 0 0 0 0 0 0 0 0 0 0 ...
names(train)[which(!(names(train) %in% names(test)))]
## [1] "Survived"
set.seed(1234)
mod_rf <- randomForest(Survived ~ ., train, ntree = 500, importance = T)
mod_rf
##
## Call:
## randomForest(formula = Survived ~ ., data = train, ntree = 500, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 17.96%
## Confusion matrix:
## 0 1 class.error
## 0 498 51 0.09289617
## 1 109 233 0.31871345
imp_rf <- importance(mod_rf)
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[,
1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]
ggplot(imp_df, aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) +
geom_bar(stat = "identity") + labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
coord_flip() + theme(legend.position = "none")
pred <- predict(mod_rf, newdata = test, type = "class")
head(pred)
## 1 2 3 4 5 6
## 0 0 0 0 0 0
## Levels: 0 1
result <- data.frame(PassengerId = test$PassengerId, Survived = as.numeric(pred) -
1)
# write.csv(result, 'result.csv', row.names = FALSE)
mod_log <- glm(Survived ~ ., family = binomial(link = "logit"),
data = train)
pred <- predict(mod_log, newdata = test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred_bi <- ifelse(pred > 0.5, 1, 0)
result <- data.frame(PassengerId = test$PassengerId, Survived = pred_bi)
# write.csv(result, 'result_log.csv', row.names = FALSE)
dtmod <- rpart(Survived ~ ., data = train, method = "class")
pred_dt <- predict(dtmod, newdata = test, type = "class")
result <- data.frame(PassengerId = test$PassengerId, Survived = pred_dt)
# write.csv(result, 'result_dt.csv', row.names = FALSE)