This is the Kaggle R Tutorial on Machine Learning one houre, beginner course which is available in DataCamp with some more details that are not provided in the course.The aim of the course is to introduce some of Machine learning concepts in R very briefly and walk through the Kaggle Titanic: Machine Learning from Disaster chalenge for begginner. Some of the cods are modefyed from David Langer’s \(You\) \(Tube\) Video.
# Assign the training set
train <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"))
# Assign the testing set
test <- read.csv(url("http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"))
# Make sure to have a look at your training and testing set
head(train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
head(test)
## PassengerId Pclass Name Sex
## 1 892 3 Kelly, Mr. James male
## 2 893 3 Wilkes, Mrs. James (Ellen Needs) female
## 3 894 2 Myles, Mr. Thomas Francis male
## 4 895 3 Wirz, Mr. Albert male
## 5 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
## 6 897 3 Svensson, Mr. Johan Cervin male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 1 34.5 0 0 330911 7.8292 Q
## 2 47.0 1 0 363272 7.0000 S
## 3 62.0 0 0 240276 9.6875 Q
## 4 27.0 0 0 315154 8.6625 S
## 5 22.0 1 1 3101298 12.2875 S
## 6 14.0 0 0 7538 9.2250 S
Create the column child, and indicate whether child or no child. Inorder to do that, we created a new coloumn and populated with zero then we assign one if Age is more than \(18\).
#
train$Child <- 0
train$Child[train$Age < 18] <- 1
How many people in your training set survived disaster with the Titanic and calculate their proportions.
# Two-way comparison
table(train$Child, train$Survived)
##
## 0 1
## 0 497 281
## 1 52 61
prop.table(table(train$Child, train$Survived),1)
##
## 0 1
## 0 0.6388175 0.3611825
## 1 0.4601770 0.5398230
Males & females that survived vs males & females that passed away.
table(train$Sex, train$Survived)
##
## 0 1
## female 81 233
## male 468 109
prop.table(table(train$Sex, train$Survived),1)
##
## 0 1
## female 0.2579618 0.7420382
## male 0.8110919 0.1889081
In one of the previous exercises you discovered that, in your training set, females had over \(50\%\) chance of surviving and males less than \(50\%\). Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.
As mentioned in the beginning you use your test set for validating your predictions. You might have seen that, contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.
# prediction based on gender
test_one <-test
test_one$Survived<-0
test_one$Survived[test_one$Sex == 'female']<-1
We can look at whather Passenger class mention as Pclass has any roll to play in someone survivale.
# Load up ggplot2 package to use for visualizations
library(ggplot2)
#train$Pclass <- as.factor(train$Pclass)
ggplot(train, aes(x = as.factor(Pclass), fill = factor(Survived))) +
geom_bar() +
xlab("Passenger Class") +
ylab("Total Count") +
labs(fill = "Survived")
From the histogram, we can clearly see that Passenger from \(class-3\) has less chance to survived than passenger from \(class-1\). Therefor, we can say \(Pclass\) does play a roll of survival.
Lets creat our first decision tree.You will use the the rpart() function inside of the rpart package to build your first decision tree. The rpart() function takes multiple arguments.
#Lod the rpart library to creat our decision tree
library(rpart)
#Display the Structure
str(train)
## 'data.frame': 891 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ Child : num 0 0 0 0 0 0 0 1 0 1 ...
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
# Build the decision tree
my_tree_two <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method="class")
# Visualize the decision tree using plot() and text()
plot(my_tree_two)
text(my_tree_two)
Lets do fancified version of our tree:
# Load in the packages to create a fancified version of your tree
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
# Time to plot your fancified tree
fancyRpartPlot(my_tree_two)
# Make your prediction using the test set
my_prediction <- predict(my_tree_two, test, type = "class")
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)
# Check that your data frame has 418 entries
nrow(my_solution)
## [1] 418
# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)
# Create a new decision tree my_tree_three
library(rpart)
my_tree_three <-rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data=train, method="class",control = rpart.control(minsplit =50,cp=0))
The cp parameter determines when the splitting up of the decision tree stops.The minsplit parameter monitors the amount of observations in a bucket.If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.However, if you would submit this solution to Kaggle your score would be lower than the score of a simple model.based on e.g. gender. Why? Because you went too far when setting the rules for the decisions trees.You created very specific rules based on the data in the training set that are hence only relevant for the training set but that cannot be generalized to unknows sets.You overfitted. So when creating decision trees, always be aware of this danger!
# Visualize your new decision tree
fancyRpartPlot(my_tree_three)
Now, create a new train set with the new variable.\(Family\) \(size\) is determined by the variables \(SibSb\) and \(Parch\), which indicate the number of family members a certain passenger is traveling with.So when doing feature engineering, you add a new variable \(family_size\), which is the sum of \(SibSb\) and Parch plus \(one\) (the observation itself), to the \(test\) and \(train\) set.
train_two <- train
train_two$family_size <- train_two$SibSp + train$Parch + 1
# Create a new decision tree my_tree_three
my_tree_four <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + family_size, data=train_two, method="class")
# Visualize your new decision tree
fancyRpartPlot(my_tree_four)
Was it coincidence that upper-class Rose survived and third-class passenger Jack not? Let’s have a look.Lets creat a new train and test set named \(train_new\) and \(test_new\). These data sets contain a new column with the name Title (referring to Miss, Mr, etc.).
# What is up with the 'Miss.' and 'Mr.' thing?
library(stringr)
# Any correlation with other variables (e.g., sibsp)?
misses <- train[which(str_detect(train$Name, "Miss.")),]
misses[1:5,]
## PassengerId Survived Pclass Name Sex
## 3 3 1 3 Heikkinen, Miss. Laina female
## 11 11 1 3 Sandstrom, Miss. Marguerite Rut female
## 12 12 1 1 Bonnell, Miss. Elizabeth female
## 15 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female
## 23 23 1 3 McGowan, Miss. Anna "Annie" female
## Age SibSp Parch Ticket Fare Cabin Embarked Child
## 3 26 0 0 STON/O2. 3101282 7.9250 S 0
## 11 4 1 1 PP 9549 16.7000 G6 S 1
## 12 58 0 0 113783 26.5500 C103 S 0
## 15 14 0 0 350406 7.8542 S 1
## 23 15 0 0 330923 8.0292 Q 1
Look at the \(survived\) coloumn. It seems like all of four out of five survived in this case.
# Hypothesis - Name titles correlate with age
mrses <- train[which(str_detect(train$Name, "Mrs.")), ]
mrses[1:5,]
## PassengerId Survived Pclass
## 2 2 1 1
## 4 4 1 1
## 9 9 1 3
## 10 10 1 2
## 16 16 1 2
## Name Sex Age SibSp
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
## 16 Hewlett, Mrs. (Mary D Kingcome) female 55 0
## Parch Ticket Fare Cabin Embarked Child
## 2 0 PC 17599 71.2833 C85 C 0
## 4 0 113803 53.1000 C123 S 0
## 9 2 347742 11.1333 S 0
## 10 0 237736 30.0708 C 1
## 16 0 248706 16.0000 S 0
All five of Mrs survived!!
# Check out males to see if pattern continues
males <- train[which(train$Sex == "male"), ]
males[1:5,]
## PassengerId Survived Pclass Name Sex Age
## 1 1 0 3 Braund, Mr. Owen Harris male 22
## 5 5 0 3 Allen, Mr. William Henry male 35
## 6 6 0 3 Moran, Mr. James male NA
## 7 7 0 1 McCarthy, Mr. Timothy J male 54
## 8 8 0 3 Palsson, Master. Gosta Leonard male 2
## SibSp Parch Ticket Fare Cabin Embarked Child
## 1 1 0 A/5 21171 7.2500 S 0
## 5 0 0 373450 8.0500 S 0
## 6 0 0 330877 8.4583 Q 0
## 7 0 0 17463 51.8625 E46 S 0
## 8 3 1 349909 21.0750 S 1
Now, if you see the \(Survived\) coloumn, no one survived in our first five, in the case of man.So, there must me strong correlation between gender and survived. This make sence, because when woman and children were given priority when titanic sank.
Create a utility function to help with title extraction. NOTE - Using the grep function here, but could have used the str_detect function as well.
Get_Title <- function(name) {
name <- as.character(name)
if (length(grep("Miss.", name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", name)) > 0) {
return ("Mr.")
} else {
return ("Other")
}
}
# NOTE - The code below uses a for loop which is not a very R way of
# doing things
Title <- NULL
for (i in 1:nrow(train)) {
Title <- c(Title, Get_Title(train[i,"Name"]))
}
train$Title <- as.factor(Title)
train_new<-train
Since, we have a new Variable Title, we can see thich title has higest and which title has lowest survival chance in all three passenger class.
ggplot(train, aes(x = as.factor(Title), fill = factor(Survived))) +
geom_bar() +
facet_wrap(~Pclass) +
ggtitle("Passenger Class") +
xlab("Title") +
ylab("Total Count") +
labs(fill = "Survived")
From this histogram, its clear that our, Mr are less likely to survived and it get worsed as the Passenger class decrease. So, we can say, poor man has the least chance to survived.
We just add a new coloumn, \(Title\) to our train dataset. Now we will do the same thing to test dataset in order to new column \(Title\).
Title <- NULL
for (i in 1:nrow(test)) {
Title <- c(Title, Get_Title(test[i,"Name"]))
}
test$Title <- as.factor(Title)
test_new<-test
Title is another example of feature engineering: + creating a new variable that possible improves the model. + Create a new model my_tree_five.
library(rpart)
my_tree_five <-rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, data= train_new, method = 'class')
# Visualize your new decision tree
fancyRpartPlot(my_tree_five)
# Make your prediction using `my_tree_five` and `test_new`
my_prediction <- predict(my_tree_five,test_new, type = 'class')
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId= test_new$PassengerId, Survived = my_prediction)
# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution,file= 'my_solution.csv',row.names=F)
A detailed study of Random Forests would take this tutorial a bit too far. However, since it’s an often used machine learning technique, it makes sense to provide you with a general understanding and illustrate how to apply the technique using R.In layman terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority’s vote count as the actual classification decision, avoids overfitting.Before starting with the actual analysis, you first need to meet one big condition of Random Forests: no missing values in your data frame. Lets look at what we have in our train and test dataset.
names(train)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "Child" "Title"
names(test)
## [1] "PassengerId" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
## [11] "Embarked" "Title"
In order to follow the DataCamp exacise,we drop the \(Child\) column in train data set and add \(family_size\) in order to mach with our test data set. Furthermore, we will add a \(Survived\) column to our test dataset and populate with None. By doing this we will have same variable for our train and test dataset and we cancombind togather and called it \(all_data\).
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
train$Child<- NULL
train$family_size <- train$SibSp +train$Parch +1
test$family_size <- test$SibSp + test$Parch +1
# Add a "Survived" variable to the test set to allow for combining data sets
test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,])
# Combine data sets
all_data <- rbind(train, test.survived)
# All data, both training and test set
head(all_data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked Title family_size
## 1 0 A/5 21171 7.2500 S Mr. 2
## 2 0 PC 17599 71.2833 C85 C Mrs. 2
## 3 0 STON/O2. 3101282 7.9250 S Miss. 1
## 4 0 113803 53.1000 C123 S Mrs. 2
## 5 0 373450 8.0500 S Mr. 1
## 6 0 330877 8.4583 Q Mr. 1
# Passenger on row 62 and 830 do not have a value for embarkment.
# Since many passengers embarked at Southampton, we give them the value S.
# We code all embarkment codes as factors.
all_data$Embarked[c(62,830)] = "S"
all_data$Embarked <- factor(all_data$Embarked)
# Passenger on row 1044 has an NA Fare value. Let's replace it with the median fare value.
all_data$Fare[1044] <- median(all_data$Fare, na.rm=TRUE)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
data=all_data[!is.na(all_data$Age),], method="anova")
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])
# Split the data back into a train set and a test set
train <- all_data[1:891,]
test <- all_data[892:1309,]
# Train set and test set
str(train)
## 'data.frame': 891 obs. of 14 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : chr "0" "1" "1" "1" ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Title : Factor w/ 5 levels "Master.","Miss.",..: 3 4 2 4 3 3 3 1 4 4 ...
## $ family_size: num 2 2 1 2 1 1 1 5 3 2 ...
str(test)
## 'data.frame': 418 obs. of 14 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Survived : chr "None" "None" "None" "None" ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 438 1298 1162 1303 1072 1259 178 949 896 994 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 781 841 726 776 252 869 787 159 745 520 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Title : Factor w/ 5 levels "Master.","Miss.",..: 3 4 3 3 4 3 2 3 4 3 ...
## $ family_size: num 1 2 1 1 3 1 1 3 1 3 ...
# Set seed for reproducibility
set.seed(111)
# Apply the Random Forest Algorithm
Afzal_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, data=train, importance=TRUE, ntree=1000)
# Make your prediction using the test set
my_prediction <- predict(Afzal_forest, test,type= 'class')
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)
# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)
Remember we set \(importance=TRUE\)?, Now we can see what variables are important using:
varImpPlot(Afzal_forest)
Type it into the console and see what happens.When running the function, two graphs appear: the accuracy plot shows how much worst the model would perform without the included variables. So a high decrease (= high value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.