#HOMEWORK #2
Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results.
Create a random forest for regression and analyze the results.
Format: document with screen captures & analysis.
https://gist.github.com/fyyying/4aa5b471860321d7b47fd881898162b7#file-titanic_dataset-csv
About this file:
Titantic dataset
Variables:
# define the filename-manual procedure
filename1 <- "C:/Users/Lisa/OneDrive/CUNY/622/HW2/titanic_dataset.csv"
# load the CSV file from the local directory
dataset1 <- read.csv(filename1, header=TRUE)
str(dataset1)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(dataset1)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.00 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.00 Class :character
## Median :446.0 Median :0.0000 Median :3.00 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.31
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.00
## Max. :891.0 Max. :1.0000 Max. :3.00
## NA's :1
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
We need to set the following character variables as factors……
dataset1$Survived <- as.factor(dataset1$Survived)
dataset1$Pclass <- as.factor(dataset1$Pclass)
dataset1$Sex <- as.factor(dataset1$Sex)
dataset1$Embarked<-as.factor(dataset1$Embarked)
str(dataset1)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
And now we keep the following variables in the analysis:
and keep completed cases.
ds2 <- dataset1 %>%
select(Survived,
Pclass,
Sex,
Age,
SibSp,
Parch,
Fare,
Embarked)
dim(ds2)
## [1] 891 8
#keep only complete cases
ds2<-ds2[complete.cases(ds2), ]
dim(ds2)
## [1] 713 8
head(ds2)
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 22 1 0 7.2500 S
## 2 1 1 female 38 1 0 71.2833 C
## 3 1 3 female 26 0 0 7.9250 S
## 4 1 1 female 35 1 0 53.1000 S
## 5 0 3 male 35 0 0 8.0500 S
## 7 0 1 male 54 0 0 51.8625 S
summary(ds2)
## Survived Pclass Sex Age SibSp Parch
## 0:424 1:185 female:261 Min. : 0.42 Min. :0.0000 Min. :0.000
## 1:289 2:173 male :452 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:0.000
## 3:355 Median :28.00 Median :0.0000 Median :0.000
## Mean :29.70 Mean :0.5133 Mean :0.432
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :80.00 Max. :5.0000 Max. :6.000
## Fare Embarked
## Min. : 0.00 : 2
## 1st Qu.: 8.05 C:130
## Median : 15.74 Q: 28
## Mean : 34.69 S:553
## 3rd Qu.: 33.00
## Max. :512.33
Let’s predict survival using a decision tree algorithm
set.seed(1234)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)
ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]
round(prop.table(table(select(ds2,Survived))),2)
##
## 0 1
## 0.59 0.41
round(prop.table(table(select(ds_train,Survived
))),2)
##
## 0 1
## 0.6 0.4
round(prop.table(table(select(ds_test,Survived))),2)
##
## 0 1
## 0.58 0.42
set.seed(123)
ds_mod<-rpart(Survived~.,
method="class",
data=ds_train
)
rpart.plot(ds_mod)
### Test the tree model
Survived_pred<-predict(ds_mod,ds_test, type="class")
pred_table1<-table(ds_test$Survived,Survived_pred)
pred_table1
## Survived_pred
## 0 1
## 0 91 13
## 1 25 49
sum(diag(pred_table1)/nrow(ds_test))
## [1] 0.7865169
Predicting the Survival is 78% accuracy on the test data.
Switch variables to generate 2 decision trees and compare the results.
Let’s predict Sex using a decision tree algorithm
set.seed(12345)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)
ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]
round(prop.table(table(select(ds2,Sex))),2)
##
## female male
## 0.37 0.63
round(prop.table(table(select(ds_train,Sex
))),2)
##
## female male
## 0.34 0.66
round(prop.table(table(select(ds_test,Sex))),2)
##
## female male
## 0.44 0.56
set.seed(4321)
ds_mod2<-rpart(Sex~.,
method="class",
data=ds_train
)
rpart.plot(ds_mod2)
### Test the tree model
set.seed(123)
Sex_pred<-predict(ds_mod2,ds_test, type="class")
pred_table2<-table(ds_test$Sex,Sex_pred)
pred_table2
## Sex_pred
## female male
## female 45 33
## male 6 94
sum(diag(pred_table2)/nrow(ds_test))
## [1] 0.7808989
Predicting Sex by use of the decision tree is 78%.
The Tree models were satisfactory in predicting the survival as well as the sex.
Perhaps, these two variables are related? I believe many females survived the titanic.
Create a random forest for regression and analyze the results.
Lets fit a random forest regression and see how well it performs. We will use the response variable “Fare”.
set.seed(5675)
rf.Fare<-randomForest(Fare~.,data=ds_train, importance=TRUE)
rf.Fare
##
## Call:
## randomForest(formula = Fare ~ ., data = ds_train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 1680.53
## % Var explained: 45.5
2 predictors are considered at each split. and 45.5% of the variance is explained. MSE=1680.53.
Let’s run random forest on the test data….
set.seed(23422)
rf.Fare2<-randomForest(Fare~.,data=ds_test, importance=TRUE)
rf.Fare2
##
## Call:
## randomForest(formula = Fare ~ ., data = ds_test, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 903.5824
## % Var explained: 53.46
The model gives better results for the test data, MSE = 903.5824 and variance explained 53.46% indicating that the model is performing well, even better than the training set.
importance(rf.Fare2)
## %IncMSE IncNodePurity
## Survived 5.194411 13980.852
## Pclass 30.646714 107088.564
## Sex 4.275656 8412.378
## Age 5.666830 46514.020
## SibSp 8.014390 25900.986
## Parch 10.763046 32938.212
## Embarked 5.894784 29258.477
varImpPlot(rf.Fare2)
The PCLASS is the most important variable in predicting the fare. This is intuitive because the passenger class is probably what the fare is based upon.
How well does the rf model perform on the test set?
set.seed(12345)
yhat.rf<-predict( rf.Fare, newdata=ds_test)
plot( yhat.rf, ds_test$Fare)
abline(0,1)
mean((yhat.rf - ds_test$Fare)^2)
## [1] 659.5736
This IS lower compared to the training MSE.
The test set MSE associated with the random forest is 903.5824.
Let’s do a decision tree for regression to see how it compares to the random forest model above:
set.seed(1)
tree.fare<-tree(Fare~., ds_train)
summary (tree.fare)
##
## Regression tree:
## tree(formula = Fare ~ ., data = ds_train)
## Variables actually used in tree construction:
## [1] "Pclass" "Parch" "Sex" "Age" "Embarked" "SibSp"
## Number of terminal nodes: 9
## Residual mean deviance: 1523 = 801100 / 526
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -123.200 -8.469 -5.744 0.000 7.756 433.300
6 variables have been used in the tree.
Let’s plot tree
plot(tree.fare)
text(tree.fare, pretty=0)
Do we need to prune this tree?
cv.fare<-cv.tree(tree.fare)
plot(cv.fare$size, cv.fare$dev, type="b")
Now to prune tree…..
prune.fare<-prune.tree(tree.fare, best=4)
plot(prune.fare)
text(prune.fare,pretty=0)
Now let’s make predictions.
yhat<-predict(tree.fare, newdata=ds_test)
plot(yhat,ds_test$Fare)
abline(0,1)
mean((yhat-ds_test$Fare)^2)
## [1] 927.6231
Notice the MSE on the tree is slightly higher than the MSE on the random forest model, implying that the random forest model is performing better.