#HOMEWORK #2

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results.

Create a random forest for regression and analyze the results.

Format: document with screen captures & analysis.

Dataset selection

https://gist.github.com/fyyying/4aa5b471860321d7b47fd881898162b7#file-titanic_dataset-csv

About this file:

Titantic dataset

Variables:

  • PassengerId
  • Survived
  • Pclass
  • Name
  • Sex
  • Age
  • SibSp
  • Parch
  • Ticket
  • Fare
  • Cabin
  • Embarked
# define the filename-manual procedure
filename1 <- "C:/Users/Lisa/OneDrive/CUNY/622/HW2/titanic_dataset.csv"
# load the CSV file from the local directory
dataset1 <- read.csv(filename1, header=TRUE)
str(dataset1)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(dataset1)
##   PassengerId       Survived          Pclass         Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.00   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.00   Class :character  
##  Median :446.0   Median :0.0000   Median :3.00   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.31                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.00                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.00                     
##                                   NA's   :1                        
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

We need to set the following character variables as factors……

dataset1$Survived <- as.factor(dataset1$Survived)
dataset1$Pclass <- as.factor(dataset1$Pclass)
dataset1$Sex <- as.factor(dataset1$Sex)
dataset1$Embarked<-as.factor(dataset1$Embarked)
str(dataset1)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

And now we keep the following variables in the analysis:

  • Pclass
  • Sex
  • Age
  • SibSp
  • Parch
  • Fare
  • Embarked

and keep completed cases.

ds2 <- dataset1 %>% 
  select(Survived, 
Pclass,
Sex,
Age,
SibSp,
Parch,
Fare,
Embarked)
dim(ds2)
## [1] 891   8
#keep only complete cases

ds2<-ds2[complete.cases(ds2), ]
dim(ds2)
## [1] 713   8
head(ds2)
##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S
summary(ds2)
##  Survived Pclass      Sex           Age            SibSp            Parch      
##  0:424    1:185   female:261   Min.   : 0.42   Min.   :0.0000   Min.   :0.000  
##  1:289    2:173   male  :452   1st Qu.:20.00   1st Qu.:0.0000   1st Qu.:0.000  
##           3:355                Median :28.00   Median :0.0000   Median :0.000  
##                                Mean   :29.70   Mean   :0.5133   Mean   :0.432  
##                                3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.000  
##                                Max.   :80.00   Max.   :5.0000   Max.   :6.000  
##       Fare        Embarked
##  Min.   :  0.00    :  2   
##  1st Qu.:  8.05   C:130   
##  Median : 15.74   Q: 28   
##  Mean   : 34.69   S:553   
##  3rd Qu.: 33.00           
##  Max.   :512.33

Decision Tree - predict Survival - Titantic

Let’s predict survival using a decision tree algorithm

Splitting the data

set.seed(1234)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)

ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]

round(prop.table(table(select(ds2,Survived))),2)
## 
##    0    1 
## 0.59 0.41
round(prop.table(table(select(ds_train,Survived
                       ))),2)
## 
##   0   1 
## 0.6 0.4
round(prop.table(table(select(ds_test,Survived))),2)
## 
##    0    1 
## 0.58 0.42

Training the model

set.seed(123)
ds_mod<-rpart(Survived~., 
                        method="class",
                        data=ds_train
                        )
rpart.plot(ds_mod)

### Test the tree model

Survived_pred<-predict(ds_mod,ds_test, type="class")

pred_table1<-table(ds_test$Survived,Survived_pred)

pred_table1
##    Survived_pred
##      0  1
##   0 91 13
##   1 25 49
sum(diag(pred_table1)/nrow(ds_test))
## [1] 0.7865169

Predicting the Survival is 78% accuracy on the test data.

Decision Tree - predict sex- MALE, FEMALE

Switch variables to generate 2 decision trees and compare the results.

Let’s predict Sex using a decision tree algorithm

Splitting the data

set.seed(12345)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)

ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]

round(prop.table(table(select(ds2,Sex))),2)
## 
## female   male 
##   0.37   0.63
round(prop.table(table(select(ds_train,Sex
                       ))),2)
## 
## female   male 
##   0.34   0.66
round(prop.table(table(select(ds_test,Sex))),2)
## 
## female   male 
##   0.44   0.56

Training the model

set.seed(4321)
ds_mod2<-rpart(Sex~., 
                        method="class",
                        data=ds_train
                        )
rpart.plot(ds_mod2)

### Test the tree model

set.seed(123)
Sex_pred<-predict(ds_mod2,ds_test, type="class")

pred_table2<-table(ds_test$Sex,Sex_pred)

pred_table2
##         Sex_pred
##          female male
##   female     45   33
##   male        6   94
sum(diag(pred_table2)/nrow(ds_test))
## [1] 0.7808989

Predicting Sex by use of the decision tree is 78%.

The Tree models were satisfactory in predicting the survival as well as the sex.

Perhaps, these two variables are related? I believe many females survived the titanic.

Random forest

Create a random forest for regression and analyze the results.

Lets fit a random forest regression and see how well it performs. We will use the response variable “Fare”.

set.seed(5675)
rf.Fare<-randomForest(Fare~.,data=ds_train, importance=TRUE)
rf.Fare
## 
## Call:
##  randomForest(formula = Fare ~ ., data = ds_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 1680.53
##                     % Var explained: 45.5

2 predictors are considered at each split. and 45.5% of the variance is explained. MSE=1680.53.

Let’s run random forest on the test data….

set.seed(23422)
rf.Fare2<-randomForest(Fare~.,data=ds_test, importance=TRUE)
rf.Fare2
## 
## Call:
##  randomForest(formula = Fare ~ ., data = ds_test, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 903.5824
##                     % Var explained: 53.46

The model gives better results for the test data, MSE = 903.5824 and variance explained 53.46% indicating that the model is performing well, even better than the training set.

importance(rf.Fare2)
##            %IncMSE IncNodePurity
## Survived  5.194411     13980.852
## Pclass   30.646714    107088.564
## Sex       4.275656      8412.378
## Age       5.666830     46514.020
## SibSp     8.014390     25900.986
## Parch    10.763046     32938.212
## Embarked  5.894784     29258.477
varImpPlot(rf.Fare2)

The PCLASS is the most important variable in predicting the fare. This is intuitive because the passenger class is probably what the fare is based upon.

How well does the rf model perform on the test set?

set.seed(12345)
yhat.rf<-predict( rf.Fare, newdata=ds_test)
plot( yhat.rf, ds_test$Fare)
abline(0,1)

mean((yhat.rf - ds_test$Fare)^2)
## [1] 659.5736

This IS lower compared to the training MSE.

The test set MSE associated with the random forest is 903.5824.

Regression tree predicting fare

Let’s do a decision tree for regression to see how it compares to the random forest model above:

set.seed(1)
tree.fare<-tree(Fare~., ds_train)
summary (tree.fare)
## 
## Regression tree:
## tree(formula = Fare ~ ., data = ds_train)
## Variables actually used in tree construction:
## [1] "Pclass"   "Parch"    "Sex"      "Age"      "Embarked" "SibSp"   
## Number of terminal nodes:  9 
## Residual mean deviance:  1523 = 801100 / 526 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -123.200   -8.469   -5.744    0.000    7.756  433.300

6 variables have been used in the tree.

Let’s plot tree

plot(tree.fare)
text(tree.fare, pretty=0)

Do we need to prune this tree?

cv.fare<-cv.tree(tree.fare)
plot(cv.fare$size, cv.fare$dev, type="b")

Now to prune tree…..

prune.fare<-prune.tree(tree.fare, best=4)
plot(prune.fare)
text(prune.fare,pretty=0)

Now let’s make predictions.

yhat<-predict(tree.fare, newdata=ds_test)
plot(yhat,ds_test$Fare)
abline(0,1)

mean((yhat-ds_test$Fare)^2)
## [1] 927.6231

Notice the MSE on the tree is slightly higher than the MSE on the random forest model, implying that the random forest model is performing better.