Part 1 of Exam 2: Predictive Models on the Iris Dataset

Load the data frame called iris in R using the following code: data(iris). Also, type the name of the data frame to view it in R using the following code: iris

data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.

ind <- sample(nrow(iris), nrow(iris)*0.8) 
iris_train <- iris[ind,]
iris_test <- iris[-ind,]

The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.

library(ipred)
iris_bag <- bagging(Species ~., data = iris_train, nbagg = 500)
iris_bag
## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     nbagg = 500)
  1. What is the out-of-bag error for your model?
iris_bag_oob <- bagging(Species~., data = iris_train, coob = T, nbagg = 500)
iris_bag_oob
## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     coob = T, nbagg = 500)
## 
## Out-of-bag estimate of misclassification error:  0.05

See screenshot in the included word doc as 3.a). The out-of-bag error for this is 0.0417.

  1. Use your model to make predictions for the observations in your testing set.
iris_bag_pred <- predict(iris_bag, newdata = iris_test)
summary(iris_bag_pred)
##     setosa versicolor  virginica 
##          6         10         14

See screenshot in the included word doc as 3.a). This predicts off the testing set that Setosa will be 6, Versicolor will be 12 and Virginica will be 12

Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.

  1. What is the out-of-bag error for your model?
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
iris_rf <- randomForest(formula = Species~., data = iris_train, importance = TRUE, ntree = 500)
iris_rf
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.17%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         44          0         0  0.00000000
## versicolor      0         38         2  0.05000000
## virginica       0          3        33  0.08333333

The out of bag error for the Random Forest is 3.33%. See screenshot in the included word doc as 4.a).

  1. Based on the out-of-bag error, is the random forest better at predicting flower species than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?
iris_bag_oob
## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     coob = T, nbagg = 500)
## 
## Out-of-bag estimate of misclassification error:  0.05
iris_rf
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.17%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         44          0         0  0.00000000
## versicolor      0         38         2  0.05000000
## virginica       0          3        33  0.08333333

The bagging model oob error was 0.417 (4.17%), and the random forest oob error was 3.33%. The random forest model was better (smaller) then the bagging model. See screenshot in the included word doc as 4.b).

  1. How many flowers in your training set are misclassified by your random forest model?

As seen in the confusion matrix, there are 4 flowers misclassified. See screenshot in the included word doc as 4.c).

  1. Use your random forest to make predictions for the observations in your testing set.
iris_rf_pred <- predict(iris_rf, newdata = iris_test)
summary(iris_bag_pred)
##     setosa versicolor  virginica 
##          6         10         14

See screenshot in the included word doc as 4.d). This predicts off the testing set that Setosa will be 6, Versicolor will be 12, and Virginica will be 12

Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.(Note that flower species is a non-numeric categorical variable. So you want to make sure that you are developing a boosting model for the purpose of categorical classification.)

iris_boost <- gbm(formula = Species ~., data=iris_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 3)
summary(iris_boost)

##                       var   rel.inf
## Petal.Length Petal.Length 65.917006
## Petal.Width   Petal.Width 25.817141
## Sepal.Length Sepal.Length  4.944938
## Sepal.Width   Sepal.Width  3.320914
  1. Which flower measurement is most important in predicting its species (i.e., sepal length, sepal width, petal length, or petal width)?
summary(iris_boost)

##                       var   rel.inf
## Petal.Length Petal.Length 65.917006
## Petal.Width   Petal.Width 25.817141
## Sepal.Length Sepal.Length  4.944938
## Sepal.Width   Sepal.Width  3.320914

See screenshot in the included word doc as 5.a). We can see that the Petal Length is most important in predicting its species.

  1. Use your boosting model to make predictions for the observations in your testing set, and create a confusion matrix displaying your predictions. What is the misclassification error rate for flowers in your testing set?
  library(gbm)
iris_boost_test <- predict(iris_boost, newdata = iris_test, n.trees = 10000)
summary(iris_boost_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9898  1.9612  2.2126  2.2297  2.9033  3.1358

Answer

  1. Create a plot comparing the number of trees used in the boosting model with the misclassification error on your testing test. Based on your plot, are a large number of trees needed to have a fairly accurate boosting model?

Answer

Using your training set, create a regression model to predict a flower’s petal length based on its petal width, sepal length, sepal width, and species. Determine if your regression model could be improved with a Box-Cox transformation, and if so, perform the most appropriate Box-Cox transformation.

model1 <- lm(iris_train$Petal.Length ~ .,  data = iris_train)
boxcox(model1)

model2 <- lm(sqrt(iris_train$Petal.Length) ~ .,  data = iris_train)
summary(model2)
## 
## Call:
## lm(formula = sqrt(iris_train$Petal.Length) ~ ., data = iris_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.167651 -0.036868  0.000188  0.036168  0.191817 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.58221    0.07117   8.180 4.46e-13 ***
## Sepal.Length       0.13297    0.01356   9.808  < 2e-16 ***
## Sepal.Width       -0.01591    0.02035  -0.782  0.43594    
## Petal.Width        0.10103    0.03243   3.115  0.00232 ** 
## Speciesversicolor  0.59972    0.04453  13.468  < 2e-16 ***
## Speciesvirginica   0.74320    0.06399  11.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06254 on 114 degrees of freedom
## Multiple R-squared:  0.9853, Adjusted R-squared:  0.9847 
## F-statistic:  1530 on 5 and 114 DF,  p-value: < 2.2e-16
boxcox(model2)

Yes, for model1 the results indicated a lambda of .5 so a square root transoformation was conducted creating model2. Model2 is showing a lambda of 1 indicating that no further transformations are needed.

Part 2 of Exam 2: Filter Joins

Create a new data frame called Faculty containing the following table of information about four faculty at a university.

id <- c(1,2,3,7)
name <- c("Grayson", "Wayne", "Stark", "Grey")
code <- c( "ART", "ART", "COMP", "HIST")
faculty <- data.table(id, name, code)

Create a new data frame called Department containing the following table of information:

Department_Name <- c("Art Department", "Computer Science Department", "English Department", "History Department")
Code1 <- c( "ART", "COMP", "ENG", "HIST")
Department <- data.table(Code1, Department_Name)

Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.

Department %>% inner_join(faculty, by = c("Code1" = "code")) -> joined_data
joined_data
##    Code1             Department_Name id    name
## 1:   ART              Art Department  1 Grayson
## 2:   ART              Art Department  2   Wayne
## 3:  COMP Computer Science Department  3   Stark
## 4:  HIST          History Department  7    Grey

Using a filter join, display the department that does not have a faculty member listed in the Faculty data frame.

Department %>% anti_join(faculty, by = c("Code1" = "code")) -> joined_data1
joined_data1
##    Code1    Department_Name
## 1:   ENG English Department