Part 1: Predictive Models on the Iris Dataset

Question 1

Load the data frame called iris in R using the following code: data(iris)

Also, type the name of the data frame to view it in R using the following code: iris

data(iris)
iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Question 2

Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.

index <- sample(nrow(iris), nrow(iris)*0.80)
iris_train <- iris[index,]
iris_test <- iris[-index,]

Question 3

The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.

#install.packages('ipred')
library(ipred)
iris_bag <- bagging(formula = Species~., data = iris_train, nbagg = 500)
iris_bag
## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     nbagg = 500)
  1. What is the out-of-bag error for your model?
iris_bag_oob <- bagging(formula = Species~.,
                          data = iris_train,
                          coob = T,
                          nbagg = 500) 
iris_bag_oob
## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     coob = T, nbagg = 500)
## 
## Out-of-bag estimate of misclassification error:  0.0667
  1. Use your model to make predictions for the observations in your testing set.
iris_bag_pred <- predict(iris_bag, newdata = iris_test)
table (iris_bag_pred)
## iris_bag_pred
##     setosa versicolor  virginica 
##         12          7         11

Question 4

Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
iris_rf <- randomForest(Species~., data = iris_train, importance = TRUE)
iris_rf
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         38          0         0  0.00000000
## versicolor      0         38         3  0.07317073
## virginica       0          3        38  0.07317073
  1. What is the out-of-bag error for your model? The OOB Error for the model is 2.5%

  2. Based on the out-of-bag error, is the random forest better at predicting flower species than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy? Based on the OOB Error, it seems that the random forest model (2.5%) is slightly better than the bagging model (3.33%).

  3. How many flowers in your training set are misclassified by your random forest model? A total of 3 flowers were misclassified by the random forest model.

  4. Use your random forest to make predictions for the observations in your testing set.

iris_rf_pred <- predict(iris_rf, newdata = iris_test)
table (iris_rf_pred)
## iris_rf_pred
##     setosa versicolor  virginica 
##         12          8         10

Question 5

Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.(Note that flower species is a non-numeric categorical variable. So you want to make sure that you are developing a boosting model for the purpose of categorical classification.)

# install.packages("ggplot2")
# install.packages("caret")
# install.packages("adabag")
library(adabag)
## Loading required package: rpart
## Loading required package: caret
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
## 
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
## 
##     bagging
  1. Which flower measurement is most important in predicting its species (i.e., sepal length, sepal width, petal length, or petal width)?

  2. Use your boosting model to make predictions for the observations in your testing set, and create a confusion matrix displaying your predictions. What is the misclassification error rate for flowers in your testing set?

iris_boost = boosting(Species~., data = iris_train, boos = T)
iris_boost$importance
## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
##    50.327769    21.796364     9.453085    18.422781
pred_iris_boost = predict(iris_boost, newdata = iris_test)
pred_iris_boost$confusion
##                Observed Class
## Predicted Class setosa versicolor virginica
##      setosa         12          0         0
##      versicolor      0          8         0
##      virginica       0          1         9
pred_iris_boost$error
## [1] 0.03333333
  1. Create a plot comparing the number of trees used in the boosting model with the misclassification error on your testing test. Based on your plot, are a large number of trees needed to have a fairly accurate boosting model?
ntree <- c(1, seq(20, 100, 20))
err <- c(0)
for (i in 1:6){
  iris_boost = boosting(Species~., data = iris_train, boos = T, mfinal = ntree[i])
  pred_iris_boost = predict(iris_boost, newdata = iris_test)
  err[i] = pred_iris_boost$error
  cat(i, " ")
}
## 1  2  3  4  5  6
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "No. of Trees", ylab = "Missclassification Error")

Question 6

Using your training set, create a regression model to predict a flower’s petal length based on its petal width, sepal length, sepal width, and species. Determine if your regression model could be improved with a Box-Cox transformation, and if so, perform the most appropriate Box-Cox transformation.

library(MASS)
model <- lm(Petal.Length~., data = iris_train)
summary(model)
## 
## Call:
## lm(formula = Petal.Length ~ ., data = iris_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63563 -0.19620  0.00649  0.16998  0.68638 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.20003    0.31209  -3.845 0.000199 ***
## Sepal.Length       0.63091    0.05579  11.309  < 2e-16 ***
## Sepal.Width       -0.20055    0.09349  -2.145 0.034071 *  
## Petal.Width        0.68174    0.13673   4.986 2.22e-06 ***
## Speciesversicolor  1.36948    0.19963   6.860 3.77e-10 ***
## Speciesvirginica   1.83209    0.28059   6.529 1.91e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2663 on 114 degrees of freedom
## Multiple R-squared:  0.9781, Adjusted R-squared:  0.9772 
## F-statistic:  1020 on 5 and 114 DF,  p-value: < 2.2e-16
boxcox(model)

model2 <- lm(sqrt(iris_train$Petal.Length)~.,  data = iris_train)
summary(model2)
## 
## Call:
## lm(formula = sqrt(iris_train$Petal.Length) ~ ., data = iris_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18882 -0.03996  0.00443  0.03905  0.16647 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.59491    0.07623   7.804 3.16e-12 ***
## Sepal.Length       0.13994    0.01363  10.270  < 2e-16 ***
## Sepal.Width       -0.03887    0.02284  -1.702   0.0915 .  
## Petal.Width        0.16573    0.03340   4.962 2.46e-06 ***
## Speciesversicolor  0.52290    0.04876  10.723  < 2e-16 ***
## Speciesvirginica   0.62042    0.06854   9.052 4.46e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06506 on 114 degrees of freedom
## Multiple R-squared:  0.9834, Adjusted R-squared:  0.9827 
## F-statistic:  1352 on 5 and 114 DF,  p-value: < 2.2e-16
boxcox(model2)

Part 2: Filter Joins

Question 7

Create a new data frame called Faculty containing the following table of information about four faculty at a university:

Faculty <- data.frame(ID = c(1,2,3,7), Name = c("Grayson", "Wayne", "Stark", "Grey"), Code = c("ART", "ART", "COMP", "HIST"))
Faculty
##   ID    Name Code
## 1  1 Grayson  ART
## 2  2   Wayne  ART
## 3  3   Stark COMP
## 4  7    Grey HIST

Question 8

Create a new data frame called Department containing the following table of information:

Department <- data.frame(Code = c("ART", "COMP", "ENG", "HIST"), Department_Name = c("Art Department", "Computer Science Department", "English Department", "History Department"))
Department
##   Code             Department_Name
## 1  ART              Art Department
## 2 COMP Computer Science Department
## 3  ENG          English Department
## 4 HIST          History Department

Question 9

Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::combine()    masks randomForest::combine()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::lift()       masks caret::lift()
## ✖ ggplot2::margin()   masks randomForest::margin()
## ✖ dplyr::select()     masks MASS::select()
## ✖ purrr::when()       masks foreach::when()
Department %>% semi_join(Faculty, by = "Code") -> joined_data
joined_data
##   Code             Department_Name
## 1  ART              Art Department
## 2 COMP Computer Science Department
## 3 HIST          History Department

Question 10

Using a filter join, display the department that does not have a faculty member listed in theFaculty data frame.

Department %>% anti_join(Faculty, by = "Code") -> joined_data
joined_data
##   Code    Department_Name
## 1  ENG English Department