Part 1
Question 1-Load the data frame called iris in R using the following
code:
#loaded data set iris
data(iris)
#Also, type the name of the data frame to view it in R using the following code:
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Question 2- Randomly sample 80% of the rows in the iris data frame
to create a training set. Create a testing set containing the rest of
the rows in the iris data frame.
index <- sample(nrow(iris), nrow(iris)*0.80)
iris_train = iris[index,]
iris_test = iris[-index,]
Question 3- The iris dataset contains sepal and petal measurements
for three species of flowers. Using 500 bootstrapped sets, develop a
bagging model on your training set to predict a flower’s species based
on its sepal and petal measurements.
#load library package to run bagging
library(ipred)
#bagging model
iris_bag <- bagging(formula = Species~., data = iris_train, nbagg = 500)
iris_bag
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## nbagg = 500)
3A- What is the out-of-bag error for your model?
iris_bag_oob <- bagging(formula = Species~.,
data = iris_train,
coob = T,
nbagg = 500)
iris_bag_oob
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## coob = T, nbagg = 500)
##
## Out-of-bag estimate of misclassification error: 0.05
# initial run -out of bag error = Out-of-bag estimate of misclassification error: 0.075
3B- Use your model to make predictions for the observations in your
testing set
iris_bag_pred <- predict(iris_bag, newdata = iris_test, type= "class")
table(iris_test$Species, iris_bag_pred, dnn = c("Truth", "Predicted"))
## Predicted
## Truth setosa versicolor virginica
## setosa 7 0 0
## versicolor 0 11 1
## virginica 0 0 11
#initial run answers below
#Predicted
#Truth setosa versicolor virginica
#setosa 12 0 0
#versicolor 0 9 1
#virginica 0 0 8
Question 4- Using 500 trees, develop a random forest model on your
training set to predict a flower’s species based on its sepal and petal
measurements.
#library loaded to run random forests
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
#model for random forests using 500 trees
iris_rf <- randomForest(Species~., data = iris_train, importance = TRUE, ntree = 500)
iris_rf
##
## Call:
## randomForest(formula = Species ~ ., data = iris_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 3.33%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 43 0 0 0.00000000
## versicolor 0 36 2 0.05263158
## virginica 0 2 37 0.05128205
#initial run calculations
#Call:
#randomForest(formula = Species ~ ., data = iris_train, importance = TRUE, ntree = 500)
#Type of random forest: classification
#Number of trees: 500
#No. of variables tried at each split: 2
#OOB estimate of error rate: 5.83%
#Confusion matrix:
#setosa versicolor virginica class.error
#setosa 38 0 0 0.0000000
#versicolor 0 37 3 0.0750000
#virginica 0 4 38 0.0952381
#a) What is the out-of-bag error for your model?
# Initial out of bag error: OOB estimate of error rate: 5.83%
#b) Based on the out-of-bag error, is the random forest better at predicting flower species than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?
#The bagging model performs better as the error for the bagging model is 0.075 and the random forest model error is 5.83
#c) How many flowers in your training set are misclassified by your random forest model?
#7 3 in the versicolor row and 4 in the virginica row
#d) Use your random forest to make predictions for the observations in your testing set.
iris_rf_pred <- predict(iris_rf, newdata = iris_test, type= "class")
table(iris_test$Species, iris_rf_pred, dnn = c("Truth", "Predicted"))
## Predicted
## Truth setosa versicolor virginica
## setosa 7 0 0
## versicolor 0 11 1
## virginica 0 0 11
#initial predicitons below
#Predicted
#Truth setosa versicolor virginica
#setosa 12 0 0
#versicolor 0 9 1
#virginica 0 0 8
Question 5-#5. Develop a boosting model on your training set to
predict a flower’s species based on its sepal and petal measurements. In
developing this model, you can either use R’s default settings for
things like the number of trees and shrinkage, or you can use values of
your own choosing.
library(adabag)
## Loading required package: rpart
## Loading required package: caret
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
##
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
##
## bagging
iris_boost1 = boosting(Species~., data = iris_train,boos = T)
iris_boost1$importance
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 54.33778 21.79623 10.85876 13.00722
#intitial run
#Petal.Length Petal.Width Sepal.Length Sepal.Width
#51.32374 19.61918 11.37238 17.68471
#5A Which flower measurement is most important in predicting its species (i.e., sepal length, sepal width, petal
#length, or petal width)?
#Petal Length has the most importance
5B- Use your boosting model to make predictions for the observations
in your testing set, and create a confusion matrix displaying your
predictions. What is the misclassification error rate for flowers in
your testing set?
pred_iris_boost1 = predict(iris_boost1, newdata = iris_test)
pred_iris_boost1$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 7 0 0
## versicolor 0 11 0
## virginica 0 1 11
pred_iris_boost1$error # initial missclassification error [1] 0.06666667
## [1] 0.03333333
# initial run for confusion matrix
#Observed Class
#Predicted Class setosa versicolor virginica
#setosa 12 0 0
#versicolor 0 8 0
#virginica 0 2 8
5C- Create a plot comparing the number of trees used in the boosting
model with the misclassification error on your testing test. Based on
your plot, are a large number of trees needed to have a fairly accurate
boosting model?
ntree <- c(1, seq(20, 100, 20))
err <- c(0)
for (i in 1:6){
iris_boost1 = boosting(Species~., data = iris_train,boos = T, mfinal = ntree[i])
pred_iris_boost1 = predict(iris_boost1, newdata = iris_test)
err[i] = pred_iris_boost1$error
cat(i, " ")
}
## 1 2 3 4 5 6
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "No. of Trees", ylab = "Missclassification Error")

#no a large number are trees are not needed to produce an accurate model
Part 2
Question 7
# create a data frame called Faculty
Faculty <- data.frame(ID = c(1, 2, 3, 7), Name = c("Grayson", "Wayne", "Stark", "Grey"), Code = c("ART", "ART", "COMP", "HIST"))
head(Faculty)
## ID Name Code
## 1 1 Grayson ART
## 2 2 Wayne ART
## 3 3 Stark COMP
## 4 7 Grey HIST
Question 8
#create a data frame called Department
Department <- data.frame(Code = c("ART", "COMP", "ENG", "HIST"), Department_Name = c("Art_Department", "Computer_Science_Department","English_Department", "History_Department"))
head(Department)
## Code Department_Name
## 1 ART Art_Department
## 2 COMP Computer_Science_Department
## 3 ENG English_Department
## 4 HIST History_Department
Question 9-Using a filter join, display only those departments that
have a faculty member listed in the Faculty data frame.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::when() masks foreach::when()
Department %>% semi_join(Faculty, by = "Code") -> filter_join
filter_join
## Code Department_Name
## 1 ART Art_Department
## 2 COMP Computer_Science_Department
## 3 HIST History_Department
#Code Department_Name
#1 ART Art_Department
#2 COMP Computer_Science_Department
#3 HIST History_Department
Question 10- Using a filter join, display the department that does
not have a faculty member listed in the Faculty data frame.
Department %>% anti_join(Faculty, by = "Code") -> anti_join
anti_join
## Code Department_Name
## 1 ENG English_Department
# Code Department_Name
#1 ENG English_Department