Load the data frame called iris in R using the following code: data(iris)
Also, type the name of the data frame to view it in R using the following code: iris
data(iris)
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.
index <- sample(nrow(iris), nrow(iris)*0.80)
iris_train <- iris[index,]
iris_test <- iris[-index,]
The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.
#install.packages('ipred')
library(ipred)
iris_bag <- bagging(formula = Species~., data = iris_train, nbagg = 500)
iris_bag
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## nbagg = 500)
iris_bag_oob <- bagging(formula = Species~.,
data = iris_train,
coob = T,
nbagg = 500)
iris_bag_oob
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## coob = T, nbagg = 500)
##
## Out-of-bag estimate of misclassification error: 0.0667
iris_bag_pred <- predict(iris_bag, newdata = iris_test)
table (iris_bag_pred)
## iris_bag_pred
## setosa versicolor virginica
## 12 7 11
Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
iris_rf <- randomForest(Species~., data = iris_train, importance = TRUE)
iris_rf
##
## Call:
## randomForest(formula = Species ~ ., data = iris_train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 38 0 0 0.00000000
## versicolor 0 38 3 0.07317073
## virginica 0 3 38 0.07317073
What is the out-of-bag error for your model? The OOB Error for the model is 2.5%
Based on the out-of-bag error, is the random forest better at predicting flower species than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy? Based on the OOB Error, it seems that the random forest model (2.5%) is slightly better than the bagging model (3.33%).
How many flowers in your training set are misclassified by your random forest model? A total of 3 flowers were misclassified by the random forest model.
Use your random forest to make predictions for the observations in your testing set.
iris_rf_pred <- predict(iris_rf, newdata = iris_test)
table (iris_rf_pred)
## iris_rf_pred
## setosa versicolor virginica
## 12 8 10
Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.(Note that flower species is a non-numeric categorical variable. So you want to make sure that you are developing a boosting model for the purpose of categorical classification.)
# install.packages("ggplot2")
# install.packages("caret")
# install.packages("adabag")
library(adabag)
## Loading required package: rpart
## Loading required package: caret
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
##
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
##
## bagging
Which flower measurement is most important in predicting its species (i.e., sepal length, sepal width, petal length, or petal width)?
Use your boosting model to make predictions for the observations in your testing set, and create a confusion matrix displaying your predictions. What is the misclassification error rate for flowers in your testing set?
iris_boost = boosting(Species~., data = iris_train, boos = T)
iris_boost$importance
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 50.327769 21.796364 9.453085 18.422781
pred_iris_boost = predict(iris_boost, newdata = iris_test)
pred_iris_boost$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 8 0
## virginica 0 1 9
pred_iris_boost$error
## [1] 0.03333333
ntree <- c(1, seq(20, 100, 20))
err <- c(0)
for (i in 1:6){
iris_boost = boosting(Species~., data = iris_train, boos = T, mfinal = ntree[i])
pred_iris_boost = predict(iris_boost, newdata = iris_test)
err[i] = pred_iris_boost$error
cat(i, " ")
}
## 1 2 3 4 5 6
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "No. of Trees", ylab = "Missclassification Error")
Using your training set, create a regression model to predict a flower’s petal length based on its petal width, sepal length, sepal width, and species. Determine if your regression model could be improved with a Box-Cox transformation, and if so, perform the most appropriate Box-Cox transformation.
library(MASS)
model <- lm(Petal.Length~., data = iris_train)
summary(model)
##
## Call:
## lm(formula = Petal.Length ~ ., data = iris_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63563 -0.19620 0.00649 0.16998 0.68638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.20003 0.31209 -3.845 0.000199 ***
## Sepal.Length 0.63091 0.05579 11.309 < 2e-16 ***
## Sepal.Width -0.20055 0.09349 -2.145 0.034071 *
## Petal.Width 0.68174 0.13673 4.986 2.22e-06 ***
## Speciesversicolor 1.36948 0.19963 6.860 3.77e-10 ***
## Speciesvirginica 1.83209 0.28059 6.529 1.91e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2663 on 114 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9772
## F-statistic: 1020 on 5 and 114 DF, p-value: < 2.2e-16
boxcox(model)
model2 <- lm(sqrt(iris_train$Petal.Length)~., data = iris_train)
summary(model2)
##
## Call:
## lm(formula = sqrt(iris_train$Petal.Length) ~ ., data = iris_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18882 -0.03996 0.00443 0.03905 0.16647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59491 0.07623 7.804 3.16e-12 ***
## Sepal.Length 0.13994 0.01363 10.270 < 2e-16 ***
## Sepal.Width -0.03887 0.02284 -1.702 0.0915 .
## Petal.Width 0.16573 0.03340 4.962 2.46e-06 ***
## Speciesversicolor 0.52290 0.04876 10.723 < 2e-16 ***
## Speciesvirginica 0.62042 0.06854 9.052 4.46e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06506 on 114 degrees of freedom
## Multiple R-squared: 0.9834, Adjusted R-squared: 0.9827
## F-statistic: 1352 on 5 and 114 DF, p-value: < 2.2e-16
boxcox(model2)
Create a new data frame called Faculty containing the following table of information about four faculty at a university:
Faculty <- data.frame(ID = c(1,2,3,7), Name = c("Grayson", "Wayne", "Stark", "Grey"), Code = c("ART", "ART", "COMP", "HIST"))
Faculty
## ID Name Code
## 1 1 Grayson ART
## 2 2 Wayne ART
## 3 3 Stark COMP
## 4 7 Grey HIST
Create a new data frame called Department containing the following table of information:
Department <- data.frame(Code = c("ART", "COMP", "ENG", "HIST"), Department_Name = c("Art Department", "Computer Science Department", "English Department", "History Department"))
Department
## Code Department_Name
## 1 ART Art Department
## 2 COMP Computer Science Department
## 3 ENG English Department
## 4 HIST History Department
Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::when() masks foreach::when()
Department %>% semi_join(Faculty, by = "Code") -> joined_data
joined_data
## Code Department_Name
## 1 ART Art Department
## 2 COMP Computer Science Department
## 3 HIST History Department
Using a filter join, display the department that does not have a faculty member listed in theFaculty data frame.
Department %>% anti_join(Faculty, by = "Code") -> joined_data
joined_data
## Code Department_Name
## 1 ENG English Department