Test 2

Part 1

Question 1-Load the data frame called iris in R using the following code:

#loaded data set iris
data(iris)

#Also, type the name of the data frame to view it in R using the following code:
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Question 2- Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.

index <- sample(nrow(iris), nrow(iris)*0.80)
iris_train = iris[index,]
iris_test = iris[-index,]

Question 3- The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.

#load library package to run bagging
library(ipred)
#bagging model
iris_bag <- bagging(formula = Species~., data = iris_train, nbagg = 500)
iris_bag

## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     nbagg = 500)

3A- What is the out-of-bag error for your model?

 iris_bag_oob <- bagging(formula = Species~.,
                           data = iris_train,
                           coob = T,
                           nbagg = 500)
iris_bag_oob

## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train, 
##     coob = T, nbagg = 500)
## 
## Out-of-bag estimate of misclassification error:  0.05

# initial run -out of bag error = Out-of-bag estimate of misclassification error:  0.075

3B- Use your model to make predictions for the observations in your testing set

iris_bag_pred <- predict(iris_bag, newdata = iris_test, type= "class")
table(iris_test$Species, iris_bag_pred, dnn = c("Truth", "Predicted"))

##             Predicted
## Truth        setosa versicolor virginica
##   setosa          7          0         0
##   versicolor      0         11         1
##   virginica       0          0        11

#initial run answers below

#Predicted
#Truth        setosa versicolor virginica
#setosa         12          0         0
#versicolor      0          9         1
#virginica       0          0         8

Question 4- Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.

#library loaded to run random forests
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

#model for random forests using 500 trees
iris_rf <- randomForest(Species~., data = iris_train, importance = TRUE, ntree = 500)
iris_rf

## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         43          0         0  0.00000000
## versicolor      0         36         2  0.05263158
## virginica       0          2        37  0.05128205

#initial run calculations
#Call:
  #randomForest(formula = Species ~ ., data = iris_train, importance = TRUE,      ntree = 500) 
#Type of random forest: classification
#Number of trees: 500
#No. of variables tried at each split: 2

#OOB estimate of  error rate: 5.83%
#Confusion matrix:
             #setosa versicolor virginica class.error
#setosa         38          0         0   0.0000000
#versicolor      0         37         3   0.0750000
#virginica       0          4        38   0.0952381

#a) What is the out-of-bag error for your model?
# Initial out of bag error:  OOB estimate of  error rate: 5.83%

#b) Based on the out-of-bag error, is the random forest better at predicting flower species than the bagging model? Or is the bagging model better than the random forest? Or do both models seem to perform with about the same accuracy?
#The bagging model performs better as the error for the bagging model is 0.075 and the random forest model error is 5.83

#c) How many flowers in your training set are misclassified by your random forest model?
#7 3 in the versicolor row and 4 in the virginica row

#d) Use your random forest to make predictions for the observations in your testing set.
iris_rf_pred <- predict(iris_rf, newdata = iris_test, type= "class")
table(iris_test$Species, iris_rf_pred, dnn = c("Truth", "Predicted"))

##             Predicted
## Truth        setosa versicolor virginica
##   setosa          7          0         0
##   versicolor      0         11         1
##   virginica       0          0        11

#initial predicitons below
#Predicted
#Truth        setosa versicolor virginica
#setosa         12          0         0
#versicolor      0          9         1
#virginica       0          0         8

Question 5-#5. Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.

library(adabag)

## Loading required package: rpart

## Loading required package: caret

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

## Loading required package: foreach

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

## 
## Attaching package: 'adabag'

## The following object is masked from 'package:ipred':
## 
##     bagging

iris_boost1 = boosting(Species~., data = iris_train,boos = T)
iris_boost1$importance

## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
##     54.33778     21.79623     10.85876     13.00722

#intitial run
#Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
#51.32374     19.61918     11.37238     17.68471

#5A Which flower measurement is most important in predicting its species (i.e., sepal length, sepal width, petal
#length, or petal width)?
#Petal Length has the most importance

5B- Use your boosting model to make predictions for the observations in your testing set, and create a confusion matrix displaying your predictions. What is the misclassification error rate for flowers in your testing set?

pred_iris_boost1 = predict(iris_boost1, newdata = iris_test)
pred_iris_boost1$confusion

##                Observed Class
## Predicted Class setosa versicolor virginica
##      setosa          7          0         0
##      versicolor      0         11         0
##      virginica       0          1        11

pred_iris_boost1$error # initial missclassification error [1] 0.06666667

## [1] 0.03333333

# initial run for confusion matrix
#Observed Class
#Predicted Class setosa versicolor virginica
#setosa         12          0         0
#versicolor      0          8         0
#virginica       0          2         8

5C- Create a plot comparing the number of trees used in the boosting model with the misclassification error on your testing test. Based on your plot, are a large number of trees needed to have a fairly accurate boosting model?

ntree <- c(1, seq(20, 100, 20))
err <- c(0)
for (i in 1:6){
  iris_boost1 = boosting(Species~., data = iris_train,boos = T, mfinal = ntree[i])
  pred_iris_boost1 = predict(iris_boost1, newdata = iris_test)
  err[i] = pred_iris_boost1$error
  cat(i, " ")
}

## 1  2  3  4  5  6

plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "No. of Trees", ylab = "Missclassification Error")

#no a large number are trees are not needed to produce an accurate model

Question 6-#6. Using your training set, create a regression model to predict a flower’s petal length based on its petal width, sepal length, sepal width, and species. Determine if your regression model could be improved with a Box-Cox transformation, and if so, perform the most appropriate Box-Cox transformation.

# regression model- all covariates are signficant based on pvalues
irismodel1 <- lm(Petal.Length~ ., data = iris)
summary(irismodel1)

## 
## Call:
## lm(formula = Petal.Length ~ ., data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78396 -0.15708  0.00193  0.14730  0.65418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
## Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
## Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
## Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
## Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
## Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2627 on 144 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9778 
## F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

# R-squared 0.9778

#Residual standard error: 0.2627 on 144 degrees of freedom
#Multiple R-squared:  0.9786,   Adjusted R-squared:  0.9778 
#F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

#Based on the plot below a box cox transformation would be beneficial since the data does not stay linear and 
#instead travels away from the horizontal line
plot(irismodel1$fitted.values, irismodel1$residuals)
abline(h = 0)

# The center line below is closest to 1/2, which means a square root of the data should improve the model and it does see below
library(MASS)
boxcox(irismodel1)

# regression model with a square root transformation
irismodel2 <- lm(I(sqrt(Petal.Length))~ ., data = iris)
summary(irismodel2)

## 
## Call:
## lm(formula = I(sqrt(Petal.Length)) ~ ., data = iris)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.193433 -0.041925  0.001955  0.036008  0.205347 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.60552    0.06880   8.802 3.84e-15 ***
## Sepal.Length       0.13655    0.01281  10.661  < 2e-16 ***
## Sepal.Width       -0.03455    0.02048  -1.687   0.0939 .  
## Petal.Width        0.14781    0.03096   4.775 4.38e-06 ***
## Speciesversicolor  0.54438    0.04422  12.311  < 2e-16 ***
## Speciesvirginica   0.65163    0.06240  10.442  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06698 on 144 degrees of freedom
## Multiple R-squared:  0.9826, Adjusted R-squared:  0.982 
## F-statistic:  1624 on 5 and 144 DF,  p-value: < 2.2e-16

#Residual standard error: 0.06698 on 144 degrees of freedom
#Multiple R-squared:  0.9826,   Adjusted R-squared:  0.982 
#F-statistic:  1624 on 5 and 144 DF,  p-value: < 2.2e-16

# The improvement of the model from the original R-Squared (0.9778) to the square transformation R-Sqaured (0.982)

Part 2

Question 7

# create a data frame called Faculty
Faculty <- data.frame(ID = c(1, 2, 3, 7), Name = c("Grayson", "Wayne", "Stark", "Grey"), Code = c("ART", "ART", "COMP", "HIST")) 
head(Faculty)

##   ID    Name Code
## 1  1 Grayson  ART
## 2  2   Wayne  ART
## 3  3   Stark COMP
## 4  7    Grey HIST

Question 8

#create a data frame called Department
Department <- data.frame(Code = c("ART", "COMP", "ENG", "HIST"), Department_Name = c("Art_Department", "Computer_Science_Department","English_Department", "History_Department")) 
head(Department)

##   Code             Department_Name
## 1  ART              Art_Department
## 2 COMP Computer_Science_Department
## 3  ENG          English_Department
## 4 HIST          History_Department

Question 9-Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::combine()    masks randomForest::combine()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::lift()       masks caret::lift()
## ✖ ggplot2::margin()   masks randomForest::margin()
## ✖ dplyr::select()     masks MASS::select()
## ✖ purrr::when()       masks foreach::when()

Department %>% semi_join(Faculty, by = "Code") -> filter_join
filter_join

##   Code             Department_Name
## 1  ART              Art_Department
## 2 COMP Computer_Science_Department
## 3 HIST          History_Department

#Code             Department_Name
#1  ART              Art_Department
#2 COMP Computer_Science_Department
#3 HIST          History_Department

Question 10- Using a filter join, display the department that does not have a faculty member listed in the Faculty data frame.

Department %>% anti_join(Faculty, by = "Code") -> anti_join
anti_join

##   Code    Department_Name
## 1  ENG English_Department

#  Code    Department_Name
#1  ENG English_Department

Test 2

Melissa Oberg

2022-12-11

Part 1

Question 1-Load the data frame called iris in R using the following code:

Question 2- Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.

Question 3- The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.

3A- What is the out-of-bag error for your model?

3B- Use your model to make predictions for the observations in your testing set

Question 4- Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.

Question 5-#5. Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.

5B- Use your boosting model to make predictions for the observations in your testing set, and create a confusion matrix displaying your predictions. What is the misclassification error rate for flowers in your testing set?

5C- Create a plot comparing the number of trees used in the boosting model with the misclassification error on your testing test. Based on your plot, are a large number of trees needed to have a fairly accurate boosting model?

Part 2

Question 7

Question 8

Question 9-Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.

Question 10- Using a filter join, display the department that does not have a faculty member listed in the Faculty data frame.