Load the data frame called iris in R using the following code: data(iris). Also, type the name of the data frame to view it in R using the following code: iris
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Randomly sample 80% of the rows in the iris data frame to create a training set. Create a testing set containing the rest of the rows in the iris data frame.
ind <- sample(nrow(iris), nrow(iris)*0.8)
iris_train <- iris[ind,]
iris_test <- iris[-ind,]
The iris dataset contains sepal and petal measurements for three species of flowers. Using 500 bootstrapped sets, develop a bagging model on your training set to predict a flower’s species based on its sepal and petal measurements.
library(ipred)
iris_bag <- bagging(Species ~., data = iris_train, nbagg = 500)
iris_bag
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## nbagg = 500)
iris_bag_oob <- bagging(Species~., data = iris_train, coob = T, nbagg = 500)
iris_bag_oob
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## coob = T, nbagg = 500)
##
## Out-of-bag estimate of misclassification error: 0.05
See screenshot in the included word doc as 3.a). The out-of-bag error for this is 0.0417.
iris_bag_pred <- predict(iris_bag, newdata = iris_test)
summary(iris_bag_pred)
## setosa versicolor virginica
## 6 10 14
See screenshot in the included word doc as 3.a). This predicts off the testing set that Setosa will be 6, Versicolor will be 12 and Virginica will be 12
Using 500 trees, develop a random forest model on your training set to predict a flower’s species based on its sepal and petal measurements.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
iris_rf <- randomForest(formula = Species~., data = iris_train, importance = TRUE, ntree = 500)
iris_rf
##
## Call:
## randomForest(formula = Species ~ ., data = iris_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.17%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 44 0 0 0.00000000
## versicolor 0 38 2 0.05000000
## virginica 0 3 33 0.08333333
The out of bag error for the Random Forest is 3.33%. See screenshot in the included word doc as 4.a).
iris_bag_oob
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## coob = T, nbagg = 500)
##
## Out-of-bag estimate of misclassification error: 0.05
iris_rf
##
## Call:
## randomForest(formula = Species ~ ., data = iris_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.17%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 44 0 0 0.00000000
## versicolor 0 38 2 0.05000000
## virginica 0 3 33 0.08333333
The bagging model oob error was 0.417 (4.17%), and the random forest oob error was 3.33%. The random forest model was better (smaller) then the bagging model. See screenshot in the included word doc as 4.b).
As seen in the confusion matrix, there are 4 flowers misclassified. See screenshot in the included word doc as 4.c).
iris_rf_pred <- predict(iris_rf, newdata = iris_test)
summary(iris_bag_pred)
## setosa versicolor virginica
## 6 10 14
See screenshot in the included word doc as 4.d). This predicts off the testing set that Setosa will be 6, Versicolor will be 12, and Virginica will be 12
Develop a boosting model on your training set to predict a flower’s species based on its sepal and petal measurements. In developing this model, you can either use R’s default settings for things like the number of trees and shrinkage, or you can use values of your own choosing.(Note that flower species is a non-numeric categorical variable. So you want to make sure that you are developing a boosting model for the purpose of categorical classification.)
iris_boost <- gbm(formula = Species ~., data=iris_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 3)
summary(iris_boost)
## var rel.inf
## Petal.Length Petal.Length 65.917006
## Petal.Width Petal.Width 25.817141
## Sepal.Length Sepal.Length 4.944938
## Sepal.Width Sepal.Width 3.320914
summary(iris_boost)
## var rel.inf
## Petal.Length Petal.Length 65.917006
## Petal.Width Petal.Width 25.817141
## Sepal.Length Sepal.Length 4.944938
## Sepal.Width Sepal.Width 3.320914
See screenshot in the included word doc as 5.a). We can see that the Petal Length is most important in predicting its species.
library(gbm)
iris_boost_test <- predict(iris_boost, newdata = iris_test, n.trees = 10000)
summary(iris_boost_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9898 1.9612 2.2126 2.2297 2.9033 3.1358
Answer
Answer
Using your training set, create a regression model to predict a flower’s petal length based on its petal width, sepal length, sepal width, and species. Determine if your regression model could be improved with a Box-Cox transformation, and if so, perform the most appropriate Box-Cox transformation.
model1 <- lm(iris_train$Petal.Length ~ ., data = iris_train)
boxcox(model1)
model2 <- lm(sqrt(iris_train$Petal.Length) ~ ., data = iris_train)
summary(model2)
##
## Call:
## lm(formula = sqrt(iris_train$Petal.Length) ~ ., data = iris_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.167651 -0.036868 0.000188 0.036168 0.191817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.58221 0.07117 8.180 4.46e-13 ***
## Sepal.Length 0.13297 0.01356 9.808 < 2e-16 ***
## Sepal.Width -0.01591 0.02035 -0.782 0.43594
## Petal.Width 0.10103 0.03243 3.115 0.00232 **
## Speciesversicolor 0.59972 0.04453 13.468 < 2e-16 ***
## Speciesvirginica 0.74320 0.06399 11.615 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06254 on 114 degrees of freedom
## Multiple R-squared: 0.9853, Adjusted R-squared: 0.9847
## F-statistic: 1530 on 5 and 114 DF, p-value: < 2.2e-16
boxcox(model2)
Yes, for model1 the results indicated a lambda of .5 so a square root transoformation was conducted creating model2. Model2 is showing a lambda of 1 indicating that no further transformations are needed.
Create a new data frame called Faculty containing the following table of information about four faculty at a university.
id <- c(1,2,3,7)
name <- c("Grayson", "Wayne", "Stark", "Grey")
code <- c( "ART", "ART", "COMP", "HIST")
faculty <- data.table(id, name, code)
Create a new data frame called Department containing the following table of information:
Department_Name <- c("Art Department", "Computer Science Department", "English Department", "History Department")
Code1 <- c( "ART", "COMP", "ENG", "HIST")
Department <- data.table(Code1, Department_Name)
Using a filter join, display only those departments that have a faculty member listed in the Faculty data frame.
Department %>% inner_join(faculty, by = c("Code1" = "code")) -> joined_data
joined_data
## Code1 Department_Name id name
## 1: ART Art Department 1 Grayson
## 2: ART Art Department 2 Wayne
## 3: COMP Computer Science Department 3 Stark
## 4: HIST History Department 7 Grey
Using a filter join, display the department that does not have a faculty member listed in the Faculty data frame.
Department %>% anti_join(faculty, by = c("Code1" = "code")) -> joined_data1
joined_data1
## Code1 Department_Name
## 1: ENG English Department