Use R to complete all of the following questions. Provide the instructor with the output from your code as either screenshots pasted in Word, or as output generated in an HTML document. Submit both your code and output in Brightspace. Make sure that all textual explanations match the output that you provide the instructor.
data(iris)
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
index <- sample(nrow(iris), nrow(iris)*0.80)
iris_train <- iris[index,]
iris_test <- iris[-index,]
#library needed
library(ipred)
#fitting the model
iris_bag <- bagging(formula = Species~., data = iris_train, nbagg = 500)
#100 bootstrap samples, and we fit a tree model to each of these 100 bootstrap samples. Final is the average of the 100 trees.
#nbag is the number of bootstraps we want
iris_bag
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## nbagg = 500)
iris_bag_oob <- bagging(formula = Species~.,
data = iris_train,
coob = T,
nbagg = 500) #coob = T means that it will automatically calculate the out-of-bag prediction error, finds the mean squared error
iris_bag_oob
##
## Bagging classification trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = Species ~ ., data = iris_train,
## coob = T, nbagg = 500)
##
## Out-of-bag estimate of misclassification error: 0.075
The original output screenshot is included below:
The out-of-bag error is 0.0583.
iris_bag_pred <- predict(iris_bag, newdata = iris_test)
iris_bag_pred
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor virginica
## [25] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
The observations are shown above: 7 predictions for setosa flowers, 14 predictions for versicolor flowers, and 9 predictions for virginica flowers in the testing set.
#library needed
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
iris_rf <- randomForest(Species~., data=iris_train, importance = TRUE, ntree = 500)
iris_rf
##
## Call:
## randomForest(formula = Species ~ ., data = iris_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 6.67%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 39 0 0 0.00000000
## versicolor 0 35 3 0.07894737
## virginica 0 5 38 0.11627907
The following displays the origninal output generated:
The out-of-bag-error for this model is 5.83%.
Both models have the same out-of-bag error for both of the models generated for my training and testing set. It is likely that as the training and testing set change the out-of-bag error will change, however it shouldn’t change drastically. Thus, both models seem to perform with about the same accuracy.
The random forest model classified all of the setosa flowers correctly, misclassified 3 of the versicolor flowers as virgninca (3/37- the class.error 0.0810), and misclassified 4 virginica flowers as versicolor (4/40- the class.error 0.100). Overall misclassification was 7/120 flowers (overall class.error being 0.0583).
iris_rf_pred <- predict(iris_rf, iris_test)
iris_rf_pred
## 2 3 6 12 19 28 31
## setosa setosa setosa setosa setosa setosa setosa
## 36 39 45 48 54 55 61
## setosa setosa setosa setosa versicolor versicolor versicolor
## 63 66 72 73 82 88 92
## versicolor versicolor versicolor versicolor versicolor versicolor versicolor
## 97 100 102 106 118 141 144
## versicolor versicolor virginica virginica virginica virginica virginica
## 146 148
## virginica virginica
## Levels: setosa versicolor virginica
The observations are shown above: 7 predictions for setosa flowers, 13 predictions for versicolor flowers, and 10 predictions for virginica flowers in the testing set.
#library needed
library(adabag)
## Loading required package: rpart
## Loading required package: caret
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
##
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
##
## bagging
#creating the boosting model
iris_boost = boosting(Species~., data = iris_train, boos = T)
iris_boost$importance
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 47.64604 25.07661 13.65901 13.61834
The originally output is displayed below:
Petal Length appears to be most important in predicting flower species at a value of 44.21972.
pred_iris_boost = predict(iris_boost, newdata = iris_test)
pred_iris_boost
## $formula
## Species ~ .
##
## $votes
## [,1] [,2] [,3]
## [1,] 86.4818556 11.257602 4.5998653
## [2,] 86.4818556 12.074194 3.7832734
## [3,] 86.6805460 14.990601 0.6681754
## [4,] 87.5896938 10.966356 3.7832734
## [5,] 86.6805460 14.990601 0.6681754
## [6,] 87.5896938 14.081454 0.6681754
## [7,] 86.4818556 12.074194 3.7832734
## [8,] 86.4818556 15.189292 0.6681754
## [9,] 86.4818556 11.257602 4.5998653
## [10,] 87.5896938 14.081454 0.6681754
## [11,] 86.4818556 12.074194 3.7832734
## [12,] 0.0000000 87.684284 14.6550388
## [13,] 0.8537486 88.930940 12.5546347
## [14,] 2.6319429 87.040838 12.6665421
## [15,] 0.8431612 82.376745 19.1194169
## [16,] 1.9716098 94.971523 5.3961897
## [17,] 1.6891025 91.634602 9.0156182
## [18,] 0.0000000 49.281715 53.0576082
## [19,] 1.7227951 92.710088 7.9064392
## [20,] 0.0000000 84.999823 17.3395000
## [21,] 1.9716098 87.785552 12.5821613
## [22,] 1.6891025 94.068025 6.5821950
## [23,] 1.6891025 92.940173 7.7100469
## [24,] 0.0000000 18.881068 83.4582551
## [25,] 0.0000000 2.672649 99.6666739
## [26,] 1.1078381 8.982707 92.2487774
## [27,] 0.0000000 10.296339 92.0429838
## [28,] 0.0000000 11.738945 90.6003777
## [29,] 0.0000000 13.243871 89.0954521
## [30,] 0.0000000 2.221698 100.1176247
##
## $prob
## [,1] [,2] [,3]
## [1,] 0.845050107 0.11000270 0.04494719
## [2,] 0.845050107 0.11798196 0.03696793
## [3,] 0.846991593 0.14647939 0.00652902
## [4,] 0.855875253 0.10715681 0.03696793
## [5,] 0.846991593 0.14647939 0.00652902
## [6,] 0.855875253 0.13759573 0.00652902
## [7,] 0.845050107 0.11798196 0.03696793
## [8,] 0.845050107 0.14842087 0.00652902
## [9,] 0.845050107 0.11000270 0.04494719
## [10,] 0.855875253 0.13759573 0.00652902
## [11,] 0.845050107 0.11798196 0.03696793
## [12,] 0.000000000 0.85679953 0.14320047
## [13,] 0.008342332 0.86898112 0.12267655
## [14,] 0.025717806 0.85051215 0.12377004
## [15,] 0.008238878 0.80493736 0.18682376
## [16,] 0.019265418 0.92800617 0.05272841
## [17,] 0.016504921 0.89539973 0.08809535
## [18,] 0.000000000 0.48155209 0.51844791
## [19,] 0.016834146 0.90590875 0.07725710
## [20,] 0.000000000 0.83056855 0.16943145
## [21,] 0.019265418 0.85778906 0.12294552
## [22,] 0.016504921 0.91917772 0.06431736
## [23,] 0.016504921 0.90815701 0.07533807
## [24,] 0.000000000 0.18449475 0.81550525
## [25,] 0.000000000 0.02611556 0.97388444
## [26,] 0.010825146 0.08777376 0.90140109
## [27,] 0.000000000 0.10060980 0.89939020
## [28,] 0.000000000 0.11470611 0.88529389
## [29,] 0.000000000 0.12941136 0.87058864
## [30,] 0.000000000 0.02170913 0.97829087
##
## $class
## [1] "setosa" "setosa" "setosa" "setosa" "setosa"
## [6] "setosa" "setosa" "setosa" "setosa" "setosa"
## [11] "setosa" "versicolor" "versicolor" "versicolor" "versicolor"
## [16] "versicolor" "versicolor" "virginica" "versicolor" "versicolor"
## [21] "versicolor" "versicolor" "versicolor" "virginica" "virginica"
## [26] "virginica" "virginica" "virginica" "virginica" "virginica"
##
## $confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 11 0
## virginica 0 1 7
##
## $error
## [1] 0.03333333
#this is the code just for the confusion matrix
pred_iris_boost$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 11 0
## virginica 0 1 7
#SUM of the wrong flowers divided by the sum of all of the flowers
pred_iris_boost$error
## [1] 0.03333333
The output is included in the screenshot below:
The observations are shown above: 7 predictions for setosa flowers, 14 predictions for versicolor flowers, and 9 predictions for virginica flowers in the testing set. All 7 setosa flowers were predicted correctly, however 2 versicolor flowers were predicted to be virginica (2/14- 0.143) and 1 virginica flower was predicted to be a versicolor flower (1/9- 0.111). The total misclassification error rate was 0.1 (calculated from 3/30)
ntree <- c(1, seq(20, 100, 20))
err <- c(0)
for (i in 1:6){
iris_boost = boosting(Species~., data = iris_train, boos = T, mfinal = ntree[i])
pred_credit_boost = predict(iris_boost, newdata = iris_test)
err[i] = pred_iris_boost$error
cat(i, " ")
}
## 1 2 3 4 5 6
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "No. of Trees", ylab = "Missclassification Error")
The original output is included below:
Since the output displays a straight line this means that less trees will provide the team with same results. The prediction methods of boosting for this data set are constant.
model1 <- lm(Petal.Length ~., data = iris_train)
summary(model1)
##
## Call:
## lm(formula = Petal.Length ~ ., data = iris_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78018 -0.15685 0.00861 0.14957 0.64142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.95644 0.30664 -3.119 0.0023 **
## Sepal.Length 0.61073 0.05505 11.094 < 2e-16 ***
## Sepal.Width -0.23524 0.09031 -2.605 0.0104 *
## Petal.Width 0.61181 0.12959 4.721 6.72e-06 ***
## Speciesversicolor 1.45652 0.18758 7.765 3.87e-12 ***
## Speciesvirginica 1.94464 0.26268 7.403 2.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2663 on 114 degrees of freedom
## Multiple R-squared: 0.9783, Adjusted R-squared: 0.9774
## F-statistic: 1030 on 5 and 114 DF, p-value: < 2.2e-16
library(MASS)
boxcox(model1)
The output of the box-cox graph is included below:
As seen the value of lambda falls at around 0.5, which means a box-cox transformation will be useful within the model created. The transformation is the square root of the petal length
model1b = lm(I(sqrt(Petal.Length)) ~., data = iris_train)
summary(model1b)
##
## Call:
## lm(formula = I(sqrt(Petal.Length)) ~ ., data = iris_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.198873 -0.039252 0.003451 0.040521 0.211371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.64992 0.07825 8.306 2.32e-13 ***
## Sepal.Length 0.13785 0.01405 9.813 < 2e-16 ***
## Sepal.Width -0.05132 0.02305 -2.227 0.0279 *
## Petal.Width 0.14961 0.03307 4.524 1.50e-05 ***
## Speciesversicolor 0.54170 0.04787 11.316 < 2e-16 ***
## Speciesvirginica 0.64576 0.06703 9.633 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06797 on 114 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9816
## F-statistic: 1270 on 5 and 114 DF, p-value: < 2.2e-16
There were improvements in the adjusted r-squared value following the box-cox transformation, meaning it was successful.
##Part 2: Filter Joins
Faculty <- data.frame(ID=c(1,2,3,7), Name=c("Grayson","Wayne", "Stark", "Grey"), Code=c("ART", "ART", "COMP", "HIST"))
Faculty
## ID Name Code
## 1 1 Grayson ART
## 2 2 Wayne ART
## 3 3 Stark COMP
## 4 7 Grey HIST
Department <- data.frame(Code=c("ART", "COMP", "ENG", "HIST"), Department_Name=c("Art Department", "Computer Science Department", "English Department", "History Department"))
Department
## Code Department_Name
## 1 ART Art Department
## 2 COMP Computer Science Department
## 3 ENG English Department
## 4 HIST History Department
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::when() masks foreach::when()
Department %>% semi_join(Faculty, by = "Code") -> joined_data
joined_data
## Code Department_Name
## 1 ART Art Department
## 2 COMP Computer Science Department
## 3 HIST History Department
Department %>% anti_join(Faculty, by = "Code") -> joined_data
joined_data
## Code Department_Name
## 1 ENG English Department