Some questions on this homework involve random sampling, and so each student may obtain slightly different results. Because of this, I need to see both your code and resulting output. For simplicity and to ensure that your output matches your textual explanations and answers, I will accept copies of your output placed in Microsoft Word for this assignment. However if you feel more comfortable doing so, you can also alternatively submit an HTML document produced in R Markdown. Just be sure that the HTML document displays your output and that your written answers match the displayed output. Use the penguins2 excel file to answer all of the questions in this homework.
#import the read excel library
library(readxl)
penguins <- read_excel("/Users/kamriefoster/Downloads/penguins2.xlsx")
penguins <- as.data.frame(penguins)
#check for missing values
sum(is.na(penguins))
## [1] 8
#impute missing values with the average of the column
penguins$bill_length_mm[is.na(penguins$bill_length_mm)] <- median(penguins$bill_length_mm, na.rm = T)
penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)] <- median(penguins$bill_depth_mm, na.rm = T)
penguins$flipper_length_mm[is.na(penguins$flipper_length_mm)] <- median(penguins$flipper_length_mm, na.rm = T)
penguins$body_mass_g[is.na(penguins$body_mass_g)] <- median(penguins$body_mass_g, na.rm = T)
#recheck for missing values
sum(is.na(penguins))
## [1] 0
penguins$species <- as.factor(penguins$species)
penguins$island <- as.factor(penguins$island)
penguins$year <- as.factor(penguins$year)
index <- sample(nrow(penguins), nrow(penguins)*0.90)
penguins_train = penguins[index,]
penguins_test = penguins[-index,]
#load the following libraries and packages to create a boosting model
#install.packages("gbm")
library(gbm)
## Loaded gbm 2.1.8.1
penguins_boost <- gbm(formula = body_mass_g~., data=penguins_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 3)
summary(penguins_boost)
## var rel.inf
## flipper_length_mm flipper_length_mm 35.309163
## species species 22.948697
## bill_length_mm bill_length_mm 18.142506
## bill_depth_mm bill_depth_mm 17.225648
## year year 3.403124
## island island 2.970861
When developing a model you should use the training set. First, since the training set includes 90% of the data, it is more likely that a more accurate model can be built. By using the training set this model can be validated and tested later with the testing set created. So the model is developed with the training set and later tested for accuracy with the testing set.
In this predictive model the most inflential covariate is the
flipper_length_mm. A screenshot of the output is included:
plot(penguins_boost, i = "flipper_length_mm")
penguins_boost_pred_test <- predict(penguins_boost, penguins_test, n.trees = 10000)
mean((penguins_test$body_mass_g - penguins_boost_pred_test)^2)
## [1] 134149
ntree <- seq(100, 10000, 100)
predmat <- predict(penguins_boost, newdata = penguins_test, n.trees = ntree)
#predmat
dim(predmat)
## [1] 35 100
err <- apply((predmat-penguins_test$body_mass_g)^2, 2, mean)
#The middle entry of two means that we will be applying the mean to all of the columns. If this were a 1 and not a two it would find the average of each row in the matrix.
#these are the mean squared errors
#the apply is used whenever we want to do something to the rows and the columns
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "n.trees", ylab = "Test MSE")
The original output of the code is provided by the
screenshots below:
The original plot is also included below:
Based on the graph provided above is clear that there should be some improvement in the overall MSE if the number of trees is decreased from the 10,000 that were created earlier. The sweet spot looks to be under 2,000 and is likely closer to about 1,000 trees. This conclusion was drawn by looking at the minimum value on the line graph that was created above.
#install.packages('randomForest')
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
penguins_rf <- randomForest(body_mass_g~., data = penguins_train, importance = TRUE, ntree = 10000)
penguins_rf # mean of square residuals in the output is the out-of-bag mean squared error
##
## Call:
## randomForest(formula = body_mass_g ~ ., data = penguins_train, importance = TRUE, ntree = 10000)
## Type of random forest: regression
## Number of trees: 10000
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 106442.2
## % Var explained: 83.42
A copy of the original output:
With 83.69% of variation explained by the model, the random forest is an overall good fit for the data. The percent of variation explained by the model is similar to the R-Squared Value, thus the closer to 100% the better the model fits the data. The model is by no means perfect but would provide decent results when used.
The out-of-bag error is equivalent to the mean of squared residuals, which in the screenshot included from the original output is 102915.5.
penguins_rf$importance
## %IncMSE IncNodePurity
## species 451551.970 51696560
## island 107215.680 14089178
## bill_length_mm 101510.818 23279172
## bill_depth_mm 119892.350 38166879
## flipper_length_mm 362822.201 57451183
## year 4527.484 2766736
The output shows that the two most important covariates in
the random forest model are species and flipper_length_mm. These values
are supported by the highest %IncMSE with 370094.725 and 365820.071
respectively. A screenshot of the original output has been included
below:
penguins_rf_pred <- predict(penguins_rf, penguins_test)
#rather than using the testing set you can place any dataset you want to make the predictions on second at the boston_test location
mean((penguins_test$body_mass_g - penguins_rf_pred)^2)
## [1] 112297.9
#changed the bagging predicitions to the predictions from the random forest
The screenshot includes the code and the MSE for the random forest model. Based on the results of each of the original outputs the random forest model is a better fit. This model gives an MSE value of 155057.3 while the bagging model gave an MSE value of 165224.2. The lower the MSE the better the model is at predicting values and fitting the data.
plot(penguins_rf$mse, type = 'l', col = 2, lwd = 2, xlab = "ntree", ylab = "OOB Error")
#plot the MSE of that model. Type = l means we are making a line graph. col =2 is the color red. lwd is how thick of a line we will have in r. xlab is the title of the x-axis while ylab is the y-axis title.
According to the plot generated above, the number of trees for the random forest could be much lower than the 10,000 used in this example. The minimum value is well under 2,000 trees. After about 500-1,000 trees the OOB appears to level out and there is no evidence that the results are more accurate passed 1,000 trees. A screenshot of the originally generated graph is provided below;