Homework 6

Some questions on this homework involve random sampling, and so each student may obtain slightly different results. Because of this, I need to see both your code and resulting output. For simplicity and to ensure that your output matches your textual explanations and answers, I will accept copies of your output placed in Microsoft Word for this assignment. However if you feel more comfortable doing so, you can also alternatively submit an HTML document produced in R Markdown. Just be sure that the HTML document displays your output and that your written answers match the displayed output. Use the penguins2 excel file to answer all of the questions in this homework.

Load the penguins2 excel file into R, and convert the dataset into a data frame.

#import the read excel library
library(readxl)

penguins <- read_excel("/Users/kamriefoster/Downloads/penguins2.xlsx")

penguins <- as.data.frame(penguins)

Check for missing values in the dataset. For any missing value(s) in a numeric column(s) of data, impute the average of the column. (Note: You should perform this imputation in R and not in excel.)

#check for missing values 
sum(is.na(penguins))

## [1] 8

#impute missing values with the average of the column
penguins$bill_length_mm[is.na(penguins$bill_length_mm)] <- median(penguins$bill_length_mm, na.rm = T)

penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)] <- median(penguins$bill_depth_mm, na.rm = T)

penguins$flipper_length_mm[is.na(penguins$flipper_length_mm)] <- median(penguins$flipper_length_mm, na.rm = T)

penguins$body_mass_g[is.na(penguins$body_mass_g)] <- median(penguins$body_mass_g, na.rm = T)

#recheck for missing values
sum(is.na(penguins))

## [1] 0

Convert all non-numeric columns of data into factors.

penguins$species <- as.factor(penguins$species)
penguins$island <- as.factor(penguins$island)
penguins$year <- as.factor(penguins$year)

Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.

index <- sample(nrow(penguins), nrow(penguins)*0.90)
penguins_train = penguins[index,]
penguins_test = penguins[-index,]

Using 10,000 trees that each have 3 splits, develop a boosting model to predict the body mass of a penguin. In creating your model, use a shrinkage/weight of 0.01.

#load the following libraries and packages to create a boosting model
#install.packages("gbm")
library(gbm)

## Loaded gbm 2.1.8.1

penguins_boost <- gbm(formula = body_mass_g~., data=penguins_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 3)
summary(penguins_boost)

##                                 var   rel.inf
## flipper_length_mm flipper_length_mm 35.309163
## species                     species 22.948697
## bill_length_mm       bill_length_mm 18.142506
## bill_depth_mm         bill_depth_mm 17.225648
## year                           year  3.403124
## island                       island  2.970861

Does it make more sense to use the training set or testing set when developing this model?

When developing a model you should use the training set. First, since the training set includes 90% of the data, it is more likely that a more accurate model can be built. By using the training set this model can be validated and tested later with the testing set created. So the model is developed with the training set and later tested for accuracy with the testing set.

Which covariate is most influential in your predictive model? In answering this question, be sure to provide your output from your code listing the relative influence of each covariate.

In this predictive model the most inflential covariate is the flipper_length_mm. A screenshot of the output is included:

Create a graph displaying the relationship (as determined in your model) between the target variable and the most influential covariate.

plot(penguins_boost, i = "flipper_length_mm")

Use the testing set to calculate your model’s mean squared error (MSE).

penguins_boost_pred_test <- predict(penguins_boost, penguins_test, n.trees = 10000)
mean((penguins_test$body_mass_g - penguins_boost_pred_test)^2)

## [1] 134149

Create a graph displaying the test MSE when various amounts of trees are used in the boosting model. Based on this graph, could the MSE be improved by using a smaller number of trees? Be sure to provide the instructor with all output produced by your code.

ntree <- seq(100, 10000, 100)

predmat <- predict(penguins_boost, newdata = penguins_test, n.trees = ntree)
#predmat
dim(predmat)

## [1]  35 100

err <- apply((predmat-penguins_test$body_mass_g)^2, 2, mean) 
#The middle entry of two means that we will be applying the mean to all of the columns. If this were a 1 and not a two it would find the average of each row in the matrix. 
#these are the mean squared errors
#the apply is used whenever we want to do something to the rows and the columns
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "n.trees", ylab = "Test MSE")

The original output of the code is provided by the screenshots below:

The original plot is also included below:

Based on the graph provided above is clear that there should be some improvement in the overall MSE if the number of trees is decreased from the 10,000 that were created earlier. The sweet spot looks to be under 2,000 and is likely closer to about 1,000 trees. This conclusion was drawn by looking at the minimum value on the line graph that was created above.

Now create a random forest of 10,000 trees to predict the body mass of a penguin. Be sure to provide the instructor with the output that you use to answer each part of this question below.

#install.packages('randomForest')
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

penguins_rf <- randomForest(body_mass_g~., data = penguins_train, importance = TRUE, ntree = 10000)

penguins_rf # mean of square residuals in the output is the out-of-bag mean squared error

## 
## Call:
##  randomForest(formula = body_mass_g ~ ., data = penguins_train,      importance = TRUE, ntree = 10000) 
##                Type of random forest: regression
##                      Number of trees: 10000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 106442.2
##                     % Var explained: 83.42

A copy of the original output:

Based on the percent of variation explained by the model, do you believe that the random forest fits your data well?

With 83.69% of variation explained by the model, the random forest is an overall good fit for the data. The percent of variation explained by the model is similar to the R-Squared Value, thus the closer to 100% the better the model fits the data. The model is by no means perfect but would provide decent results when used.

What is the out-of-bag error for the random forest?

The out-of-bag error is equivalent to the mean of squared residuals, which in the screenshot included from the original output is 102915.5.

What are the two most important covariates in your random forest model?

penguins_rf$importance

##                      %IncMSE IncNodePurity
## species           451551.970      51696560
## island            107215.680      14089178
## bill_length_mm    101510.818      23279172
## bill_depth_mm     119892.350      38166879
## flipper_length_mm 362822.201      57451183
## year                4527.484       2766736

The output shows that the two most important covariates in the random forest model are species and flipper_length_mm. These values are supported by the highest %IncMSE with 370094.725 and 365820.071 respectively. A screenshot of the original output has been included below:

Use the testing set to calculate the MSE for your random forest model. Based on your results, is the random forest more accurate than the boosting model? Or is your boosting model more accurate than your random forest?

penguins_rf_pred <- predict(penguins_rf, penguins_test)
#rather than using the testing set you can place any dataset you want to make the predictions on second at the boston_test location

mean((penguins_test$body_mass_g - penguins_rf_pred)^2)

## [1] 112297.9

#changed the bagging predicitions to the predictions from the random forest

The screenshot includes the code and the MSE for the random forest model. Based on the results of each of the original outputs the random forest model is a better fit. This model gives an MSE value of 155057.3 while the bagging model gave an MSE value of 165224.2. The lower the MSE the better the model is at predicting values and fitting the data.

Create a graph comparing the out-of-bag error to the number of trees used in the random forest model. Based on this graph, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?

plot(penguins_rf$mse, type = 'l', col = 2, lwd = 2, xlab = "ntree", ylab = "OOB Error")

#plot the MSE of that model. Type = l means we are making a line graph. col =2 is the color red. lwd is how thick of a line we will have in r. xlab is the title of the x-axis while ylab is the y-axis title.

According to the plot generated above, the number of trees for the random forest could be much lower than the 10,000 used in this example. The minimum value is well under 2,000 trees. After about 500-1,000 trees the OOB appears to level out and there is no evidence that the results are more accurate passed 1,000 trees. A screenshot of the originally generated graph is provided below;

MSDA 621- Homework 6

Kamrie Foster

2022-11-22

Homework 6