Week 6 Homework

Load the penguins2 excel file into R, and convert the dataset into a data frame.

pen <- read_excel("C:/Users/justt/Desktop/School/621/Assignment/Homework 6/penguins2.xlsx")
pen <- data.frame(pen)
colnames(pen)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "year"
str(pen)
## 'data.frame':    344 obs. of  7 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: num  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : num  3750 3800 3250 NA 3450 ...
##  $ year             : num  2007 2007 2007 2007 2007 ...

Check for missing values in the dataset. For any missing value(s) in a numeric column(s) of data, impute the average of the column. (Note: You should perform this imputation in R and not in excel.)

data.frame(num_missing=colSums(is.na(pen)))
##                   num_missing
## species                     0
## island                      0
## bill_length_mm              2
## bill_depth_mm               2
## flipper_length_mm           2
## body_mass_g                 2
## year                        0
imp_bill_length_mm <- mean(pen$bill_length_mm, na.rm = T)
pen$bill_length_mm[is.na(pen$bill_length_mm)==T] <- imp_bill_length_mm
imp_bill_depth_mm <- mean(pen$bill_depth_mm, na.rm = T)
pen$bill_depth_mm[is.na(pen$bill_depth_mm)==T] <- imp_bill_depth_mm
imp_flipper_length_mm <- mean(pen$flipper_length_mm, na.rm = T)
pen$flipper_length_mm[is.na(pen$flipper_length_mm)==T] <- imp_flipper_length_mm
imp_body_mass_g <- mean(pen$body_mass_g, na.rm = T)
pen$body_mass_g[is.na(pen$body_mass_g)==T] <- imp_body_mass_g
data.frame(num_missing=colSums(is.na(pen)))
##                   num_missing
## species                     0
## island                      0
## bill_length_mm              0
## bill_depth_mm               0
## flipper_length_mm           0
## body_mass_g                 0
## year                        0

Convert all non-numeric columns of data into factors.

str(pen)
## 'data.frame':    344 obs. of  7 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 43.9 36.7 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 17.2 19.3 ...
##  $ flipper_length_mm: num  181 186 195 201 193 ...
##  $ body_mass_g      : num  3750 3800 3250 4202 3450 ...
##  $ year             : num  2007 2007 2007 2007 2007 ...
pen$species <- as.factor(pen$species)
pen$island <- as.factor(pen$island)
str(pen)
## 'data.frame':    344 obs. of  7 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 43.9 36.7 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 17.2 19.3 ...
##  $ flipper_length_mm: num  181 186 195 201 193 ...
##  $ body_mass_g      : num  3750 3800 3250 4202 3450 ...
##  $ year             : num  2007 2007 2007 2007 2007 ...

Randomly sample the rows of your data to include 90% of the rows in your training set. Use the rest of the rows as your testing set.

index <- sample(nrow(pen), nrow(pen)*0.90)
pen_train = pen[index,]
pen_test = pen[-index,]

Using 10,000 trees that each have 3 splits, develop a boosting model to predict the body mass of a penguin. In creating your model, use a shrinkage/weight of 0.01.

# pen_lm <- glm(body_mass_g~., data = pen)
# pen_lm
# summary(pen_lm)
pen_boost <- gbm(formula = body_mass_g ~., data=pen_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 3)
summary(pen_boost)

##                                 var   rel.inf
## flipper_length_mm flipper_length_mm 42.501225
## bill_length_mm       bill_length_mm 17.652297
## bill_depth_mm         bill_depth_mm 17.362608
## species                     species 16.541087
## year                           year  3.048610
## island                       island  2.894173
  1. Does it make more sense to use the training set or testing set when developing this model?

It makes more sense to use the training set for developing the model. We will then use the testing set to validate our predictions.

  1. Which covariate is most influential in your predictive model? In answering this question, be sure to provide your output from your code listing the relative influence of each covariate.
summary(pen_boost)

##                                 var   rel.inf
## flipper_length_mm flipper_length_mm 42.501225
## bill_length_mm       bill_length_mm 17.652297
## bill_depth_mm         bill_depth_mm 17.362608
## species                     species 16.541087
## year                           year  3.048610
## island                       island  2.894173

See graph and table in the included word doc as 5.b)

In the table above we can see that flipper_length_mm is the most relevant covariate with relative influence of 39.488468.

Create a graph displaying the relationship (as determined in your model) between the target variable and the most influential covariate.

plot(pen_boost, i = "flipper_length_mm")

See graph in the included word doc as 5.c) which is the relationship between target (body_mass_g) and the most influential covariate (flipper_length_mm).

  1. Use the testing set to calculate your model’s mean squared error (MSE).
pen_boost_pred_test <- predict(pen_boost, pen_test, n.trees = 10000)
mean((pen_test$body_mass_g - pen_boost_pred_test)^2)
## [1] 170387.5

The models MSE, based on the testing set is: 106410.5 See screenshot in the included word doc as 5.d)

  1. Create a graph displaying the test MSE when various amounts of trees are used in the boosting model. Based on this graph, could the MSE be improved by using a smaller number of trees? Be sure to provide the instructor with all output produced by your code.
ntree <- seq(100, 10000, 100)
predmat <- predict(pen_boost, newdata = pen_test, n.trees = ntree)
dim(predmat)
## [1]  35 100
err <- apply((predmat-pen_test$body_mass_g)^2, 2, mean) 
plot(ntree, err, type = 'l', col = 2, lwd = 2, xlab = "n.trees", ylab = "Test MSE")
axis(1, at = ntree, las = 1)

According to the graph, the number of trees used could be improved by using approximately 400 trees. See screenshot in the included word doc as 5.e), there are 2 graphs at the end.

Now create a random forest of 10,000 trees to predict the body mass of a penguin. Be sure to provide the instructor with the output that you use to answer each part of this question below.

pen_rf <-randomForest(body_mass_g ~., data=pen_train, importance = TRUE, ntree = 10000)
pen_rf 
## 
## Call:
##  randomForest(formula = body_mass_g ~ ., data = pen_train, importance = TRUE,      ntree = 10000) 
##                Type of random forest: regression
##                      Number of trees: 10000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 105105.1
##                     % Var explained: 83.54
pen_rf$importance
##                      %IncMSE IncNodePurity
## species           347956.198      46364227
## island             90713.419      12461619
## bill_length_mm    110456.389      25124065
## bill_depth_mm     114280.632      37464967
## flipper_length_mm 402479.225      62634393
## year                5700.136       2587601
  1. Based on the percent of variation explained by the model, do you believe that the random forest fits your data well?
pen_rf
## 
## Call:
##  randomForest(formula = body_mass_g ~ ., data = pen_train, importance = TRUE,      ntree = 10000) 
##                Type of random forest: regression
##                      Number of trees: 10000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 105105.1
##                     % Var explained: 83.54

The variation explained is 82.32%, meaning our model fit the data 82.32%. This is a good fit for our data. See screenshot in the included word doc as 6.a).

  1. What is the out-of-bag error for the random forest?
mean(pen_rf$mse)
## [1] 105687.4
plot(pen_rf$mse, type = 'l', col = 2, lwd = 2, xlab = "ntree", ylab = "OOB Error")
axis(1, at = ntree, las = 1)

The out-of-bag error (Average MSE for the random forest) is 111468.1. The graph shows that this occurs when where the number of trees is approx 100 trees. See screenshot in the included word doc as 6.b).

  1. What are the two most important covariates in your random forest model?
pen_rf$importance
##                      %IncMSE IncNodePurity
## species           347956.198      46364227
## island             90713.419      12461619
## bill_length_mm    110456.389      25124065
## bill_depth_mm     114280.632      37464967
## flipper_length_mm 402479.225      62634393
## year                5700.136       2587601

The two most important covariates in my random forest model are species at 360383.875% and flipper_length_mm at 332536.512%.
See screenshot in the included word doc as 6.c).

  1. Use the testing set to calculate the MSE for your random forest model. Based on your results, is the random forest more accurate than the boosting model? Or is your boosting model more accurate than your random forest?
pen_rf_pred <- predict(pen_rf, pen_test)
mean((pen_test$body_mass_g - pen_rf_pred)^2)
## [1] 119080.4
mean((pen_test$body_mass_g - pen_boost_pred_test)^2)
## [1] 170387.5

The MSE for the random forest model using the testing set is 64253.36. The MSE for the boosting model using the testing set is 106410.5. This shows that the random forest model has the lower MSE compared to the boosting model, meaning that random forest is the most accurate model. See screenshot in the included word doc as 6.d).

  1. Create a graph comparing the out-of-bag error to the number of trees used in the random forest model. Based on this graph, does it appear that a large number of trees are needed to develop a fairly accurate random forest model?
plot(pen_rf$mse, type = 'l', col = 2, lwd = 2, xlab = "ntree", ylab = "OOB Error")
axis(1, at = ntree, las = 1)

The graph shows that the minimum out-of-bag error occurs when where the number of trees is approx 100 trees. A large # of trees, such as our 10,000, are not necessary. See screenshot in the included word doc as 6.e).