Data processing.
First, we process the dataset. Then, we split the data set into training and test sets.
Codes for data processing:
Warning: package 'glmnet' was built under R version 4.3.3
Warning: package 'Matrix' was built under R version 4.3.2
set.seed(123)
outbreaks <- read.csv("C:/Users/abbie/Downloads/foodborne/outbreaks.csv")
#factor
outbreaks$Month = as.factor(outbreaks$Month)
outbreaks$State = as.factor(outbreaks$State)
outbreaks$Location = as.factor(outbreaks$Location)
outbreaks$Location = as.numeric(outbreaks$Location)Codes for data splitting:
outbreaks$State = NULL
outbreaks$Food = NULL
outbreaks$Ingredient = NULL
outbreaks$Species = NULL
outbreaks$Serotype.Genotype = NULL
outbreaks$Status = NULL
summary(outbreaks) Year Month Location Illnesses
Min. :1998 May :1898 Min. : 1.00 Min. : 2.00
1st Qu.:2001 June :1819 1st Qu.: 57.00 1st Qu.: 3.00
Median :2005 December:1816 Median :100.00 Median : 8.00
Mean :2006 April :1725 Mean : 76.78 Mean : 19.54
3rd Qu.:2010 March :1724 3rd Qu.:100.00 3rd Qu.: 19.00
Max. :2015 July :1538 Max. :162.00 Max. :1939.00
(Other) :8599
Hospitalizations Fatalities
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.000 Median : 0.000
Mean : 0.948 Mean : 0.022
3rd Qu.: 1.000 3rd Qu.: 0.000
Max. :308.000 Max. :33.000
NA's :3625 NA's :3601
outbreaks = na.omit(outbreaks)
#trainig model
outbreaks$Location = as.numeric(outbreaks$Location)
n = nrow(outbreaks)
n_training = round(n * 0.75)
n_test = n - n_training
index = sample(n, size = n_training)
training_data = outbreaks[index ,]
test_data = outbreaks[-index ,]Every year, almost 48 million Americans, become ill after consuming poisoned food. It is known that close to 250 toxins and pathogen can cause foodborne illnesses and all of these toxins have the potential to start an outbreak.
The summary:
Call:
glm(formula = Illnesses ~ ., data = training_data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 971.520034 156.685211 6.200 5.83e-10 ***
Year -0.471946 0.078027 -6.048 1.51e-09 ***
MonthAugust -5.373667 1.905996 -2.819 0.00482 **
MonthDecember -2.181348 1.825397 -1.195 0.23211
MonthFebruary -1.151633 1.954906 -0.589 0.55581
MonthJanuary -1.467636 1.927442 -0.761 0.44641
MonthJuly -5.405970 1.917798 -2.819 0.00483 **
MonthJune -3.073819 1.830068 -1.680 0.09306 .
MonthMarch -3.069269 1.850049 -1.659 0.09714 .
MonthMay -2.533180 1.806633 -1.402 0.16090
MonthNovember -0.367757 1.933517 -0.190 0.84915
MonthOctober -3.797262 1.979633 -1.918 0.05512 .
MonthSeptember -4.059560 2.005458 -2.024 0.04297 *
Location -0.080541 0.009603 -8.387 < 2e-16 ***
Hospitalizations 4.371968 0.074948 58.333 < 2e-16 ***
Fatalities -13.310544 1.030781 -12.913 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1776.987)
Null deviance: 26767740 on 11358 degrees of freedom
Residual deviance: 20156358 on 11343 degrees of freedom
AIC: 117249
Number of Fisher Scoring iterations: 2
The following interactions appear to be statistically significant (at 5% level of significance): location, hospitalizations, fatalities, Month:August, Month:JUly, and Month:December (Month has twelve levels).
The test error:
lm_pred = predict(lm_fit, newdata = test_data)
(lm_test_error <- mean((test_data$Illnesses - lm_pred)^2))[1] 1720.342
The distribution of the response variable, illnesses, is skewed. Thus, no transformation will be considered for this dataset.
We now fit a LDA regression model on the training set
The test error:
[1] 0.874989
Note that the test error is based on classification, thus it can not be directly comparable with the previous test error of the original variable.
We now fit a ridge regression model on the training set.
library(glmnet)
Y = training_data$Illnesses
X = as.matrix(training_data[, colnames(training_data) != "Illnesses"])
X_test = as.matrix(test_data[, colnames(test_data) != "Illnesses"])
ridge_fit <- cv.glmnet(y = Y, x = X, alpha = 0)Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
This plot shows that the ridge penalty fails to reduce the cross-validated error.
The optimum the penalty parameter \(\lambda\) selected from the cross-validation:
[1] 2.323024
The test error:
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
[1] 1724.468
The test error is higher than the MLR model.
We fit a lasso model on the training set.
Y = training_data$Illnesses
X = as.matrix(training_data[, colnames(training_data) != "Illnesses"])
library(glmnet)
lasso_fit = cv.glmnet(y = Y, x=X, alpha = 1)Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Like the ridge regression, this plot also shows that the LASSO penalty fails to reduce the cross-validated error.
The optimum \(\lambda\):
[1] 0.06616054
The corresponding fitted coefficients:
6 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 924.71763378
Year -0.45007107
Month .
Location -0.07701458
Hospitalizations 4.34237475
Fatalities -13.05431658
The optimum LASSO removes only one predictor, date, from the model.
The test error:
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
[1] 1721.119
The test error is slightly higher than the MLR model. So, LASSO is helpful for this data.
We fit a PCR model on the training set.
Warning: package 'pls' was built under R version 4.3.3
Attaching package: 'pls'
The following object is masked from 'package:stats':
loadings
pcr_fit = pcr(Illnesses ~ ., data = training_data, scale = TRUE, validation = "CV")
summary(pcr_fit)Data: X dimension: 11359 15
Y dimension: 11359 1
Fit method: svdpc
Number of components considered: 15
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 48.55 46.50 46.56 46.56 46.56 46.56 46.57
adjCV 48.55 46.42 46.47 46.47 46.47 46.48 46.48
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
CV 46.62 46.6 46.59 46.57 46.58 46.58 46.5
adjCV 46.53 46.5 46.49 46.48 46.49 46.49 46.4
14 comps 15 comps
CV 42.26 42.26
adjCV 42.25 42.26
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 9.719 17.52 24.92 32.29 39.63 46.93 54.18
Illnesses 11.977 12.09 12.09 12.09 12.09 12.10 12.10
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
X 61.39 68.59 75.77 82.94 90.09 95.58 99.33
Illnesses 12.21 12.24 12.26 12.37 12.38 12.77 24.67
15 comps
X 100.0
Illnesses 24.7
Validation plot:
MSEP:
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
2356.938 2161.994 2167.554 2167.663 2167.819 2168.266
6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
2168.967 2173.105 2171.369 2170.492 2168.606 2170.146
12 comps 13 comps 14 comps 15 comps
2169.914 2161.866 1785.812 1786.182
[1] 14
The cross-validated error is minimum when we use all 15 principal components, meaning no dimension reduction. However, the change in MSEP is small after 14 components. I will be using 14 principal components for this data.
The training error:
[1] 1775.123
The test error:
pcr_pred_test <- predict(pcr_fit, newdata = test_data, ncomp = 14)
mean((pcr_pred_test - test_data$Illnesses)^2)[1] 1720.75
PCR has less error than all previous models.
We fit a PLS model on the training set.
pls_fit = plsr(Illnesses ~ ., data = training_data, scale = TRUE, validation = "CV")
summary(pls_fit)Data: X dimension: 11359 15
Y dimension: 11359 1
Fit method: kernelpls
Number of components considered: 15
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 48.55 45.61 42.52 42.45 42.46 42.46 42.46
adjCV 48.55 45.39 42.49 42.43 42.43 42.44 42.44
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
CV 42.46 42.46 42.46 42.46 42.46 42.46 42.46
adjCV 42.44 42.44 42.44 42.44 42.44 42.44 42.44
14 comps 15 comps
CV 42.46 42.46
adjCV 42.44 42.44
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 8.892 13.58 20.26 25.54 27.54 34.62 41.88
Illnesses 20.728 24.65 24.69 24.69 24.70 24.70 24.70
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
X 49.14 56.47 63.7 71.0 78.23 85.54 92.8
Illnesses 24.70 24.70 24.7 24.7 24.70 24.70 24.7
15 comps
X 100.0
Illnesses 24.7
MSEP:
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
2356.938 2080.428 1808.056 1802.213 1802.431 1802.980
6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
1802.896 1802.898 1802.898 1802.898 1802.898 1802.898
12 comps 13 comps 14 comps 15 comps
1802.898 1802.898 1802.898 1802.898
[1] 3
We will consider 3 principal components for this data
The training error:
pcr_pred_train <- predict(pls_fit, newdata = training_data, ncomp = 3)
mean((pcr_pred_train - training_data$Illnesses)^2)[1] 1774.803
The test error:
pcr_pred_test <- predict(pls_fit, newdata = test_data, ncomp = 3)
mean((pcr_pred_test - test_data$Illnesses)^2)[1] 1720.444
PLS has the lowest test error out of all models.
We fit a Decision Tree model on the training set.
Warning: package 'tree' was built under R version 4.3.3
Regression tree:
tree(formula = Illnesses ~ ., data = training_data)
Variables actually used in tree construction:
[1] "Hospitalizations" "Location"
Number of terminal nodes: 6
Residual mean deviance: 1790 = 20330000 / 11350
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-573.900 -12.110 -8.111 0.000 1.564 1575.000
The four predictors used to construct the tree: Hospitalizations, Location, Month, Year. The total number of terminal nodes is 8. The training MSE is 1589. Tree plot:
The test error:
pred_test <- predict(tree_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))[1] 1655.786
Tree has the lowest error than previous models. ## Bagging We fit bagging model on the training set.
Warning: package 'randomForest' was built under R version 4.3.3
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
p <- ncol(training_data) - 1
bag_fit <- randomForest(Illnesses ~ ., data = training_data, mtry = p, importance = TRUE)
bag_fit
Call:
randomForest(formula = Illnesses ~ ., data = training_data, mtry = p, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 2073.17
% Var explained: 12.02
The training error:
pred_train <- predict(bag_fit, newdata = training_data)
(training_MSE <- mean((training_data$Illnesses - pred_train)^2))[1] 854.9911
The test error:
pred_test <- predict(bag_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))[1] 1833.53
The test error from the bagging model performs better than previous models with an error of 1857.745.
We fit a random forest model on the training set.
m_opt <- (1:p)[which.min(test_MSE)]
RF_fit <- randomForest(Illnesses ~ ., data = training_data, mtry = m_opt, importance = TRUE)The test error.
pred_test <- predict(RF_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))[1] 1652.889
The test error of random forest is slightly better than the bagging method.
Importance of variables.
%IncMSE IncNodePurity
Year 5.6014385 1001410.3
Month 1.7440596 1194744.2
Location 13.6711068 1624047.7
Hospitalizations 27.6308585 5479410.9
Fatalities -0.4881926 940522.9
The random forest finds that hospitalizations and location are among the
most important variables. ## Boosting We fit a Boosting model on the
training set.
Warning: package 'gbm' was built under R version 4.3.3
Loaded gbm 2.1.9
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
boosting_fit = gbm(Illnesses ~ ., data = training_data, distribution = "gaussian", n.trees = 500, interaction.depth = 2)
summary(boosting_fit) var rel.inf
Hospitalizations Hospitalizations 61.395021
Month Month 20.748780
Location Location 7.639004
Fatalities Fatalities 5.436828
Year Year 4.780366
The boosting method also finds that Hospitalizations is among the most important variables. However, unlike the bagging and random forest, Month is more important than Year in boosting.
The error test:
Using 500 trees...
[1] 1633.911
Out of the 13 methods used, random forest gave the lowest test error being equal to 1639.503. Boosting also gave the smallest test error with a difference of 0.20, thus for this model random forest outperformed the other methods. However the models boosting and tree should be considered as well.