analysis.knit

Data processing.

First, we process the dataset. Then, we split the data set into training and test sets.

Codes for data processing:

library(glmnet)

Warning: package 'glmnet' was built under R version 4.3.3

Warning: package 'Matrix' was built under R version 4.3.2

set.seed(123)
outbreaks <- read.csv("C:/Users/abbie/Downloads/foodborne/outbreaks.csv")

#factor
outbreaks$Month = as.factor(outbreaks$Month)
outbreaks$State = as.factor(outbreaks$State)
outbreaks$Location = as.factor(outbreaks$Location)
outbreaks$Location = as.numeric(outbreaks$Location)

Codes for data splitting:

outbreaks$State = NULL
outbreaks$Food = NULL
outbreaks$Ingredient = NULL
outbreaks$Species = NULL
outbreaks$Serotype.Genotype = NULL
outbreaks$Status = NULL
summary(outbreaks)

      Year           Month         Location        Illnesses      
 Min.   :1998   May     :1898   Min.   :  1.00   Min.   :   2.00  
 1st Qu.:2001   June    :1819   1st Qu.: 57.00   1st Qu.:   3.00  
 Median :2005   December:1816   Median :100.00   Median :   8.00  
 Mean   :2006   April   :1725   Mean   : 76.78   Mean   :  19.54  
 3rd Qu.:2010   March   :1724   3rd Qu.:100.00   3rd Qu.:  19.00  
 Max.   :2015   July    :1538   Max.   :162.00   Max.   :1939.00  
                (Other) :8599                                     
 Hospitalizations    Fatalities    
 Min.   :  0.000   Min.   : 0.000  
 1st Qu.:  0.000   1st Qu.: 0.000  
 Median :  0.000   Median : 0.000  
 Mean   :  0.948   Mean   : 0.022  
 3rd Qu.:  1.000   3rd Qu.: 0.000  
 Max.   :308.000   Max.   :33.000  
 NA's   :3625      NA's   :3601

outbreaks = na.omit(outbreaks)
#trainig model 
outbreaks$Location = as.numeric(outbreaks$Location)
n = nrow(outbreaks)
n_training = round(n * 0.75)
n_test = n - n_training
index = sample(n, size = n_training)
training_data = outbreaks[index ,]
test_data = outbreaks[-index ,]

SUMMARY

Every year, almost 48 million Americans, become ill after consuming poisoned food. It is known that close to 250 toxins and pathogen can cause foodborne illnesses and all of these toxins have the potential to start an outbreak.

The summary:

lm_fit = glm(Illnesses ~ ., data = training_data)
summary(lm_fit)


Call:
glm(formula = Illnesses ~ ., data = training_data)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      971.520034 156.685211   6.200 5.83e-10 ***
Year              -0.471946   0.078027  -6.048 1.51e-09 ***
MonthAugust       -5.373667   1.905996  -2.819  0.00482 ** 
MonthDecember     -2.181348   1.825397  -1.195  0.23211    
MonthFebruary     -1.151633   1.954906  -0.589  0.55581    
MonthJanuary      -1.467636   1.927442  -0.761  0.44641    
MonthJuly         -5.405970   1.917798  -2.819  0.00483 ** 
MonthJune         -3.073819   1.830068  -1.680  0.09306 .  
MonthMarch        -3.069269   1.850049  -1.659  0.09714 .  
MonthMay          -2.533180   1.806633  -1.402  0.16090    
MonthNovember     -0.367757   1.933517  -0.190  0.84915    
MonthOctober      -3.797262   1.979633  -1.918  0.05512 .  
MonthSeptember    -4.059560   2.005458  -2.024  0.04297 *  
Location          -0.080541   0.009603  -8.387  < 2e-16 ***
Hospitalizations   4.371968   0.074948  58.333  < 2e-16 ***
Fatalities       -13.310544   1.030781 -12.913  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1776.987)

    Null deviance: 26767740  on 11358  degrees of freedom
Residual deviance: 20156358  on 11343  degrees of freedom
AIC: 117249

Number of Fisher Scoring iterations: 2

The following interactions appear to be statistically significant (at 5% level of significance): location, hospitalizations, fatalities, Month:August, Month:JUly, and Month:December (Month has twelve levels).

The test error:

lm_pred = predict(lm_fit, newdata = test_data)
(lm_test_error <- mean((test_data$Illnesses - lm_pred)^2))

[1] 1720.342

The distribution of the response variable, illnesses, is skewed. Thus, no transformation will be considered for this dataset.

LDA

We now fit a LDA regression model on the training set

library(MASS)
lda.fit = lda(Illnesses ~ ., data = training_data)

The test error:

lda.predict = predict(lda.fit)
(Error_rate = mean(lda.predict$class != training_data$Illnesses))

[1] 0.874989

Note that the test error is based on classification, thus it can not be directly comparable with the previous test error of the original variable.

Ridge Regression

We now fit a ridge regression model on the training set.

library(glmnet)
Y = training_data$Illnesses
X = as.matrix(training_data[, colnames(training_data) != "Illnesses"])
X_test = as.matrix(test_data[, colnames(test_data) != "Illnesses"])
ridge_fit <- cv.glmnet(y = Y, x = X, alpha = 0)

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

plot(ridge_fit)

This plot shows that the ridge penalty fails to reduce the cross-validated error.

The optimum the penalty parameter \(\lambda\) selected from the cross-validation:

(lambda_opt = ridge_fit$lambda.min)

[1] 2.323024

The test error:

ridge_pred = predict(ridge_fit, newx = X_test, s = lambda_opt)

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

ridge_test_error <- mean((test_data$Illnesses - ridge_pred)^2)
ridge_test_error

[1] 1724.468

The test error is higher than the MLR model.

LASSO

We fit a lasso model on the training set.

Y = training_data$Illnesses
X = as.matrix(training_data[, colnames(training_data) != "Illnesses"])
library(glmnet)
lasso_fit = cv.glmnet(y = Y, x=X, alpha = 1)

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in storage.mode(xd) <- "double": NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

plot(lasso_fit)

Like the ridge regression, this plot also shows that the LASSO penalty fails to reduce the cross-validated error.

The optimum \(\lambda\):

(lambda_opt <- lasso_fit$lambda.min)

[1] 0.06616054

The corresponding fitted coefficients:

predict(lasso_fit, s = lasso_fit$lambda.min, type = "coefficients")

6 x 1 sparse Matrix of class "dgCMatrix"
                           s1
(Intercept)      924.71763378
Year              -0.45007107
Month              .         
Location          -0.07701458
Hospitalizations   4.34237475
Fatalities       -13.05431658

The optimum LASSO removes only one predictor, date, from the model.

The test error:

lasso_pred <- predict(lasso_fit, newx = X_test, s = lambda_opt)

Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

lasso_test_error <- mean((test_data$Illnesses - lasso_pred)^2)
lasso_test_error

[1] 1721.119

The test error is slightly higher than the MLR model. So, LASSO is helpful for this data.

PCR

We fit a PCR model on the training set.

library(pls)

Warning: package 'pls' was built under R version 4.3.3


Attaching package: 'pls'

The following object is masked from 'package:stats':

    loadings

pcr_fit = pcr(Illnesses ~ ., data = training_data, scale = TRUE, validation = "CV")
summary(pcr_fit)

Data:   X dimension: 11359 15 
    Y dimension: 11359 1
Fit method: svdpc
Number of components considered: 15

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           48.55    46.50    46.56    46.56    46.56    46.56    46.57
adjCV        48.55    46.42    46.47    46.47    46.47    46.48    46.48
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       46.62     46.6    46.59     46.57     46.58     46.58      46.5
adjCV    46.53     46.5    46.49     46.48     46.49     46.49      46.4
       14 comps  15 comps
CV        42.26     42.26
adjCV     42.25     42.26

TRAINING: % variance explained
           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X            9.719    17.52    24.92    32.29    39.63    46.93    54.18
Illnesses   11.977    12.09    12.09    12.09    12.09    12.10    12.10
           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
X            61.39    68.59     75.77     82.94     90.09     95.58     99.33
Illnesses    12.21    12.24     12.26     12.37     12.38     12.77     24.67
           15 comps
X             100.0
Illnesses      24.7

Validation plot:

validationplot(pcr_fit, val.type = "MSEP")

MSEP:

(MSEP_vec <- MSEP(pcr_fit)$val[1, 1, ])

(Intercept)     1 comps     2 comps     3 comps     4 comps     5 comps 
   2356.938    2161.994    2167.554    2167.663    2167.819    2168.266 
    6 comps     7 comps     8 comps     9 comps    10 comps    11 comps 
   2168.967    2173.105    2171.369    2170.492    2168.606    2170.146 
   12 comps    13 comps    14 comps    15 comps 
   2169.914    2161.866    1785.812    1786.182

(MSE_PCR_index <- (0:(length(MSEP_vec) - 1))[which.min(MSEP_vec)])

[1] 14

The cross-validated error is minimum when we use all 15 principal components, meaning no dimension reduction. However, the change in MSEP is small after 14 components. I will be using 14 principal components for this data.

The training error:

pcr_pred_train <- predict(pcr_fit, ncomp = 14)
mean((pcr_pred_train - training_data$Illnesses)^2)

[1] 1775.123

The test error:

pcr_pred_test <- predict(pcr_fit, newdata = test_data, ncomp = 14)
mean((pcr_pred_test - test_data$Illnesses)^2)

[1] 1720.75

PCR has less error than all previous models.

PLS

We fit a PLS model on the training set.

pls_fit = plsr(Illnesses ~ ., data = training_data, scale = TRUE, validation = "CV")
summary(pls_fit)

Data:   X dimension: 11359 15 
    Y dimension: 11359 1
Fit method: kernelpls
Number of components considered: 15

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           48.55    45.61    42.52    42.45    42.46    42.46    42.46
adjCV        48.55    45.39    42.49    42.43    42.43    42.44    42.44
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       42.46    42.46    42.46     42.46     42.46     42.46     42.46
adjCV    42.44    42.44    42.44     42.44     42.44     42.44     42.44
       14 comps  15 comps
CV        42.46     42.46
adjCV     42.44     42.44

TRAINING: % variance explained
           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X            8.892    13.58    20.26    25.54    27.54    34.62    41.88
Illnesses   20.728    24.65    24.69    24.69    24.70    24.70    24.70
           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
X            49.14    56.47      63.7      71.0     78.23     85.54      92.8
Illnesses    24.70    24.70      24.7      24.7     24.70     24.70      24.7
           15 comps
X             100.0
Illnesses      24.7

MSEP:

(MSEP_pls_vec <- MSEP(pls_fit)$val[1, 1, ])

(Intercept)     1 comps     2 comps     3 comps     4 comps     5 comps 
   2356.938    2080.428    1808.056    1802.213    1802.431    1802.980 
    6 comps     7 comps     8 comps     9 comps    10 comps    11 comps 
   1802.896    1802.898    1802.898    1802.898    1802.898    1802.898 
   12 comps    13 comps    14 comps    15 comps 
   1802.898    1802.898    1802.898    1802.898

(MSE_PCR_index <- (0:(length(MSEP_pls_vec)-1))[which.min(MSEP_pls_vec)])

[1] 3

We will consider 3 principal components for this data

The training error:

pcr_pred_train <- predict(pls_fit, newdata = training_data, ncomp = 3)
mean((pcr_pred_train - training_data$Illnesses)^2)

[1] 1774.803

The test error:

pcr_pred_test <- predict(pls_fit, newdata = test_data, ncomp = 3)
mean((pcr_pred_test - test_data$Illnesses)^2)

[1] 1720.444

PLS has the lowest test error out of all models.

Decision Tree

We fit a Decision Tree model on the training set.

library(tree)

Warning: package 'tree' was built under R version 4.3.3

tree_fit = tree(Illnesses ~ ., data = training_data)
summary(tree_fit)


Regression tree:
tree(formula = Illnesses ~ ., data = training_data)
Variables actually used in tree construction:
[1] "Hospitalizations" "Location"        
Number of terminal nodes:  6 
Residual mean deviance:  1790 = 20330000 / 11350 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-573.900  -12.110   -8.111    0.000    1.564 1575.000

The four predictors used to construct the tree: Hospitalizations, Location, Month, Year. The total number of terminal nodes is 8. The training MSE is 1589. Tree plot:

plot(tree_fit)
text(tree_fit)

The test error:

pred_test <- predict(tree_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))

[1] 1655.786

Tree has the lowest error than previous models. ## Bagging We fit bagging model on the training set.

library(randomForest)

Warning: package 'randomForest' was built under R version 4.3.3

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.

p <- ncol(training_data) - 1
bag_fit <- randomForest(Illnesses ~ ., data = training_data, mtry = p, importance = TRUE)
bag_fit


Call:
 randomForest(formula = Illnesses ~ ., data = training_data, mtry = p,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 5

          Mean of squared residuals: 2073.17
                    % Var explained: 12.02

The training error:

pred_train <- predict(bag_fit, newdata = training_data)
(training_MSE <- mean((training_data$Illnesses - pred_train)^2))

[1] 854.9911

The test error:

pred_test <- predict(bag_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))

[1] 1833.53

The test error from the bagging model performs better than previous models with an error of 1857.745.

Random Forest

We fit a random forest model on the training set.

m_opt <- (1:p)[which.min(test_MSE)]
RF_fit <- randomForest(Illnesses ~ ., data = training_data, mtry = m_opt, importance = TRUE)

The test error.

pred_test <- predict(RF_fit, newdata = test_data)
(test_MSE <- mean((test_data$Illnesses - pred_test)^2))

[1] 1652.889

The test error of random forest is slightly better than the bagging method.

Importance of variables.

(imp_var <- importance(RF_fit))

                    %IncMSE IncNodePurity
Year              5.6014385     1001410.3
Month             1.7440596     1194744.2
Location         13.6711068     1624047.7
Hospitalizations 27.6308585     5479410.9
Fatalities       -0.4881926      940522.9

varImpPlot(RF_fit, main = "Variable Importance for Random Forest.")

The random forest finds that hospitalizations and location are among the most important variables. ## Boosting We fit a Boosting model on the training set.

library(gbm)

Warning: package 'gbm' was built under R version 4.3.3

Loaded gbm 2.1.9

This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

boosting_fit = gbm(Illnesses ~ ., data = training_data, distribution = "gaussian", n.trees = 500, interaction.depth = 2)
summary(boosting_fit)

                              var   rel.inf
Hospitalizations Hospitalizations 61.395021
Month                       Month 20.748780
Location                 Location  7.639004
Fatalities             Fatalities  5.436828
Year                         Year  4.780366

The boosting method also finds that Hospitalizations is among the most important variables. However, unlike the bagging and random forest, Month is more important than Year in boosting.

The error test:

pred_test <- predict(boosting_fit, newdata = test_data)

Using 500 trees...

(test_MSE <- mean((test_data$Illnesses - pred_test)^2))

[1] 1633.911

Conclusions

Out of the 13 methods used, random forest gave the lowest test error being equal to 1639.503. Boosting also gave the smallest test error with a difference of 0.20, thus for this model random forest outperformed the other methods. However the models boosting and tree should be considered as well.