Introduction

According to the Centers for Disease Control and Prevention, every year 48 million people in the United States develop a foodborne illness, 128,000 become hospitalized, and 3,000 will die. Certain factors are known for increasing the likelihood a person becomes severely ill, such as the age and prior health status of the afflicted person. Researchers have identified more than 250 pathogens, toxins, and chemicals that can lead to a foodborne illness. Public health agencies are urging both citizens and state governments to adopt a proactive approach to reduce the number of food-borne illnesses. One of the first steps to improve and change current health practices is to conduct an analysis that helps determine where and when illnesses are more likely to occur and if there are any trends that can predict outbreaks.

I am using the “Foodborne Disease Outbreaks, 1998 - 2015” dataset from Kaggle to determine the correlation between total number of illnesses and various, independent variables. I will be analyzing how the U.S. state, year, location of food preparation, etiology, and status (whether the etiology is confirmed or suspected) impacts total number of illnesses. I assume that the total number of foodborne illnesses are increasing by the year and that foodborne illnesses are more prominent in coastal areas.

I will use the Poisson Regression Analysis because the dependent variable (foodborne illnesses) is a count variable as it measures the amount of times illnesses occur. The Poisson distribution is both discrete and positive.

The dataset “Foodborne Disease Outbreaks, 1998 - 2015” is licensed by the Centers for Disease Control and Prevention. It contains 12 variables (Year. Month, State, Location, Food, Ingredient, Species, Serotype/Genotype, Status, Illnesses, Hospitalizations, Fatalities).

Importing Dataset & Packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
library(Zelig)
## Loading required package: survival
## 
## Attaching package: 'Zelig'
## The following object is masked from 'package:ggplot2':
## 
##     stat
library(ZeligChoice)
library(readr)
foodborne_disease<-read_csv("C:/Users/Skippz/Downloads/foodborne-diseases.zip")
## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Month = col_character(),
##   State = col_character(),
##   Location = col_character(),
##   Food = col_character(),
##   Ingredient = col_character(),
##   Species = col_character(),
##   `Serotype/Genotype` = col_character(),
##   Status = col_character(),
##   Illnesses = col_double(),
##   Hospitalizations = col_double(),
##   Fatalities = col_double()
## )
fb_diseases<-select(foodborne_disease, "Year", "Month", "State", "Location", "Species", "Status", "Illnesses", "Hospitalizations", "Fatalities")
head(fb_diseases)
## # A tibble: 6 x 9
##    Year Month State Location Species Status Illnesses Hospitalizations
##   <dbl> <chr> <chr> <chr>    <chr>   <chr>      <dbl>            <dbl>
## 1  1998 Janu~ Cali~ Restaur~ <NA>    <NA>          20                0
## 2  1998 Janu~ Cali~ <NA>     <NA>    <NA>         112                0
## 3  1998 Janu~ Cali~ Restaur~ <NA>    <NA>          35                0
## 4  1998 Janu~ Cali~ Restaur~ Scombr~ Confi~         4                0
## 5  1998 Janu~ Cali~ Private~ Salmon~ Confi~        26                3
## 6  1998 Janu~ Cali~ Restaur~ Shigel~ Confi~        25                3
## # ... with 1 more variable: Fatalities <dbl>

Converting Datatypes

fb_diseases$Month<-as.factor(fb_diseases$Month)
fb_diseases$State<-as.factor(fb_diseases$State)
fb_diseases$Location<-as.factor(fb_diseases$Location)
fb_diseases$Species<-as.factor(fb_diseases$Species)
fb_diseases$Status<-as.factor(fb_diseases$Status)
fb_diseases$Year<-as.integer(fb_diseases$Year)
fb_diseases$Illnesses<-as.integer(fb_diseases$Illnesses)
fb_diseases$Hospitalizations<-as.integer(fb_diseases$Hospitalizations)
fb_diseases$Fatalities<-as.integer(fb_diseases$Fatalities)
fb_diseases$Location[fb_diseases$Location=='All']<- NA
fb_diseases$Species[fb_diseases$Species=='All']<- NA
fb_diseases$Status[fb_diseases$Status=='All']<- NA
fb_diseases$Illnesses[fb_diseases$Illnesses=='All']<- NA
attr(fb_diseases,'spec')<- NULL
str(fb_diseases)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 19119 obs. of  9 variables:
##  $ Year            : int  1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 ...
##  $ Month           : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ State           : Factor w/ 55 levels "Alabama","Alaska",..: 5 5 5 5 5 5 5 5 6 6 ...
##  $ Location        : Factor w/ 161 levels "Banquet Facility",..: 99 NA 99 99 66 99 99 99 99 99 ...
##  $ Species         : Factor w/ 201 levels "Amnesic shellfish poison",..: NA NA NA 163 134 165 NA 134 199 199 ...
##  $ Status          : Factor w/ 22 levels "Confirmed","Confirmed; Confirmed",..: NA NA NA 1 1 1 NA 1 14 14 ...
##  $ Illnesses       : int  20 112 35 4 26 25 8 4 21 3 ...
##  $ Hospitalizations: int  0 0 0 0 3 3 0 3 NA NA ...
##  $ Fatalities      : int  0 0 0 0 0 0 0 0 NA 0 ...
dim(fb_diseases)
## [1] 19119     9

Selecting the Best Model

P1<- zelig(Illnesses ~ Year, model = "poisson", data = fb_diseases, cite = F)
P2<- zelig(Illnesses ~ Year + Hospitalizations, model = "poisson", data = fb_diseases, cite = F)
P3<- zelig(Illnesses ~ Year + Hospitalizations + Fatalities, model = "poisson", data = fb_diseases, cite = F)
P4<- zelig(Illnesses ~ Year + Hospitalizations + Fatalities + State, model = "poisson", data = fb_diseases, cite = F)
htmlreg(list(P1,P2,P3, P4))
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) 20.84*** 37.97*** 38.26*** 61.60***
(0.64) (0.71) (0.72) (0.76)
Year -0.01*** -0.02*** -0.02*** -0.03***
(0.00) (0.00) (0.00) (0.00)
Hospitalizations 0.02*** 0.02*** 0.02***
(0.00) (0.00) (0.00)
Fatalities 0.01*** -0.01***
(0.00) (0.00)
StateAlaska 0.46***
(0.04)
StateArizona 0.77***
(0.03)
StateArkansas 1.73***
(0.03)
StateCalifornia 0.73***
(0.03)
StateColorado 0.97***
(0.03)
StateConnecticut 0.81***
(0.03)
StateDelaware 0.75***
(0.07)
StateFlorida 0.23***
(0.03)
StateGeorgia 1.12***
(0.03)
StateGuam 0.51***
(0.11)
StateHawaii 0.08*
(0.03)
StateIdaho 1.13***
(0.04)
StateIllinois 0.98***
(0.03)
StateIndiana 1.09***
(0.03)
StateIowa 1.30***
(0.03)
StateKansas 0.67***
(0.03)
StateKentucky 0.87***
(0.05)
StateLouisiana 1.95***
(0.03)
StateMaine 1.13***
(0.03)
StateMaryland 0.41***
(0.03)
StateMassachusetts 1.29***
(0.03)
StateMichigan 0.76***
(0.03)
StateMinnesota 0.70***
(0.03)
StateMississippi 1.52***
(0.04)
StateMissouri 1.23***
(0.03)
StateMontana 1.04***
(0.06)
StateMultistate 1.55***
(0.03)
StateNebraska 1.17***
(0.04)
StateNevada 1.54***
(0.03)
StateNew Hampshire 1.20***
(0.04)
StateNew Jersey 1.06***
(0.03)
StateNew Mexico 1.19***
(0.04)
StateNew York 0.78***
(0.03)
StateNorth Carolina 1.47***
(0.03)
StateNorth Dakota 1.48***
(0.03)
StateOhio 0.68***
(0.03)
StateOklahoma 1.51***
(0.03)
StateOregon 0.78***
(0.03)
StatePennsylvania 0.99***
(0.03)
StatePuerto Rico 0.50***
(0.04)
StateRepublic of Palau -0.62
(0.33)
StateRhode Island 0.84***
(0.04)
StateSouth Carolina 1.44***
(0.03)
StateSouth Dakota 1.08***
(0.05)
StateTennessee 1.08***
(0.03)
StateTexas 1.39***
(0.03)
StateUtah 1.09***
(0.03)
StateVermont 0.32***
(0.05)
StateVirginia 1.22***
(0.03)
StateWashington 0.31***
(0.03)
StateWashington DC 1.68***
(0.04)
StateWest Virginia 1.12***
(0.04)
StateWisconsin 1.10***
(0.03)
StateWyoming 1.21***
(0.04)
AIC 781020.08 567059.42 549715.12 511710.75
BIC 781035.80 567082.37 549745.63 512153.03
Log Likelihood -390508.04 -283526.71 -274853.56 -255797.38
Deviance 703798.68 503952.67 488038.56 449926.19
Num. obs. 19119 15494 15145 15145
p < 0.001, p < 0.01, p < 0.05

Model 5 is the best model as it contains majority of variables (establishing its complexity) and it has the lowest AIC and BIC scores among the models.

P5<- zelig(Illnesses ~ Year+ Hospitalizations + Fatalities + State, model = "poisson", data = fb_diseases, cite = F)
summary(P5)
## Model: 
## 
## Call:
## z5$zelig(formula = Illnesses ~ Year + Hospitalizations + Fatalities + 
##     State, data = fb_diseases)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.264   -3.792   -2.477    0.046  107.127  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)             6.160e+01  7.621e-01  80.837  < 2e-16
## Year                   -2.967e-02  3.804e-04 -77.999  < 2e-16
## Hospitalizations        1.510e-02  6.333e-05 238.462  < 2e-16
## Fatalities             -8.533e-03  1.719e-03  -4.964 6.90e-07
## StateAlaska             4.605e-01  4.135e-02  11.137  < 2e-16
## StateArizona            7.691e-01  2.891e-02  26.605  < 2e-16
## StateArkansas           1.734e+00  3.285e-02  52.779  < 2e-16
## StateCalifornia         7.299e-01  2.555e-02  28.561  < 2e-16
## StateColorado           9.661e-01  2.720e-02  35.520  < 2e-16
## StateConnecticut        8.065e-01  2.949e-02  27.350  < 2e-16
## StateDelaware           7.454e-01  7.323e-02  10.178  < 2e-16
## StateFlorida            2.312e-01  2.642e-02   8.750  < 2e-16
## StateGeorgia            1.115e+00  2.745e-02  40.626  < 2e-16
## StateGuam               5.067e-01  1.061e-01   4.775 1.80e-06
## StateHawaii             7.592e-02  3.057e-02   2.484   0.0130
## StateIdaho              1.127e+00  3.589e-02  31.407  < 2e-16
## StateIllinois           9.779e-01  2.583e-02  37.856  < 2e-16
## StateIndiana            1.088e+00  3.078e-02  35.358  < 2e-16
## StateIowa               1.297e+00  2.978e-02  43.569  < 2e-16
## StateKansas             6.726e-01  2.874e-02  23.404  < 2e-16
## StateKentucky           8.704e-01  4.817e-02  18.071  < 2e-16
## StateLouisiana          1.949e+00  2.965e-02  65.739  < 2e-16
## StateMaine              1.129e+00  3.003e-02  37.585  < 2e-16
## StateMaryland           4.091e-01  2.742e-02  14.921  < 2e-16
## StateMassachusetts      1.288e+00  2.878e-02  44.745  < 2e-16
## StateMichigan           7.631e-01  2.666e-02  28.626  < 2e-16
## StateMinnesota          6.986e-01  2.643e-02  26.427  < 2e-16
## StateMississippi        1.524e+00  3.724e-02  40.923  < 2e-16
## StateMissouri           1.226e+00  3.120e-02  39.304  < 2e-16
## StateMontana            1.045e+00  5.713e-02  18.288  < 2e-16
## StateMultistate         1.553e+00  2.695e-02  57.627  < 2e-16
## StateNebraska           1.169e+00  4.499e-02  25.987  < 2e-16
## StateNevada             1.540e+00  3.290e-02  46.807  < 2e-16
## StateNew Hampshire      1.200e+00  3.555e-02  33.755  < 2e-16
## StateNew Jersey         1.055e+00  3.151e-02  33.484  < 2e-16
## StateNew Mexico         1.193e+00  3.796e-02  31.442  < 2e-16
## StateNew York           7.760e-01  2.682e-02  28.936  < 2e-16
## StateNorth Carolina     1.469e+00  2.903e-02  50.589  < 2e-16
## StateNorth Dakota       1.483e+00  3.247e-02  45.685  < 2e-16
## StateOhio               6.790e-01  2.615e-02  25.959  < 2e-16
## StateOklahoma           1.512e+00  3.304e-02  45.758  < 2e-16
## StateOregon             7.847e-01  2.731e-02  28.731  < 2e-16
## StatePennsylvania       9.913e-01  2.694e-02  36.804  < 2e-16
## StatePuerto Rico        4.997e-01  4.177e-02  11.964  < 2e-16
## StateRepublic of Palau -6.159e-01  3.343e-01  -1.843   0.0654
## StateRhode Island       8.427e-01  4.013e-02  21.001  < 2e-16
## StateSouth Carolina     1.442e+00  2.928e-02  49.269  < 2e-16
## StateSouth Dakota       1.084e+00  4.905e-02  22.103  < 2e-16
## StateTennessee          1.075e+00  2.762e-02  38.932  < 2e-16
## StateTexas              1.387e+00  2.739e-02  50.617  < 2e-16
## StateUtah               1.088e+00  3.467e-02  31.375  < 2e-16
## StateVermont            3.215e-01  5.376e-02   5.980 2.23e-09
## StateVirginia           1.219e+00  2.848e-02  42.790  < 2e-16
## StateWashington         3.130e-01  2.763e-02  11.331  < 2e-16
## StateWashington DC      1.685e+00  4.130e-02  40.797  < 2e-16
## StateWest Virginia      1.118e+00  4.187e-02  26.706  < 2e-16
## StateWisconsin          1.099e+00  2.702e-02  40.675  < 2e-16
## StateWyoming            1.211e+00  3.703e-02  32.706  < 2e-16
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 537812  on 15144  degrees of freedom
## Residual deviance: 449926  on 15087  degrees of freedom
##   (3974 observations deleted due to missingness)
## AIC: 511711
## 
## Number of Fisher Scoring iterations: 6
## 
## Next step: Use 'setx' method
x<- setx(P5, Year= mean(fb_diseases$Year))
x1<-setx(P5, Year=mean(fb_diseases$Year)+(sd(fb_diseases$Year)))
s<-sim(P5,x=x,x1=x1)
summary(s) 
sim x :
v
mean sd 50% 2.5% 97.5%
1,] 17.23495 0.09388558 17.23708 17.05665 17.42531
v
mean sd 50% 2.5% 97.5%
1,] 17.088 4.001033 17 10 25
sim x1 :
v
mean sd 50% 2.5% 97.5%
1,] 14.78924 0.08542617 14.788 14.6291 14.95117
v
mean sd 50% 2.5% 97.5%
1,] 14.601 3.77174 14 8 23
d
mean sd 50% 2.5% 97.5%
1,] -2.445713 0.03201976 -2.445905 -2.510489 -2.381397

Results

AAccording to the predicated values, an increase in the year indicates a decrease by -2.45 in the total number of illnesses suggesting that illnesses from foodborne diseases are actually decreasing.