Topic: Proportions of US residents dead because of the Lyme Disease in 2015 (from Week 1 to Week 52).

Introduction

Background Informations.

Lyme disease, from its Latin name Lyme borreliosis, is caused by Ixodes ticks (bite) through bacterial infection and only if the tick has already bitten an infected animal. It is transmitted to pets and humans. Headache, chills, fever, myalgia are the most common symptoms. A visual means on his body is the appearance of a red or purple circle or mark where the tick has bitten. It is not necessary to consult a doctor only if the infected person is sick. Antibiotics (Doxycycline, or Amoxicillin, or Cefuroxime) treat the disease. It is possible to remove the tick on our own from our body with fine-tipped tweezers or a tick puller. To avoid ticks and their bites, it is possible to spray a tick spray on the clothes.

Data

Collections

The variables were collected by observations. Some variables are observational (Weeks, lyme disease death counts, and year 2015) and one categorical (area). It is about the death proportions of US resident who get the Lyme disease. There are two bias. First, the variables are incomplete: we are missing data for C.N.M.I, American Samoa, Guam, Puerto Rico, and Virgin Island. Consequently, we exclude those territories for answering the questions. Second, We assume it is human death of lyme disease proportion.

Informations

The Centers for Disease Control and Prevention publishes the Morbidity and Mortality Weekly Report (MMWR) series every week (CDC). The MMWR series is CDC’s primary medium for scientific publication of timely, credible, definitive, correct, impartial, and useful public health facts and recommendations. Physicians, nurses, public health professionals, epidemiologists and other scientists, academics, students, and laboratorians are among the most frequent readers of the MMWR. Here, the MMWR begin with the 1st week of 2015 and ends with th 52th week of 2015. The proportions of lyme disease death are reported by Regions and States in the USA.

Datasets

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.4

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lyme <- read_csv("lymeweek.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Reporting_Area = col_character(),
##   `MMWR Year` = col_double(),
##   MMWR_Week = col_double(),
##   `Lyme disease, Cum 2014` = col_double()
## )

glimpse(lyme)

## Rows: 3,380
## Columns: 4
## $ Reporting_Area           <chr> "ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS"...
## $ `MMWR Year`              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ MMWR_Week                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `Lyme disease, Cum 2014` <dbl> NA, NA, NA, NA, 1, NA, 11, 3, NA, 8, 1, 1,...

view(lyme)

Questions

Do the proportion of lyme death cases vary between regions in the USA?
Do the proportion of lyme death cases vary within regions in the USA?

Research Questions Answers (Week 1 study)

Between Regions

Do the proportion of lyme death cases vary between regions in the USA ?

Histogram of the Number of Death by Lyme in Different Regions

allregions1 <- c("New England", "Mid Atlantic", "E.N. Central", "W.N. Central", "S. Atlantic", "E.S. Central", "W.S. Central", "Mountain", "Pacific")
allcases1 <- c(124, 186, 13, 2, 44, 1, 0, 1, 2)
df1 <- data.frame(allregions1, allcases1)
df1

##    allregions1 allcases1
## 1  New England       124
## 2 Mid Atlantic       186
## 3 E.N. Central        13
## 4 W.N. Central         2
## 5  S. Atlantic        44
## 6 E.S. Central         1
## 7 W.S. Central         0
## 8     Mountain         1
## 9      Pacific         2

plot1 <- df1%>%
  ggplot(aes(x=allregions1, y = allcases1, fill = allregions1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("Number of Death by Lyme in Different Regions")+
  ylab("Reported Cases") +
  xlab("Regions") 
plot1

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases between US regions. Ha: There is a difference in proportion of lyme death cases between US regions.

## [1] 392

392/9

## [1] 43.55556

null.probs = c(44/392, 47/392, 43/392, 43/392, 43/392, 43/392, 43/392, 43/392, 43/392)
allcases1 = c(124, 186, 13, 2, 44, 1, 0, 1, 2)
chisq.test(allcases1, p=null.probs)

## 
##  Chi-squared test for given probabilities
## 
## data:  allcases1
## X-squared = 819.53, df = 8, p-value < 2.2e-16

Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death is different among US regions.

Within Regions

Do the proportion of lyme death cases vary within regions in the USA ?

New England

Histogram of the Number of Death by Lyme in New England

newenglands1 <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont")
casenewenglands1 <- c(30, 15, 42, 5, 9, 8)
df2 <- data.frame(newenglands1, casenewenglands1)
df2

##    newenglands1 casenewenglands1
## 1   Connecticut               30
## 2         Maine               15
## 3 Massachusetts               42
## 4 New Hampshire                5
## 5  Rhode Island                9
## 6       Vermont                8

plot2 <- df2%>%
  ggplot(aes(x=newenglands1, y = casenewenglands1, fill = newenglands1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30)) + 
  ggtitle("Number of Death by Lyme in New England")+
  ylab("Reported Cases") +
  xlab("States")
plot2

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the New England. Ha: There is a difference in proportion of lyme death cases in the New England.

30+15+42+5+9+8

## [1] 109

109/6

## [1] 18.16667

null.probs = c(18/109, 18/109, 19/109, 18/109, 18/109, 18./109)
casenewenglands1 = c(30, 15, 42, 5, 9, 8)
chisq.test(casenewenglands1, p=null.probs)

## 
##  Chi-squared test for given probabilities
## 
## data:  casenewenglands1
## X-squared = 55.787, df = 5, p-value = 8.992e-11

Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among New England states.

Mid Atlantic

Histogram of the Number of Death by Lyme in Mid Atlantic

ma1 <- c("New Jersey", "New York", "Pennsylvania")
casema1 <- c(39, 44, 79)
df3 <- data.frame(ma1, casema1)
df3

##            ma1 casema1
## 1   New Jersey      39
## 2     New York      44
## 3 Pennsylvania      79

plot3 <- df3%>%
  ggplot(aes(x=ma1, y = casema1, fill = ma1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 360)) + 
  ggtitle("Number of Death by Lyme in Mid. Atlantic")+
  ylab("Reported Cases") +
  xlab("States")
plot3

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the Mid Atlantic. Ha: There is a difference in proportion of lyme death cases in the Mid Atlantic.

39+44+79

## [1] 162

162/3

## [1] 54

null.probs = c(54/162, 54/162, 54/162)
casem1 = c(39, 44, 79)
chisq.test(casem1, p=null.probs)

## 
##  Chi-squared test for given probabilities
## 
## data:  casem1
## X-squared = 17.593, df = 2, p-value = 0.0001513

Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among Mid Atlantic states.

E. N. Central

Histogram of the Number of Death by Lyme in E. N. Central

enc1 <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin")
caseenc1 <- c(1, 1, 1, 1, 10)
df4 <- data.frame(enc1, caseenc1)
df4

##        enc1 caseenc1
## 1  Illinois        1
## 2   Indiana        1
## 3  Michigan        1
## 4      Ohio        1
## 5 Wisconsin       10

plot4 <- df4%>%
  ggplot(aes(x=enc1, y = caseenc1, fill = enc1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30)) + 
  ggtitle("Number of Death by Lyme in E. N. Central")+
  ylab("Reported Cases") +
  xlab("States")
plot4

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the E. N. Central. Ha: There is a difference in proportion of lyme death cases in the E. N. Central.

1+1+1+1+10

## [1] 14

14/5

## [1] 2.8

null.probs = c(2.8/14, 2.8/14, 2.8/14, 2.8/14, 2.8/14)
caseenc1 = c(1,1,1,1,10)
chisq.test(caseenc1, p=null.probs)

## Warning in chisq.test(caseenc1, p = null.probs): Chi-squared approximation may
## be incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  caseenc1
## X-squared = 23.143, df = 4, p-value = 0.0001186

Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among E. N. Central states.

W. N. Central

Histogram of the Number of Death by Lyme in W. N. Central

wnc1 <- c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota")
casewnc1 <- c(1, 0, 0, 0, 0, 0, 0)
df10 <- data.frame(wnc1, casewnc1)
df10

##           wnc1 casewnc1
## 1         Iowa        1
## 2       Kansas        0
## 3    Minnesota        0
## 4     Missouri        0
## 5     Nebraska        0
## 6 North Dakota        0
## 7 South Dakota        0

plot10 <- df10 %>%
  ggplot(aes(x=wnc1, y = casewnc1, fill = wnc1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30)) + 
  ggtitle("Number of Death by Lyme in W. N. Central")+
  ylab("Reported Cases") +
  xlab("States")
plot10

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the W. N. Central. Ha: There is a difference in proportion of lyme death cases in the W. N. Central.

## [1] 1

1/7

## [1] 0.1428571

null.probs = c(0.4, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
casewnc1 = c(1, 0, 0, 0, 0, 0, 0)
chisq.test(casewnc1, p=null.probs)

## Warning in chisq.test(casewnc1, p = null.probs): Chi-squared approximation may
## be incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  casewnc1
## X-squared = 1.5, df = 6, p-value = 0.9595

Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among E. N. Central states.

s. Atlantic

Histogram of the Number of Death by Lyme in S. Atlantic

sa1 <- c("Delaware", "DC", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia")
casesa1 <- c(6, 0, 2, 0, 15, 0, 0, 16, 1)
df5 <- data.frame(sa1, casesa1)
df5

##              sa1 casesa1
## 1       Delaware       6
## 2             DC       0
## 3        Florida       2
## 4        Georgia       0
## 5       Maryland      15
## 6 North Carolina       0
## 7 South Carolina       0
## 8       Virginia      16
## 9  West Virginia       1

plot5 <- df5%>%
  ggplot(aes(x=sa1, y = casesa1, fill = sa1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30)) + 
  ggtitle("Number of Death by Lyme in S. Atlantic")+
  ylab("Reported Cases") +
  xlab("States")
plot5

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the S. Atlantic. Ha: There is a difference in proportion of lyme death cases in the S. Atlantic.

6+0+2+0+15+0+0+16+1

## [1] 40

40/9

## [1] 4.444444

null.probs = c(4/40, 4/40, 4/40, 4/40, 4/40, 4/40, 4/40, 8/40, 4/40)
casesa1 = c(6, 0, 2, 0, 15, 0, 0, 16, 1)
chisq.test(casesa1, p=null.probs)

## Warning in chisq.test(casesa1, p = null.probs): Chi-squared approximation may be
## incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  casesa1
## X-squared = 58.5, df = 8, p-value = 9.17e-10

Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among S. Atlantic states.

E. S. Central

Histogram of the Number of Death by Lyme in E. S. Central

esc1 <- c("Alabama", "Kentucky", "Mississippi", "Tennessee")
caseesc1 <- c(0, 1, 0, 0)
df6 <- data.frame(esc1, caseesc1)
df6

##          esc1 caseesc1
## 1     Alabama        0
## 2    Kentucky        1
## 3 Mississippi        0
## 4   Tennessee        0

plot6 <- df6%>%
  ggplot(aes(x=esc1, y = caseesc1, fill = esc1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 360)) + 
  ggtitle("Number of Death by Lyme in E. S. Central")+
  ylab("Reported Cases") +
  xlab("States")
plot6

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the E. S. Central. Ha: There is a difference in proportion of lyme death cases in the E. S. Central.

## [1] 1

1/4

## [1] 0.25

null.probs = c(0.25, 0.25, 0.25, 0.25)
caseesc1 = c(0, 1, 0, 0)
chisq.test(caseesc1, p=null.probs)

## Warning in chisq.test(caseesc1, p = null.probs): Chi-squared approximation may
## be incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  caseesc1
## X-squared = 3, df = 3, p-value = 0.3916

Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among E. S. Central states.

W.S. Central

Histogram of the Number of Death by Lyme in W. S. Central

wsc1 <- c("Arkansas", "Louisiana", "Oklahoma", "Texas")
casewsc1 <- c(0, 0, 0, 0)
df7 <- data.frame(wsc1, casewsc1)
df7

##        wsc1 casewsc1
## 1  Arkansas        0
## 2 Louisiana        0
## 3  Oklahoma        0
## 4     Texas        0

plot7 <- df7%>%
  ggplot(aes(x=wsc1, y = casewsc1, fill = wsc1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 360)) + 
  ggtitle("Number of Death by Lyme in W. S. Central")+
  ylab("Reported Cases") +
  xlab("States")
plot7

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the W. S. Central. Ha: There is a difference in proportion of lyme death cases in the W. S. Central.

Answer: With the eyeball we can say, since all the values are the same, the p-value = 1. Consequently, p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among W. S. Central states.

Mountain

Histogram of the Number of Death by Lyme in Mountain

m1 <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming") 
casem1 <- c(0, 0, 0, 0, 1, 0, 0, 0)
df8 <- data.frame(m1, casem1)
df8

##           m1 casem1
## 1    Arizona      0
## 2   Colorado      0
## 3      Idaho      0
## 4    Montana      0
## 5     Nevada      1
## 6 New Mexico      0
## 7       Utah      0
## 8    Wyoming      0

plot8 <- df8%>%
  ggplot(aes(x=m1, y = casem1, fill = m1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30)) + 
  ggtitle("Number of Death by Lyme in Mountain")+
  ylab("Reported Cases") +
  xlab("States")
plot8

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the Mountain. Ha: There is a difference in proportion of lyme death cases in the Mountain.

## [1] 1

1/8

## [1] 0.125

null.probs = c(0.1, 0.1, 0.1, 0.1, 0.3, 0.1, 0.1, 0.1)
casem1 = c(0, 0, 0, 0, 1, 0, 0, 0)
chisq.test(casem1, p=null.probs)

## Warning in chisq.test(casem1, p = null.probs): Chi-squared approximation may be
## incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  casem1
## X-squared = 2.3333, df = 7, p-value = 0.9391

Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among Mountain states.

Pacific

Histogram of the Number of Death by Lyme in Pacific

p1 <- c("Alaska", "California", "Hawaii", "Oregon", "Washington")
casep1 <- c(0, 1, 0, 1, 0)
df9 <- data.frame(p1, casep1)
df9

##           p1 casep1
## 1     Alaska      0
## 2 California      1
## 3     Hawaii      0
## 4     Oregon      1
## 5 Washington      0

plot9 <- df9%>%
  ggplot(aes(x=p1, y = casep1, fill = p1)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 360)) + 
  ggtitle("Number of Death by Lyme in Pacific")+
  ylab("Reported Cases") +
  xlab("States")
plot9

Hypothesis Test Question

Ho: There is no difference in proportion of lyme death cases in the Pacific. Ha: There is a difference in proportion of lyme death cases in the Pacific.

1+1

## [1] 2

2/5

## [1] 0.4

null.probs = c(0.1, 0.35, 0.1, 0.35, 0.1)
casep1 = c(0, 1, 0, 1, 0)
chisq.test(casep1, p=null.probs)

## Warning in chisq.test(casep1, p = null.probs): Chi-squared approximation may be
## incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  casep1
## X-squared = 0.85714, df = 4, p-value = 0.9306

Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among Pacific states.

Odds of dying of the lyme disease

Mid Atlantic (the most)

MidAtlantic <- c(186, 44)
Others1 <- c(206, 348)
testma <- data.frame(MidAtlantic, Others1)
testma

##   MidAtlantic Others1
## 1         186     206
## 2          44     348

392/9

## [1] 43.55556

392-44

## [1] 348

(186*348)/(44*206)

## [1] 7.141218

Answer: Odds are 7.14 times higher risk of dying of lyme disease in Mid Atlantic region than when compared to all people in the US.

Pacific (the less)

Pacific <- c(2, 44)
Others2 <- c(390, 348)
testp <- data.frame(Pacific, Others2)
testp

##   Pacific Others2
## 1       2     390
## 2      44     348

392/9

## [1] 43.55556

392-44

## [1] 348

(2*348)/(44*390)

## [1] 0.04055944

Answer: Odds are 0.04 times lower risk of dying of lyme disease in Pacific region than when compared to all people in the US.

Linear Regressions

States

library(tidyverse)
setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lymefits1 <- read_csv("lymefits1.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Reporting_Area = col_character(),
##   MMWR_Year = col_double(),
##   MMWR_Week = col_double(),
##   `Lyme disease, Previous 52 weeks Med` = col_double(),
##   `Lyme disease, Cum 2014` = col_double()
## )

lymefits1a <- lymefits1 %>%
  rename(
    areas = Reporting_Area,
    counts = `Lyme disease, Previous 52 weeks Med`,
    week = MMWR_Week
    )

glimpse(lymefits1)

## Rows: 2,652
## Columns: 5
## $ Reporting_Area                        <chr> "ALABAMA", "ALASKA", "ARIZONA...
## $ MMWR_Year                             <dbl> 2015, 2015, 2015, 2015, 2015,...
## $ MMWR_Week                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ `Lyme disease, Previous 52 weeks Med` <dbl> 0, 0, 0, 0, 0, 0, 30, 6, 0, 2...
## $ `Lyme disease, Cum 2014`              <dbl> NA, NA, NA, NA, 1, NA, 11, 3,...

View(lymefits1)

fit1 <- glm(counts ~ week + areas, data=lymefits1a)
summary(fit1)

## 
## Call:
## glm(formula = counts ~ week + areas, data = lymefits1a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -17.409   -0.101   -0.002    0.081   32.947  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.623015   0.332482   1.874 0.061066 .  
## week                -0.003191   0.003011  -1.060 0.289451    
## areasALASKA         -0.538462   0.456456  -1.180 0.238244    
## areasARIZONA        -0.538462   0.456456  -1.180 0.238244    
## areasARKANSAS       -0.538462   0.456456  -1.180 0.238244    
## areasCALIFORNIA      0.019231   0.456456   0.042 0.966398    
## areasCOLORADO       -0.538462   0.456456  -1.180 0.238244    
## areasCONNECTICUT    24.884615   0.456456  54.517  < 2e-16 ***
## areasDELAWARE        4.903846   0.456456  10.743  < 2e-16 ***
## areasDIST. OF COL.   0.423077   0.456456   0.927 0.354078    
## areasFLORIDA         1.730769   0.456456   3.792 0.000153 ***
## areasGEORGIA        -0.538462   0.456456  -1.180 0.238244    
## areasHAWAII         -0.538462   0.456456  -1.180 0.238244    
## areasIDAHO          -0.538462   0.456456  -1.180 0.238244    
## areasILLINOIS        0.730769   0.456456   1.601 0.109506    
## areasINDIANA        -0.019231   0.456456  -0.042 0.966398    
## areasIOWA            0.596154   0.456456   1.306 0.191651    
## areasKANSAS         -0.538462   0.456456  -1.180 0.238244    
## areasKENTUCKY       -0.538462   0.456456  -1.180 0.238244    
## areasLOUISIANA      -0.538462   0.456456  -1.180 0.238244    
## areasMAINE          13.576923   0.456456  29.744  < 2e-16 ***
## areasMARYLAND       19.538462   0.456456  42.805  < 2e-16 ***
## areasMASSACHUSETTS  37.942308   0.456456  83.124  < 2e-16 ***
## areasMICHIGAN        0.519231   0.456456   1.138 0.255423    
## areasMINNESOTA       0.576923   0.456456   1.264 0.206372    
## areasMISSISSIPPI    -0.538462   0.456456  -1.180 0.238244    
## areasMISSOURI       -0.538462   0.456456  -1.180 0.238244    
## areasMONTANA        -0.538462   0.456456  -1.180 0.238244    
## areasNEBRASKA       -0.538462   0.456456  -1.180 0.238244    
## areasNEVADA         -0.538462   0.456456  -1.180 0.238244    
## areasNEW HAMPSHIRE   3.961538   0.456456   8.679  < 2e-16 ***
## areasNEW JERSEY     39.500000   0.456456  86.536  < 2e-16 ***
## areasNEW MEXICO     -0.538462   0.456456  -1.180 0.238244    
## areasNEW YORK       40.846154   0.456456  89.486  < 2e-16 ***
## areasNORTH CAROLINA -0.384615   0.456456  -0.843 0.399522    
## areasNORTH DAKOTA   -0.538462   0.456456  -1.180 0.238244    
## areasOHIO            1.211538   0.456456   2.654 0.007997 ** 
## areasOKLAHOMA       -0.538462   0.456456  -1.180 0.238244    
## areasOREGON          0.307692   0.456456   0.674 0.500314    
## areasPENNSYLVANIA   88.596154   0.456456 194.096  < 2e-16 ***
## areasRHODE ISLAND   10.076923   0.456456  22.076  < 2e-16 ***
## areasSOUTH CAROLINA  0.423077   0.456456   0.927 0.354078    
## areasSOUTH DAKOTA   -0.538462   0.456456  -1.180 0.238244    
## areasTENNESSEE      -0.538462   0.456456  -1.180 0.238244    
## areasTEXAS          -0.403846   0.456456  -0.885 0.376377    
## areasUTAH           -0.538462   0.456456  -1.180 0.238244    
## areasVERMONT         6.115385   0.456456  13.398  < 2e-16 ***
## areasVIRGINIA       18.750000   0.456456  41.077  < 2e-16 ***
## areasWASHINGTON     -0.538462   0.456456  -1.180 0.238244    
## areasWEST VIRGINIA   1.307692   0.456456   2.865 0.004205 ** 
## areasWISCONSIN       8.807692   0.456456  19.296  < 2e-16 ***
## areasWYOMING        -0.538462   0.456456  -1.180 0.238244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.417143)
## 
##     Null deviance: 659682  on 2651  degrees of freedom
## Residual deviance:  14085  on 2600  degrees of freedom
## AIC: 12060
## 
## Number of Fisher Scoring iterations: 2

par(mfrow = c(2,2))
plot(fit1)

fit111 <- lm(counts ~ week + areas, data=lymefits1a)
summary(fit111)

## 
## Call:
## lm(formula = counts ~ week + areas, data = lymefits1a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.409  -0.101  -0.002   0.081  32.947 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.623015   0.332482   1.874 0.061066 .  
## week                -0.003191   0.003011  -1.060 0.289451    
## areasALASKA         -0.538462   0.456456  -1.180 0.238244    
## areasARIZONA        -0.538462   0.456456  -1.180 0.238244    
## areasARKANSAS       -0.538462   0.456456  -1.180 0.238244    
## areasCALIFORNIA      0.019231   0.456456   0.042 0.966398    
## areasCOLORADO       -0.538462   0.456456  -1.180 0.238244    
## areasCONNECTICUT    24.884615   0.456456  54.517  < 2e-16 ***
## areasDELAWARE        4.903846   0.456456  10.743  < 2e-16 ***
## areasDIST. OF COL.   0.423077   0.456456   0.927 0.354078    
## areasFLORIDA         1.730769   0.456456   3.792 0.000153 ***
## areasGEORGIA        -0.538462   0.456456  -1.180 0.238244    
## areasHAWAII         -0.538462   0.456456  -1.180 0.238244    
## areasIDAHO          -0.538462   0.456456  -1.180 0.238244    
## areasILLINOIS        0.730769   0.456456   1.601 0.109506    
## areasINDIANA        -0.019231   0.456456  -0.042 0.966398    
## areasIOWA            0.596154   0.456456   1.306 0.191651    
## areasKANSAS         -0.538462   0.456456  -1.180 0.238244    
## areasKENTUCKY       -0.538462   0.456456  -1.180 0.238244    
## areasLOUISIANA      -0.538462   0.456456  -1.180 0.238244    
## areasMAINE          13.576923   0.456456  29.744  < 2e-16 ***
## areasMARYLAND       19.538462   0.456456  42.805  < 2e-16 ***
## areasMASSACHUSETTS  37.942308   0.456456  83.124  < 2e-16 ***
## areasMICHIGAN        0.519231   0.456456   1.138 0.255423    
## areasMINNESOTA       0.576923   0.456456   1.264 0.206372    
## areasMISSISSIPPI    -0.538462   0.456456  -1.180 0.238244    
## areasMISSOURI       -0.538462   0.456456  -1.180 0.238244    
## areasMONTANA        -0.538462   0.456456  -1.180 0.238244    
## areasNEBRASKA       -0.538462   0.456456  -1.180 0.238244    
## areasNEVADA         -0.538462   0.456456  -1.180 0.238244    
## areasNEW HAMPSHIRE   3.961538   0.456456   8.679  < 2e-16 ***
## areasNEW JERSEY     39.500000   0.456456  86.536  < 2e-16 ***
## areasNEW MEXICO     -0.538462   0.456456  -1.180 0.238244    
## areasNEW YORK       40.846154   0.456456  89.486  < 2e-16 ***
## areasNORTH CAROLINA -0.384615   0.456456  -0.843 0.399522    
## areasNORTH DAKOTA   -0.538462   0.456456  -1.180 0.238244    
## areasOHIO            1.211538   0.456456   2.654 0.007997 ** 
## areasOKLAHOMA       -0.538462   0.456456  -1.180 0.238244    
## areasOREGON          0.307692   0.456456   0.674 0.500314    
## areasPENNSYLVANIA   88.596154   0.456456 194.096  < 2e-16 ***
## areasRHODE ISLAND   10.076923   0.456456  22.076  < 2e-16 ***
## areasSOUTH CAROLINA  0.423077   0.456456   0.927 0.354078    
## areasSOUTH DAKOTA   -0.538462   0.456456  -1.180 0.238244    
## areasTENNESSEE      -0.538462   0.456456  -1.180 0.238244    
## areasTEXAS          -0.403846   0.456456  -0.885 0.376377    
## areasUTAH           -0.538462   0.456456  -1.180 0.238244    
## areasVERMONT         6.115385   0.456456  13.398  < 2e-16 ***
## areasVIRGINIA       18.750000   0.456456  41.077  < 2e-16 ***
## areasWASHINGTON     -0.538462   0.456456  -1.180 0.238244    
## areasWEST VIRGINIA   1.307692   0.456456   2.865 0.004205 ** 
## areasWISCONSIN       8.807692   0.456456  19.296  < 2e-16 ***
## areasWYOMING        -0.538462   0.456456  -1.180 0.238244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.327 on 2600 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9782 
## F-statistic:  2337 on 51 and 2600 DF,  p-value: < 2.2e-16

fit10 <- glm(counts ~ areas, data=lymefits1a)
summary(fit10)

## 
## Call:
## glm(formula = counts ~ areas, data = lymefits1a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -17.481   -0.135    0.000    0.038   32.865  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.53846    0.32277   1.668 0.095387 .  
## areasALASKA         -0.53846    0.45647  -1.180 0.238255    
## areasARIZONA        -0.53846    0.45647  -1.180 0.238255    
## areasARKANSAS       -0.53846    0.45647  -1.180 0.238255    
## areasCALIFORNIA      0.01923    0.45647   0.042 0.966399    
## areasCOLORADO       -0.53846    0.45647  -1.180 0.238255    
## areasCONNECTICUT    24.88462    0.45647  54.516  < 2e-16 ***
## areasDELAWARE        4.90385    0.45647  10.743  < 2e-16 ***
## areasDIST. OF COL.   0.42308    0.45647   0.927 0.354089    
## areasFLORIDA         1.73077    0.45647   3.792 0.000153 ***
## areasGEORGIA        -0.53846    0.45647  -1.180 0.238255    
## areasHAWAII         -0.53846    0.45647  -1.180 0.238255    
## areasIDAHO          -0.53846    0.45647  -1.180 0.238255    
## areasILLINOIS        0.73077    0.45647   1.601 0.109515    
## areasINDIANA        -0.01923    0.45647  -0.042 0.966399    
## areasIOWA            0.59615    0.45647   1.306 0.191661    
## areasKANSAS         -0.53846    0.45647  -1.180 0.238255    
## areasKENTUCKY       -0.53846    0.45647  -1.180 0.238255    
## areasLOUISIANA      -0.53846    0.45647  -1.180 0.238255    
## areasMAINE          13.57692    0.45647  29.744  < 2e-16 ***
## areasMARYLAND       19.53846    0.45647  42.804  < 2e-16 ***
## areasMASSACHUSETTS  37.94231    0.45647  83.122  < 2e-16 ***
## areasMICHIGAN        0.51923    0.45647   1.138 0.255434    
## areasMINNESOTA       0.57692    0.45647   1.264 0.206383    
## areasMISSISSIPPI    -0.53846    0.45647  -1.180 0.238255    
## areasMISSOURI       -0.53846    0.45647  -1.180 0.238255    
## areasMONTANA        -0.53846    0.45647  -1.180 0.238255    
## areasNEBRASKA       -0.53846    0.45647  -1.180 0.238255    
## areasNEVADA         -0.53846    0.45647  -1.180 0.238255    
## areasNEW HAMPSHIRE   3.96154    0.45647   8.679  < 2e-16 ***
## areasNEW JERSEY     39.50000    0.45647  86.534  < 2e-16 ***
## areasNEW MEXICO     -0.53846    0.45647  -1.180 0.238255    
## areasNEW YORK       40.84615    0.45647  89.483  < 2e-16 ***
## areasNORTH CAROLINA -0.38462    0.45647  -0.843 0.399533    
## areasNORTH DAKOTA   -0.53846    0.45647  -1.180 0.238255    
## areasOHIO            1.21154    0.45647   2.654 0.007999 ** 
## areasOKLAHOMA       -0.53846    0.45647  -1.180 0.238255    
## areasOREGON          0.30769    0.45647   0.674 0.500324    
## areasPENNSYLVANIA   88.59615    0.45647 194.091  < 2e-16 ***
## areasRHODE ISLAND   10.07692    0.45647  22.076  < 2e-16 ***
## areasSOUTH CAROLINA  0.42308    0.45647   0.927 0.354089    
## areasSOUTH DAKOTA   -0.53846    0.45647  -1.180 0.238255    
## areasTENNESSEE      -0.53846    0.45647  -1.180 0.238255    
## areasTEXAS          -0.40385    0.45647  -0.885 0.376388    
## areasUTAH           -0.53846    0.45647  -1.180 0.238255    
## areasVERMONT         6.11538    0.45647  13.397  < 2e-16 ***
## areasVIRGINIA       18.75000    0.45647  41.076  < 2e-16 ***
## areasWASHINGTON     -0.53846    0.45647  -1.180 0.238255    
## areasWEST VIRGINIA   1.30769    0.45647   2.865 0.004206 ** 
## areasWISCONSIN       8.80769    0.45647  19.295  < 2e-16 ***
## areasWYOMING        -0.53846    0.45647  -1.180 0.238255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.417399)
## 
##     Null deviance: 659682  on 2651  degrees of freedom
## Residual deviance:  14091  on 2601  degrees of freedom
## AIC: 12059
## 
## Number of Fisher Scoring iterations: 2

fit11 <- glm(counts ~ week, data=lymefits1a)
summary(fit11)

## 
## Call:
## glm(formula = counts ~ week, data = lymefits1a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
##  -6.740   -6.664   -6.587   -2.666  115.422  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.743678   0.621699  10.847   <2e-16 ***
## week        -0.003191   0.020414  -0.156    0.876    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 248.9343)
## 
##     Null deviance: 659682  on 2651  degrees of freedom
## Residual deviance: 659676  on 2650  degrees of freedom
## AIC: 22162
## 
## Number of Fisher Scoring iterations: 2

Answer: 97.82% of the variations in the observations may be explained by this model. Consequently, there is a correlation between the lyme disease death counts and the area + the week. The variable “area” is a significant variable. Not all the States are significant. The most significant are: Connecticut, Delaware, Florida, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Ohio, Pennsylvania, Rhode Island, Vermont, Virginia, West Virginia, and Wisconsin. We can also say, thanks to the Normal Q-Q plot, that the lyme disease death cases in the US’s States are not normally distributed (categorical variables).

US and Regions

library(tidyverse)
setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lymefits2 <- read.csv("lymefits2.csv")

glimpse (lymefits2)

## Rows: 520
## Columns: 5
## $ Reporting_Area                     <chr> "E.N. CENTRAL", "E.S. CENTRAL", ...
## $ MMWR.Year                          <int> 2015, 2015, 2015, 2015, 2015, 20...
## $ MMWR_Week                          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Lyme.disease..Previous.52.week.Med <int> 13, 1, 186, 1, 124, 2, 44, 392, ...
## $ Lyme.disease..Cum.2014             <int> 8, 1, 77, 1, 53, 2, 11, 156, 3, ...

View(lymefits2)

lymefits2a <- lymefits2 %>%
  rename(
    areas = Reporting_Area,
    counts = `Lyme.disease..Previous.52.week.Med`,
    week = MMWR_Week
    )

fit2 <- glm(counts ~ week + areas, data=lymefits2a)
summary(fit2)

## 
## Call:
## glm(formula = counts ~ week + areas, data = lymefits2a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -38.622   -1.323   -0.100    0.987   42.423  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         13.87534    1.46461   9.474  < 2e-16 ***
## week                 0.01414    0.02694   0.525      0.6    
## areasE.S. CENTRAL  -12.75000    1.80846  -7.050 5.85e-12 ***
## areasMID. ATLANTIC 184.98077    1.80846 102.286  < 2e-16 ***
## areasMOUNTAIN      -13.25000    1.80846  -7.327 9.31e-13 ***
## areasNEW ENGLAND    94.03846    1.80846  51.999  < 2e-16 ***
## areasPACIFIC       -12.11538    1.80846  -6.699 5.56e-11 ***
## areasS. ATLANTIC    44.05769    1.80846  24.362  < 2e-16 ***
## areasUNITED STATES 401.09615    1.80846 221.789  < 2e-16 ***
## areasW.N. CENTRAL   -9.28846    1.80846  -5.136 4.00e-07 ***
## areasW.S. CENTRAL  -14.11538    1.80846  -7.805 3.40e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 85.03365)
## 
##     Null deviance: 8505772  on 519  degrees of freedom
## Residual deviance:   43282  on 509  degrees of freedom
## AIC: 3799
## 
## Number of Fisher Scoring iterations: 2

par(mfrow = c(2,2))
plot(fit2)

fit3 <- lm(counts ~ week + areas, data=lymefits2a)
summary(fit3)

## 
## Call:
## lm(formula = counts ~ week + areas, data = lymefits2a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.622  -1.323  -0.100   0.987  42.423 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         13.87534    1.46461   9.474  < 2e-16 ***
## week                 0.01414    0.02694   0.525      0.6    
## areasE.S. CENTRAL  -12.75000    1.80846  -7.050 5.85e-12 ***
## areasMID. ATLANTIC 184.98077    1.80846 102.286  < 2e-16 ***
## areasMOUNTAIN      -13.25000    1.80846  -7.327 9.31e-13 ***
## areasNEW ENGLAND    94.03846    1.80846  51.999  < 2e-16 ***
## areasPACIFIC       -12.11538    1.80846  -6.699 5.56e-11 ***
## areasS. ATLANTIC    44.05769    1.80846  24.362  < 2e-16 ***
## areasUNITED STATES 401.09615    1.80846 221.789  < 2e-16 ***
## areasW.N. CENTRAL   -9.28846    1.80846  -5.136 4.00e-07 ***
## areasW.S. CENTRAL  -14.11538    1.80846  -7.805 3.40e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.221 on 509 degrees of freedom
## Multiple R-squared:  0.9949, Adjusted R-squared:  0.9948 
## F-statistic:  9952 on 10 and 509 DF,  p-value: < 2.2e-16

fit20 <- glm(counts ~ areas, data=lymefits2a)
summary(fit20)

## 
## Call:
## glm(formula = counts ~ areas, data = lymefits2a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -38.346   -1.250   -0.135    0.865   42.769  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          14.250      1.278  11.151  < 2e-16 ***
## areasE.S. CENTRAL   -12.750      1.807  -7.055 5.65e-12 ***
## areasMID. ATLANTIC  184.981      1.807 102.359  < 2e-16 ***
## areasMOUNTAIN       -13.250      1.807  -7.332 8.97e-13 ***
## areasNEW ENGLAND     94.038      1.807  52.036  < 2e-16 ***
## areasPACIFIC        -12.115      1.807  -6.704 5.39e-11 ***
## areasS. ATLANTIC     44.058      1.807  24.379  < 2e-16 ***
## areasUNITED STATES  401.096      1.807 221.947  < 2e-16 ***
## areasW.N. CENTRAL    -9.288      1.807  -5.140 3.93e-07 ***
## areasW.S. CENTRAL   -14.115      1.807  -7.811 3.26e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 84.91282)
## 
##     Null deviance: 8505772  on 519  degrees of freedom
## Residual deviance:   43306  on 510  degrees of freedom
## AIC: 3797.2
## 
## Number of Fisher Scoring iterations: 2

fit21 <- glm(counts ~ week, data=lymefits2a)
summary(fit21)

## 
## Call:
## glm(formula = counts ~ week, data = lymefits2a)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -80.81  -78.95  -69.64   30.65  375.58  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.14072   11.40287   7.028 6.63e-12 ***
## week         0.01414    0.37442   0.038     0.97    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 16420.36)
## 
##     Null deviance: 8505772  on 519  degrees of freedom
## Residual deviance: 8505748  on 518  degrees of freedom
## AIC: 6527
## 
## Number of Fisher Scoring iterations: 2

Answer: 99.48% of the variations in the observations may be explained by this model. Consequently, there is a correlation between the lyme disease death and the area + the week. The variable “area” is a significant variable. Each Regions for this model are significant. We can also say, thanks to the Normal Q-Q plot, that the lyme disease death cases in Us and its regions are not normally distributed (categorical variables).

Conclusion

Overall Conclusion

There is a difference in the proportion of lyme death cases between US regions and within US regions. The regions where there is no difference in the proportion of lyme death cases are W. N. Central (p-value = 0.9595), E. S. Central (p-value = 0.3916), W. N. Central(p-value = 1), Mountain (p-value = 0.9391), Pacific (p-value = 0.9306). The regions where there is a difference in the proportion of lyme death cases are New England (p-value = 8.992e-11), Mid. Atlantic (p-value = 0.0001513), E. N. Central (p-value = 0.0001186), S. Atlantic (9.17e-10).

Odds are 7.14 times higher risk of dying of lyme disease in Mid Atlantic region than when compared to all people in the US.. Odds are 0.04 times lower risk of dying of lyme disease in Pacific region than when compared to all people in the US.

For the US, US’s regions ad US’s states, more than 97% of the variation in the observations may be explained by this model, linear regression, (For US’s states: adjusted R-squared = 0.9782 and for US + US’s regions: adjusted R-squared = 0.9948). There is a correlation between the lyme disease death and the area + the week: The variables “count” and “area” are significant variables. (For US’s states: AIC = 12060 and for US + US’s regions: AIC = 3799). If we only use the variables “area”, the AIC goes down (For US’s states: AIC = 12059 and for US + US’s regions: AIC = 3797.2); consequently the model is improved. On the contrary, if we only use the variables “week”, the AIC goes higher (For US’s states: AIC = 22162 and for US + US’s regions: AIC = 6227). Both models are counts = week + areas Most of the Eastern States are significant for the first model(Connecticut, Delaware, Florida, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Ohio, Pennsylvania, Rhode Island, Vermont, Virginia, West Virginia, and Wisconsin). Each of the Regions are significant for the second model.

Opinions

I think we need more data (American Samoa, C.N.M.I., Guam, Puerto Rico, Virgin Island) to have a better idea of the number of deaths due to lyme disease in the U.S.
The model is a great model where all the US regions and states are significant (the one with the smallest lyme disease death report and the one with the highest lyme disease death report). Also, the adjusted R-squared is either respectively 99.48% and 97.82% but the diagnostic plot indicates that this may not be the most appropriate model because it may not be linear. Ignoring the Diagnostic plot it is a great modals.

Lessons Learned

Analyzing a researcher’s dataset is challenging (coding, missing data…).
Appreciating more the researchers’ work (challenges, difficulties to coding, a lot of works and hours).

Bibliography

https://www.kaggle.com/cdc/nndss-lyme-disease-to-meningococcal?select=nndss-table-ii.-lyme-disease-to-meningococcal.csv https://www.cdc.gov/mmwr/about.html#:~:text=The%20Morbidity%20and%20Mortality%20Weekly%20Report%20(MMWR%20)%20series%20is%20prepared,Control%20and%20Prevention%20(CDC)

Final Project

Auriane Grippi

4/20/2021

Introduction

Background Informations.

Data

Collections

Informations

Datasets

Questions

Research Questions Answers (Week 1 study)

Between Regions

Histogram of the Number of Death by Lyme in Different Regions

Hypothesis Test Question

Within Regions

New England

Histogram of the Number of Death by Lyme in New England

Hypothesis Test Question

Mid Atlantic

Histogram of the Number of Death by Lyme in Mid Atlantic

Hypothesis Test Question

E. N. Central

Histogram of the Number of Death by Lyme in E. N. Central

Hypothesis Test Question

W. N. Central

Histogram of the Number of Death by Lyme in W. N. Central

Hypothesis Test Question

s. Atlantic

Histogram of the Number of Death by Lyme in S. Atlantic

Hypothesis Test Question

E. S. Central

Histogram of the Number of Death by Lyme in E. S. Central

Hypothesis Test Question

W.S. Central

Histogram of the Number of Death by Lyme in W. S. Central

Hypothesis Test Question

Mountain

Histogram of the Number of Death by Lyme in Mountain

Hypothesis Test Question

Pacific

Histogram of the Number of Death by Lyme in Pacific

Hypothesis Test Question

Odds of dying of the lyme disease

Mid Atlantic (the most)

Pacific (the less)

Linear Regressions

States

US and Regions

Conclusion

Overall Conclusion

Opinions

Lessons Learned

Bibliography