Topic: Proportions of US residents dead because of the Lyme Disease in 2015 (from Week 1 to Week 52).
Lyme disease, from its Latin name Lyme borreliosis, is caused by Ixodes ticks (bite) through bacterial infection and only if the tick has already bitten an infected animal. It is transmitted to pets and humans. Headache, chills, fever, myalgia are the most common symptoms. A visual means on his body is the appearance of a red or purple circle or mark where the tick has bitten. It is not necessary to consult a doctor only if the infected person is sick. Antibiotics (Doxycycline, or Amoxicillin, or Cefuroxime) treat the disease. It is possible to remove the tick on our own from our body with fine-tipped tweezers or a tick puller. To avoid ticks and their bites, it is possible to spray a tick spray on the clothes.
The variables were collected by observations. Some variables are observational (Weeks, lyme disease death counts, and year 2015) and one categorical (area). It is about the death proportions of US resident who get the Lyme disease. There are two bias. First, the variables are incomplete: we are missing data for C.N.M.I, American Samoa, Guam, Puerto Rico, and Virgin Island. Consequently, we exclude those territories for answering the questions. Second, We assume it is human death of lyme disease proportion.
The Centers for Disease Control and Prevention publishes the Morbidity and Mortality Weekly Report (MMWR) series every week (CDC). The MMWR series is CDC’s primary medium for scientific publication of timely, credible, definitive, correct, impartial, and useful public health facts and recommendations. Physicians, nurses, public health professionals, epidemiologists and other scientists, academics, students, and laboratorians are among the most frequent readers of the MMWR. Here, the MMWR begin with the 1st week of 2015 and ends with th 52th week of 2015. The proportions of lyme disease death are reported by Regions and States in the USA.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lyme <- read_csv("lymeweek.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Reporting_Area = col_character(),
## `MMWR Year` = col_double(),
## MMWR_Week = col_double(),
## `Lyme disease, Cum 2014` = col_double()
## )
glimpse(lyme)
## Rows: 3,380
## Columns: 4
## $ Reporting_Area <chr> "ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS"...
## $ `MMWR Year` <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ MMWR_Week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `Lyme disease, Cum 2014` <dbl> NA, NA, NA, NA, 1, NA, 11, 3, NA, 8, 1, 1,...
view(lyme)
Do the proportion of lyme death cases vary between regions in the USA ?
allregions1 <- c("New England", "Mid Atlantic", "E.N. Central", "W.N. Central", "S. Atlantic", "E.S. Central", "W.S. Central", "Mountain", "Pacific")
allcases1 <- c(124, 186, 13, 2, 44, 1, 0, 1, 2)
df1 <- data.frame(allregions1, allcases1)
df1
## allregions1 allcases1
## 1 New England 124
## 2 Mid Atlantic 186
## 3 E.N. Central 13
## 4 W.N. Central 2
## 5 S. Atlantic 44
## 6 E.S. Central 1
## 7 W.S. Central 0
## 8 Mountain 1
## 9 Pacific 2
plot1 <- df1%>%
ggplot(aes(x=allregions1, y = allcases1, fill = allregions1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 45)) +
ggtitle("Number of Death by Lyme in Different Regions")+
ylab("Reported Cases") +
xlab("Regions")
plot1
Ho: There is no difference in proportion of lyme death cases between US regions. Ha: There is a difference in proportion of lyme death cases between US regions.
392
## [1] 392
392/9
## [1] 43.55556
null.probs = c(44/392, 47/392, 43/392, 43/392, 43/392, 43/392, 43/392, 43/392, 43/392)
allcases1 = c(124, 186, 13, 2, 44, 1, 0, 1, 2)
chisq.test(allcases1, p=null.probs)
##
## Chi-squared test for given probabilities
##
## data: allcases1
## X-squared = 819.53, df = 8, p-value < 2.2e-16
Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death is different among US regions.
Do the proportion of lyme death cases vary within regions in the USA ?
newenglands1 <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont")
casenewenglands1 <- c(30, 15, 42, 5, 9, 8)
df2 <- data.frame(newenglands1, casenewenglands1)
df2
## newenglands1 casenewenglands1
## 1 Connecticut 30
## 2 Maine 15
## 3 Massachusetts 42
## 4 New Hampshire 5
## 5 Rhode Island 9
## 6 Vermont 8
plot2 <- df2%>%
ggplot(aes(x=newenglands1, y = casenewenglands1, fill = newenglands1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 30)) +
ggtitle("Number of Death by Lyme in New England")+
ylab("Reported Cases") +
xlab("States")
plot2
Ho: There is no difference in proportion of lyme death cases in the New England. Ha: There is a difference in proportion of lyme death cases in the New England.
30+15+42+5+9+8
## [1] 109
109/6
## [1] 18.16667
null.probs = c(18/109, 18/109, 19/109, 18/109, 18/109, 18./109)
casenewenglands1 = c(30, 15, 42, 5, 9, 8)
chisq.test(casenewenglands1, p=null.probs)
##
## Chi-squared test for given probabilities
##
## data: casenewenglands1
## X-squared = 55.787, df = 5, p-value = 8.992e-11
Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among New England states.
ma1 <- c("New Jersey", "New York", "Pennsylvania")
casema1 <- c(39, 44, 79)
df3 <- data.frame(ma1, casema1)
df3
## ma1 casema1
## 1 New Jersey 39
## 2 New York 44
## 3 Pennsylvania 79
plot3 <- df3%>%
ggplot(aes(x=ma1, y = casema1, fill = ma1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 360)) +
ggtitle("Number of Death by Lyme in Mid. Atlantic")+
ylab("Reported Cases") +
xlab("States")
plot3
Ho: There is no difference in proportion of lyme death cases in the Mid Atlantic. Ha: There is a difference in proportion of lyme death cases in the Mid Atlantic.
39+44+79
## [1] 162
162/3
## [1] 54
null.probs = c(54/162, 54/162, 54/162)
casem1 = c(39, 44, 79)
chisq.test(casem1, p=null.probs)
##
## Chi-squared test for given probabilities
##
## data: casem1
## X-squared = 17.593, df = 2, p-value = 0.0001513
Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among Mid Atlantic states.
enc1 <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin")
caseenc1 <- c(1, 1, 1, 1, 10)
df4 <- data.frame(enc1, caseenc1)
df4
## enc1 caseenc1
## 1 Illinois 1
## 2 Indiana 1
## 3 Michigan 1
## 4 Ohio 1
## 5 Wisconsin 10
plot4 <- df4%>%
ggplot(aes(x=enc1, y = caseenc1, fill = enc1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 30)) +
ggtitle("Number of Death by Lyme in E. N. Central")+
ylab("Reported Cases") +
xlab("States")
plot4
Ho: There is no difference in proportion of lyme death cases in the E. N. Central. Ha: There is a difference in proportion of lyme death cases in the E. N. Central.
1+1+1+1+10
## [1] 14
14/5
## [1] 2.8
null.probs = c(2.8/14, 2.8/14, 2.8/14, 2.8/14, 2.8/14)
caseenc1 = c(1,1,1,1,10)
chisq.test(caseenc1, p=null.probs)
## Warning in chisq.test(caseenc1, p = null.probs): Chi-squared approximation may
## be incorrect
##
## Chi-squared test for given probabilities
##
## data: caseenc1
## X-squared = 23.143, df = 4, p-value = 0.0001186
Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among E. N. Central states.
wnc1 <- c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota")
casewnc1 <- c(1, 0, 0, 0, 0, 0, 0)
df10 <- data.frame(wnc1, casewnc1)
df10
## wnc1 casewnc1
## 1 Iowa 1
## 2 Kansas 0
## 3 Minnesota 0
## 4 Missouri 0
## 5 Nebraska 0
## 6 North Dakota 0
## 7 South Dakota 0
plot10 <- df10 %>%
ggplot(aes(x=wnc1, y = casewnc1, fill = wnc1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 30)) +
ggtitle("Number of Death by Lyme in W. N. Central")+
ylab("Reported Cases") +
xlab("States")
plot10
Ho: There is no difference in proportion of lyme death cases in the W. N. Central. Ha: There is a difference in proportion of lyme death cases in the W. N. Central.
1
## [1] 1
1/7
## [1] 0.1428571
null.probs = c(0.4, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
casewnc1 = c(1, 0, 0, 0, 0, 0, 0)
chisq.test(casewnc1, p=null.probs)
## Warning in chisq.test(casewnc1, p = null.probs): Chi-squared approximation may
## be incorrect
##
## Chi-squared test for given probabilities
##
## data: casewnc1
## X-squared = 1.5, df = 6, p-value = 0.9595
Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among E. N. Central states.
sa1 <- c("Delaware", "DC", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia")
casesa1 <- c(6, 0, 2, 0, 15, 0, 0, 16, 1)
df5 <- data.frame(sa1, casesa1)
df5
## sa1 casesa1
## 1 Delaware 6
## 2 DC 0
## 3 Florida 2
## 4 Georgia 0
## 5 Maryland 15
## 6 North Carolina 0
## 7 South Carolina 0
## 8 Virginia 16
## 9 West Virginia 1
plot5 <- df5%>%
ggplot(aes(x=sa1, y = casesa1, fill = sa1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 30)) +
ggtitle("Number of Death by Lyme in S. Atlantic")+
ylab("Reported Cases") +
xlab("States")
plot5
Ho: There is no difference in proportion of lyme death cases in the S. Atlantic. Ha: There is a difference in proportion of lyme death cases in the S. Atlantic.
6+0+2+0+15+0+0+16+1
## [1] 40
40/9
## [1] 4.444444
null.probs = c(4/40, 4/40, 4/40, 4/40, 4/40, 4/40, 4/40, 8/40, 4/40)
casesa1 = c(6, 0, 2, 0, 15, 0, 0, 16, 1)
chisq.test(casesa1, p=null.probs)
## Warning in chisq.test(casesa1, p = null.probs): Chi-squared approximation may be
## incorrect
##
## Chi-squared test for given probabilities
##
## data: casesa1
## X-squared = 58.5, df = 8, p-value = 9.17e-10
Answer: p-value < α. Reject the null. There is very strong evidence that the proportion of lyme death cases is different among S. Atlantic states.
esc1 <- c("Alabama", "Kentucky", "Mississippi", "Tennessee")
caseesc1 <- c(0, 1, 0, 0)
df6 <- data.frame(esc1, caseesc1)
df6
## esc1 caseesc1
## 1 Alabama 0
## 2 Kentucky 1
## 3 Mississippi 0
## 4 Tennessee 0
plot6 <- df6%>%
ggplot(aes(x=esc1, y = caseesc1, fill = esc1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 360)) +
ggtitle("Number of Death by Lyme in E. S. Central")+
ylab("Reported Cases") +
xlab("States")
plot6
Ho: There is no difference in proportion of lyme death cases in the E. S. Central. Ha: There is a difference in proportion of lyme death cases in the E. S. Central.
1
## [1] 1
1/4
## [1] 0.25
null.probs = c(0.25, 0.25, 0.25, 0.25)
caseesc1 = c(0, 1, 0, 0)
chisq.test(caseesc1, p=null.probs)
## Warning in chisq.test(caseesc1, p = null.probs): Chi-squared approximation may
## be incorrect
##
## Chi-squared test for given probabilities
##
## data: caseesc1
## X-squared = 3, df = 3, p-value = 0.3916
Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among E. S. Central states.
wsc1 <- c("Arkansas", "Louisiana", "Oklahoma", "Texas")
casewsc1 <- c(0, 0, 0, 0)
df7 <- data.frame(wsc1, casewsc1)
df7
## wsc1 casewsc1
## 1 Arkansas 0
## 2 Louisiana 0
## 3 Oklahoma 0
## 4 Texas 0
plot7 <- df7%>%
ggplot(aes(x=wsc1, y = casewsc1, fill = wsc1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 360)) +
ggtitle("Number of Death by Lyme in W. S. Central")+
ylab("Reported Cases") +
xlab("States")
plot7
Ho: There is no difference in proportion of lyme death cases in the W. S. Central. Ha: There is a difference in proportion of lyme death cases in the W. S. Central.
Answer: With the eyeball we can say, since all the values are the same, the p-value = 1. Consequently, p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among W. S. Central states.
m1 <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming")
casem1 <- c(0, 0, 0, 0, 1, 0, 0, 0)
df8 <- data.frame(m1, casem1)
df8
## m1 casem1
## 1 Arizona 0
## 2 Colorado 0
## 3 Idaho 0
## 4 Montana 0
## 5 Nevada 1
## 6 New Mexico 0
## 7 Utah 0
## 8 Wyoming 0
plot8 <- df8%>%
ggplot(aes(x=m1, y = casem1, fill = m1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 30)) +
ggtitle("Number of Death by Lyme in Mountain")+
ylab("Reported Cases") +
xlab("States")
plot8
Ho: There is no difference in proportion of lyme death cases in the Mountain. Ha: There is a difference in proportion of lyme death cases in the Mountain.
1
## [1] 1
1/8
## [1] 0.125
null.probs = c(0.1, 0.1, 0.1, 0.1, 0.3, 0.1, 0.1, 0.1)
casem1 = c(0, 0, 0, 0, 1, 0, 0, 0)
chisq.test(casem1, p=null.probs)
## Warning in chisq.test(casem1, p = null.probs): Chi-squared approximation may be
## incorrect
##
## Chi-squared test for given probabilities
##
## data: casem1
## X-squared = 2.3333, df = 7, p-value = 0.9391
Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among Mountain states.
p1 <- c("Alaska", "California", "Hawaii", "Oregon", "Washington")
casep1 <- c(0, 1, 0, 1, 0)
df9 <- data.frame(p1, casep1)
df9
## p1 casep1
## 1 Alaska 0
## 2 California 1
## 3 Hawaii 0
## 4 Oregon 1
## 5 Washington 0
plot9 <- df9%>%
ggplot(aes(x=p1, y = casep1, fill = p1)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 360)) +
ggtitle("Number of Death by Lyme in Pacific")+
ylab("Reported Cases") +
xlab("States")
plot9
Ho: There is no difference in proportion of lyme death cases in the Pacific. Ha: There is a difference in proportion of lyme death cases in the Pacific.
1+1
## [1] 2
2/5
## [1] 0.4
null.probs = c(0.1, 0.35, 0.1, 0.35, 0.1)
casep1 = c(0, 1, 0, 1, 0)
chisq.test(casep1, p=null.probs)
## Warning in chisq.test(casep1, p = null.probs): Chi-squared approximation may be
## incorrect
##
## Chi-squared test for given probabilities
##
## data: casep1
## X-squared = 0.85714, df = 4, p-value = 0.9306
Answer: p-value > α. Fail to reject the null. There is no compelling evidence that the proportion of lyme death cases is different among Pacific states.
MidAtlantic <- c(186, 44)
Others1 <- c(206, 348)
testma <- data.frame(MidAtlantic, Others1)
testma
## MidAtlantic Others1
## 1 186 206
## 2 44 348
392/9
## [1] 43.55556
392-44
## [1] 348
(186*348)/(44*206)
## [1] 7.141218
Answer: Odds are 7.14 times higher risk of dying of lyme disease in Mid Atlantic region than when compared to all people in the US.
Pacific <- c(2, 44)
Others2 <- c(390, 348)
testp <- data.frame(Pacific, Others2)
testp
## Pacific Others2
## 1 2 390
## 2 44 348
392/9
## [1] 43.55556
392-44
## [1] 348
(2*348)/(44*390)
## [1] 0.04055944
Answer: Odds are 0.04 times lower risk of dying of lyme disease in Pacific region than when compared to all people in the US.
library(tidyverse)
setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lymefits1 <- read_csv("lymefits1.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Reporting_Area = col_character(),
## MMWR_Year = col_double(),
## MMWR_Week = col_double(),
## `Lyme disease, Previous 52 weeks Med` = col_double(),
## `Lyme disease, Cum 2014` = col_double()
## )
lymefits1a <- lymefits1 %>%
rename(
areas = Reporting_Area,
counts = `Lyme disease, Previous 52 weeks Med`,
week = MMWR_Week
)
glimpse(lymefits1)
## Rows: 2,652
## Columns: 5
## $ Reporting_Area <chr> "ALABAMA", "ALASKA", "ARIZONA...
## $ MMWR_Year <dbl> 2015, 2015, 2015, 2015, 2015,...
## $ MMWR_Week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ `Lyme disease, Previous 52 weeks Med` <dbl> 0, 0, 0, 0, 0, 0, 30, 6, 0, 2...
## $ `Lyme disease, Cum 2014` <dbl> NA, NA, NA, NA, 1, NA, 11, 3,...
View(lymefits1)
fit1 <- glm(counts ~ week + areas, data=lymefits1a)
summary(fit1)
##
## Call:
## glm(formula = counts ~ week + areas, data = lymefits1a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.409 -0.101 -0.002 0.081 32.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.623015 0.332482 1.874 0.061066 .
## week -0.003191 0.003011 -1.060 0.289451
## areasALASKA -0.538462 0.456456 -1.180 0.238244
## areasARIZONA -0.538462 0.456456 -1.180 0.238244
## areasARKANSAS -0.538462 0.456456 -1.180 0.238244
## areasCALIFORNIA 0.019231 0.456456 0.042 0.966398
## areasCOLORADO -0.538462 0.456456 -1.180 0.238244
## areasCONNECTICUT 24.884615 0.456456 54.517 < 2e-16 ***
## areasDELAWARE 4.903846 0.456456 10.743 < 2e-16 ***
## areasDIST. OF COL. 0.423077 0.456456 0.927 0.354078
## areasFLORIDA 1.730769 0.456456 3.792 0.000153 ***
## areasGEORGIA -0.538462 0.456456 -1.180 0.238244
## areasHAWAII -0.538462 0.456456 -1.180 0.238244
## areasIDAHO -0.538462 0.456456 -1.180 0.238244
## areasILLINOIS 0.730769 0.456456 1.601 0.109506
## areasINDIANA -0.019231 0.456456 -0.042 0.966398
## areasIOWA 0.596154 0.456456 1.306 0.191651
## areasKANSAS -0.538462 0.456456 -1.180 0.238244
## areasKENTUCKY -0.538462 0.456456 -1.180 0.238244
## areasLOUISIANA -0.538462 0.456456 -1.180 0.238244
## areasMAINE 13.576923 0.456456 29.744 < 2e-16 ***
## areasMARYLAND 19.538462 0.456456 42.805 < 2e-16 ***
## areasMASSACHUSETTS 37.942308 0.456456 83.124 < 2e-16 ***
## areasMICHIGAN 0.519231 0.456456 1.138 0.255423
## areasMINNESOTA 0.576923 0.456456 1.264 0.206372
## areasMISSISSIPPI -0.538462 0.456456 -1.180 0.238244
## areasMISSOURI -0.538462 0.456456 -1.180 0.238244
## areasMONTANA -0.538462 0.456456 -1.180 0.238244
## areasNEBRASKA -0.538462 0.456456 -1.180 0.238244
## areasNEVADA -0.538462 0.456456 -1.180 0.238244
## areasNEW HAMPSHIRE 3.961538 0.456456 8.679 < 2e-16 ***
## areasNEW JERSEY 39.500000 0.456456 86.536 < 2e-16 ***
## areasNEW MEXICO -0.538462 0.456456 -1.180 0.238244
## areasNEW YORK 40.846154 0.456456 89.486 < 2e-16 ***
## areasNORTH CAROLINA -0.384615 0.456456 -0.843 0.399522
## areasNORTH DAKOTA -0.538462 0.456456 -1.180 0.238244
## areasOHIO 1.211538 0.456456 2.654 0.007997 **
## areasOKLAHOMA -0.538462 0.456456 -1.180 0.238244
## areasOREGON 0.307692 0.456456 0.674 0.500314
## areasPENNSYLVANIA 88.596154 0.456456 194.096 < 2e-16 ***
## areasRHODE ISLAND 10.076923 0.456456 22.076 < 2e-16 ***
## areasSOUTH CAROLINA 0.423077 0.456456 0.927 0.354078
## areasSOUTH DAKOTA -0.538462 0.456456 -1.180 0.238244
## areasTENNESSEE -0.538462 0.456456 -1.180 0.238244
## areasTEXAS -0.403846 0.456456 -0.885 0.376377
## areasUTAH -0.538462 0.456456 -1.180 0.238244
## areasVERMONT 6.115385 0.456456 13.398 < 2e-16 ***
## areasVIRGINIA 18.750000 0.456456 41.077 < 2e-16 ***
## areasWASHINGTON -0.538462 0.456456 -1.180 0.238244
## areasWEST VIRGINIA 1.307692 0.456456 2.865 0.004205 **
## areasWISCONSIN 8.807692 0.456456 19.296 < 2e-16 ***
## areasWYOMING -0.538462 0.456456 -1.180 0.238244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.417143)
##
## Null deviance: 659682 on 2651 degrees of freedom
## Residual deviance: 14085 on 2600 degrees of freedom
## AIC: 12060
##
## Number of Fisher Scoring iterations: 2
par(mfrow = c(2,2))
plot(fit1)
fit111 <- lm(counts ~ week + areas, data=lymefits1a)
summary(fit111)
##
## Call:
## lm(formula = counts ~ week + areas, data = lymefits1a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.409 -0.101 -0.002 0.081 32.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.623015 0.332482 1.874 0.061066 .
## week -0.003191 0.003011 -1.060 0.289451
## areasALASKA -0.538462 0.456456 -1.180 0.238244
## areasARIZONA -0.538462 0.456456 -1.180 0.238244
## areasARKANSAS -0.538462 0.456456 -1.180 0.238244
## areasCALIFORNIA 0.019231 0.456456 0.042 0.966398
## areasCOLORADO -0.538462 0.456456 -1.180 0.238244
## areasCONNECTICUT 24.884615 0.456456 54.517 < 2e-16 ***
## areasDELAWARE 4.903846 0.456456 10.743 < 2e-16 ***
## areasDIST. OF COL. 0.423077 0.456456 0.927 0.354078
## areasFLORIDA 1.730769 0.456456 3.792 0.000153 ***
## areasGEORGIA -0.538462 0.456456 -1.180 0.238244
## areasHAWAII -0.538462 0.456456 -1.180 0.238244
## areasIDAHO -0.538462 0.456456 -1.180 0.238244
## areasILLINOIS 0.730769 0.456456 1.601 0.109506
## areasINDIANA -0.019231 0.456456 -0.042 0.966398
## areasIOWA 0.596154 0.456456 1.306 0.191651
## areasKANSAS -0.538462 0.456456 -1.180 0.238244
## areasKENTUCKY -0.538462 0.456456 -1.180 0.238244
## areasLOUISIANA -0.538462 0.456456 -1.180 0.238244
## areasMAINE 13.576923 0.456456 29.744 < 2e-16 ***
## areasMARYLAND 19.538462 0.456456 42.805 < 2e-16 ***
## areasMASSACHUSETTS 37.942308 0.456456 83.124 < 2e-16 ***
## areasMICHIGAN 0.519231 0.456456 1.138 0.255423
## areasMINNESOTA 0.576923 0.456456 1.264 0.206372
## areasMISSISSIPPI -0.538462 0.456456 -1.180 0.238244
## areasMISSOURI -0.538462 0.456456 -1.180 0.238244
## areasMONTANA -0.538462 0.456456 -1.180 0.238244
## areasNEBRASKA -0.538462 0.456456 -1.180 0.238244
## areasNEVADA -0.538462 0.456456 -1.180 0.238244
## areasNEW HAMPSHIRE 3.961538 0.456456 8.679 < 2e-16 ***
## areasNEW JERSEY 39.500000 0.456456 86.536 < 2e-16 ***
## areasNEW MEXICO -0.538462 0.456456 -1.180 0.238244
## areasNEW YORK 40.846154 0.456456 89.486 < 2e-16 ***
## areasNORTH CAROLINA -0.384615 0.456456 -0.843 0.399522
## areasNORTH DAKOTA -0.538462 0.456456 -1.180 0.238244
## areasOHIO 1.211538 0.456456 2.654 0.007997 **
## areasOKLAHOMA -0.538462 0.456456 -1.180 0.238244
## areasOREGON 0.307692 0.456456 0.674 0.500314
## areasPENNSYLVANIA 88.596154 0.456456 194.096 < 2e-16 ***
## areasRHODE ISLAND 10.076923 0.456456 22.076 < 2e-16 ***
## areasSOUTH CAROLINA 0.423077 0.456456 0.927 0.354078
## areasSOUTH DAKOTA -0.538462 0.456456 -1.180 0.238244
## areasTENNESSEE -0.538462 0.456456 -1.180 0.238244
## areasTEXAS -0.403846 0.456456 -0.885 0.376377
## areasUTAH -0.538462 0.456456 -1.180 0.238244
## areasVERMONT 6.115385 0.456456 13.398 < 2e-16 ***
## areasVIRGINIA 18.750000 0.456456 41.077 < 2e-16 ***
## areasWASHINGTON -0.538462 0.456456 -1.180 0.238244
## areasWEST VIRGINIA 1.307692 0.456456 2.865 0.004205 **
## areasWISCONSIN 8.807692 0.456456 19.296 < 2e-16 ***
## areasWYOMING -0.538462 0.456456 -1.180 0.238244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.327 on 2600 degrees of freedom
## Multiple R-squared: 0.9786, Adjusted R-squared: 0.9782
## F-statistic: 2337 on 51 and 2600 DF, p-value: < 2.2e-16
fit10 <- glm(counts ~ areas, data=lymefits1a)
summary(fit10)
##
## Call:
## glm(formula = counts ~ areas, data = lymefits1a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.481 -0.135 0.000 0.038 32.865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53846 0.32277 1.668 0.095387 .
## areasALASKA -0.53846 0.45647 -1.180 0.238255
## areasARIZONA -0.53846 0.45647 -1.180 0.238255
## areasARKANSAS -0.53846 0.45647 -1.180 0.238255
## areasCALIFORNIA 0.01923 0.45647 0.042 0.966399
## areasCOLORADO -0.53846 0.45647 -1.180 0.238255
## areasCONNECTICUT 24.88462 0.45647 54.516 < 2e-16 ***
## areasDELAWARE 4.90385 0.45647 10.743 < 2e-16 ***
## areasDIST. OF COL. 0.42308 0.45647 0.927 0.354089
## areasFLORIDA 1.73077 0.45647 3.792 0.000153 ***
## areasGEORGIA -0.53846 0.45647 -1.180 0.238255
## areasHAWAII -0.53846 0.45647 -1.180 0.238255
## areasIDAHO -0.53846 0.45647 -1.180 0.238255
## areasILLINOIS 0.73077 0.45647 1.601 0.109515
## areasINDIANA -0.01923 0.45647 -0.042 0.966399
## areasIOWA 0.59615 0.45647 1.306 0.191661
## areasKANSAS -0.53846 0.45647 -1.180 0.238255
## areasKENTUCKY -0.53846 0.45647 -1.180 0.238255
## areasLOUISIANA -0.53846 0.45647 -1.180 0.238255
## areasMAINE 13.57692 0.45647 29.744 < 2e-16 ***
## areasMARYLAND 19.53846 0.45647 42.804 < 2e-16 ***
## areasMASSACHUSETTS 37.94231 0.45647 83.122 < 2e-16 ***
## areasMICHIGAN 0.51923 0.45647 1.138 0.255434
## areasMINNESOTA 0.57692 0.45647 1.264 0.206383
## areasMISSISSIPPI -0.53846 0.45647 -1.180 0.238255
## areasMISSOURI -0.53846 0.45647 -1.180 0.238255
## areasMONTANA -0.53846 0.45647 -1.180 0.238255
## areasNEBRASKA -0.53846 0.45647 -1.180 0.238255
## areasNEVADA -0.53846 0.45647 -1.180 0.238255
## areasNEW HAMPSHIRE 3.96154 0.45647 8.679 < 2e-16 ***
## areasNEW JERSEY 39.50000 0.45647 86.534 < 2e-16 ***
## areasNEW MEXICO -0.53846 0.45647 -1.180 0.238255
## areasNEW YORK 40.84615 0.45647 89.483 < 2e-16 ***
## areasNORTH CAROLINA -0.38462 0.45647 -0.843 0.399533
## areasNORTH DAKOTA -0.53846 0.45647 -1.180 0.238255
## areasOHIO 1.21154 0.45647 2.654 0.007999 **
## areasOKLAHOMA -0.53846 0.45647 -1.180 0.238255
## areasOREGON 0.30769 0.45647 0.674 0.500324
## areasPENNSYLVANIA 88.59615 0.45647 194.091 < 2e-16 ***
## areasRHODE ISLAND 10.07692 0.45647 22.076 < 2e-16 ***
## areasSOUTH CAROLINA 0.42308 0.45647 0.927 0.354089
## areasSOUTH DAKOTA -0.53846 0.45647 -1.180 0.238255
## areasTENNESSEE -0.53846 0.45647 -1.180 0.238255
## areasTEXAS -0.40385 0.45647 -0.885 0.376388
## areasUTAH -0.53846 0.45647 -1.180 0.238255
## areasVERMONT 6.11538 0.45647 13.397 < 2e-16 ***
## areasVIRGINIA 18.75000 0.45647 41.076 < 2e-16 ***
## areasWASHINGTON -0.53846 0.45647 -1.180 0.238255
## areasWEST VIRGINIA 1.30769 0.45647 2.865 0.004206 **
## areasWISCONSIN 8.80769 0.45647 19.295 < 2e-16 ***
## areasWYOMING -0.53846 0.45647 -1.180 0.238255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.417399)
##
## Null deviance: 659682 on 2651 degrees of freedom
## Residual deviance: 14091 on 2601 degrees of freedom
## AIC: 12059
##
## Number of Fisher Scoring iterations: 2
fit11 <- glm(counts ~ week, data=lymefits1a)
summary(fit11)
##
## Call:
## glm(formula = counts ~ week, data = lymefits1a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.740 -6.664 -6.587 -2.666 115.422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.743678 0.621699 10.847 <2e-16 ***
## week -0.003191 0.020414 -0.156 0.876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 248.9343)
##
## Null deviance: 659682 on 2651 degrees of freedom
## Residual deviance: 659676 on 2650 degrees of freedom
## AIC: 22162
##
## Number of Fisher Scoring iterations: 2
Answer: 97.82% of the variations in the observations may be explained by this model. Consequently, there is a correlation between the lyme disease death counts and the area + the week. The variable “area” is a significant variable. Not all the States are significant. The most significant are: Connecticut, Delaware, Florida, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Ohio, Pennsylvania, Rhode Island, Vermont, Virginia, West Virginia, and Wisconsin. We can also say, thanks to the Normal Q-Q plot, that the lyme disease death cases in the US’s States are not normally distributed (categorical variables).
library(tidyverse)
setwd("C:/Users/auria/OneDrive/Bureau/MATH 217/Final Project")
lymefits2 <- read.csv("lymefits2.csv")
glimpse (lymefits2)
## Rows: 520
## Columns: 5
## $ Reporting_Area <chr> "E.N. CENTRAL", "E.S. CENTRAL", ...
## $ MMWR.Year <int> 2015, 2015, 2015, 2015, 2015, 20...
## $ MMWR_Week <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Lyme.disease..Previous.52.week.Med <int> 13, 1, 186, 1, 124, 2, 44, 392, ...
## $ Lyme.disease..Cum.2014 <int> 8, 1, 77, 1, 53, 2, 11, 156, 3, ...
View(lymefits2)
lymefits2a <- lymefits2 %>%
rename(
areas = Reporting_Area,
counts = `Lyme.disease..Previous.52.week.Med`,
week = MMWR_Week
)
fit2 <- glm(counts ~ week + areas, data=lymefits2a)
summary(fit2)
##
## Call:
## glm(formula = counts ~ week + areas, data = lymefits2a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -38.622 -1.323 -0.100 0.987 42.423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.87534 1.46461 9.474 < 2e-16 ***
## week 0.01414 0.02694 0.525 0.6
## areasE.S. CENTRAL -12.75000 1.80846 -7.050 5.85e-12 ***
## areasMID. ATLANTIC 184.98077 1.80846 102.286 < 2e-16 ***
## areasMOUNTAIN -13.25000 1.80846 -7.327 9.31e-13 ***
## areasNEW ENGLAND 94.03846 1.80846 51.999 < 2e-16 ***
## areasPACIFIC -12.11538 1.80846 -6.699 5.56e-11 ***
## areasS. ATLANTIC 44.05769 1.80846 24.362 < 2e-16 ***
## areasUNITED STATES 401.09615 1.80846 221.789 < 2e-16 ***
## areasW.N. CENTRAL -9.28846 1.80846 -5.136 4.00e-07 ***
## areasW.S. CENTRAL -14.11538 1.80846 -7.805 3.40e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 85.03365)
##
## Null deviance: 8505772 on 519 degrees of freedom
## Residual deviance: 43282 on 509 degrees of freedom
## AIC: 3799
##
## Number of Fisher Scoring iterations: 2
par(mfrow = c(2,2))
plot(fit2)
fit3 <- lm(counts ~ week + areas, data=lymefits2a)
summary(fit3)
##
## Call:
## lm(formula = counts ~ week + areas, data = lymefits2a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.622 -1.323 -0.100 0.987 42.423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.87534 1.46461 9.474 < 2e-16 ***
## week 0.01414 0.02694 0.525 0.6
## areasE.S. CENTRAL -12.75000 1.80846 -7.050 5.85e-12 ***
## areasMID. ATLANTIC 184.98077 1.80846 102.286 < 2e-16 ***
## areasMOUNTAIN -13.25000 1.80846 -7.327 9.31e-13 ***
## areasNEW ENGLAND 94.03846 1.80846 51.999 < 2e-16 ***
## areasPACIFIC -12.11538 1.80846 -6.699 5.56e-11 ***
## areasS. ATLANTIC 44.05769 1.80846 24.362 < 2e-16 ***
## areasUNITED STATES 401.09615 1.80846 221.789 < 2e-16 ***
## areasW.N. CENTRAL -9.28846 1.80846 -5.136 4.00e-07 ***
## areasW.S. CENTRAL -14.11538 1.80846 -7.805 3.40e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.221 on 509 degrees of freedom
## Multiple R-squared: 0.9949, Adjusted R-squared: 0.9948
## F-statistic: 9952 on 10 and 509 DF, p-value: < 2.2e-16
fit20 <- glm(counts ~ areas, data=lymefits2a)
summary(fit20)
##
## Call:
## glm(formula = counts ~ areas, data = lymefits2a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -38.346 -1.250 -0.135 0.865 42.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.250 1.278 11.151 < 2e-16 ***
## areasE.S. CENTRAL -12.750 1.807 -7.055 5.65e-12 ***
## areasMID. ATLANTIC 184.981 1.807 102.359 < 2e-16 ***
## areasMOUNTAIN -13.250 1.807 -7.332 8.97e-13 ***
## areasNEW ENGLAND 94.038 1.807 52.036 < 2e-16 ***
## areasPACIFIC -12.115 1.807 -6.704 5.39e-11 ***
## areasS. ATLANTIC 44.058 1.807 24.379 < 2e-16 ***
## areasUNITED STATES 401.096 1.807 221.947 < 2e-16 ***
## areasW.N. CENTRAL -9.288 1.807 -5.140 3.93e-07 ***
## areasW.S. CENTRAL -14.115 1.807 -7.811 3.26e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 84.91282)
##
## Null deviance: 8505772 on 519 degrees of freedom
## Residual deviance: 43306 on 510 degrees of freedom
## AIC: 3797.2
##
## Number of Fisher Scoring iterations: 2
fit21 <- glm(counts ~ week, data=lymefits2a)
summary(fit21)
##
## Call:
## glm(formula = counts ~ week, data = lymefits2a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -80.81 -78.95 -69.64 30.65 375.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.14072 11.40287 7.028 6.63e-12 ***
## week 0.01414 0.37442 0.038 0.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 16420.36)
##
## Null deviance: 8505772 on 519 degrees of freedom
## Residual deviance: 8505748 on 518 degrees of freedom
## AIC: 6527
##
## Number of Fisher Scoring iterations: 2
Answer: 99.48% of the variations in the observations may be explained by this model. Consequently, there is a correlation between the lyme disease death and the area + the week. The variable “area” is a significant variable. Each Regions for this model are significant. We can also say, thanks to the Normal Q-Q plot, that the lyme disease death cases in Us and its regions are not normally distributed (categorical variables).
There is a difference in the proportion of lyme death cases between US regions and within US regions. The regions where there is no difference in the proportion of lyme death cases are W. N. Central (p-value = 0.9595), E. S. Central (p-value = 0.3916), W. N. Central(p-value = 1), Mountain (p-value = 0.9391), Pacific (p-value = 0.9306). The regions where there is a difference in the proportion of lyme death cases are New England (p-value = 8.992e-11), Mid. Atlantic (p-value = 0.0001513), E. N. Central (p-value = 0.0001186), S. Atlantic (9.17e-10).
Odds are 7.14 times higher risk of dying of lyme disease in Mid Atlantic region than when compared to all people in the US.. Odds are 0.04 times lower risk of dying of lyme disease in Pacific region than when compared to all people in the US.
For the US, US’s regions ad US’s states, more than 97% of the variation in the observations may be explained by this model, linear regression, (For US’s states: adjusted R-squared = 0.9782 and for US + US’s regions: adjusted R-squared = 0.9948). There is a correlation between the lyme disease death and the area + the week: The variables “count” and “area” are significant variables. (For US’s states: AIC = 12060 and for US + US’s regions: AIC = 3799). If we only use the variables “area”, the AIC goes down (For US’s states: AIC = 12059 and for US + US’s regions: AIC = 3797.2); consequently the model is improved. On the contrary, if we only use the variables “week”, the AIC goes higher (For US’s states: AIC = 22162 and for US + US’s regions: AIC = 6227). Both models are counts = week + areas Most of the Eastern States are significant for the first model(Connecticut, Delaware, Florida, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Ohio, Pennsylvania, Rhode Island, Vermont, Virginia, West Virginia, and Wisconsin). Each of the Regions are significant for the second model.
I think we need more data (American Samoa, C.N.M.I., Guam, Puerto Rico, Virgin Island) to have a better idea of the number of deaths due to lyme disease in the U.S.
The model is a great model where all the US regions and states are significant (the one with the smallest lyme disease death report and the one with the highest lyme disease death report). Also, the adjusted R-squared is either respectively 99.48% and 97.82% but the diagnostic plot indicates that this may not be the most appropriate model because it may not be linear. Ignoring the Diagnostic plot it is a great modals.
https://www.kaggle.com/cdc/nndss-lyme-disease-to-meningococcal?select=nndss-table-ii.-lyme-disease-to-meningococcal.csv https://www.cdc.gov/mmwr/about.html#:~:text=The%20Morbidity%20and%20Mortality%20Weekly%20Report%20(MMWR%20)%20series%20is%20prepared,Control%20and%20Prevention%20(CDC)