In December 2019, the novel virus, subsequently known as COVID-19, was identified from an outbreak in Wuhan, China in which the World Health Organization (WHO) declared the outbreak as a pandemic in March 2020. It was not until December 2020 that the FDA gave emergency use authorization to two mRNA COVID-19 vaccines, the Pfizer-BioNTech and the Moderna COVID-19 vaccines. Subsequently in 2021, the FDA gave emergency use authorization to the Janssen/Johnson & Johnson COVID-19 vaccine. As of October 7, 2022, the pandemic had caused more than 620 million cases and 6.55 million confirmed deaths, making it one of the deadliest in history.
We obtained the “Rates of COVID-19 Cases and Deaths by Vaccination Status” collected from March 2021 to July 2022 data set from the Centers for Disease Control and Prevention (CDC) website to study the efficiency of Covid19 vaccinations in preventing COVID deaths within different age groups. The data set is consisted of 1,295 observations and 16 variables, in which the dependent variable is the COVID-19 outcome. The independent variables that we focus for our study are age groups, vaccine products, and fully vaccinated population. Please refer the dataset section below for the description for each variable. For our studying purpose, we perform data wrangling, data exploratory, chi-squared test, and logistic regression model.
https://www.cdc.gov/museum/timeline/covid19.html https://www.yalemedicine.org/news/covid-timeline
Are COVID-19 vaccinations efficient in preventing Covid deaths within age groups?
- Null Hypothesis: Covid19 vaccinations are not efficient in preventing Covid deaths within age groups.
- Alternative Hypothesis: Covid19 vaccinations are efficient in preventing Covid deaths within age groups.
- Outcome: COVID-19 Case or Death [case; death]
- Month: Calendar month and year corresponding to MMWR week value
- MMWR_week MMWR epidemiological year and week [YYYYWW format; e.g. 202101] Plain Text
- Age_group: Age Group [5-11 years; 12-17 years; 18-29 years; 30-49 years; 50-64 years; 65-79 years;
80+ years; all_ages_adj]
- Vaccine_product: FDA-authorized COVID-19 vaccine product name [Janssen; Moderna; Pfizer; all_types]
- Vaccinated_with_outcome: Weekly count of individuals vaccinated with at least a primary series with the corresponding outcome
- Fully_vaccinated_population: Cumulative weekly count of the population vaccinated with at least a primary series
- Unvaccinated_with_outcome: Weekly count of unvaccinated individuals with the corresponding outcome
- Unvaccinated_population: Cumulative weekly estimated count of the unvaccinated population
- Crude_vax_IR: Unadjusted incidence rate of the corresponding outcome among the population
vaccinated with at least a primary series (per 100,000 population)
- Crude_unvax_IR: Unadjusted incidence rate of the corresponding outcome among the unvaccinated
population (per 100,000 population)
- Crude_IRR: Unadjusted incidence rate ratio (unvaccinated: vaccinated with at least a primary series)
- Age_adjusted_vax_IR: Age-standardized incidence rate of the corresponding outcome among the population vaccinated with at least a primary series (per 100,000 population)
- Age_adjusted_unvax_IR: Age-standardized incidence rate of the corresponding outcome among the unvaccinated population (per 100,000 population)
- Age_adjusted_IRR: Age-standardized incidence rate ratio (unvaccinated rate: vaccinated with at least a primary series rate)
- Continuity: correction Flag for whether continuity correction was applied for one or more jurisdictions in the strata [1=Yes, 0=No].
We begin by screening that dataset to make sure that our data is ready for further analysis. The first thing was to covert all categorical variables into factor format in order to use them in our analysis. This is done by setting the variable as factor and adding the labels. To continue, by taking a look at the summary of the data - we can see that there are no errors but that there are some missing values. Therefore, we handle them by removing the data with 20 or more missing. Once we have completed this step, we proceeded to handle outliers via the three step method of Leverage, Cooks, and mahalanobis. The common outliers are removed and a new subset is created to perform further analysis.
setwd("/Users/billy/OneDrive/Documents/ANLY 502/")
data <- read.csv("Rates_of_COVID-19_Cases_or_Deaths_by_Age_Group_and_Vaccination_Status.csv")
## Converting character variables
data$outcome <- factor(data$outcome,
labels = c("case", "death"))
data$month <- factor(data$month,
labels = c("Apr-21", "May-21", "Jun-21", "Jul-21", "Aug-21", "Sep-21", "Oct-21", "Nov-21", "Dec-21", "Jan-22", "Feb-22", "Mar-22", "Apr-22", "May-22", "Jun-22", "Jul-22"))
data$`Age_group` <- factor(data$Age_group,
labels = c("5-11", "12-17", "18-29", "30-49", "50-64", "65-79", "80+", "all_ages_adj"))
data$`Vaccine_product` <- factor(data$Vaccine_product,
labels = c("all_types", "Janssen", "Moderna", "Pfizer"))
apply(data[ , c("outcome", "month", "Age_group", "Vaccine_product")], 2, table)
## $outcome
##
## case death
## 713 680
##
## $month
##
## Apr-21 Apr-22 Aug-21 Dec-21 Feb-22 Jan-22 Jul-21 Jul-22 Jun-21 Jun-22 Mar-22
## 80 55 100 100 88 88 80 88 108 110 88
## May-21 May-22 Nov-21 Oct-21 Sep-21
## 80 88 80 80 80
##
## $Age_group
##
## 12-17 18-29 30-49 5-11 50-64 65-79
## 133 133 133 63 133 133
## 80+ all_ages_adj
## 133 532
##
## $Vaccine_product
##
## all_types Janssen Moderna Pfizer
## 994 133 133 133
# Categorical variables have been converted to factors for further analysis.
## Checking errors
#summary(data)
# When running summary, we can see that there are no negative numbers or any errors that need to be handled within our data set so we will proceed to check for missing data.
## Checking missing data
notypos <- data
apply(notypos, 2, function(x) { sum(is.na(x))})
## outcome month
## 0 0
## MMWR_week Age_group
## 0 0
## Vaccine_product Vaccinated_with_outcome
## 0 0
## Fully_vaccinated_population Unvaccinated_with_outcome
## 0 0
## Unvaccinated_population Crude_vax_IR
## 0 0
## Crude_unvax_IR Crude_IRR
## 0 98
## Age_adjusted_vax_IR Age_adjusted_unvax_IR
## 861 861
## Age_adjusted_IRR Continuity_correction
## 861 0
percentmiss <- function(x){ sum(is.na(x))/length(x) * 100}
missing <- apply(notypos, 1, percentmiss)
table(missing)
## missing
## 0 18.75 25
## 532 763 98
# As we can see above, we have a some missing data and will proceed to remove them.
replace_rows <- subset(notypos, missing <= 20)
noreplace_row <- subset(notypos, missing > 20)
nrow(notypos)
## [1] 1393
nrow(replace_rows)
## [1] 1295
nrow(noreplace_row)
## [1] 98
apply(replace_rows, 2, percentmiss)
## outcome month
## 0.00000 0.00000
## MMWR_week Age_group
## 0.00000 0.00000
## Vaccine_product Vaccinated_with_outcome
## 0.00000 0.00000
## Fully_vaccinated_population Unvaccinated_with_outcome
## 0.00000 0.00000
## Unvaccinated_population Crude_vax_IR
## 0.00000 0.00000
## Crude_unvax_IR Crude_IRR
## 0.00000 0.00000
## Age_adjusted_vax_IR Age_adjusted_unvax_IR
## 58.91892 58.91892
## Age_adjusted_IRR Continuity_correction
## 58.91892 0.00000
replace_columns <- replace_rows[ , -c(1, 2, 4, 5)]
noreplace_columns <- replace_rows[ , c(1, 2, 4, 5)]
#install.packages("mice", repos = "https://cran.us.r-project.org/")
temp_no_miss <- mice(replace_columns, method='cart')
##
## iter imp variable
## 1 1 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 1 2 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 1 3 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 1 4 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 1 5 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 2 1 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 2 2 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 2 3 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 2 4 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 2 5 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 3 1 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 3 2 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 3 3 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 3 4 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 3 5 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 4 1 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 4 2 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 4 3 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 4 4 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 4 5 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 5 1 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 5 2 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 5 3 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 5 4 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## 5 5 Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## Warning: Number of logged events: 75
nomiss <- complete(temp_no_miss, 1)
dim(notypos)
## [1] 1393 16
dim(nomiss)
## [1] 1295 12
all_colunms <- cbind(noreplace_columns, nomiss)
dim(all_colunms)
## [1] 1295 16
summary(all_colunms)
## outcome month Age_group Vaccine_product MMWR_week
## case :713 Jun-21 :100 all_ages_adj:532 all_types:896 Min. :202114
## death:582 Jun-22 : 98 50-64 :133 Janssen :133 1st Qu.:202131
## Dec-21 : 97 65-79 :133 Moderna :133 Median :202148
## Aug-21 : 90 80+ :133 Pfizer :133 Mean :202168
## Mar-22 : 86 30-49 :132 3rd Qu.:202212
## Feb-22 : 83 18-29 :110 Max. :202229
## (Other):741 (Other) :122
## Vaccinated_with_outcome Fully_vaccinated_population Unvaccinated_with_outcome
## Min. : 1.0 Min. : 38107 Min. : 0.0
## 1st Qu.: 156.5 1st Qu.: 10175936 1st Qu.: 786.5
## Median : 1887.0 Median : 25046957 Median : 6783.0
## Mean : 36746.2 Mean : 36814796 Mean : 76799.9
## 3rd Qu.: 23606.5 3rd Qu.: 45899113 3rd Qu.: 58787.5
## Max. :1982037.0 Max. :150041139 Max. :1880066.0
##
## Unvaccinated_population Crude_vax_IR Crude_unvax_IR
## Min. : 987580 Min. : 0.0038 Min. : 0.000
## 1st Qu.: 6644913 1st Qu.: 0.6134 1st Qu.: 5.243
## Median : 16536440 Median : 11.0721 Median : 73.630
## Mean : 32570854 Mean : 92.3167 Mean : 271.348
## 3rd Qu.: 55867478 3rd Qu.: 100.4499 3rd Qu.: 335.083
## Max. :122905181 Max. :2057.0984 Max. :3887.401
##
## Crude_IRR Age_adjusted_vax_IR Age_adjusted_unvax_IR Age_adjusted_IRR
## Min. : 0.000 Min. : 0.0421 Min. : 0.684 Min. : 1.108
## 1st Qu.: 2.549 1st Qu.: 0.6904 1st Qu.: 14.160 1st Qu.: 3.957
## Median : 4.931 Median : 19.3546 Median : 93.633 Median : 6.636
## Mean : 7.741 Mean : 84.7374 Mean : 268.010 Mean : 9.329
## 3rd Qu.: 8.509 3rd Qu.: 92.2640 3rd Qu.: 379.078 3rd Qu.:13.382
## Max. :120.369 Max. :1590.8089 Max. :3330.522 Max. :34.793
##
## Continuity_correction
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6795
## 3rd Qu.:1.0000
## Max. :1.0000
##
## Outliers
#leverage
model1 <- lm(Vaccinated_with_outcome ~ MMWR_week + Fully_vaccinated_population + Unvaccinated_with_outcome + Unvaccinated_population + Crude_vax_IR + Crude_unvax_IR + Crude_IRR + Age_adjusted_vax_IR + Age_adjusted_unvax_IR + Age_adjusted_IRR + Continuity_correction, data = all_colunms)
k <- 16 #Number of IV
leverage <- hatvalues(model1)
cutleverage <- (2*k+2) /nrow(all_colunms)
badleverage <- as.numeric(leverage > cutleverage)
table(badleverage)
## badleverage
## 0 1
## 1218 77
#Cooks
cooks <- cooks.distance(model1)
cutcooks <- 4 / (nrow(all_colunms) - k - 1)
badcooks <- as.numeric(cooks > cutcooks)
table(badcooks)
## badcooks
## 0 1
## 1235 60
#Mahal
mahal <- mahalanobis(all_colunms[ , -c(1, 2, 3, 4)],
colMeans(all_colunms[ , -c(1, 2, 3, 4)]),
cov(all_colunms[ , -c(1, 2, 3, 4)]),
tol=1e-20)
cutmahal <- qchisq(1-.001, ncol(all_colunms[ , -c(1, 2, 3, 4)]))
badmahal <- as.numeric(mahal > cutmahal)
table(badmahal)
## badmahal
## 0 1
## 1216 79
# Overall
totalout <- badmahal + badleverage + badcooks
table(totalout)
## totalout
## 0 1 2 3
## 1211 5 26 53
noout <- subset(all_colunms, totalout < 2)
### Data is now clean. NA's and outliers have been dealt with. The clean dataset is now named noout.
The next phase of our analysis is checking for assumptions. We begin by performing an assumption check for additivity. This is where we take a look to see the correlation between each variable in our dataset. According to our analysis and the plot shown in the below code, we have met the assumption of additivity. Moving on to the assumption of linearity. According ot our analysis, we have not met the assumption of linearity since the dots do not quite line up between -2 and 2. Our next assumption check is normality. Here we are trying to see if our data is normally distributed. As seen in our analysis below and the histogram, we have met the assumption of normality. Last but not least, we take a look at Homogeneity and Homoscedasticity. Both assumptions are met since the scatter plot below is equally distributed from top to bottom and from right to left.
### Checking Assumptions
## Additivity
cor(noout[ , -c(1, 2, 3, 4)])
## MMWR_week Vaccinated_with_outcome
## MMWR_week 1.00000000 0.2075690
## Vaccinated_with_outcome 0.20756904 1.0000000
## Fully_vaccinated_population 0.14033860 0.4526271
## Unvaccinated_with_outcome -0.10838637 0.6051674
## Unvaccinated_population -0.23748871 0.1193802
## Crude_vax_IR 0.24885161 0.6753920
## Crude_unvax_IR 0.02296960 0.5672159
## Crude_IRR -0.46982210 -0.2479043
## Age_adjusted_vax_IR 0.15902480 0.5684938
## Age_adjusted_unvax_IR 0.00317378 0.4635185
## Age_adjusted_IRR -0.47967604 -0.3566683
## Continuity_correction 0.37405311 0.1836956
## Fully_vaccinated_population
## MMWR_week 0.14033860
## Vaccinated_with_outcome 0.45262707
## Fully_vaccinated_population 1.00000000
## Unvaccinated_with_outcome 0.23490007
## Unvaccinated_population 0.43737388
## Crude_vax_IR 0.03022509
## Crude_unvax_IR -0.01235813
## Crude_IRR -0.15828346
## Age_adjusted_vax_IR 0.09541926
## Age_adjusted_unvax_IR 0.03160017
## Age_adjusted_IRR -0.08098918
## Continuity_correction 0.39028275
## Unvaccinated_with_outcome Unvaccinated_population
## MMWR_week -0.1083864 -0.237488708
## Vaccinated_with_outcome 0.6051674 0.119380211
## Fully_vaccinated_population 0.2349001 0.437373883
## Unvaccinated_with_outcome 1.0000000 0.384408500
## Unvaccinated_population 0.3844085 1.000000000
## Crude_vax_IR 0.5061012 -0.099881906
## Crude_unvax_IR 0.5927061 -0.110923660
## Crude_IRR -0.1730923 -0.165619288
## Age_adjusted_vax_IR 0.5487665 -0.013231386
## Age_adjusted_unvax_IR 0.6303294 -0.031781499
## Age_adjusted_IRR -0.3028509 0.003218947
## Continuity_correction 0.1753270 0.409079348
## Crude_vax_IR Crude_unvax_IR Crude_IRR
## MMWR_week 0.24885161 0.02296960 -0.46982210
## Vaccinated_with_outcome 0.67539197 0.56721594 -0.24790428
## Fully_vaccinated_population 0.03022509 -0.01235813 -0.15828346
## Unvaccinated_with_outcome 0.50610124 0.59270605 -0.17309232
## Unvaccinated_population -0.09988191 -0.11092366 -0.16561929
## Crude_vax_IR 1.00000000 0.84964878 -0.32795439
## Crude_unvax_IR 0.84964878 1.00000000 -0.24259127
## Crude_IRR -0.32795439 -0.24259127 1.00000000
## Age_adjusted_vax_IR 0.45970780 0.38487619 -0.16484736
## Age_adjusted_unvax_IR 0.42729410 0.46785522 -0.09448754
## Age_adjusted_IRR -0.45010910 -0.36401336 0.61616154
## Continuity_correction 0.05052707 0.02905750 -0.33500489
## Age_adjusted_vax_IR Age_adjusted_unvax_IR
## MMWR_week 0.159024796 0.00317378
## Vaccinated_with_outcome 0.568493847 0.46351850
## Fully_vaccinated_population 0.095419256 0.03160017
## Unvaccinated_with_outcome 0.548766489 0.63032940
## Unvaccinated_population -0.013231386 -0.03178150
## Crude_vax_IR 0.459707805 0.42729410
## Crude_unvax_IR 0.384876185 0.46785522
## Crude_IRR -0.164847364 -0.09448754
## Age_adjusted_vax_IR 1.000000000 0.90519206
## Age_adjusted_unvax_IR 0.905192061 1.00000000
## Age_adjusted_IRR -0.363266948 -0.31261364
## Continuity_correction 0.002687837 -0.04658412
## Age_adjusted_IRR Continuity_correction
## MMWR_week -0.479676038 0.374053111
## Vaccinated_with_outcome -0.356668283 0.183695624
## Fully_vaccinated_population -0.080989177 0.390282749
## Unvaccinated_with_outcome -0.302850892 0.175326964
## Unvaccinated_population 0.003218947 0.409079348
## Crude_vax_IR -0.450109102 0.050527071
## Crude_unvax_IR -0.364013355 0.029057499
## Crude_IRR 0.616161543 -0.335004893
## Age_adjusted_vax_IR -0.363266948 0.002687837
## Age_adjusted_unvax_IR -0.312613645 -0.046584121
## Age_adjusted_IRR 1.000000000 -0.153616389
## Continuity_correction -0.153616389 1.000000000
corrplot(cor(noout[ , -c(1, 2, 3, 4)]))
## Linearity
random <- rchisq(nrow(noout), 7)
fake <- lm(random ~ .,
data = noout)
standardized <- rstudent(fake)
fitvaules <- scale(fake$fitted.values)
{qqnorm(standardized)
abline(0,1)}
plot(fake, 2)
## Normality
skewness(noout[ , -c(1, 2, 3, 4)])
## MMWR_week Vaccinated_with_outcome
## 0.2236250 4.4973390
## Fully_vaccinated_population Unvaccinated_with_outcome
## 1.6225035 3.0018507
## Unvaccinated_population Crude_vax_IR
## 0.6741116 3.8043972
## Crude_unvax_IR Crude_IRR
## 2.6620519 2.2766928
## Age_adjusted_vax_IR Age_adjusted_unvax_IR
## 3.1014201 2.0529742
## Age_adjusted_IRR Continuity_correction
## 0.8398245 -0.8131734
kurtosis(noout[ , -c(1, 2, 3, 4)]) - 3
## MMWR_week Vaccinated_with_outcome
## -1.7330934 27.6200593
## Fully_vaccinated_population Unvaccinated_with_outcome
## 2.0091120 10.5656546
## Unvaccinated_population Crude_vax_IR
## -0.6321728 23.0086850
## Crude_unvax_IR Crude_IRR
## 11.7945146 5.8732446
## Age_adjusted_vax_IR Age_adjusted_unvax_IR
## 15.7396173 6.8746818
## Age_adjusted_IRR Continuity_correction
## -0.2709831 -1.3387491
hist(standardized, breaks=15)
length(standardized)
## [1] 1216
## Homogeneity/Homoscedasticity
{plot(fitvaules, standardized)
abline(0,0)
abline(v = 0)}
- Assumption of additivity has been met.
- Assumption for linearity has not been met as the plot is not lined up between -2 and 2.
- Assumption for normality has been met because it is a normal distribution with a slight right skew.
- Assumption for Homogeneity and Homoscedasticity have been met.
ggplot(noout, aes(x=outcome)) +
geom_bar(fill="steelblue") +
theme_minimal()+
labs(x = "Outcome",
y = "Count",
title = "Outcome of vaccine")
The outcome has 653 cases and 568 death record
ggplot(noout, aes(x=Age_group)) +
geom_bar(fill="steelblue") +
theme_minimal()+
labs(x = "Age group",
y = "Count",
title = "Age distribution")
The distribution of Age group is uniform across the dataset, only the ‘all_ages_adj’ have count which is out of proportion.
ggplot(noout, aes(x=Vaccine_product)) +
geom_bar(fill="steelblue") +
theme_minimal()+
labs(x = "Vaccine product",
y = "Count",
title = "Types of Vaccine product")+ scale_fill_brewer(palette = "Set2")
The distribution of vaccine type is uniform across the dataset, ‘all_types’ vaccine which consist of all other vaccines including these 3 are more in numbers which makes sense too.
noout2 <- noout %>% filter(Age_group !='all_ages_adj' )
ggplot(noout2, aes(Age_group, fill = outcome)) +
geom_bar( position = "dodge")+
theme_minimal()+
labs(x = "Age group",
y = "Count",
title = "Plot of outcome among the age group")
In distribution of age group by outcome we can see that death is lower for age group less than 50, In age greater than 50 it becomes approximate equal
noout3 <- noout %>% filter(Vaccine_product != 'all_types' )
ggplot(noout3, aes(y=Fully_vaccinated_population , x=MMWR_week, color=Vaccine_product)) +
geom_point(size = .9)+
theme_minimal()+
scale_y_continuous( labels = scales::comma)+
labs(x = "year and week (YYYYWW)",
y = "Fully vaccinated population",
title = "Plot of vaccinated population ")
From the plot we can see that pfizer was the most common vaccine received by the population
ggplot(noout2, aes(y=Unvaccinated_population , x=MMWR_week, color=Age_group)) +
geom_point()+
theme_minimal()+
scale_y_continuous( labels = scales::comma)+
labs(x = "year and week (YYYYWW)",
y = "Fully unvaccinated population",
title = "Plot of Unvaccinated population ")
Above plot is of unvaccinated population over the week split by Age group
ggplot(noout2, aes(y=Fully_vaccinated_population , x=MMWR_week, color=Age_group)) +
geom_point()+
theme_minimal()+
scale_y_continuous( labels = scales::comma)+
labs(x = "year and week (YYYYWW)",
y = "Fully vaccinated population",
title = "Plot of Fully vaccinated population ")
Above is the plot of vaccinated population over the week split by Age group
We wanted to Chi- squared test for Age group and outcome variable
H0 : Age and outcome are independent, there is no relationship between the two categorical variables. Knowing the value of one variable does not help to predict the value of the other variable
H1: Age and outcome are dependent, there is a relationship between the two categorical variables. Knowing the value of one variable helps to predict the value of the other variable
table(noout$Age_group,noout$outcome)
##
## case death
## 5-11 29 6
## 12-17 60 13
## 18-29 61 36
## 30-49 62 60
## 50-64 64 63
## 65-79 59 62
## 80+ 64 64
## all_ages_adj 254 259
chisq.test(noout$outcome, noout$Age_group)
##
## Pearson's Chi-squared test
##
## data: noout$outcome and noout$Age_group
## X-squared = 45.57, df = 7, p-value = 1.06e-07
# Doing the test by removing 'all_ages_adj'
noout2 <- noout %>% filter(Age_group !='all_ages_adj' )
table(noout2$Age_group,noout2$outcome)
##
## case death
## 5-11 29 6
## 12-17 60 13
## 18-29 61 36
## 30-49 62 60
## 50-64 64 63
## 65-79 59 62
## 80+ 64 64
## all_ages_adj 0 0
chisq.test(noout2$Age_group,noout2$outcome )
##
## Pearson's Chi-squared test
##
## data: noout2$Age_group and noout2$outcome
## X-squared = 39.822, df = 6, p-value = 4.937e-07
We can reject our Null hypothesis as p < 0.01 which means Age and outcome are dependent, there is a relationship between the two categorical variables.
glm(formula = outcome ~ Age_group + Vaccine_product + Fully_vaccinated_population +
Unvaccinated_population, family = "binomial", data = all_colunms)
##
## Call: glm(formula = outcome ~ Age_group + Vaccine_product + Fully_vaccinated_population +
## Unvaccinated_population, family = "binomial", data = all_colunms)
##
## Coefficients:
## (Intercept) Age_group12-17
## -2.711e-01 7.418e-06
## Age_group18-29 Age_group30-49
## 2.416e+00 4.283e+00
## Age_group50-64 Age_group65-79
## 3.229e+00 2.247e+00
## Age_group80+ Age_groupall_ages_adj
## 8.513e-01 1.447e+01
## Vaccine_productJanssen Vaccine_productModerna
## -8.677e+00 -5.921e+00
## Vaccine_productPfizer Fully_vaccinated_population
## -4.116e+00 -7.603e-08
## Unvaccinated_population
## -7.624e-08
##
## Degrees of Freedom: 1294 Total (i.e. Null); 1282 Residual
## Null Deviance: 1782
## Residual Deviance: 1657 AIC: 1683
From using the Generalized Linear Model function to perform our linear regression, we obtained the results illustrated above. We can observe that at a confidential level of 95%, three variables can be successfully considered as significant predictors. These variables are: - Vaccine products as they all have their “Pr(>|z|) values smaller than our confidence level of 95%. - The other variable that would be successfully considered a significant predictor is the”fully_vaccinated_population” as we can also notice that its “Pr(>|z|) value is less than our confidence level of 95%. - The third variable would be the “Age group 12-17” as this variable also has a “Pr(>|z|)” value that is smaller than our confidence level of 95%. We can therefore conclude that Being fully vaccinated (with the three approved vaccine products: Pfizer, Moderna and Janssen) are efficient at preventing covid death within certain age groups (in our case, that age group would be 12-17) —
We have studied COVID data interpretation and visualization using R in the US to understand better how COVID impact age, deaths and population. The pandemic takes variable shapes and forms across the US and amongst different age range. The pandemic has impacted the US with discrepancy in the vaccinated and unvaccinated population, fragmentation of their ages and higher socio-economic inequities more than others. Age 50 – 64 consist of the one of the highest cases and highest range with covid deaths.