Coronavirus Disease(Covid-19), caused by Severe Acute Respiratory Syndrome Coronavirus 2(SARS-COV-2) was first identified in December 2019 in Wuhan, a rail and aviation hub in the Hubei Province of China. However since its discovery, there has been over 95 million cases of covid-19 recorded with almost 1 million fatalities. However regardless of these high number of cases in united states since its discovery, almost every single country around the globe has been affected from this pandemic with cases been recorded in all borders, making prevention of its spread a top priority for governments, medical and political leaders. The associated symptoms of the virus include persistent cough, fever,fatigue and loss of smell or taste(Grant et al,2020).
Despite movement restrictions imposed in all 50 states during the peak period of covid-19,there has been cases mortality documented across all states in United States. A lot of residents in various states however remain susceptible to Covid-19, highlighting the need for urgent and effective vaccine to curb the spread of the virus. The evolving spread of the virus on the entire population of US has been very critical with the CDC committing much effort into research with private stakeholders on the development of a vaccine.
To mitigate the spread of this virus, there has been an increase pace in vaccine development. In December 2020, efficacy results from vaccine clinal trials were reported,boosting the confidence in the healthcare system in United States. Since its inception into the borders of united states, there has been so many studies and research conducted on covid-19 using visualization and statistical methods to drive meaning from the rising number of cases and its associated mortality rates.
In the united states , covid-19 has been a test for the healthcare system across every state with severe threat to the public healthcare system. This resulted in most hospitals and healthcare expert been stretched to their maximum during the peak days of virus. However since the introduction and development vaccine to boost the immune system of individuals against Covid-19, there has been a drastic decline in the infection rate, although there are still cases of covid-19 been recorded every across every state. This is as a result of many people declining to get vaccine with their concern been that the vaccine development was so quick and cannot be trusted. However, prior studies suggest that health workers and individuals aged 65 and older were prioritized for vaccination(Gary,2020). This prioritization however significantly relied on the existence of pre-underlying health conditions such as diabetes and hypertension with high tendency for them develop severe disease (Wang et al, 2020).
This project aims to conduct statistical analysis on the covid-19 occurrence in the united states whiles using various statistical tools from the R-statistical software to visualize the dataset to drive meaning from these occurrences. This study is however organised as follows; data collection, descriptive statistics, exploratory analysis, methodology , results discussion and conclusion.
To get answers to the purpose intended for this project and all corresponding hypothesis that will be established, this research employs two dataset from the CDC labeled “Provisional Covid-19 Death by Sex and Age” and “Vaccine_gov_covid_19”. Both dataset used in this study are publicly made available on the CDC website.This project employs R statistical software on a gathered panel to run the statistical analysis and visualization. The provisional data made available from the CDC website is at both state and country level which makes provision for biographic information such as age and sex. The initial dataset label provisional covid-19 death by age and sex consisted of 104652 observations with 16 variables whiles vaccination dataset consisted of 468596 observations of 33 variables. Below is the initial view of the dataset used for this project.
library(readr)
Provisional_COVID_19_Deaths_by_Sex_and_Age_4_ <- read_csv("Provisional_COVID-19_Deaths_by_Sex_and_Age (4).csv")
## Rows: 104652 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Data As Of, Start Date, End Date, Group, State, Sex, Age Group, Foo...
## dbl (8): Year, Month, COVID-19 Deaths, Total Deaths, Pneumonia Deaths, Pneum...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(Provisional_COVID_19_Deaths_by_Sex_and_Age_4_)
covid19_data <- Provisional_COVID_19_Deaths_by_Sex_and_Age_4_
dim(covid19_data)
## [1] 104652 16
names(covid19_data)
## [1] "Data As Of"
## [2] "Start Date"
## [3] "End Date"
## [4] "Group"
## [5] "Year"
## [6] "Month"
## [7] "State"
## [8] "Sex"
## [9] "Age Group"
## [10] "COVID-19 Deaths"
## [11] "Total Deaths"
## [12] "Pneumonia Deaths"
## [13] "Pneumonia and COVID-19 Deaths"
## [14] "Influenza Deaths"
## [15] "Pneumonia, Influenza, or COVID-19 Deaths"
## [16] "Footnote"
library(readr)
Vaccines_gov_COVID_19_vaccinating_provider_locations <- read_csv("Vaccines.gov__COVID-19_vaccinating_provider_locations.csv")
## Rows: 467734 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): provider_location_guid, loc_store_no, loc_phone, loc_name, loc_ad...
## dbl (5): supply_level, latitude, longitude, min_age_months, min_age_years
## lgl (5): insurance_accepted, walkins_accepted, in_stock, Unnamed Column, o...
## date (1): quantity_last_updated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(Vaccines_gov_COVID_19_vaccinating_provider_locations)
vaccination_data <- Vaccines_gov_COVID_19_vaccinating_provider_locations
dim(vaccination_data)
## [1] 467734 33
names(vaccination_data)
## [1] "provider_location_guid" "loc_store_no" "loc_phone"
## [4] "loc_name" "loc_admin_street1" "loc_admin_street2"
## [7] "loc_admin_city" "loc_admin_state" "loc_admin_zip"
## [10] "sunday_hours" "monday_hours" "tuesday_hours"
## [13] "wednesday_hours" "thursday_hours" "friday_hours"
## [16] "saturday_hours" "web_address" "pre_screen"
## [19] "insurance_accepted" "walkins_accepted" "provider_notes"
## [22] "ndc" "med_name" "in_stock"
## [25] "supply_level" "quantity_last_updated" "latitude"
## [28] "longitude" "Category" "Unnamed Column"
## [31] "offers_free_masks" "min_age_months" "min_age_years"
### Removing of missing incidents
new_data <- vaccination_data[!is.na(vaccination_data$latitude),]
#summary(new_data)
library(raster)
## Loading required package: sp
wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
vaccination_spt= SpatialPointsDataFrame(vaccination_data[,c('longitude','latitude')], vaccination_data, proj4string = wgs84)
plot(vaccination_spt)
Below is a descriptive analysis of the dataset used for this project.The study employs R statistical software to run summary on both data set coupled with chatting the missing cases with these dataset.Below reports the total number of missing cases from vaccination data and covid19_data to be 2303113 and 199737 respectively. The cleaning of the dataset resulted in a final data for covid19_data as 45157 observation with 10 variables.
library(naniar)
summary(covid19_data)
## Data As Of Start Date End Date Group
## Length:104652 Length:104652 Length:104652 Length:104652
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Year Month State Sex
## Min. :2020 Min. : 1.000 Length:104652 Length:104652
## 1st Qu.:2020 1st Qu.: 3.000 Class :character Class :character
## Median :2021 Median : 6.000 Mode :character Mode :character
## Mean :2021 Mean : 6.206
## 3rd Qu.:2022 3rd Qu.: 9.000
## Max. :2022 Max. :12.000
## NA's :2754 NA's :11016
## Age Group COVID-19 Deaths Total Deaths Pneumonia Deaths
## Length:104652 Min. : 0.0 Min. : 0 Min. : 0.0
## Class :character 1st Qu.: 0.0 1st Qu.: 43 1st Qu.: 0.0
## Mode :character Median : 11.0 Median : 156 Median : 19.0
## Mean : 376.1 Mean : 2862 Mean : 366.5
## 3rd Qu.: 67.0 3rd Qu.: 678 3rd Qu.: 82.0
## Max. :1062929.0 Max. :9401189 Max. :980051.0
## NA's :28371 NA's :14881 NA's :32565
## Pneumonia and COVID-19 Deaths Influenza Deaths
## Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0 Median : 0.000
## Mean : 189.4 Mean : 3.585
## 3rd Qu.: 31.0 3rd Qu.: 0.000
## Max. :539118.0 Max. :12475.000
## NA's :27419 NA's :18414
## Pneumonia, Influenza, or COVID-19 Deaths Footnote
## Min. : 0.0 Length:104652
## 1st Qu.: 0.0 Class :character
## Median : 27.0 Mode :character
## Mean : 559.7
## 3rd Qu.: 121.0
## Max. :1514510.0
## NA's :31774
gg_miss_upset(covid19_data)
sum(is.na(covid19_data))
## [1] 200150
#summary(vaccination_data)
gg_miss_upset(vaccination_data)
sum(is.na(vaccination_data))
## [1] 2295957
###cleaning dataset of covid-19 data
COVID_19_new <- subset(covid19_data, select = -c(`Pneumonia Deaths`,`Influenza Deaths`,Group,Year,Footnote,`Data As Of`) )
covid.data<-COVID_19_new
covid.data<-COVID_19_new
covid19_data<-na.omit(covid.data)
names(covid19_data)
## [1] "Start Date"
## [2] "End Date"
## [3] "Month"
## [4] "State"
## [5] "Sex"
## [6] "Age Group"
## [7] "COVID-19 Deaths"
## [8] "Total Deaths"
## [9] "Pneumonia and COVID-19 Deaths"
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
head(covid19_data)
## # A tibble: 6 × 10
## `Start Date` `End Date` Month State Sex `Age Group` `COVID-19 Deat…`
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 01/01/2020 01/31/2020 1 United States All … All Ages 6
## 2 01/01/2020 01/31/2020 1 United States All … Under 1 ye… 0
## 3 01/01/2020 01/31/2020 1 United States All … 0-17 years 0
## 4 01/01/2020 01/31/2020 1 United States All … 1-4 years 0
## 5 01/01/2020 01/31/2020 1 United States All … 5-14 years 0
## 6 01/01/2020 01/31/2020 1 United States All … 15-24 years 0
## # … with 3 more variables: `Total Deaths` <dbl>,
## # `Pneumonia and COVID-19 Deaths` <dbl>,
## # `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>
dim(covid19_data)
## [1] 37764 10
##convert the column to a factor
library(plyr)
covid19_data$Sex.code <- revalue(covid19_data$Sex, c("Male"="1", "Female"="2", "All Sexes"="3"))
covid19_data$Sex.code <- mapvalues(covid19_data$Sex, from = c("Male", "Female", "All Sexes"), to = c("1", "2", "3"))
covid19_data
## # A tibble: 37,764 × 11
## `Start Date` `End Date` Month State Sex `Age Group` `COVID-19 Deat…`
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 01/01/2020 01/31/2020 1 United Stat… All … All Ages 6
## 2 01/01/2020 01/31/2020 1 United Stat… All … Under 1 ye… 0
## 3 01/01/2020 01/31/2020 1 United Stat… All … 0-17 years 0
## 4 01/01/2020 01/31/2020 1 United Stat… All … 1-4 years 0
## 5 01/01/2020 01/31/2020 1 United Stat… All … 5-14 years 0
## 6 01/01/2020 01/31/2020 1 United Stat… All … 15-24 years 0
## 7 01/01/2020 01/31/2020 1 United Stat… All … 18-29 years 0
## 8 01/01/2020 01/31/2020 1 United Stat… All … 25-34 years 0
## 9 01/01/2020 01/31/2020 1 United Stat… All … 30-39 years 0
## 10 01/01/2020 01/31/2020 1 United Stat… All … 35-44 years 0
## # … with 37,754 more rows, and 4 more variables: `Total Deaths` <dbl>,
## # `Pneumonia and COVID-19 Deaths` <dbl>,
## # `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Sex.code <chr>
names(covid19_data)
## [1] "Start Date"
## [2] "End Date"
## [3] "Month"
## [4] "State"
## [5] "Sex"
## [6] "Age Group"
## [7] "COVID-19 Deaths"
## [8] "Total Deaths"
## [9] "Pneumonia and COVID-19 Deaths"
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"
unique(covid19_data[c('State')])
## # A tibble: 54 × 1
## State
## <chr>
## 1 United States
## 2 Alabama
## 3 Alaska
## 4 Arizona
## 5 Arkansas
## 6 California
## 7 Colorado
## 8 Connecticut
## 9 Delaware
## 10 District of Columbia
## # … with 44 more rows
colSums(is.na(covid19_data))
## Start Date
## 0
## End Date
## 0
## Month
## 0
## State
## 0
## Sex
## 0
## Age Group
## 0
## COVID-19 Deaths
## 0
## Total Deaths
## 0
## Pneumonia and COVID-19 Deaths
## 0
## Pneumonia, Influenza, or COVID-19 Deaths
## 0
## Sex.code
## 0
covid19_data[duplicated(covid19_data),]
## # A tibble: 0 × 11
## # … with 11 variables: Start Date <chr>, End Date <chr>, Month <dbl>,
## # State <chr>, Sex <chr>, Age Group <chr>, COVID-19 Deaths <dbl>,
## # Total Deaths <dbl>, Pneumonia and COVID-19 Deaths <dbl>,
## # Pneumonia, Influenza, or COVID-19 Deaths <dbl>, Sex.code <chr>
table(covid19_data$Sex)
##
## All Sexes Female Male
## 13836 11348 12580
barplot(table(covid19_data$Sex))
This defines the respective age groups related to the cases been recorded when it comes to covid-19. Below is the count distribution of Age Group as reported in the used dataset.
table(covid19_data$`Age Group`)
##
## 0-17 years 1-4 years 15-24 years 18-29 years
## 1682 1504 1479 1290
## 25-34 years 30-39 years 35-44 years 40-49 years
## 1146 1159 1283 1530
## 45-54 years 5-14 years 50-64 years 55-64 years
## 1966 1242 3093 2854
## 65-74 years 75-84 years 85 years and over All Ages
## 3455 3616 3506 4741
## Under 1 year
## 2218
barplot(table(covid19_data$`Age Group`))
This variable describe the moralities recorded resulting from covid-19 infections
This also defines the total number of death recorded resulting from covid-19 and other underlying illness.
Below is the exploratory analysis of the variables described above within the covid19_data to further explore the relationship with other triggering variables that will be used in analysis.
library(ggplot2)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Loading required package: viridisLite
#### Sex and Covid-19 deaths
ggplot(covid19_data, aes(x=`COVID-19 Deaths`, y= `Age Group` , fill=Sex, stat = "identity")) +
geom_boxplot()
## Age groups and Covid-19 deaths
ggplot(covid19_data, aes(x=Sex, y= `Total Deaths`, fill=`Age Group`, stat = "identity")) +
geom_boxplot()
months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
ggplot(covid19_data, aes(x= Month, y= `Total Deaths`)) + geom_bar(stat="identity") +
theme_classic() +
labs(title = "Total Deaths confirmed By Month", x= "MONTH",caption = "Fig.3", y= "Total Deaths confirmed", fill= months) + scale_x_discrete(limits = months)+
theme(plot.title = element_text(hjust = 0.5))
###Covid death by month across States
months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
p2 <- ggplot(covid19_data[!duplicated(covid19_data),], aes(fill=State, y=log(`COVID-19 Deaths`), x= Month)) +
geom_bar( stat="identity") +
facet_wrap(~ State) +
labs(title = "Monthly deaths per State", caption = "Fig.4") +
xlab("Month") + ylab("Deaths")+
scale_x_discrete(limits = months)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position = "none")
options(scipen=5)
p2
## Warning: Removed 13101 rows containing missing values (`geom_bar()`).
It can be observed that there that major differences with regards to the trends in the data, with the occurrence of mortalities recorded in every state across United States.
months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
# plot data
ggplot(data = covid19_data, aes(x = Month, y = `COVID-19 Deaths`)) +
geom_bar(stat='identity') +
scale_x_discrete(limits = months) +
labs(title = "COVID-19 Deaths by Month",
caption = "Fig.5",
x = "Month",
y = "COVID-19 Deaths cases",
size = 5,fill= months)
newcovid19_dta <- covid19_data[-(1:51),]
Below is report on the top 10 states with high covid-19 death
# group data
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:raster':
##
## intersect, select, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
newcovid19_dta <- subset(newcovid19_dta, State != "United States")
names(newcovid19_dta)
## [1] "Start Date"
## [2] "End Date"
## [3] "Month"
## [4] "State"
## [5] "Sex"
## [6] "Age Group"
## [7] "COVID-19 Deaths"
## [8] "Total Deaths"
## [9] "Pneumonia and COVID-19 Deaths"
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"
top_State_by_total_cases =newcovid19_dta %>%
group_by(State)%>%
summarize(total_cases = max(`COVID-19 Deaths`)) %>%
top_n(10, total_cases)
# see result
top_State_by_total_cases
## # A tibble: 10 × 2
## State total_cases
## <chr> <dbl>
## 1 Arizona 4032
## 2 California 20064
## 3 Florida 10406
## 4 Illinois 4182
## 5 New Jersey 9010
## 6 New York 7288
## 7 New York City 14926
## 8 Ohio 6051
## 9 Pennsylvania 6546
## 10 Texas 10724
ggplot(top_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) +
geom_bar(stat='identity') +
ylab("Cumulative Death ") +
coord_flip() +
labs(title = "Cummulative Death for Top 10 States ",
caption = "Fig.6")
From the above plot, it can be suggested from the data that California has high record of covid-19 death in recent times followed by New York City.
Below reports the the bottom 10 states by covid-19 death in the united states
# group data
#install.packages("dplyr")
library(dplyr)
newcovid19_dta <- subset(newcovid19_dta, State != "United States")
names(newcovid19_dta)
## [1] "Start Date"
## [2] "End Date"
## [3] "Month"
## [4] "State"
## [5] "Sex"
## [6] "Age Group"
## [7] "COVID-19 Deaths"
## [8] "Total Deaths"
## [9] "Pneumonia and COVID-19 Deaths"
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"
Bottom_State_by_total_cases =newcovid19_dta %>%
group_by(State)%>%
summarize(total_cases = max(`COVID-19 Deaths`)) %>%
slice_min(total_cases, n=10)
# see result
Bottom_State_by_total_cases
## # A tibble: 10 × 2
## State total_cases
## <chr> <dbl>
## 1 Vermont 71
## 2 Wyoming 210
## 3 Alaska 220
## 4 Hawaii 251
## 5 District of Columbia 272
## 6 New Hampshire 312
## 7 Maine 338
## 8 Delaware 340
## 9 Montana 444
## 10 Rhode Island 511
ggplot(Bottom_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) +
geom_bar(stat='identity') +
coord_flip() +
ylab("Cumulative Death ") +
labs(subtitle = "Cummulative Death for 10 Bottom States ",
caption = "Fig.7")
From the above plot as shown visually from the data suggest that Vermont as the low rate of covid-19 death followed by Wyoming.
The aggregate measures on covid-19 cases have resulted in several statistics emerging depicting differences that are noticeable in the spread of the infectious disease(COVID-19).These disparities have been noticed across age and sex. This study however investigates and examine all triggering factors associated with covid-19 mortality in USA.The following analysis method are employed on the above dataset to gain more insight in Covid-19 happening in United States.
1.Linear Regression Model
2.Anova Test
3.Time Series
4.Spatial Analysis
To explore the relationship between covid-19 death and Total Death, a linear regression model is employed to stimulate the trend to enable the prediction of future condition. The linear regression model can determine the value of one response variable based on the value of the given predictor variable.Below is a report on regression model.
#build regression model using lm()
dat_lm1 <- lm(`Total Deaths`~`COVID-19 Deaths`, data=covid19_data)
options(digits=2)
summary(dat_lm1)
##
## Call:
## lm(formula = `Total Deaths` ~ `COVID-19 Deaths`, data = covid19_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -202271 -745 -698 -258 263899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 745.4889 33.7721 22.1 <2e-16 ***
## `COVID-19 Deaths` 5.4492 0.0192 283.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6500 on 37762 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.681
## F-statistic: 8.05e+04 on 1 and 37762 DF, p-value: <2e-16
#build regression model using lm()
dat_lm2 <- lm(`COVID-19 Deaths`~ Sex, data=covid19_data)
options(digits=2)
summary(dat_lm2)
##
## Call:
## lm(formula = `COVID-19 Deaths` ~ Sex, data = covid19_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -341 -231 -180 -113 105210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 341.4 14.8 23.09 < 2e-16 ***
## SexFemale -161.5 22.0 -7.33 2.3e-13 ***
## SexMale -133.8 21.4 -6.24 4.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1740 on 37761 degrees of freedom
## Multiple R-squared: 0.00169, Adjusted R-squared: 0.00164
## F-statistic: 32 on 2 and 37761 DF, p-value: 1.25e-14
#plot the models
ggplot(dat_lm1, aes(x = `COVID-19 Deaths`, y = `Total Deaths`)) +
geom_point(col ="red" )+
stat_smooth(method = "lm", col = "cornflowerblue", caption = "Fig.8")
## Warning in stat_smooth(method = "lm", col = "cornflowerblue", caption =
## "Fig.8"): Ignoring unknown parameters: `caption`
## `geom_smooth()` using formula = 'y ~ x'
From the above statistical output, there is a suggestion that the p-value is smaller than 0.05, hence making the linear regression model exhibit a positive relationship between the total number of death and covid-19 Deaths. There is also the assertion from the regression that sex has p-values very small, closer to zero, hence making it statistically significant at 0.05 level.
The data samples explored for this research are extracted from the Center for disease control (CDC) makes available provisional data on Covid-19 Deaths by sex, age, and state. Key variables included age sex, covid-19 deaths, and Total deaths. These however form the basis for conducting statistical analysis on the data with the formed hypothesis. ANOVA test is conducted on the data to confirm or reject the set hypothesis.
Ho: there is no difference in covid-19 deaths recorded regarding respective age groups and sex
H1: there is a difference in covid-19 deaths recorded regarding respective age groups and sex
A one-way ANOVA tests
First, a separate one-way ANOVA test is conducted on age groups and sex from the dataset using the statistical software R. The model lists the independent variable being tested in the model Age group and model residuals. The second test on sex also list the independent variable and the model residuals. Also in display is the degree of freedom and sum of squares for age and sex respectively. The two separate test reports a p-value less than 0.001, making Age group and sex have a significant impact covid deaths recorded. This however conforms with prior studies that have suggested ages of patients as a core trigger when it comes to Covid-19 deaths. Older ages they suggest have a higher chance of getting infected with the virus and being fatal as well. Hence the null hypothesis (Ho) is rejected.
###One way anova
one.way <- aov(`COVID-19 Deaths` ~ `Age Group`, data = covid19_data)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Age Group` 16 2787432112 174214507 58.9 <2e-16 ***
## Residuals 37747 111622464656 2957121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
one.way_sex <- aov(`COVID-19 Deaths` ~ Sex , data = covid19_data)
summary(one.way_sex)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 2 193801339 96900670 32 1.3e-14 ***
## Residuals 37761 114216095429 3024711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A two-way ANOVA test (without interaction)
A further test is however conducted by modeling covid-deaths as a function of age groups and sex. It can be observed from the test results below that both age groups and sex (male and female) are statistically significant with p-values less than 0.001.
##Two-way ANOVA without interaction of two independent variables
two.way <- aov(`COVID-19 Deaths` ~ `Age Group` + Sex, data = covid19_data)
summary(two.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Age Group` 16 2787432112 174214507 59.0 < 2e-16 ***
## Sex 2 185165313 92582657 31.4 2.5e-14 ***
## Residuals 37745 111437299343 2952372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A two-way ANOVA test (Interaction)
Moreover, the study also employs another statistical model to find out whether the independent variables tested earlier have an interaction rather than the only addictive effect. The test reports that Age Group: sex to have an f-value of 3.13 and a p-value closer to zero(less than 0.001). This suggests that there are not many variations explained by the interaction between age group and sex.
##Two-way ANOVA with interaction of two independent variables
interaction <- aov(`COVID-19 Deaths` ~ `Age Group`*Sex, data = covid19_data)
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Age Group` 16 2787432112 174214507 59.11 < 2e-16 ***
## Sex 2 185165313 92582657 31.41 2.3e-14 ***
## `Age Group`:Sex 32 287429510 8982172 3.05 1.6e-08 ***
## Residuals 37713 111149869833 2947256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above models three models, the best model is sought out for the study. This is done by using Akaike Information Criterion (AIC) which is a good test for model fit. Considering the results after running the test, the two-way model with interaction is suggested as the best fit model for this study. This model has the lowest AIC value of 674013 and AIC weight of 1, which means that it explains 100% of the total variation in the dependent variable that can be explianed by the full model
library(AICcmodavg)
###model fit for all the three models
model.set <- list(one.way, one.way_sex, two.way, interaction)
model.names <- c("one.way","one.way_sex", "two.way", "interaction")
aictab(model.set, modnames = model.names)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## interaction 52 669770 0 1 1 -334833
## two.way 20 669803 33 0 1 -334882
## one.way 18 669862 92 0 1 -334913
## one.way_sex 4 670701 932 0 1 -335347
Homoscedasticity in model fit
Further check on the model to determine whether it fits the assumptions of homoscedasticity is conducted by plotting the model diagnostic in R statistical software. Each plot gives a specific piece of information about the model fit chosen. The normal Q-Q plot plots a regression between the theoretical residuals of the model. This Q-Q plot has a little bit of deviation.
par(mfrow=c(2,2))
plot(two.way)
Time series as a statistical technique allows for analysis of a time series data or trend with the goal of extracting meaningful statistics. For the purpose of this project time series forecasting is employed to predict values based on previous observed data, which represent variables that take on different times.
From the above exploratory analysis, it can be observed that California in recent times has the top number of fatalities with regards to the deadly virus(Covid-19). A deeper look is done to predict the number of fatalities for the top four states.
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:raster':
##
## intersect, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# group by date
str(newcovid19_dta)
## tibble [36,030 × 11] (S3: tbl_df/tbl/data.frame)
## $ Start Date : chr [1:36030] "01/01/2020" "01/01/2020" "01/01/2020" "01/01/2020" ...
## $ End Date : chr [1:36030] "01/31/2020" "01/31/2020" "01/31/2020" "01/31/2020" ...
## $ Month : num [1:36030] 1 1 1 1 1 1 1 1 1 1 ...
## $ State : chr [1:36030] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Sex : chr [1:36030] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
## $ Age Group : chr [1:36030] "Under 1 year" "18-29 years" "40-49 years" "45-54 years" ...
## $ COVID-19 Deaths : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total Deaths : num [1:36030] 31 72 199 297 1034 ...
## $ Pneumonia and COVID-19 Deaths : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
## $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:36030] 0 0 13 21 75 91 65 0 0 0 ...
## $ Sex.code : chr [1:36030] "3" "3" "3" "3" ...
## - attr(*, "na.action")= 'omit' Named int [1:66888] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:66888] "1" "2" "3" "4" ...
#Converting factor datatypes to character
newcovid19_dta[["State"]] <- as.character(newcovid19_dta[["State"]] )
newcovid19_dta$`Start Date`<- as.Date(newcovid19_dta$`Start Date`, format = "%m/%d/%Y")
str(newcovid19_dta)
## tibble [36,030 × 11] (S3: tbl_df/tbl/data.frame)
## $ Start Date : Date[1:36030], format: "2020-01-01" "2020-01-01" ...
## $ End Date : chr [1:36030] "01/31/2020" "01/31/2020" "01/31/2020" "01/31/2020" ...
## $ Month : num [1:36030] 1 1 1 1 1 1 1 1 1 1 ...
## $ State : chr [1:36030] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Sex : chr [1:36030] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
## $ Age Group : chr [1:36030] "Under 1 year" "18-29 years" "40-49 years" "45-54 years" ...
## $ COVID-19 Deaths : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total Deaths : num [1:36030] 31 72 199 297 1034 ...
## $ Pneumonia and COVID-19 Deaths : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
## $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:36030] 0 0 13 21 75 91 65 0 0 0 ...
## $ Sex.code : chr [1:36030] "3" "3" "3" "3" ...
## - attr(*, "na.action")= 'omit' Named int [1:66888] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:66888] "1" "2" "3" "4" ...
coronaData_California <- newcovid19_dta %>%
select(`Start Date`, State, `COVID-19 Deaths`, `Total Deaths`) %>%
group_by(`Start Date`) %>%
filter(State == "California") %>%
summarise(totalConfirmed = sum(`COVID-19 Deaths`,na.rm=TRUE), totalDeaths = sum(`Total Deaths`,na.rm=TRUE))
tail(coronaData_California)
## # A tibble: 6 × 3
## `Start Date` totalConfirmed totalDeaths
## <date> <dbl> <dbl>
## 1 2022-05-01 2458 97449
## 2 2022-06-01 3792 95804
## 3 2022-07-01 6391 105223
## 4 2022-08-01 5521 103480
## 5 2022-09-01 3425 93351
## 6 2022-10-01 830 33418
The Prophet model is employed in this instance, which require columns to be named as “ds” and variable column to be named “y” to perform forecast
# get coronavirus confirmed cases data for New York state
coronaData_confirmed_California <- coronaData_California %>%
select(`Start Date`, totalDeaths)
# get coronavirus deaths data for New York state
coronaData_death_california <- coronaData_California %>%
select(`Start Date`, totalConfirmed)
# transforming data for forecasting
names(coronaData_confirmed_California) <- c("ds", "y")
names(coronaData_death_california) <- c("ds", "y")
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
library(lubridate)
#Predicting covid-19 confirmed Death of California State using predict()
preCA <- prophet(coronaData_confirmed_California)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
futureCA <- make_future_dataframe(preCA,periods = 45)
forecastCA <- predict(preCA,futureCA)
tail(forecastCA[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
## ds yhat yhat_lower yhat_upper
## 74 2022-11-10 -998930 -1019505 -979964
## 75 2022-11-11 -1017583 -1038619 -998387
## 76 2022-11-12 -1011638 -1030874 -991128
## 77 2022-11-13 -982288 -1002161 -963416
## 78 2022-11-14 -931301 -951262 -910955
## 79 2022-11-15 -860977 -880965 -841636
#broken down the prediction into trend and yearly seasonality and plot
prophet_plot_components(preCA,forecastCA)
From the above plots, it can be suggested that there was initial high trends in relation to covid-19 death resulting from covid-19 in the State of California as seen from the data. However, the trends seems to decline over the subsequent year periods(2022 and beyond).The main attributing factor for this decline been availability of proven and tested vaccines for covid and also with a higher number of population getting vaccinated.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
colnames(covid19_data)[1] = "Date"
newcovid19_dta <- covid19_data %>%
subset(State == "United States") %>%
dplyr::select(Date, `COVID-19 Deaths`, `Total Deaths`) %>%
mutate(Date = mdy(Date))
plot1 <- ggplot(newcovid19_dta, aes(x=Date, y=`COVID-19 Deaths`)) +
geom_line()
plot2 <- ggplot(newcovid19_dta, aes(x=Date, y=`Total Deaths`)) +
geom_line()
grid.arrange(plot1, plot2, nrow = 2)
From the above graph it can be suggested that Covid-19 death since its presence in united states has had an up trends in 2020 and 2021, However tends to have a declining trend in 2022.
newcovid19_dta_cts <- ts(
data = newcovid19_dta$`COVID-19 Deaths`,
start = min(newcovid19_dta$Date),
frequency = 7
)
plot(newcovid19_dta_cts, main = "COVID-19 Deaths")
newcovid19_dta_cts <- ts(
data = newcovid19_dta$`Total Deaths`,
start = min(newcovid19_dta$Date),
frequency = 7
)
plot(newcovid19_dta_cts, main = "Total Deaths")
From the above plots, it can be observed that our data model flatuates and has up and down trends. The model assumes that the data does not really have an increasing trend.
In this step the data is been split into train and test data. A closer look of the data is observed from the last 7 days as train data and the rest as test data.
test <- tail(newcovid19_dta_cts, 7)
train <- head(newcovid19_dta_cts, length(newcovid19_dta_cts) - length(test))
plot(train)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
train_ets <- ets(y = train, model = "ZZZ")
train_ets
## ETS(M,N,N)
##
## Call:
## ets(y = train, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.7866
##
## Initial states:
## l = 130174.0713
##
## sigma: 1.3
##
## AIC AICc BIC
## 45237 45237 45253
From the above auto modelling it can be observed that our model has no seasonal pattern in trends.
checkresiduals(train_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 561, df = 8, p-value <2e-16
##
## Model df: 2. Total lags used: 10
plot(train_ets)
train_fore <- forecast(train_ets, 7)
train_fore
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1728 967 -595 2530 -1423 3358
## 1729 967 -1553 3488 -2888 4823
## 1730 967 -2789 4724 -4778 6713
## 1731 967 -4464 6399 -7339 9274
## 1732 967 -6779 8714 -10879 12814
## 1733 967 -10010 11944 -15820 17755
## 1734 967 -14539 16474 -22748 24683
plot(train_fore)
accuracy(train_fore, test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -95 55889 23591 -1259 1312 1.04 -0.32
## Test set 7831 10369 7831 76 76 0.35 NA
The model as suggested above has a mean absolute percentage error of 76% and MASE OF 0.35%
Spatial Analysis is the process of examining the locations, attributes and relationships of a spatial data through overlay whiles employing other analytical techniques to address or gain useful knowledge. Thus, this involves extracting or creating new information about a set of geographic features to perform routine examination, assessment, evaluation, analysis or modeling of data in a geographic area based on pre-established and computerized criteria and standards.
##subset dataset
## data wrangling
newdata<- new_data %>%
filter(quantity_last_updated > "2021-08-01" & quantity_last_updated < "2022-08-01")
#glimpse(newdata)
REMOVAL OF MISSING INCIDENTS The dataset is prepossessed to remove all incidents without any latitude and longitude
####Remove all incidents with no Lat/Long
new_data <- newdata[!is.na(newdata$latitude),]
The easiest way in getting this is to download the united states map using the map_data command from the maps package, which is plotted using the geom_polygon within ggplot.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:viridis':
##
## unemp
## The following object is masked from 'package:plyr':
##
## ozone
USA_States <- map_data("state") ## downloading usa map
##ploting the data using ggplot
ggplot() + geom_polygon(data= USA_States, aes(x=long, y= lat, group = group), fill= NA, colour= "black") +
coord_fixed(1.3)
The cleaned dataset has longitude and latitude coordinates whic can sometimes be called point data. Below is a display of these coordinates
library(ggspatial)
library(usmap)
ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point( data= new_data, aes(x = longitude, y = latitude)) + labs(title = " The Locations of Covid-19 vaccination coordinates in the United States", x= "Longitude", y= "Latitude", caption="Fig.9")
Below is also a view of the coordinates of the cleaned data on the world map
world_map <- map_data("world")
##visualizing our data
ggplot() + geom_polygon(data= world_map, aes(x=long, y= lat, group = group),fill= NA, colour= "black") +
coord_fixed(1.3) +
geom_point(data= new_data, aes(x = longitude, y = latitude), colour = "red")+
labs(title = " The Locations of Covid-19 vaccination in the United States on World map", x= "Longitude", y= "Latitude", caption="Fig.10")
Making changes to the map view of the coordinates used in the dataset which gives a nice clean background of our generated map.
ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = "black") +
coord_fixed(1.3) +
geom_point( data= new_data, aes(x = longitude, y = latitude, colour= "red"), alpha= 1/15 ) +
theme_minimal() + theme(legend.position = "none") +
labs(title = " The Locations of Covid-19 vaccination in the United States", x= "Longitude", y= "Latitude",caption="Fig.11")
Further advanced map can be generated using the map view to illustrate the coordinates in the vaccination data for the state of indiana and USA. Below illustrates the visual mapview on these coordinates.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.7 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::%@%() masks rlang::%@%()
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ purrr::as_function() masks rlang::as_function()
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::desc() masks plyr::desc()
## ✖ tidyr::extract() masks raster::extract()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks rlang::flatten()
## ✖ purrr::flatten_chr() masks rlang::flatten_chr()
## ✖ purrr::flatten_dbl() masks rlang::flatten_dbl()
## ✖ purrr::flatten_int() masks rlang::flatten_int()
## ✖ purrr::flatten_lgl() masks rlang::flatten_lgl()
## ✖ purrr::flatten_raw() masks rlang::flatten_raw()
## ✖ dplyr::id() masks plyr::id()
## ✖ lubridate::intersect() masks raster::intersect(), base::intersect()
## ✖ purrr::invoke() masks rlang::invoke()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::map() masks maps::map()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::select() masks raster::select()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ purrr::splice() masks rlang::splice()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ lubridate::union() masks raster::union(), base::union()
library(sf)
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(mapview)
###Map data for Indiana
VaccineIN <- new_data %>%
filter(loc_admin_state == "IN")
#VaccineNC %>%
# glimpse()
## mapview for Indiana with coordinates from the clean data
mapview(VaccineIN, xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)
###map view for United States with Coordinates
mapview(new_data, xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)