Covid-19 Statistical Analysis, Evidence from USA

Introduction

Coronavirus Disease(Covid-19), caused by Severe Acute Respiratory Syndrome Coronavirus 2(SARS-COV-2) was first identified in December 2019 in Wuhan, a rail and aviation hub in the Hubei Province of China. However since its discovery, there has been over 95 million cases of covid-19 recorded with almost 1 million fatalities. However regardless of these high number of cases in united states since its discovery, almost every single country around the globe has been affected from this pandemic with cases been recorded in all borders, making prevention of its spread a top priority for governments, medical and political leaders. The associated symptoms of the virus include persistent cough, fever,fatigue and loss of smell or taste(Grant et al,2020).

Despite movement restrictions imposed in all 50 states during the peak period of covid-19,there has been cases mortality documented across all states in United States. A lot of residents in various states however remain susceptible to Covid-19, highlighting the need for urgent and effective vaccine to curb the spread of the virus. The evolving spread of the virus on the entire population of US has been very critical with the CDC committing much effort into research with private stakeholders on the development of a vaccine.

To mitigate the spread of this virus, there has been an increase pace in vaccine development. In December 2020, efficacy results from vaccine clinal trials were reported,boosting the confidence in the healthcare system in United States. Since its inception into the borders of united states, there has been so many studies and research conducted on covid-19 using visualization and statistical methods to drive meaning from the rising number of cases and its associated mortality rates.

In the united states , covid-19 has been a test for the healthcare system across every state with severe threat to the public healthcare system. This resulted in most hospitals and healthcare expert been stretched to their maximum during the peak days of virus. However since the introduction and development vaccine to boost the immune system of individuals against Covid-19, there has been a drastic decline in the infection rate, although there are still cases of covid-19 been recorded every across every state. This is as a result of many people declining to get vaccine with their concern been that the vaccine development was so quick and cannot be trusted. However, prior studies suggest that health workers and individuals aged 65 and older were prioritized for vaccination(Gary,2020). This prioritization however significantly relied on the existence of pre-underlying health conditions such as diabetes and hypertension with high tendency for them develop severe disease (Wang et al, 2020).

This project aims to conduct statistical analysis on the covid-19 occurrence in the united states whiles using various statistical tools from the R-statistical software to visualize the dataset to drive meaning from these occurrences. This study is however organised as follows; data collection, descriptive statistics, exploratory analysis, methodology , results discussion and conclusion.

Data Collection

To get answers to the purpose intended for this project and all corresponding hypothesis that will be established, this research employs two dataset from the CDC labeled “Provisional Covid-19 Death by Sex and Age” and “Vaccine_gov_covid_19”. Both dataset used in this study are publicly made available on the CDC website.This project employs R statistical software on a gathered panel to run the statistical analysis and visualization. The provisional data made available from the CDC website is at both state and country level which makes provision for biographic information such as age and sex. The initial dataset label provisional covid-19 death by age and sex consisted of 104652 observations with 16 variables whiles vaccination dataset consisted of 468596 observations of 33 variables. Below is the initial view of the dataset used for this project.

library(readr)
Provisional_COVID_19_Deaths_by_Sex_and_Age_4_ <- read_csv("Provisional_COVID-19_Deaths_by_Sex_and_Age (4).csv")

## Rows: 104652 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Data As Of, Start Date, End Date, Group, State, Sex, Age Group, Foo...
## dbl (8): Year, Month, COVID-19 Deaths, Total Deaths, Pneumonia Deaths, Pneum...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#View(Provisional_COVID_19_Deaths_by_Sex_and_Age_4_)
covid19_data <- Provisional_COVID_19_Deaths_by_Sex_and_Age_4_

 dim(covid19_data)

## [1] 104652     16

 names(covid19_data)

##  [1] "Data As Of"                              
##  [2] "Start Date"                              
##  [3] "End Date"                                
##  [4] "Group"                                   
##  [5] "Year"                                    
##  [6] "Month"                                   
##  [7] "State"                                   
##  [8] "Sex"                                     
##  [9] "Age Group"                               
## [10] "COVID-19 Deaths"                         
## [11] "Total Deaths"                            
## [12] "Pneumonia Deaths"                        
## [13] "Pneumonia and COVID-19 Deaths"           
## [14] "Influenza Deaths"                        
## [15] "Pneumonia, Influenza, or COVID-19 Deaths"
## [16] "Footnote"

library(readr)
Vaccines_gov_COVID_19_vaccinating_provider_locations <- read_csv("Vaccines.gov__COVID-19_vaccinating_provider_locations.csv")

## Rows: 467734 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (22): provider_location_guid, loc_store_no, loc_phone, loc_name, loc_ad...
## dbl   (5): supply_level, latitude, longitude, min_age_months, min_age_years
## lgl   (5): insurance_accepted, walkins_accepted, in_stock, Unnamed Column, o...
## date  (1): quantity_last_updated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#View(Vaccines_gov_COVID_19_vaccinating_provider_locations)
vaccination_data <- Vaccines_gov_COVID_19_vaccinating_provider_locations


dim(vaccination_data)

## [1] 467734     33

names(vaccination_data)

##  [1] "provider_location_guid" "loc_store_no"           "loc_phone"             
##  [4] "loc_name"               "loc_admin_street1"      "loc_admin_street2"     
##  [7] "loc_admin_city"         "loc_admin_state"        "loc_admin_zip"         
## [10] "sunday_hours"           "monday_hours"           "tuesday_hours"         
## [13] "wednesday_hours"        "thursday_hours"         "friday_hours"          
## [16] "saturday_hours"         "web_address"            "pre_screen"            
## [19] "insurance_accepted"     "walkins_accepted"       "provider_notes"        
## [22] "ndc"                    "med_name"               "in_stock"              
## [25] "supply_level"           "quantity_last_updated"  "latitude"              
## [28] "longitude"              "Category"               "Unnamed Column"        
## [31] "offers_free_masks"      "min_age_months"         "min_age_years"

### Removing of missing incidents
new_data <- vaccination_data[!is.na(vaccination_data$latitude),]

#summary(new_data)

library(raster)

## Loading required package: sp

wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
vaccination_spt= SpatialPointsDataFrame(vaccination_data[,c('longitude','latitude')], vaccination_data, proj4string = wgs84)
plot(vaccination_spt)

Descriptive Statistics

Below is a descriptive analysis of the dataset used for this project.The study employs R statistical software to run summary on both data set coupled with chatting the missing cases with these dataset.Below reports the total number of missing cases from vaccination data and covid19_data to be 2303113 and 199737 respectively. The cleaning of the dataset resulted in a final data for covid19_data as 45157 observation with 10 variables.

library(naniar)
summary(covid19_data)

##   Data As Of         Start Date          End Date            Group          
##  Length:104652      Length:104652      Length:104652      Length:104652     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       Year          Month           State               Sex           
##  Min.   :2020   Min.   : 1.000   Length:104652      Length:104652     
##  1st Qu.:2020   1st Qu.: 3.000   Class :character   Class :character  
##  Median :2021   Median : 6.000   Mode  :character   Mode  :character  
##  Mean   :2021   Mean   : 6.206                                        
##  3rd Qu.:2022   3rd Qu.: 9.000                                        
##  Max.   :2022   Max.   :12.000                                        
##  NA's   :2754   NA's   :11016                                         
##   Age Group         COVID-19 Deaths      Total Deaths     Pneumonia Deaths  
##  Length:104652      Min.   :      0.0   Min.   :      0   Min.   :     0.0  
##  Class :character   1st Qu.:      0.0   1st Qu.:     43   1st Qu.:     0.0  
##  Mode  :character   Median :     11.0   Median :    156   Median :    19.0  
##                     Mean   :    376.1   Mean   :   2862   Mean   :   366.5  
##                     3rd Qu.:     67.0   3rd Qu.:    678   3rd Qu.:    82.0  
##                     Max.   :1062929.0   Max.   :9401189   Max.   :980051.0  
##                     NA's   :28371       NA's   :14881     NA's   :32565     
##  Pneumonia and COVID-19 Deaths Influenza Deaths   
##  Min.   :     0.0              Min.   :    0.000  
##  1st Qu.:     0.0              1st Qu.:    0.000  
##  Median :     0.0              Median :    0.000  
##  Mean   :   189.4              Mean   :    3.585  
##  3rd Qu.:    31.0              3rd Qu.:    0.000  
##  Max.   :539118.0              Max.   :12475.000  
##  NA's   :27419                 NA's   :18414      
##  Pneumonia, Influenza, or COVID-19 Deaths   Footnote        
##  Min.   :      0.0                        Length:104652     
##  1st Qu.:      0.0                        Class :character  
##  Median :     27.0                        Mode  :character  
##  Mean   :    559.7                                          
##  3rd Qu.:    121.0                                          
##  Max.   :1514510.0                                          
##  NA's   :31774

gg_miss_upset(covid19_data)

sum(is.na(covid19_data))

## [1] 200150

#summary(vaccination_data)
gg_miss_upset(vaccination_data)

sum(is.na(vaccination_data))

## [1] 2295957

Data Cleaning

###cleaning dataset of covid-19 data
COVID_19_new <- subset(covid19_data, select = -c(`Pneumonia Deaths`,`Influenza Deaths`,Group,Year,Footnote,`Data As Of`) )
covid.data<-COVID_19_new

covid.data<-COVID_19_new

covid19_data<-na.omit(covid.data)
names(covid19_data)

##  [1] "Start Date"                              
##  [2] "End Date"                                
##  [3] "Month"                                   
##  [4] "State"                                   
##  [5] "Sex"                                     
##  [6] "Age Group"                               
##  [7] "COVID-19 Deaths"                         
##  [8] "Total Deaths"                            
##  [9] "Pneumonia and COVID-19 Deaths"           
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"

head(covid19_data)

## # A tibble: 6 × 10
##   `Start Date` `End Date` Month State         Sex   `Age Group` `COVID-19 Deat…`
##   <chr>        <chr>      <dbl> <chr>         <chr> <chr>                  <dbl>
## 1 01/01/2020   01/31/2020     1 United States All … All Ages                   6
## 2 01/01/2020   01/31/2020     1 United States All … Under 1 ye…                0
## 3 01/01/2020   01/31/2020     1 United States All … 0-17 years                 0
## 4 01/01/2020   01/31/2020     1 United States All … 1-4 years                  0
## 5 01/01/2020   01/31/2020     1 United States All … 5-14 years                 0
## 6 01/01/2020   01/31/2020     1 United States All … 15-24 years                0
## # … with 3 more variables: `Total Deaths` <dbl>,
## #   `Pneumonia and COVID-19 Deaths` <dbl>,
## #   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>

dim(covid19_data)

## [1] 37764    10

##convert the column to a factor
library(plyr)
covid19_data$Sex.code <- revalue(covid19_data$Sex, c("Male"="1", "Female"="2", "All Sexes"="3"))
covid19_data$Sex.code <- mapvalues(covid19_data$Sex, from = c("Male", "Female", "All Sexes"), to = c("1", "2", "3"))
covid19_data

## # A tibble: 37,764 × 11
##    `Start Date` `End Date` Month State        Sex   `Age Group` `COVID-19 Deat…`
##    <chr>        <chr>      <dbl> <chr>        <chr> <chr>                  <dbl>
##  1 01/01/2020   01/31/2020     1 United Stat… All … All Ages                   6
##  2 01/01/2020   01/31/2020     1 United Stat… All … Under 1 ye…                0
##  3 01/01/2020   01/31/2020     1 United Stat… All … 0-17 years                 0
##  4 01/01/2020   01/31/2020     1 United Stat… All … 1-4 years                  0
##  5 01/01/2020   01/31/2020     1 United Stat… All … 5-14 years                 0
##  6 01/01/2020   01/31/2020     1 United Stat… All … 15-24 years                0
##  7 01/01/2020   01/31/2020     1 United Stat… All … 18-29 years                0
##  8 01/01/2020   01/31/2020     1 United Stat… All … 25-34 years                0
##  9 01/01/2020   01/31/2020     1 United Stat… All … 30-39 years                0
## 10 01/01/2020   01/31/2020     1 United Stat… All … 35-44 years                0
## # … with 37,754 more rows, and 4 more variables: `Total Deaths` <dbl>,
## #   `Pneumonia and COVID-19 Deaths` <dbl>,
## #   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Sex.code <chr>

names(covid19_data)

##  [1] "Start Date"                              
##  [2] "End Date"                                
##  [3] "Month"                                   
##  [4] "State"                                   
##  [5] "Sex"                                     
##  [6] "Age Group"                               
##  [7] "COVID-19 Deaths"                         
##  [8] "Total Deaths"                            
##  [9] "Pneumonia and COVID-19 Deaths"           
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"

unique(covid19_data[c('State')])

## # A tibble: 54 × 1
##    State               
##    <chr>               
##  1 United States       
##  2 Alabama             
##  3 Alaska              
##  4 Arizona             
##  5 Arkansas            
##  6 California          
##  7 Colorado            
##  8 Connecticut         
##  9 Delaware            
## 10 District of Columbia
## # … with 44 more rows

colSums(is.na(covid19_data))

##                               Start Date 
##                                        0 
##                                 End Date 
##                                        0 
##                                    Month 
##                                        0 
##                                    State 
##                                        0 
##                                      Sex 
##                                        0 
##                                Age Group 
##                                        0 
##                          COVID-19 Deaths 
##                                        0 
##                             Total Deaths 
##                                        0 
##            Pneumonia and COVID-19 Deaths 
##                                        0 
## Pneumonia, Influenza, or COVID-19 Deaths 
##                                        0 
##                                 Sex.code 
##                                        0

covid19_data[duplicated(covid19_data),]

## # A tibble: 0 × 11
## # … with 11 variables: Start Date <chr>, End Date <chr>, Month <dbl>,
## #   State <chr>, Sex <chr>, Age Group <chr>, COVID-19 Deaths <dbl>,
## #   Total Deaths <dbl>, Pneumonia and COVID-19 Deaths <dbl>,
## #   Pneumonia, Influenza, or COVID-19 Deaths <dbl>, Sex.code <chr>

Variable Definition

table(covid19_data$Sex)

## 
## All Sexes    Female      Male 
##     13836     11348     12580

barplot(table(covid19_data$Sex))

Age Group

This defines the respective age groups related to the cases been recorded when it comes to covid-19. Below is the count distribution of Age Group as reported in the used dataset.

table(covid19_data$`Age Group`)

## 
##        0-17 years         1-4 years       15-24 years       18-29 years 
##              1682              1504              1479              1290 
##       25-34 years       30-39 years       35-44 years       40-49 years 
##              1146              1159              1283              1530 
##       45-54 years        5-14 years       50-64 years       55-64 years 
##              1966              1242              3093              2854 
##       65-74 years       75-84 years 85 years and over          All Ages 
##              3455              3616              3506              4741 
##      Under 1 year 
##              2218

barplot(table(covid19_data$`Age Group`))

COVID-19 Deaths

This variable describe the moralities recorded resulting from covid-19 infections

Total Deaths

This also defines the total number of death recorded resulting from covid-19 and other underlying illness.

Exploratory Data Analysis

Below is the exploratory analysis of the variables described above within the covid19_data to further explore the relationship with other triggering variables that will be used in analysis.

library(ggplot2)
library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(viridis)

## Loading required package: viridisLite

#### Sex and Covid-19 deaths
ggplot(covid19_data, aes(x=`COVID-19 Deaths`, y= `Age Group` , fill=Sex, stat = "identity")) + 
  geom_boxplot()

## Age groups and Covid-19 deaths
ggplot(covid19_data, aes(x=Sex, y= `Total Deaths`, fill=`Age Group`, stat = "identity")) + 
  geom_boxplot()

months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

ggplot(covid19_data, aes(x= Month, y= `Total Deaths`)) + geom_bar(stat="identity") +
  theme_classic() +
  labs(title = "Total Deaths confirmed By Month", x=  "MONTH",caption = "Fig.3", y= "Total Deaths confirmed", fill= months) + scale_x_discrete(limits = months)+
  theme(plot.title = element_text(hjust = 0.5))

###Covid death by month across States
months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

p2 <- ggplot(covid19_data[!duplicated(covid19_data),], aes(fill=State, y=log(`COVID-19 Deaths`), x= Month)) +
  geom_bar( stat="identity") +
  facet_wrap(~ State) +
  labs(title = "Monthly deaths per State", caption = "Fig.4") +
  xlab("Month") + ylab("Deaths")+
  scale_x_discrete(limits = months)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90)) +
  theme(legend.position = "none")

options(scipen=5)

p2

## Warning: Removed 13101 rows containing missing values (`geom_bar()`).

It can be observed that there that major differences with regards to the trends in the data, with the occurrence of mortalities recorded in every state across United States.

months = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

# plot data
ggplot(data = covid19_data, aes(x = Month, y = `COVID-19 Deaths`)) +
    geom_bar(stat='identity') +
  scale_x_discrete(limits = months) +
    labs(title = "COVID-19 Deaths by Month",
         caption = "Fig.5",
         x = "Month",
         y = "COVID-19 Deaths cases",
         size = 5,fill= months)

newcovid19_dta <- covid19_data[-(1:51),]

Top 10 Covid-19 Death by State

Below is report on the top 10 states with high covid-19 death

# group data
#install.packages("dplyr") 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

newcovid19_dta <- subset(newcovid19_dta, State != "United States")
names(newcovid19_dta)

##  [1] "Start Date"                              
##  [2] "End Date"                                
##  [3] "Month"                                   
##  [4] "State"                                   
##  [5] "Sex"                                     
##  [6] "Age Group"                               
##  [7] "COVID-19 Deaths"                         
##  [8] "Total Deaths"                            
##  [9] "Pneumonia and COVID-19 Deaths"           
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"

top_State_by_total_cases =newcovid19_dta %>%
    group_by(State)%>%
    summarize(total_cases = max(`COVID-19 Deaths`)) %>%
    top_n(10, total_cases)

# see result
top_State_by_total_cases

## # A tibble: 10 × 2
##    State         total_cases
##    <chr>               <dbl>
##  1 Arizona              4032
##  2 California          20064
##  3 Florida             10406
##  4 Illinois             4182
##  5 New Jersey           9010
##  6 New York             7288
##  7 New York City       14926
##  8 Ohio                 6051
##  9 Pennsylvania         6546
## 10 Texas               10724

ggplot(top_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) +
geom_bar(stat='identity') +
ylab("Cumulative Death ")  +
  coord_flip() +
    labs(title = "Cummulative Death for Top 10 States ",
         caption = "Fig.6")

From the above plot, it can be suggested from the data that California has high record of covid-19 death in recent times followed by New York City.

Bottom 10 Covid-19 Death by State

Below reports the the bottom 10 states by covid-19 death in the united states

# group data
#install.packages("dplyr") 
library(dplyr)
newcovid19_dta <- subset(newcovid19_dta, State != "United States")
names(newcovid19_dta)

##  [1] "Start Date"                              
##  [2] "End Date"                                
##  [3] "Month"                                   
##  [4] "State"                                   
##  [5] "Sex"                                     
##  [6] "Age Group"                               
##  [7] "COVID-19 Deaths"                         
##  [8] "Total Deaths"                            
##  [9] "Pneumonia and COVID-19 Deaths"           
## [10] "Pneumonia, Influenza, or COVID-19 Deaths"
## [11] "Sex.code"

Bottom_State_by_total_cases =newcovid19_dta %>%
    group_by(State)%>%
    summarize(total_cases = max(`COVID-19 Deaths`)) %>%
   slice_min(total_cases, n=10)

# see result
Bottom_State_by_total_cases

## # A tibble: 10 × 2
##    State                total_cases
##    <chr>                      <dbl>
##  1 Vermont                       71
##  2 Wyoming                      210
##  3 Alaska                       220
##  4 Hawaii                       251
##  5 District of Columbia         272
##  6 New Hampshire                312
##  7 Maine                        338
##  8 Delaware                     340
##  9 Montana                      444
## 10 Rhode Island                 511

ggplot(Bottom_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) +
geom_bar(stat='identity') + 
    coord_flip() +
ylab("Cumulative Death ")  +
    labs(subtitle = "Cummulative Death for 10 Bottom States ",
         caption = "Fig.7")

From the above plot as shown visually from the data suggest that Vermont as the low rate of covid-19 death followed by Wyoming.

Analysis Methods

The aggregate measures on covid-19 cases have resulted in several statistics emerging depicting differences that are noticeable in the spread of the infectious disease(COVID-19).These disparities have been noticed across age and sex. This study however investigates and examine all triggering factors associated with covid-19 mortality in USA.The following analysis method are employed on the above dataset to gain more insight in Covid-19 happening in United States.

1.Linear Regression Model

2.Anova Test

3.Time Series

4.Spatial Analysis

Linear Regression Model

To explore the relationship between covid-19 death and Total Death, a linear regression model is employed to stimulate the trend to enable the prediction of future condition. The linear regression model can determine the value of one response variable based on the value of the given predictor variable.Below is a report on regression model.

#build regression model using lm()
dat_lm1 <- lm(`Total Deaths`~`COVID-19 Deaths`, data=covid19_data)
options(digits=2)
summary(dat_lm1)

## 
## Call:
## lm(formula = `Total Deaths` ~ `COVID-19 Deaths`, data = covid19_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202271    -745    -698    -258  263899 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       745.4889    33.7721    22.1   <2e-16 ***
## `COVID-19 Deaths`   5.4492     0.0192   283.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6500 on 37762 degrees of freedom
## Multiple R-squared:  0.681,  Adjusted R-squared:  0.681 
## F-statistic: 8.05e+04 on 1 and 37762 DF,  p-value: <2e-16

#build regression model using lm()
dat_lm2 <- lm(`COVID-19 Deaths`~ Sex, data=covid19_data)
options(digits=2)
summary(dat_lm2)

## 
## Call:
## lm(formula = `COVID-19 Deaths` ~ Sex, data = covid19_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -341   -231   -180   -113 105210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    341.4       14.8   23.09  < 2e-16 ***
## SexFemale     -161.5       22.0   -7.33  2.3e-13 ***
## SexMale       -133.8       21.4   -6.24  4.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1740 on 37761 degrees of freedom
## Multiple R-squared:  0.00169,    Adjusted R-squared:  0.00164 
## F-statistic:   32 on 2 and 37761 DF,  p-value: 1.25e-14

#plot the models
ggplot(dat_lm1, aes(x = `COVID-19 Deaths`, y = `Total Deaths`)) + 
  geom_point(col ="red" )+
  stat_smooth(method = "lm", col = "cornflowerblue",  caption = "Fig.8")

## Warning in stat_smooth(method = "lm", col = "cornflowerblue", caption =
## "Fig.8"): Ignoring unknown parameters: `caption`

## `geom_smooth()` using formula = 'y ~ x'

Interpretation

From the above statistical output, there is a suggestion that the p-value is smaller than 0.05, hence making the linear regression model exhibit a positive relationship between the total number of death and covid-19 Deaths. There is also the assertion from the regression that sex has p-values very small, closer to zero, hence making it statistically significant at 0.05 level.

Anova Test

The data samples explored for this research are extracted from the Center for disease control (CDC) makes available provisional data on Covid-19 Deaths by sex, age, and state. Key variables included age sex, covid-19 deaths, and Total deaths. These however form the basis for conducting statistical analysis on the data with the formed hypothesis. ANOVA test is conducted on the data to confirm or reject the set hypothesis.

Ho: there is no difference in covid-19 deaths recorded regarding respective age groups and sex

H1: there is a difference in covid-19 deaths recorded regarding respective age groups and sex

A one-way ANOVA tests

First, a separate one-way ANOVA test is conducted on age groups and sex from the dataset using the statistical software R. The model lists the independent variable being tested in the model Age group and model residuals. The second test on sex also list the independent variable and the model residuals. Also in display is the degree of freedom and sum of squares for age and sex respectively. The two separate test reports a p-value less than 0.001, making Age group and sex have a significant impact covid deaths recorded. This however conforms with prior studies that have suggested ages of patients as a core trigger when it comes to Covid-19 deaths. Older ages they suggest have a higher chance of getting infected with the virus and being fatal as well. Hence the null hypothesis (Ho) is rejected.

###One way anova 
one.way <- aov(`COVID-19 Deaths`  ~ `Age Group`, data = covid19_data)
summary(one.way)

##                Df       Sum Sq   Mean Sq F value Pr(>F)    
## `Age Group`    16   2787432112 174214507    58.9 <2e-16 ***
## Residuals   37747 111622464656   2957121                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

one.way_sex <- aov(`COVID-19 Deaths`  ~ Sex , data = covid19_data)
summary(one.way_sex)

##                Df       Sum Sq  Mean Sq F value  Pr(>F)    
## Sex             2    193801339 96900670      32 1.3e-14 ***
## Residuals   37761 114216095429  3024711                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A two-way ANOVA test (without interaction)

A further test is however conducted by modeling covid-deaths as a function of age groups and sex. It can be observed from the test results below that both age groups and sex (male and female) are statistically significant with p-values less than 0.001.

##Two-way ANOVA without interaction of two independent variables 
two.way <- aov(`COVID-19 Deaths` ~ `Age Group` + Sex, data = covid19_data)

summary(two.way)

##                Df       Sum Sq   Mean Sq F value  Pr(>F)    
## `Age Group`    16   2787432112 174214507    59.0 < 2e-16 ***
## Sex             2    185165313  92582657    31.4 2.5e-14 ***
## Residuals   37745 111437299343   2952372                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A two-way ANOVA test (Interaction)

Moreover, the study also employs another statistical model to find out whether the independent variables tested earlier have an interaction rather than the only addictive effect. The test reports that Age Group: sex to have an f-value of 3.13 and a p-value closer to zero(less than 0.001). This suggests that there are not many variations explained by the interaction between age group and sex.

##Two-way ANOVA with interaction of two independent variables 
interaction <- aov(`COVID-19 Deaths` ~ `Age Group`*Sex, data =  covid19_data)

summary(interaction)

##                    Df       Sum Sq   Mean Sq F value  Pr(>F)    
## `Age Group`        16   2787432112 174214507   59.11 < 2e-16 ***
## Sex                 2    185165313  92582657   31.41 2.3e-14 ***
## `Age Group`:Sex    32    287429510   8982172    3.05 1.6e-08 ***
## Residuals       37713 111149869833   2947256                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above models three models, the best model is sought out for the study. This is done by using Akaike Information Criterion (AIC) which is a good test for model fit. Considering the results after running the test, the two-way model with interaction is suggested as the best fit model for this study. This model has the lowest AIC value of 674013 and AIC weight of 1, which means that it explains 100% of the total variation in the dependent variable that can be explianed by the full model

library(AICcmodavg)
###model fit for all the three models
model.set <- list(one.way, one.way_sex, two.way, interaction)
model.names <- c("one.way","one.way_sex", "two.way", "interaction")

aictab(model.set, modnames = model.names)

## 
## Model selection based on AICc:
## 
##              K   AICc Delta_AICc AICcWt Cum.Wt      LL
## interaction 52 669770          0      1      1 -334833
## two.way     20 669803         33      0      1 -334882
## one.way     18 669862         92      0      1 -334913
## one.way_sex  4 670701        932      0      1 -335347

Homoscedasticity in model fit

Further check on the model to determine whether it fits the assumptions of homoscedasticity is conducted by plotting the model diagnostic in R statistical software. Each plot gives a specific piece of information about the model fit chosen. The normal Q-Q plot plots a regression between the theoretical residuals of the model. This Q-Q plot has a little bit of deviation.

par(mfrow=c(2,2))
plot(two.way)

Time Series Analysis

Time series as a statistical technique allows for analysis of a time series data or trend with the goal of extracting meaningful statistics. For the purpose of this project time series forecasting is employed to predict values based on previous observed data, which represent variables that take on different times.

Extracting Covid-19 Data for California

From the above exploratory analysis, it can be observed that California in recent times has the top number of fatalities with regards to the deadly virus(Covid-19). A deeper look is done to predict the number of fatalities for the top four states.

library(lubridate)

## Loading required package: timechange

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:raster':
## 
##     intersect, union

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# group by date
str(newcovid19_dta)

## tibble [36,030 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Start Date                              : chr [1:36030] "01/01/2020" "01/01/2020" "01/01/2020" "01/01/2020" ...
##  $ End Date                                : chr [1:36030] "01/31/2020" "01/31/2020" "01/31/2020" "01/31/2020" ...
##  $ Month                                   : num [1:36030] 1 1 1 1 1 1 1 1 1 1 ...
##  $ State                                   : chr [1:36030] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Sex                                     : chr [1:36030] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
##  $ Age Group                               : chr [1:36030] "Under 1 year" "18-29 years" "40-49 years" "45-54 years" ...
##  $ COVID-19 Deaths                         : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total Deaths                            : num [1:36030] 31 72 199 297 1034 ...
##  $ Pneumonia and COVID-19 Deaths           : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:36030] 0 0 13 21 75 91 65 0 0 0 ...
##  $ Sex.code                                : chr [1:36030] "3" "3" "3" "3" ...
##  - attr(*, "na.action")= 'omit' Named int [1:66888] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:66888] "1" "2" "3" "4" ...

#Converting factor datatypes to character
newcovid19_dta[["State"]] <- as.character(newcovid19_dta[["State"]] )

newcovid19_dta$`Start Date`<- as.Date(newcovid19_dta$`Start Date`, format = "%m/%d/%Y")
str(newcovid19_dta)

## tibble [36,030 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Start Date                              : Date[1:36030], format: "2020-01-01" "2020-01-01" ...
##  $ End Date                                : chr [1:36030] "01/31/2020" "01/31/2020" "01/31/2020" "01/31/2020" ...
##  $ Month                                   : num [1:36030] 1 1 1 1 1 1 1 1 1 1 ...
##  $ State                                   : chr [1:36030] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Sex                                     : chr [1:36030] "All Sexes" "All Sexes" "All Sexes" "All Sexes" ...
##  $ Age Group                               : chr [1:36030] "Under 1 year" "18-29 years" "40-49 years" "45-54 years" ...
##  $ COVID-19 Deaths                         : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total Deaths                            : num [1:36030] 31 72 199 297 1034 ...
##  $ Pneumonia and COVID-19 Deaths           : num [1:36030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:36030] 0 0 13 21 75 91 65 0 0 0 ...
##  $ Sex.code                                : chr [1:36030] "3" "3" "3" "3" ...
##  - attr(*, "na.action")= 'omit' Named int [1:66888] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:66888] "1" "2" "3" "4" ...

coronaData_California <- newcovid19_dta %>%
  select(`Start Date`, State, `COVID-19 Deaths`, `Total Deaths`)  %>% 
  
  group_by(`Start Date`) %>%
  
  filter(State == "California") %>%
  
  summarise(totalConfirmed = sum(`COVID-19 Deaths`,na.rm=TRUE), totalDeaths = sum(`Total Deaths`,na.rm=TRUE))

tail(coronaData_California)

## # A tibble: 6 × 3
##   `Start Date` totalConfirmed totalDeaths
##   <date>                <dbl>       <dbl>
## 1 2022-05-01             2458       97449
## 2 2022-06-01             3792       95804
## 3 2022-07-01             6391      105223
## 4 2022-08-01             5521      103480
## 5 2022-09-01             3425       93351
## 6 2022-10-01              830       33418

The Prophet model is employed in this instance, which require columns to be named as “ds” and variable column to be named “y” to perform forecast

# get coronavirus confirmed cases data for New York state

coronaData_confirmed_California <- coronaData_California  %>%
  select(`Start Date`, totalDeaths) 

# get coronavirus deaths data for New York state
coronaData_death_california <- coronaData_California %>%
  select(`Start Date`, totalConfirmed)

# transforming data for forecasting    
names(coronaData_confirmed_California) <- c("ds", "y")
names(coronaData_death_california) <- c("ds", "y")

library(prophet)

## Loading required package: Rcpp

## Loading required package: rlang

library(lubridate)
#Predicting covid-19 confirmed Death of California State  using predict()  
preCA <- prophet(coronaData_confirmed_California)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

futureCA <- make_future_dataframe(preCA,periods = 45)

forecastCA <- predict(preCA,futureCA)

tail(forecastCA[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

##            ds     yhat yhat_lower yhat_upper
## 74 2022-11-10  -998930   -1019505    -979964
## 75 2022-11-11 -1017583   -1038619    -998387
## 76 2022-11-12 -1011638   -1030874    -991128
## 77 2022-11-13  -982288   -1002161    -963416
## 78 2022-11-14  -931301    -951262    -910955
## 79 2022-11-15  -860977    -880965    -841636

#broken down the prediction into trend and yearly seasonality and plot
prophet_plot_components(preCA,forecastCA)

From the above plots, it can be suggested that there was initial high trends in relation to covid-19 death resulting from covid-19 in the State of California as seen from the data. However, the trends seems to decline over the subsequent year periods(2022 and beyond).The main attributing factor for this decline been availability of proven and tested vaccines for covid and also with a higher number of population getting vaccinated.

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

colnames(covid19_data)[1] = "Date"

newcovid19_dta <- covid19_data %>%
  subset(State == "United States") %>%
 dplyr::select(Date, `COVID-19 Deaths`, `Total Deaths`) %>%
  mutate(Date = mdy(Date))


plot1 <- ggplot(newcovid19_dta, aes(x=Date, y=`COVID-19 Deaths`)) +
  geom_line()

plot2 <- ggplot(newcovid19_dta, aes(x=Date, y=`Total Deaths`)) +
  geom_line()
grid.arrange(plot1, plot2, nrow = 2)

From the above graph it can be suggested that Covid-19 death since its presence in united states has had an up trends in 2020 and 2021, However tends to have a declining trend in 2022.

Modelling

newcovid19_dta_cts <- ts(
  data = newcovid19_dta$`COVID-19 Deaths`,
  start = min(newcovid19_dta$Date),
  frequency = 7
)

plot(newcovid19_dta_cts, main = "COVID-19 Deaths")

newcovid19_dta_cts <- ts(
  data = newcovid19_dta$`Total Deaths`,
  start = min(newcovid19_dta$Date),
  frequency = 7
)

plot(newcovid19_dta_cts, main = "Total Deaths")

From the above plots, it can be observed that our data model flatuates and has up and down trends. The model assumes that the data does not really have an increasing trend.

Cross Validation

In this step the data is been split into train and test data. A closer look of the data is observed from the last 7 days as train data and the rest as test data.

test <- tail(newcovid19_dta_cts, 7)
train <- head(newcovid19_dta_cts, length(newcovid19_dta_cts) - length(test))

plot(train)

Time Series Modeling

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

train_ets <- ets(y = train, model = "ZZZ")
train_ets

## ETS(M,N,N) 
## 
## Call:
##  ets(y = train, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.7866 
## 
##   Initial states:
##     l = 130174.0713 
## 
##   sigma:  1.3
## 
##   AIC  AICc   BIC 
## 45237 45237 45253

From the above auto modelling it can be observed that our model has no seasonal pattern in trends.

checkresiduals(train_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 561, df = 8, p-value <2e-16
## 
## Model df: 2.   Total lags used: 10

plot(train_ets)

Forecasting

train_fore <- forecast(train_ets, 7)
train_fore

##      Point Forecast  Lo 80 Hi 80  Lo 95 Hi 95
## 1728            967   -595  2530  -1423  3358
## 1729            967  -1553  3488  -2888  4823
## 1730            967  -2789  4724  -4778  6713
## 1731            967  -4464  6399  -7339  9274
## 1732            967  -6779  8714 -10879 12814
## 1733            967 -10010 11944 -15820 17755
## 1734            967 -14539 16474 -22748 24683

plot(train_fore)

accuracy(train_fore, test)

##                ME  RMSE   MAE   MPE MAPE MASE  ACF1
## Training set  -95 55889 23591 -1259 1312 1.04 -0.32
## Test set     7831 10369  7831    76   76 0.35    NA

The model as suggested above has a mean absolute percentage error of 76% and MASE OF 0.35%

Spatial Analysis

Spatial Analysis is the process of examining the locations, attributes and relationships of a spatial data through overlay whiles employing other analytical techniques to address or gain useful knowledge. Thus, this involves extracting or creating new information about a set of geographic features to perform routine examination, assessment, evaluation, analysis or modeling of data in a geographic area based on pre-established and computerized criteria and standards.

##subset dataset
## data wrangling
newdata<- new_data  %>%
filter(quantity_last_updated > "2021-08-01" & quantity_last_updated < "2022-08-01")
#glimpse(newdata)

REMOVAL OF MISSING INCIDENTS The dataset is prepossessed to remove all incidents without any latitude and longitude

####Remove all incidents with no Lat/Long
new_data <- newdata[!is.na(newdata$latitude),]

Getting a map of United States

The easiest way in getting this is to download the united states map using the map_data command from the maps package, which is plotted using the geom_polygon within ggplot.

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:viridis':
## 
##     unemp

## The following object is masked from 'package:plyr':
## 
##     ozone

USA_States <- map_data("state") ## downloading usa map

##ploting the data using ggplot
ggplot() + geom_polygon(data= USA_States, aes(x=long, y= lat, group = group), fill= NA, colour= "black") +
  coord_fixed(1.3)

Visualising the Cleaned Dataset

The cleaned dataset has longitude and latitude coordinates whic can sometimes be called point data. Below is a display of these coordinates

library(ggspatial)
library(usmap)
 ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = "black") +
   coord_fixed(1.3) +
   geom_point( data= new_data, aes(x = longitude, y = latitude)) + labs(title = " The Locations of Covid-19 vaccination coordinates in the United States", x= "Longitude", y= "Latitude", caption="Fig.9")

Below is also a view of the coordinates of the cleaned data on the world map

world_map <- map_data("world")
##visualizing our data
ggplot() + geom_polygon(data= world_map, aes(x=long, y= lat, group = group),fill= NA, colour= "black") +
  coord_fixed(1.3) +
  geom_point(data= new_data, aes(x = longitude, y = latitude), colour = "red")+
  labs(title = " The Locations of Covid-19 vaccination in the United States on World map", x= "Longitude", y= "Latitude", caption="Fig.10")

Making changes to the map view of the coordinates used in the dataset which gives a nice clean background of our generated map.

ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = "black") +
   coord_fixed(1.3) +
   geom_point( data= new_data, aes(x = longitude, y = latitude, colour= "red"), alpha= 1/15 ) +
  theme_minimal() + theme(legend.position = "none") +
  labs(title = " The Locations of Covid-19 vaccination in the United States", x= "Longitude", y= "Latitude",caption="Fig.11")

Covid-19 vaccination locations-Indiana/USA

Further advanced map can be generated using the map view to illustrate the coordinates in the vaccination data for the state of indiana and USA. Below illustrates the visual mapview on these coordinates.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.7     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::%@%()             masks rlang::%@%()
## ✖ dplyr::arrange()         masks plyr::arrange()
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ purrr::as_function()     masks rlang::as_function()
## ✖ gridExtra::combine()     masks dplyr::combine()
## ✖ purrr::compact()         masks plyr::compact()
## ✖ dplyr::count()           masks plyr::count()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::desc()            masks plyr::desc()
## ✖ tidyr::extract()         masks raster::extract()
## ✖ dplyr::failwith()        masks plyr::failwith()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ purrr::flatten()         masks rlang::flatten()
## ✖ purrr::flatten_chr()     masks rlang::flatten_chr()
## ✖ purrr::flatten_dbl()     masks rlang::flatten_dbl()
## ✖ purrr::flatten_int()     masks rlang::flatten_int()
## ✖ purrr::flatten_lgl()     masks rlang::flatten_lgl()
## ✖ purrr::flatten_raw()     masks rlang::flatten_raw()
## ✖ dplyr::id()              masks plyr::id()
## ✖ lubridate::intersect()   masks raster::intersect(), base::intersect()
## ✖ purrr::invoke()          masks rlang::invoke()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::map()             masks maps::map()
## ✖ dplyr::mutate()          masks plyr::mutate()
## ✖ dplyr::rename()          masks plyr::rename()
## ✖ dplyr::select()          masks raster::select()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ purrr::splice()          masks rlang::splice()
## ✖ dplyr::summarise()       masks plyr::summarise()
## ✖ dplyr::summarize()       masks plyr::summarize()
## ✖ lubridate::union()       masks raster::union(), base::union()

library(sf)

## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

library(mapview)
 ###Map data for Indiana
VaccineIN <- new_data  %>% 
  filter(loc_admin_state == "IN")
#VaccineNC %>% 
 # glimpse()

## mapview for Indiana with coordinates from the clean data
mapview(VaccineIN, xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)

 ###map view for United States with Coordinates
mapview(new_data, xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)

Conclusion

The findings in this project suggest that age groups attribute to different count when it comes to Covid-19 mortality in the United States. Higher mortality numbers in the early days of the pandemic were among older adults who exhibit vulnerable risk factors which convey greater risk of Covid-19. There is climax in covid-19 death as seen from the project during cold seasons, from December through January and February. The findings in this project conforms with other findings that suggest that age and sex(female and male) relates more to covid-19 infections as it proves to be statistically significant. The inception of covid-19 vaccination in the United States have more disperse locations through out all states, aimed towards curbing the spread. Visual plots as projected in this study suggest a significant decline in covid-19 mortalities in fore coming year(2023) and beyond. This project utilize findings from LR, ANOVA, Time series and Spatial Analysis which add to previous research and analysis in the area of Covid-19 in USA. As such it be suggested that there is more room for further analysis on the CDC provisional and vaccination data to drive more insight. From the above, in overall it can suggested that continuous and effective vaccination efforts across all suburbs and cities in United States is encouraged to slow and possibly eliminate Covid-19.

Reference

National Center for Health Statistics. Provisional COVID-19 Death Counts by Sex, Age, and State | Data | Centers for Disease Control and Prevention [Internet]. Available from: https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Sex-Age-and-S/9bhg-hcku

Center for Disease Control and Prevention. Vaccines.gov: COVID-19 vaccinating provider locations [Internet]. Available from: https://data.cdc.gov/Vaccinations/Vaccines-gov-COVID-19-vaccinating-provider-locatio/5jp2-pgaw

Wang B, Li R, Lu Z, Huang Y. Does comorbidity increase the risk of patients with COVID-19: evidence from meta-analysis. Aging (Albany NY). 2020;12(7):6049-6057. doi:10.18632/aging.103000

Grant MC, Geoghegan L, Arbyn M, et al. The prevalence of symptoms in 24,410 adults infected by the novel coronavirus (SARS-CoV-2; COVID-19): A systematic review and meta-analysis of 148 studies from 9 countries. PLoS One. 2020;15(6):e0234765. Published 2020 Jun 23. doi:10.1371/journal.pone.0234765

Garg S. Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019 — COVID-NET, 14 States, March 1–30, 2020. MMWR Morb Mortal Wkly Rep 2020; 69. Available at: https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e3.htm.

Appendix

Setting work directory setwd(“C:/Users/owner/Desktop/DATA ANALYSIS FOR SOCIAL SCIENCE”)

Covid-19 Statistical Analysis, Evidence from USA

loading required libraries

library(maps)

library(ggmap)

library(tidyverse)

library(sf)

library(mapview)

library(naniar)

library(dplyr)

library(raster)

library(broom)

library(AICcmodavg)

library(hrbrthemes)

library(prophet)

library(gridExtra)

library(lubridate)

library(gganimate)

library(ggplot2)

library(viridis)

Loading Data files

library(readr)

Provisional_COVID_19_Deaths_by_Sex_and_Age_4_ <- read_csv(“Provisional_COVID-19_Deaths_by_Sex_and_Age (4).csv”)

View(Provisional_COVID_19_Deaths_by_Sex_and_Age_4_)

covid19_data <- Provisional_COVID_19_Deaths_by_Sex_and_Age_4_

dim(covid19_data)

names(covid19_data)

library(readr)

Vaccines_gov_COVID_19_vaccinating_provider_locations <-read_csv(“Vaccines.gov__COVID-19_vaccinating_provider_locations.csv”)

View(Vaccines_gov_COVID_19_vaccinating_provider_locations)

vaccination_data <- Vaccines_gov_COVID_19_vaccinating_provider_locations

dim(vaccination_data)

names(vaccination_data)

Removing of missing incidents

new_data <- vaccination_data[!is.na(vaccination_data$latitude),]

summary(new_data)

creating object file

library(raster)

wgs84 = CRS(“+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs”)

vaccination_spt= SpatialPointsDataFrame(vaccination_data[,c(‘longitude’,‘latitude’)], vaccination_data, proj4string = wgs84)

plot(vaccination_spt)

Descriptive Statistics

library(naniar)

summary(covid19_data)

gg_miss_upset(covid19_data)

sum(is.na(covid19_data))

summary(vaccination_data)

gg_miss_upset(vaccination_data)

sum(is.na(vaccination_data))

#########Data Cleaning#####

#########cleaning dataset of covid-19 data

COVID_19_new <- subset(covid19_data, select = -c(Pneumonia Deaths,Influenza Deaths,Group,Year,Footnote,Data As Of) )

covid.data<-COVID_19_new

covid19_data<-na.omit(covid.data)

names(covid19_data)

head(covid19_data)

dim(covid19_data)

#######convert the column to a factor

library(plyr)

covid19_data$Sex.code <- revalue(covid19_data$Sex, c(“Male”=“1”, “Female”=“2”, “All Sexes”=“3”))

covid19_data$Sex.code <- mapvalues(covid19_data$Sex, from = c(“Male”, “Female”, “All Sexes”), to = c(“1”, “2”, “3”))

covid19_data

names(covid19_data)

checking uniqueness in dataset

unique(covid19_data[c(‘State’)])

colSums(is.na(covid19_data))

covid19_data[duplicated(covid19_data),]

#######variables description ##

sex

table(covid19_data$Sex)

barplot(table(covid19_data$Sex))

age group

table(covid19_data$Age Group)

barplot(table(covid19_data$Age Group))

########Exploratory Data Analysis#####

#######Sex and Covid-19 deaths

ggplot(covid19_data, aes(x=COVID-19 Deaths, y= Age Group , fill=Sex, stat = “identity”)) + geom_boxplot()

Age groups and Covid-19 deaths

ggplot(covid19_data, aes(x=Sex, y= Total Deaths, fill=Age Group, stat = “identity”)) + geom_boxplot()

months = c(“January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”,“December”)

ggplot(covid19_data, aes(x= Month, y= Total Deaths)) + geom_bar(stat=“identity”) + theme_classic() + labs(title = “Total Deaths confirmed By Month”, x= “MONTH”,caption = “Fig.3”, y= “Total Deaths confirmed”, fill= months) + scale_x_discrete(limits = months)+ theme(plot.title = element_text(hjust = 0.5))

#########Covid death by month across States

months = c(“January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, “December”)

p2 <- ggplot(covid19_data[!duplicated(covid19_data),], aes(fill=State, y=log(COVID-19 Deaths), x= Month)) + geom_bar( stat=“identity”) + facet_wrap(~ State) + labs(title = “Monthly deaths per State”, caption = “Fig.4”) + xlab(“Month”) + ylab(“Deaths”)+ scale_x_discrete(limits = months) p2 <- p2 + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = “none”)

options(scipen=5)

#########covid 19 death by month months = c(“January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, “December”)

plot data

ggplot(data = covid19_data, aes(x = Month, y = COVID-19 Deaths)) + geom_bar(stat=‘identity’) + scale_x_discrete(limits = months) + labs(title = “COVID-19 Deaths by Month”, caption = “Fig.5”, x = “Month”, y = “COVID-19 Deaths cases”, size = 5,fill= months)

#########Top 10 Covid-19 Death by State###

#########group data

#########install.packages(“dplyr”)

library(dplyr)

newcovid19_dta <- subset(newcovid19_dta, State != “United States”) names(newcovid19_dta) top_State_by_total_cases =newcovid19_dta %>% group_by(State)%>% summarize(total_cases = max(COVID-19 Deaths)) %>% top_n(10, total_cases)

see result

top_State_by_total_cases

ggplot(top_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) + geom_bar(stat=‘identity’) + ylab(“Cumulative Death”) + coord_flip() + labs(title = “Cummulative Death for Top 10 States”, caption = “Fig.6”)

#########Bottom 10 Covid-19 Death by State ###

#########group data

#########install.packages(“dplyr”)

library(dplyr)

newcovid19_dta <- subset(newcovid19_dta, State != “United States”) names(newcovid19_dta) Bottom_State_by_total_cases =newcovid19_dta %>% group_by(State)%>% summarize(total_cases = max(COVID-19 Deaths)) %>% slice_min(total_cases, n=10)

see result

Bottom_State_by_total_cases

ggplot(Bottom_State_by_total_cases, aes(reorder(State, total_cases),total_cases, group= State, fill=State)) + geom_bar(stat=‘identity’) + coord_flip() + ylab(“Cumulative Death”) + labs(subtitle = “Cummulative Death for 10 Bottom States”, caption = “Fig.7”)

#########Analysis Methods

#########1.Linear Regression Model

#########2.Anova Test

#########3.Time Series

#########4.Spatial Analysis

#########Linear Regression Model####

#########build regression model using lm()

dat_lm1 <- lm(Total Deaths~COVID-19 Deaths, data=covid19_data)

options(digits=2)

summary(dat_lm1)

#########build regression model using lm()

dat_lm2 <- lm(COVID-19 Deaths~ Sex, data=covid19_data)

options(digits=2)

summary(dat_lm2)

#########plot the models

ggplot(dat_lm1, aes(x = COVID-19 Deaths, y = Total Deaths)) + geom_point(col =“red” )+ stat_smooth(method = “lm”, col = “cornflowerblue”, caption = “Fig.8”)

#########Anova Test######

#########One way anova

one.way <- aov(COVID-19 Deaths ~ Age Group, data = covid19_data)

summary(one.way)

one.way_sex <- aov(COVID-19 Deaths ~ Sex , data = covid19_data)

summary(one.way_sex)

#########Two-way ANOVA without interaction of two independent variables two.way <- aov(COVID-19 Deaths ~ Age Group + Sex, data = covid19_data)

summary(two.way)

#########Two-way ANOVA with interaction of two independent variables

interaction <- aov(COVID-19 Deaths ~ Age Group*Sex, data = covid19_data)

summary(interaction)

library(AICcmodavg)

#########model fit for all the three models

model.set <- list(one.way, two.way, interaction)

model.names <- c(“one.way”,“one.way_sex”, “two.way”, “interaction”)

aictab(model.set, modnames = model.names)

#########Homoscedasticity in model fit ####

par(mfrow=c(2,2))

plot(two.way)

#########Time Series Analysis###

#########Extracting Covid-19 Data for California

#########From the above exploratory analysis, it can be observed that California in recent times has the top number of

#########fatalities with regards to the deadly virus(Covid-19).A deeper look is done to predict the number of fatalities for #########the top four states.

library(lubridate)

group by date

str(newcovid19_dta)

#########Converting factor datatypes to character

newcovid19_dta[[“State”]] <- as.character(newcovid19_dta[[“State”]] )

newcovid19_dta$`Start Date`<- as.Date(newcovid19_dta$Start Date, format = “%m/%d/%Y”)

str(newcovid19_dta)

coronaData_California <- newcovid19_dta %>% select(Start Date, State, COVID-19 Deaths, Total Deaths) %>%

group_by(Start Date) %>%

filter(State == “California”) %>%

summarise(totalConfirmed = sum(COVID-19 Deaths,na.rm=TRUE), totalDeaths = sum(Total Deaths,na.rm=TRUE))

tail(coronaData_California)

get coronavirus confirmed cases data for New York state

coronaData_confirmed_California <- coronaData_California %>% select(Start Date, totalDeaths)

get coronavirus deaths data for New York state

coronaData_death_california <- coronaData_California %>% select(Start Date, totalConfirmed)

transforming data for forecasting

names(coronaData_confirmed_California) <- c(“ds”, “y”)

names(coronaData_death_california) <- c(“ds”, “y”)

#########Predicting covid-19 confirmed Death of California State using predict()

preCA <- prophet(coronaData_confirmed_California)

futureCA <- make_future_dataframe(preCA,periods = 45)

forecastCA <- predict(preCA,futureCA)

tail(forecastCA[c(‘ds’, ‘yhat’, ‘yhat_lower’, ‘yhat_upper’)])

#########broken down the prediction into trend and yearly seasonality and plot

prophet_plot_components(preCA,forecastCA)

library(gridExtra)

colnames(covid19_data)[1] = “Date”

newcovid19_dta <- covid19_data %>% subset(State == “United States”) %>% dplyr::select(Date, COVID-19 Deaths, Total Deaths) %>% mutate(Date = mdy(Date))

plot1 <- ggplot(newcovid19_dta, aes(x=Date, y=COVID-19 Deaths)) + geom_line()

plot2 <- ggplot(newcovid19_dta, aes(x=Date, y=Total Deaths)) + geom_line()

grid.arrange(plot1, plot2, nrow = 2)

#########Modelling####

#########COVID-19 Deaths

newcovid19_dta_cts <- ts( data = newcovid19_dta$`COVID-19 Deaths`, start = min(newcovid19_dta$Date), frequency = 7)

plot(newcovid19_dta_cts, main = “COVID-19 Deaths”)

#########Total Deaths

newcovid19_dta_cts <- ts( data = newcovid19_dta$`Total Deaths`, start = min(newcovid19_dta$Date), frequency = 7 )

plot(newcovid19_dta_cts, main = “Total Deaths”)

test <- tail(newcovid19_dta_cts, 7)

train <- head(newcovid19_dta_cts, length(newcovid19_dta_cts) - length(test))

plot(train)

library(forecast)

train_ets <- ets(y = train, model = “ZZZ”)

train_ets

checkresiduals(train_ets)

plot(train_ets)

#########Forecasting

train_fore <- forecast(train_ets, 7)

train_fore

plot(train_fore)

#########accuracy

accuracy(train_fore, test)

#########Spatial Analysis

#########subset dataset

data wrangling

newdata<- new_data %>% filter(quantity_last_updated > “2021-08-01” & quantity_last_updated < “2022-08-01”)

#########glimpse(newdata)

#########Remove all incidents with no Lat/Long

new_data <- newdata[!is.na(newdata$latitude),]

#########Getting a map of United States

library(maps)

USA_States <- map_data(“state”) ## downloading usa map

#########ploting the data using ggplot

ggplot() + geom_polygon(data= USA_States, aes(x=long, y= lat, group = group), fill= NA, colour= “black”) + coord_fixed(1.3)

#########Visualising the Cleaned Dataset

library(ggspatial)

library(usmap)

ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = “black”) + coord_fixed(1.3) + geom_point( data= new_data, aes(x = longitude, y = latitude)) + labs(title = ” The Locations of Covid-19 vaccination coordinates in the United States”, x= “Longitude”, y= “Latitude”, caption=“Fig.9”)

#########Below is also a view of the coordinates of the cleaned data on the world map

world_map <- map_data(“world”)

#########visualizing our data ggplot() + geom_polygon(data= world_map, aes(x=long, y= lat, group = group),fill= NA, colour= “black”) + coord_fixed(1.3) + geom_point(data= new_data, aes(x = longitude, y = latitude), colour = “red”)+ labs(title = ” The Locations of Covid-19 vaccination in the United States on World map”, x= “Longitude”, y= “Latitude”, caption=“Fig.10”)

#########clear mapview of coordinates

ggplot() + geom_polygon(data = USA_States, aes(x=long, y= lat, group = group), fill = NA, colour = “black”) + coord_fixed(1.3) + geom_point( data= new_data, aes(x = longitude, y = latitude, colour= “red”), alpha= 1/15 ) + theme_minimal() + theme(legend.position = “none”) + labs(title = ” The Locations of Covid-19 vaccination in the United States”, x= “Longitude”, y= “Latitude”,caption=“Fig.11”)

Covid-19 vaccination locations-Indiana/USA

#########Map data for Indiana

VaccineIN <- new_data %>% filter(loc_admin_state == “IN”) VaccineNC %>% glimpse()

mapview for Indiana with coordinates from the clean data

mapview(VaccineIN, xcol = “longitude”, ycol = “latitude”, crs = 4269, grid = FALSE)

#########map view for United States with Coordinates

mapview(new_data, xcol = “longitude”, ycol = “latitude”, crs = 4269, grid = FALSE)

Covid-19 Statistical Analysis, Evidence from USA

CHARLES WIREDU

12-10-2022

Introduction

Data Collection

Descriptive Statistics

Data Cleaning

Variable Definition

Exploratory Data Analysis

Top 10 Covid-19 Death by State

Bottom 10 Covid-19 Death by State

Analysis Methods

Linear Regression Model

Interpretation

Anova Test

Time Series Analysis

Extracting Covid-19 Data for California

Modelling

Cross Validation

Time Series Modeling

Forecasting

Spatial Analysis

Getting a map of United States

Visualising the Cleaned Dataset

Covid-19 vaccination locations-Indiana/USA

Conclusion

Reference

Appendix