COVID-19 has immensely affected lives for every people around the globe. It has affected more than 30 million people considering to become a global pandemic. As of September 2020, more than 6 million people were tested positive alone in the United States.
This report attempts to gather data from John Hopkins University [1] followed with necessary data preprocessing preparing further for analysis and modelling. This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. The report is divided into different sections.It begins with data collection and initial cleaning to create a tidy structure followed by understanding the data types and variables. Two different datasets were merged into one and structured so that each row represented observation and columns represented variables converting necessary variable into appropriate data types. Subsequently, a new factor variable is created that represents number of confirmed cases into different bins from high to low and also creating another variable that explains mortality rate of the United States. Then, missing values and outliers in the data are identified and handled according to the assumption based on confirmed cases in the counties. Finally, one of the numerical variables is transformed which was skewed to the right to achieve normality.
The dataset contains information related to coronavirus cases and deaths recorded in the United States. The data is published by John Hopkins University [1] and is made publicly available. Data consists of following variables and their description.
library(tidyverse)
library(lubridate)
library(MVN)
us_confirmed <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv')
head(us_confirmed)
us_deaths <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv')
head(us_deaths)
Gazing at the initial view of the data, it does not follow the tidy principles therefore necessary steps are taken to structure the dataset. Also, as there are two different data source for confirmed cases and deaths, the dataset is merged for the ease of analysis. Undesirable columns such as UID,iso2,iso3,code3 and removed from the resulting data frame.
# Tidy the dataset
us_confirmed_df <- us_confirmed %>% gather(`1/22/20`,`9/28/20`,key = 'Date',value = "Confirmed Cases")
columns = c("FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","Date","Confirmed Cases")
us_confirmed_df <- us_confirmed_df %>% select(columns)
us_confirmed_df
us_deaths_df <- us_deaths %>% gather(`1/22/20`,`9/28/20`,key = 'Date',value = "Deaths")
columns = c("FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","Population","Date","Deaths")
us_deaths_df <- us_deaths_df %>% select(columns)
head(us_deaths_df)
#Merge two dataframes
us_covid <- us_confirmed_df %>% left_join(us_deaths_df)
## Joining, by = c("FIPS", "Admin2", "Province_State", "Country_Region", "Lat", "Long_", "Combined_Key", "Date")
head(us_covid)
After successfully reshaping and merging the dataset, basic summary statistics is calculated. The data frame consists of 6680 observations with 11 different variables representing the case number as per the date. Taking a glimpse at the data and their types, variables with inappropriate data types are converted into their correct formats. For example, variables like FIPS is converted into a character as it explains the unique code but not a number and date variable is parsed into its appropriate format representing a month, day and year.
While the case numbers are a cumulative sum, the latest date i.e September 28, 2020, is taken to avoid data inconsistencies and errors. The resulting data frame now consists of county-level information with the total number of confirmed cases and deaths. Based on the range of the number of confirmed cases, a new factor variable is introduced which bins them into categories such as Low cases, Medium Cases, High Cases and Very High Cases.
glimpse(us_covid)
## Observations: 6,680
## Variables: 11
## $ FIPS <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 10...
## $ Admin2 <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_ <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", ...
## $ `Confirmed Cases` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Population <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#Convert Province_State to factors, Date to date object
us_covid$FIPS <- as.character(us_covid$FIPS)
us_covid$Date <- as.Date(parse_date_time(us_covid$Date,"mdy"))
glimpse(us_covid)
## Observations: 6,680
## Variables: 11
## $ FIPS <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2 <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_ <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01...
## $ `Confirmed Cases` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Population <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#Filter the latest date
latest_date <- max(us_covid$Date,na.rm=TRUE)
data <- us_covid %>% filter(Date == latest_date)
#Create bins for Confirmed Cases
case_bins <- cut(data$`Confirmed Cases`, breaks=c(0,10000,50000,100000,max(data$`Confirmed Cases`)), labels=c("Low Cases","Medium Cases","High Cases","Very High Cases"),
include.lowest=T)
data$Case_Category <- case_bins
glimpse(data)
## Observations: 3,340
## Variables: 12
## $ FIPS <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2 <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_ <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date <date> 2020-09-28, 2020-09-28, 2020-09-28, 2020-09...
## $ `Confirmed Cases` <dbl> 1785, 5588, 886, 657, 1618, 607, 914, 3548, ...
## $ Population <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths <dbl> 25, 50, 7, 10, 15, 14, 39, 44, 42, 13, 30, 1...
## $ Case_Category <fct> Low Cases, Low Cases, Low Cases, Low Cases, ...
A quick statistical summary to identify whether the dataset contains missing values reveals that there are few missing values in columns FIPS, Admin2, Population and Deaths.
#Dealing missing values
colSums(is.na(data))
## FIPS Admin2 Province_State Country_Region
## 10 6 0 0
## Lat Long_ Combined_Key Date
## 0 0 0 0
## Confirmed Cases Population Deaths Case_Category
## 0 305 305 0
While looking carefully at the missing data for the county, values were reported from the islands or territories that belong to the United States which do not represent counties. According to the US Geological Survey website[2], the missing county names were the same as the state name and hence, same values were imputed. For missing population and deaths, the average number is used to impute the population whereas, for deaths, a condition is applied such that if there are no confirmed cases, there are no deaths and if there are cases recorded, the average value of death is imputed to such missing values. Finally, missing values for FIPS are coded as constant as it does not affect our analysis.
After treating the missing values, a quick sanity check is performed across all columns to see whether there is any special or infinite range of values.
#Code missing county state as same as the state name to not loose data
data$Admin2 <- ifelse(is.na(data$Admin2),data$Province_State,data$Admin2)
data$Province_State <- as.factor(data$Province_State)
data$Admin2 <- as.factor(data$Admin2)
#Mean imputation for population, deaths and FIPS
data$Population[is.na(data$Population)] <- mean(data$Population, na.rm=T)
data$Deaths[is.na(data$Deaths)] <- ifelse(data$`Confirmed Cases` =='0',0,mean(data$Deaths,na.rm=T))
data$Deaths <- ifelse(data$`Confirmed Cases`==0,0,data$Deaths)
data$FIPS[is.na(data$FIPS)] <- '0000'
colSums(is.na(data))
## FIPS Admin2 Province_State Country_Region
## 0 0 0 0
## Lat Long_ Combined_Key Date
## 0 0 0 0
## Confirmed Cases Population Deaths Case_Category
## 0 0 0 0
sum_special <- function(x){
sum((is.infinite(x)|is.nan(x)))
}
sapply(data,sum_special)
## FIPS Admin2 Province_State Country_Region
## 0 0 0 0
## Lat Long_ Combined_Key Date
## 0 0 0 0
## Confirmed Cases Population Deaths Case_Category
## 0 0 0 0
A new variable ‘mortality rate’ is created as the ratio of deaths per confirmed cases as a percentage. Some observations contained missing values which sare replaced as zero with the assumption that if there were no confirmed cases and no deaths for a particular county, the mortality rate is also zero.
#Mutate mortality rate
data <- data %>% mutate("Mortality Rate" = (Deaths/`Confirmed Cases`)*100)
data$`Mortality Rate`[is.na(data$`Mortality Rate`)] <- 0
sapply(data,sum_special)
## FIPS Admin2 Province_State Country_Region
## 0 0 0 0
## Lat Long_ Combined_Key Date
## 0 0 0 0
## Confirmed Cases Population Deaths Case_Category
## 0 0 0 0
## Mortality Rate
## 0
glimpse(data)
## Observations: 3,340
## Variables: 13
## $ FIPS <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2 <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bul...
## $ Province_State <fct> Alabama, Alabama, Alabama, Alabama, Alabama,...
## $ Country_Region <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_ <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date <date> 2020-09-28, 2020-09-28, 2020-09-28, 2020-09...
## $ `Confirmed Cases` <dbl> 1785, 5588, 886, 657, 1618, 607, 914, 3548, ...
## $ Population <dbl> 55869.0, 223234.0, 24686.0, 22394.0, 57826.0...
## $ Deaths <dbl> 25.00000, 50.00000, 7.00000, 10.00000, 15.00...
## $ Case_Category <fct> Low Cases, Low Cases, Low Cases, Low Cases, ...
## $ `Mortality Rate` <dbl> 1.4005602, 0.8947745, 0.7900677, 1.5220700, ...
Outliers are the most extreme points in the observation that represents abnormality. However, some values can be useful and therefore represent the correct information. After looking through the summary, boxplots and Chi-square QQ-plot for investigating the outliers in the numerical columns, there are more than 50% values detected as Outliers. While examining further, it is evident that these values represent the confirmed cases, deaths, population and the mortality rate of the counties in the United States due to the coronavirus pandemic. The variability on this dataset explains most of the people in areas such as New York, King, Los Angeles counties are immensely affected by the virus although some counties had quite less to no recorded cases. The detected anomalies are kept for this analysis.
#Scan outliers
#Confirmed, Population, Deaths, Mortality Rate
numeric_data <- data %>% select(`Confirmed Cases`,Deaths,Population,`Mortality Rate`)
numeric_data %>% summary()
## Confirmed Cases Deaths Population Mortality Rate
## Min. : 0 Min. : 0.00 Min. : 0 Min. : 0.000
## 1st Qu.: 118 1st Qu.: 1.00 1st Qu.: 11929 1st Qu.: 0.575
## Median : 387 Median : 8.00 Median : 29918 Median : 1.629
## Mean : 2141 Mean : 57.51 Mean : 93391 Mean : 15.945
## 3rd Qu.: 1200 3rd Qu.: 42.00 3rd Qu.: 93391 3rd Qu.: 3.311
## Max. :268455 Max. :7250.00 Max. :5150233 Max. :5801.549
par(mar = c(4,4,.7,.7))
boxplot(numeric_data$`Confirmed Cases`,xlab = "Confirmed Cases")
boxplot(numeric_data$Deaths,xlab = "Deaths")
boxplot(numeric_data$Population,xlab = "Population")
boxplot(numeric_data$`Mortality Rate`,xlab = "Mortality Rate")
results <- mvn(data = numeric_data, multivariateOutlierMethod = "quan", showOutliers = TRUE)
While inspecting further on confirmed cases, the distribution is right-skewed with most of the values lie below the median. Hence, to achieve normality, a logarithmic transformation is applied to this column.
par(mar = c(4,4,.7,.7))
hist(data$`Confirmed Cases`,col="cyan",main="Histogram of Confirmed Cases",xlab = "Confirmed Cases")
log_cases <- log10(data$`Confirmed Cases`)
hist(log_cases,col = "light blue",main = "Histogtram of Logarithmic Confirmed Cases",xlab="Log Confirmed Cases")