MATH 2349 Data Wrangling Assignment 2

COVID-19 Impact in the United States

Executive Summary

COVID-19 has immensely affected lives for every people around the globe. It has affected more than 30 million people considering to become a global pandemic. As of September 2020, more than 6 million people were tested positive alone in the United States.

This report attempts to gather data from John Hopkins University [1] followed with necessary data preprocessing preparing further for analysis and modelling. This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. The report is divided into different sections.It begins with data collection and initial cleaning to create a tidy structure followed by understanding the data types and variables. Two different datasets were merged into one and structured so that each row represented observation and columns represented variables converting necessary variable into appropriate data types. Subsequently, a new factor variable is created that represents number of confirmed cases into different bins from high to low and also creating another variable that explains mortality rate of the United States. Then, missing values and outliers in the data are identified and handled according to the assumption based on confirmed cases in the counties. Finally, one of the numerical variables is transformed which was skewed to the right to achieve normality.

Data

The dataset contains information related to coronavirus cases and deaths recorded in the United States. The data is published by John Hopkins University [1] and is made publicly available. Data consists of following variables and their description.

UID: Unique Identifier for each row entry.
iso2: Officialy assigned country code identifiers
iso3: Officialy assigned country code identifiers
code3: Unique code representing united state from other countries
FIPS: Federal Information Processing Standards code that uniquely identifies counties within the USA.
Admin2: County name. US only.
Province_State:Province, state or dependency name.
Country_Region: Country, region or sovereignty name.
Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).
Lat and Long_: Latitude and Longitudes
Combined Key: County name, State name and country
Population: Population for each county
Confirmed: Total number of confirmed cases
Deaths: Total number of deaths

library(tidyverse)
library(lubridate)
library(MVN)

us_confirmed <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv')
head(us_confirmed)

us_deaths <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv')
head(us_deaths)

Tidy and Manipulate I

Gazing at the initial view of the data, it does not follow the tidy principles therefore necessary steps are taken to structure the dataset. Also, as there are two different data source for confirmed cases and deaths, the dataset is merged for the ease of analysis. Undesirable columns such as UID,iso2,iso3,code3 and removed from the resulting data frame.

# Tidy the dataset

us_confirmed_df <- us_confirmed %>% gather(`1/22/20`,`9/28/20`,key = 'Date',value = "Confirmed Cases")
columns = c("FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","Date","Confirmed Cases")
us_confirmed_df <- us_confirmed_df %>% select(columns)
us_confirmed_df

us_deaths_df <- us_deaths %>% gather(`1/22/20`,`9/28/20`,key = 'Date',value = "Deaths")
columns = c("FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","Population","Date","Deaths")
us_deaths_df <- us_deaths_df %>% select(columns)
head(us_deaths_df)

#Merge two dataframes
us_covid <- us_confirmed_df %>% left_join(us_deaths_df)

## Joining, by = c("FIPS", "Admin2", "Province_State", "Country_Region", "Lat", "Long_", "Combined_Key", "Date")

head(us_covid)

Understand

After successfully reshaping and merging the dataset, basic summary statistics is calculated. The data frame consists of 6680 observations with 11 different variables representing the case number as per the date. Taking a glimpse at the data and their types, variables with inappropriate data types are converted into their correct formats. For example, variables like FIPS is converted into a character as it explains the unique code but not a number and date variable is parsed into its appropriate format representing a month, day and year.

While the case numbers are a cumulative sum, the latest date i.e September 28, 2020, is taken to avoid data inconsistencies and errors. The resulting data frame now consists of county-level information with the total number of confirmed cases and deaths. Based on the range of the number of confirmed cases, a new factor variable is introduced which bins them into categories such as Low cases, Medium Cases, High Cases and Very High Cases.

glimpse(us_covid)

## Observations: 6,680
## Variables: 11
## $ FIPS              <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 10...
## $ Admin2            <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State    <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region    <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat               <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_             <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key      <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date              <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", ...
## $ `Confirmed Cases` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Population        <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

#Convert Province_State to factors, Date to date object

us_covid$FIPS <- as.character(us_covid$FIPS)
us_covid$Date <- as.Date(parse_date_time(us_covid$Date,"mdy"))
glimpse(us_covid)

## Observations: 6,680
## Variables: 11
## $ FIPS              <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2            <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State    <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region    <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat               <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_             <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key      <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date              <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01...
## $ `Confirmed Cases` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Population        <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

#Filter the latest date

latest_date <- max(us_covid$Date,na.rm=TRUE)

data <- us_covid %>% filter(Date == latest_date)

#Create bins for Confirmed Cases

case_bins <- cut(data$`Confirmed Cases`, breaks=c(0,10000,50000,100000,max(data$`Confirmed Cases`)), labels=c("Low Cases","Medium Cases","High Cases","Very High Cases"),
    include.lowest=T)
data$Case_Category <- case_bins

glimpse(data)

## Observations: 3,340
## Variables: 12
## $ FIPS              <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2            <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Bl...
## $ Province_State    <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ Country_Region    <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat               <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_             <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key      <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date              <date> 2020-09-28, 2020-09-28, 2020-09-28, 2020-09...
## $ `Confirmed Cases` <dbl> 1785, 5588, 886, 657, 1618, 607, 914, 3548, ...
## $ Population        <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 1...
## $ Deaths            <dbl> 25, 50, 7, 10, 15, 14, 39, 44, 42, 13, 30, 1...
## $ Case_Category     <fct> Low Cases, Low Cases, Low Cases, Low Cases, ...

Scan I

A quick statistical summary to identify whether the dataset contains missing values reveals that there are few missing values in columns FIPS, Admin2, Population and Deaths.

#Dealing missing values
colSums(is.na(data))

##            FIPS          Admin2  Province_State  Country_Region 
##              10               6               0               0 
##             Lat           Long_    Combined_Key            Date 
##               0               0               0               0 
## Confirmed Cases      Population          Deaths   Case_Category 
##               0             305             305               0

While looking carefully at the missing data for the county, values were reported from the islands or territories that belong to the United States which do not represent counties. According to the US Geological Survey website[2], the missing county names were the same as the state name and hence, same values were imputed. For missing population and deaths, the average number is used to impute the population whereas, for deaths, a condition is applied such that if there are no confirmed cases, there are no deaths and if there are cases recorded, the average value of death is imputed to such missing values. Finally, missing values for FIPS are coded as constant as it does not affect our analysis.

After treating the missing values, a quick sanity check is performed across all columns to see whether there is any special or infinite range of values.

#Code missing county state as same as the state name to not loose data

data$Admin2 <- ifelse(is.na(data$Admin2),data$Province_State,data$Admin2)

data$Province_State <- as.factor(data$Province_State)
data$Admin2 <- as.factor(data$Admin2)

#Mean imputation for population, deaths and FIPS
data$Population[is.na(data$Population)] <- mean(data$Population, na.rm=T)

data$Deaths[is.na(data$Deaths)] <- ifelse(data$`Confirmed Cases` =='0',0,mean(data$Deaths,na.rm=T))
data$Deaths <- ifelse(data$`Confirmed Cases`==0,0,data$Deaths)

data$FIPS[is.na(data$FIPS)] <- '0000'
colSums(is.na(data))

##            FIPS          Admin2  Province_State  Country_Region 
##               0               0               0               0 
##             Lat           Long_    Combined_Key            Date 
##               0               0               0               0 
## Confirmed Cases      Population          Deaths   Case_Category 
##               0               0               0               0

sum_special <- function(x){
  sum((is.infinite(x)|is.nan(x)))
}
sapply(data,sum_special)

##            FIPS          Admin2  Province_State  Country_Region 
##               0               0               0               0 
##             Lat           Long_    Combined_Key            Date 
##               0               0               0               0 
## Confirmed Cases      Population          Deaths   Case_Category 
##               0               0               0               0

Tidy and Manipulate II

A new variable ‘mortality rate’ is created as the ratio of deaths per confirmed cases as a percentage. Some observations contained missing values which sare replaced as zero with the assumption that if there were no confirmed cases and no deaths for a particular county, the mortality rate is also zero.

#Mutate mortality rate
data <- data %>% mutate("Mortality Rate" = (Deaths/`Confirmed Cases`)*100)
data$`Mortality Rate`[is.na(data$`Mortality Rate`)] <- 0
sapply(data,sum_special)

##            FIPS          Admin2  Province_State  Country_Region 
##               0               0               0               0 
##             Lat           Long_    Combined_Key            Date 
##               0               0               0               0 
## Confirmed Cases      Population          Deaths   Case_Category 
##               0               0               0               0 
##  Mortality Rate 
##               0

glimpse(data)

## Observations: 3,340
## Variables: 13
## $ FIPS              <chr> "1001", "1003", "1005", "1007", "1009", "101...
## $ Admin2            <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bul...
## $ Province_State    <fct> Alabama, Alabama, Alabama, Alabama, Alabama,...
## $ Country_Region    <chr> "US", "US", "US", "US", "US", "US", "US", "U...
## $ Lat               <dbl> 32.53953, 30.72775, 31.86826, 32.99642, 33.9...
## $ Long_             <dbl> -86.64408, -87.72207, -85.38713, -87.12511, ...
## $ Combined_Key      <chr> "Autauga, Alabama, US", "Baldwin, Alabama, U...
## $ Date              <date> 2020-09-28, 2020-09-28, 2020-09-28, 2020-09...
## $ `Confirmed Cases` <dbl> 1785, 5588, 886, 657, 1618, 607, 914, 3548, ...
## $ Population        <dbl> 55869.0, 223234.0, 24686.0, 22394.0, 57826.0...
## $ Deaths            <dbl> 25.00000, 50.00000, 7.00000, 10.00000, 15.00...
## $ Case_Category     <fct> Low Cases, Low Cases, Low Cases, Low Cases, ...
## $ `Mortality Rate`  <dbl> 1.4005602, 0.8947745, 0.7900677, 1.5220700, ...

Scan II

Outliers are the most extreme points in the observation that represents abnormality. However, some values can be useful and therefore represent the correct information. After looking through the summary, boxplots and Chi-square QQ-plot for investigating the outliers in the numerical columns, there are more than 50% values detected as Outliers. While examining further, it is evident that these values represent the confirmed cases, deaths, population and the mortality rate of the counties in the United States due to the coronavirus pandemic. The variability on this dataset explains most of the people in areas such as New York, King, Los Angeles counties are immensely affected by the virus although some counties had quite less to no recorded cases. The detected anomalies are kept for this analysis.

#Scan outliers

#Confirmed, Population, Deaths, Mortality Rate
numeric_data <- data %>% select(`Confirmed Cases`,Deaths,Population,`Mortality Rate`)

numeric_data %>% summary()

##  Confirmed Cases      Deaths          Population      Mortality Rate    
##  Min.   :     0   Min.   :   0.00   Min.   :      0   Min.   :   0.000  
##  1st Qu.:   118   1st Qu.:   1.00   1st Qu.:  11929   1st Qu.:   0.575  
##  Median :   387   Median :   8.00   Median :  29918   Median :   1.629  
##  Mean   :  2141   Mean   :  57.51   Mean   :  93391   Mean   :  15.945  
##  3rd Qu.:  1200   3rd Qu.:  42.00   3rd Qu.:  93391   3rd Qu.:   3.311  
##  Max.   :268455   Max.   :7250.00   Max.   :5150233   Max.   :5801.549

par(mar = c(4,4,.7,.7))
boxplot(numeric_data$`Confirmed Cases`,xlab = "Confirmed Cases")
boxplot(numeric_data$Deaths,xlab = "Deaths")
boxplot(numeric_data$Population,xlab = "Population")
boxplot(numeric_data$`Mortality Rate`,xlab = "Mortality Rate")

results <- mvn(data = numeric_data, multivariateOutlierMethod = "quan", showOutliers = TRUE)

Transform

While inspecting further on confirmed cases, the distribution is right-skewed with most of the values lie below the median. Hence, to achieve normality, a logarithmic transformation is applied to this column.

par(mar = c(4,4,.7,.7))
hist(data$`Confirmed Cases`,col="cyan",main="Histogram of Confirmed Cases",xlab = "Confirmed Cases")
log_cases <- log10(data$`Confirmed Cases`)
hist(log_cases,col = "light blue",main = "Histogtram of Logarithmic Confirmed Cases",xlab="Log Confirmed Cases")

MATH 2349 Data Wrangling Assignment 2

Nischay Bikram Thapa

S3819491

COVID-19 Impact in the United States

Executive Summary

Data

Tidy and Manipulate I

Understand

Scan I

Tidy and Manipulate II

Scan II

Transform

References

[1] COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) https://github.com/CSSEGISandData/COVID-19

[2] United States Geological Survey: https://www.usgs.gov/

[3] Course Website http://rare-phoenix-161610.appspot.com/secured/index.html