Data Wrangling (COVID-19 Data Preprocessing)

Required packages

library(readr)

## Warning: package 'readr' was built under R version 4.0.5

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.5

library(MVN)

## Warning: package 'MVN' was built under R version 4.0.5

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## sROC 0.1-2 loaded

library(forecast)

## Warning: package 'forecast' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(outliers)
library(stringr)

Executive Summary

This report provides a technical practice based on the COVID-19 daily cumulative data of data wrangling.

The methods of this practice follow the five major tasks of data preprocessing, including Get, Understand, Tidy and Manipulate, Scan and Transform. Specifically, retrieving data from open data source. Then, getting a better understanding of each elements in the selected data set. Thirdly, manipulating the untidy data for following analysis. Also, checking the plausibility of any values in the data set. Finally, performing data transformation.

All data cleaning results can be found in the explanation and R codes in each section.

Recommendations include increasing the size of the selected data set for statistical analysis, increasing the different types of variables for further manipulation and improving the quality of visualisation in terms of scanning numeric data.

This report also finds the fact that the data cleaning procedure has limitations. Some limitations include that the selected data set probably is not suitable as variables are not enough to fulfil the minimum requirements. Also, the quality of this report is not high enough due to the basic understanding towards knowledge and skills data preprocessing.

Data

The COVID-19 data preprocessing based on the daily cumulative data collected by Anthony Goldbloom, Devrishi, Jacky Wang, Peijen Lin and Timo Bozsolik (2021) and located at https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university?select=CONVENIENT_global_confirmed_cases.csv.This report mainly focuses on the analysis of cumulative daily-updated COVID-19 data as at 9th May 2021.

The data sets used in this report include two raw data sources, confirmed cases and deaths in the USA, which are licensed under the Creative Commons Attribution 4.0 Internatioanl (CC BY 4.0) by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU).

The us_merge data set which has 3342 observations and 7 variables combining both numeric data and categorical data is used to conduct the COVID-19 data preprocessing after merger of two raw data sets (confirmed cases and deaths in the USA).

Specifically, the us_merge data set contains the following variables:

Province_State: Province, state or dependency name.

County: County name.

Country_Region: Country or region name.

FIPS: Federal Information Processing Standards code that uniquely identifies counties within the USA.

Population: Population in every county.

Dead: Counts include confirmed and probable (where reported).

Confirmed: Counts include confirmed and probable (where reported).

us_dead <- read_csv("RAW_us_deaths.csv") # Retrieve raw data set of deaths in us as us_death

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Province_State = col_character(),
##   Admin2 = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character(),
##   Country_Region = col_character(),
##   Combined_Key = col_character()
## )
## i Use `spec()` for the full column specifications.

us_confirmed <- read_csv("RAW_us_confirmed_cases.csv") # Retrieve raw data set of confirmed cases in us as us_confirmed

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Province_State = col_character(),
##   Admin2 = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character(),
##   Country_Region = col_character(),
##   Combined_Key = col_character()
## )
## i Use `spec()` for the full column specifications.

us_dead_merge <- us_dead[, c("Province_State", "Admin2", "Population", "05/09/2021")] # Select useful columns for as us_dead_merge
us_confirmed_merge <- us_confirmed[, c("iso3", "FIPS", "05/09/2021")] # Select useful columns as us_confirmed_merge
us_merge <- bind_cols(us_dead_merge, us_confirmed_merge) # Use bind_cols function to merge two data sets as us_merge for following analysis

## New names:
## * `05/09/2021` -> `05/09/2021...4`
## * `05/09/2021` -> `05/09/2021...7`

names(us_merge)[names(us_merge) == "Admin2"] <- "County" # This and the following three steps are renaming the column names
names(us_merge)[names(us_merge) == "iso3"] <- "Country_Region"
names(us_merge)[names(us_merge) == "05/09/2021...4"] <- "Dead"
names(us_merge)[names(us_merge) == "05/09/2021...7"] <- "Confirmed"
us_merge <- us_merge %>% relocate(Country_Region, .after = County) # This and the following step are changing the order of columns
us_merge <- us_merge %>% relocate(FIPS, .after = Country_Region)
head(us_merge)

Understand

The data set us_merge contains 3342 observations of 7 variables, including:

Three qualitative variables: multi-nominal variable - Province_State, nominal variable - County, Country_Region

Four quantitative variables: discrete quantitative variable - FIPS, Population, Dead, Confirmed

The summary of the types of 7 variables is:

Numeric (double) - FIPS, Population, Dead, Confirmed

Character - Province_State, County, Country_Region

The factor is created for variable Country_Region and all numeric (double) variables are converted to integer as the number of population, dead and confirmed are integer.

head(str(us_merge)) # Check the structure of the data set us_merge

## tibble [3,342 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Province_State: chr [1:3342] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ County        : chr [1:3342] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ Country_Region: chr [1:3342] "USA" "USA" "USA" "USA" ...
##  $ FIPS          : num [1:3342] 1001 1003 1005 1007 1009 ...
##  $ Population    : num [1:3342] 55869 223234 24686 22394 57826 ...
##  $ Dead          : num [1:3342] 108 308 57 64 137 41 69 314 123 45 ...
##  $ Confirmed     : num [1:3342] 6920 21131 2308 2607 6660 ...

## NULL

dim(us_merge) # Check the dimensions of the data set us_merge

## [1] 3342    7

class(us_merge$Province_State) # This and the following steps are checking the type and class of each variable

## [1] "character"

typeof(us_merge$Province_State)

## [1] "character"

class(us_merge$County)

## [1] "character"

typeof(us_merge$County)

## [1] "character"

class(us_merge$Country_Region)

## [1] "character"

typeof(us_merge$Country_Region)

## [1] "character"

class(us_merge$FIPS)

## [1] "numeric"

typeof(us_merge$FIPS)

## [1] "double"

class(us_merge$Population)

## [1] "numeric"

typeof(us_merge$Population)

## [1] "double"

class(us_merge$`Dead`)

## [1] "numeric"

typeof(us_merge$`Dead`)

## [1] "double"

class(us_merge$`Confirmed`)

## [1] "numeric"

typeof(us_merge$`Confirmed`)

## [1] "double"

us_merge$Country_Region <- factor(us_merge$Country_Region, levels = c("ASM", "GUM", "MNP", "PRI", "USA", "VIR"),
                                  labels = c("American Samoa", "Guam", "Northern Mariana Islands", "Puerto Rico", "United States of America", "VIrgin Islans"), ordered = TRUE) # Create, label and order factor for variable Country_Region
levels(us_merge$Country_Region) # Check the level of variable Country_Region

## [1] "American Samoa"           "Guam"                    
## [3] "Northern Mariana Islands" "Puerto Rico"             
## [5] "United States of America" "VIrgin Islans"

us_merge$FIPS <- as.integer(us_merge$FIPS) # This and the following steps are converting the double data type to integer
us_merge$Population <- as.integer(us_merge$Population)
us_merge$`Dead` <- as.integer(us_merge$`Dead`)
us_merge$`Confirmed` <- as.integer(us_merge$`Confirmed`)
head(us_merge)

Tidy & Manipulate Data I

The variables Dead and Confirmed contain all information about the number, and whether it is dead or confirmed case, which may break the tidy data rule as each variable must have its column (Wickham and Grolemund 2016). Therefore, reshaping the data set us_merge to a long format is essential.

us_merge_tidy <- us_merge %>% gather(Catogory, Number, 6:7) # Reshape the data set us_merge to a long layout as us_merge_tidy
class(us_merge_tidy$Catogory) # This and the following three steps are checking the class and type of reshaped columns

## [1] "character"

typeof(us_merge_tidy$Catogory)

## [1] "character"

class(us_merge_tidy$Number)

## [1] "integer"

typeof(us_merge_tidy$Number)

## [1] "integer"

Tidy & Manipulate Data II

According to the estimating mortality from COVID-19 from World Health Organization (2020), the Case Fatality Ratio (CFR) and Infection Ratio (IR) are two critical indicators when evaluating the severity of COVID-19. The case fatality ratio is the percentage of the number of dead and the number of confirmed while the infection ratio equals that confirmed divide by population basen on this data set.

Therefore, the two new variables CFR and IR are created from the existing variables. Also, creating a new variable Province_State_Index for string manipulation.

us_merge <- us_merge %>% mutate(us_merge, "CFR (%)" = (Dead/Confirmed) * 100) # This and following three steps are creating two new variables 
us_merge$`CFR (%)` <- round(us_merge$`CFR (%)`, digits = 2)
us_merge <- us_merge %>% mutate(us_merge, "IR (%)" = (Confirmed/Population) * 100)
us_merge$`IR (%)` <- round(us_merge$`IR (%)`, digits = 2)
us_merge <- us_merge %>% mutate(us_merge, Province_State_Index = abbreviate(Province_State, minlength = 3)) %>% relocate(Province_State_Index, .after = Province_State) # Create a new string variable 
us_merge$Province_State_Index <- str_to_upper(us_merge$Province_State_Index) # Covert all lower case characters to upper case
head(us_merge)

Scan I

The solutions for missing values, special values and other errors after scanning the data set us_merge are shown below.

The variable County contains 6 missing values as some of the province or state does not have counties and recoding the missing values with “Unassigned”.

The variable FIPS indicates the unique code for every county in the USA. It is found that some of the counties don’t have this code and only 10 missing values of this variable are detected (less than 5% of the total amount of this variable). The solution is excluding these small amount of missing data when analyzing this variable.

The variable CFR contains both missing values and infinite values (0/0) as the calculation of CFR is based on the variable Dead and the variable Confirmed, such as 0 for Dead and 0 for Confirmed. Meanwhile, the amount of missing values or infinite values are less than 5% of the total data for this variable (2.78% and 0.18% respectively). This explanation also applies to the variable IR. However, 0 for Dead and 0 for Confirmed may indicate good precaution from COVID-19 in reality. Thus, it would be better to exclude the zero values in variables Population which are miss-counted.

is.na(us_merge$Province_State) %>% sum() # This and the following steps are find the total number of missing values in each variable

## [1] 0

is.na(us_merge$County) %>% sum()

## [1] 6

is.na(us_merge$Country_Region) %>% sum()

## [1] 0

is.na(us_merge$FIPS) %>% sum()

## [1] 10

is.na(us_merge$Population) %>% sum()

## [1] 0

is.na(us_merge$Dead) %>% sum()

## [1] 0

is.na(us_merge$Confirmed) %>% sum()

## [1] 0

is.na(us_merge$`CFR (%)`) %>% sum()

## [1] 93

is.na(us_merge$`IR (%)`) %>% sum()

## [1] 70

us_merge$County[which(us_merge$County == "NA")] <- "Unassigned" # Recode the missing values with "Unassigned"
numeric_variable <- us_merge[, c("FIPS", "Population", "Dead", "Confirmed", "CFR (%)", "IR (%)")] # Create a list of 6 numeric variables
inf_nan_na <- function(x){
  if(is.numeric(x))(is.infinite(x) | is.nan(x) | is.na(x))
}
sapply(numeric_variable, function(x) sum(is.na(x))) # Use function to identify missing values or special values in general

##       FIPS Population       Dead  Confirmed    CFR (%)     IR (%) 
##         10          0          0          0         93         70

which(is.na(us_merge$FIPS)) # Identify the location of missing values for variable FIPS

##  [1] 1270 1307 1339 1594 2957 2962 2981 2982 2985 2993

which(is.infinite(us_merge$`CFR (%)`)) # Identify the location of infinite values for variable CFR

## [1]  719  813 1262 1942 2119 2205

which(is.na(us_merge$`CFR (%)`)) # Identify the location of missing values for variable CFR

##  [1]   53   65   74   82   92   99  103  114  118  173  228  305  330  336  341
## [16]  342  392  577  581  618  625  791  898  999 1026 1127 1193 1230 1256 1269
## [31] 1276 1278 1426 1512 1532 1620 1653 1694 1714 1785 1827 1830 1840 1860 1888
## [46] 2035 2102 2187 2273 2320 2325 2381 2392 2484 2526 2532 2587 2599 2882 2931
## [61] 2958 2959 2960 2961 2963 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974
## [76] 2975 2976 2979 2980 2986 2990 2991 2992 3004 3102 3135 3170 3222 3236 3288
## [91] 3307 3332 3340

which(is.infinite(us_merge$`IR (%)`)) # Identify the location of infinite values for variable IR

##  [1]   98  191  254  320  332  338  339  406  521  556  573  580  699  914 1147
## [16] 1212 1236 1270 1281 1307 1339 1355 1365 1449 1807 1844 1865 1900 1960 2057
## [31] 2288 2456 2473 2486 2670 2690 2957 2962 2981 2982 2985 2987 2993 3006 3181

which(is.na(us_merge$`IR (%)`)) # Identify the location of missing values for variable IR

##  [1]   53   65   92  114  118  173  228  305  330  336  341  342  392  581  618
## [16]  625  719  791  813  898  999 1026 1127 1193 1230 1256 1262 1278 1426 1512
## [31] 1532 1620 1653 1694 1714 1785 1827 1830 1840 1860 1888 1942 2035 2102 2119
## [46] 2187 2205 2273 2320 2325 2381 2392 2484 2526 2532 2587 2599 2882 2931 2974
## [61] 3004 3102 3135 3170 3222 3236 3288 3307 3332 3340

is.infinite(us_merge$`CFR (%)`) %>% sum() # Find the total number of infinite values for variable CFR

## [1] 6

is.infinite(us_merge$`IR (%)`) %>% sum() # Find the total number of infinite values for variable IR

## [1] 45

us_merge <- filter(us_merge, Population > 0) # Exclude the miss-counted amount of population for each county
head(us_merge)

Scan II

There are six numeric variables in the data set us_merge, including FIPS, Population, Dead, Confirmed, CFR and IR. The outlier detection for these six variables is conducted separately due to their realistic meanings.

The variable FIPS is an artificial setting so that the outlier of this variable is meaningless.

The variable Population reflects the amount of people in each county and it is assumed that some large counties may be outliers. Using z-score to detect ouliers for this variable, which may reflects the distributions of population. According to the z-score method, the variable Population has 41 outliers, which may indicate that 41 counties have large amount of people as the negative number and zero number of this variable are excluded from the data set.

It is known that people infected by COVID-19 have chances to death. Thus, the variables Dead and Confirmed are considered to have relationships, the same assumption applys to the variables CFR and IR. Scatter plot is used to detect the outliers for Dead and Confirmed while the outliers of CFR and IR is detected by the Mahalanobis distance.

According to the scatter plot, there is a possible outlier on the higher right. That may be a county with larger amount of death and confirmed in reality.

As it is shown from the QQ plot, the Mahalonobis distance suggests that there are 260 outliers for this subset. Although the number of outliers is large, it is not reasonable to exclude or delete any outliers as they are meaningful in this case.Some outlier might indicate a county with a high infection rate and a high fatality rate.

population_z <- us_merge$Population %>% scores(type = "z") # Calculate z-score for the variable Population as population_z
population_z %>% summary() # Summarize z-score, the minimum is -0.31 and the maximum is 30.17

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3130 -0.2794 -0.2336  0.0000 -0.1083 30.1695

length(which(abs(population_z) > 3)) # Find the total numer of outliers

## [1] 41

plot(Dead ~ Confirmed, data = us_merge, main = "Dead by Confirmed") # Create the scatter plot for the variables Dead and Confirmed

CFR_IR <- us_merge[, c("CFR (%)", "IR (%)")] # Create a subset as CFR_IR
CFR_IR_result <- mvn(data = CFR_IR, multivariateOutlierMethod = "quan", showOutliers = TRUE) # Draw QQ plot to detect outliers

CFR_IR_result$multivariateOutliers # Select the results related to outliers

Transform

As it is shown from the histogram of the variable CFR, the variable CFR is non-normal data (right skewness in the distribution). In order to reduce the right skewness, diferent methods, including the log transformation, the square root transformation, reciprocal transformation and BoxCox transformation, are used in this case.

hist(us_merge$`CFR (%)`) # Draw histogram of the variable CFR

log_CFR <- log10(us_merge$`CFR (%)`) # This and following three steps are doing the log transformation
hist(log_CFR)

ln_CFR <- log(us_merge$`CFR (%)`)
hist(ln_CFR)

sqrt_CFR <- sqrt(us_merge$`CFR (%)`) # This and the following steps are doing square root transformation 
hist(sqrt_CFR)

CFR_recip <- 1/us_merge$`CFR (%)` # This and the following steps are doing reciprocal transformation
hist(CFR_recip)

CFR_BoxCox <- BoxCox(us_merge$`CFR (%)`, lambda = "auto") # This and the following steps are doing BoxCox transformation

## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.

hist(CFR_BoxCox)

Normalise

As the Case Fatality Ratio (CFR) is calculated by the number of dead and the number of confirmed cases, there is a wide range between these two numbers, which may have a huge impact on the variable CFR. Hence, the variable CFR will be normalized.

center_CFR <- scale(us_merge$`CFR (%)`, center = TRUE, scale = FALSE) # Apply centering only and plot the histogram
hist(center_CFR)

scale_CFR <- scale(us_merge$`CFR (%)`, center = FALSE, scale = TRUE) # Scale without centering and plot the histogram
hist(scale_CFR)

z_CFR <- scale(us_merge$`CFR (%)`, center = TRUE, scale = TRUE) # Apply z-score standardization and plot the histogram
hist(z_CFR)

Referrence

Goldbloom A, Devrishi, Wang J, Lin P, Bozsolik T, 2021, COVID-19 Data from John Hopkins University datasets, viewed 10 May 2021, https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university?select=CONVENIENT_global_confirmed_cases.csv

World Health Organizaiton. (2020). Estimating mortality from COVID-19: scientific brief, 4 August 2020. World Health Organization. < https://apps.who.int/iris/handle/10665/333642>. License: CC BY-NC-SA 3.0 IGO