Loading Packages

#Loading Packages
library(readr)
library(lubridate)
library(dplyr)
library(outliers)
library(ggplot2)
library(caret)
library(knitr)

Executive Summary

This report presents the approaches in data wrangling in the preparation of data for the analysis stages. The datasets used as case studies for this report relates to the COVID-19 pandemic. Data for the most recent 30 days starting September 10th 2020 is considered for all countries around the world. Data tiding, data sub setting, merging of datasets, variable type conversion, variable creation, handling of null and special entries, and standardization are applied in preparing the datasets. The final dataset is composed of variables from three initial dataset in a format that can be easily analyzed. Through the processes applied in the data wrangling of the COVID-19 pandemic datasets, it is shown that data wrangling presents a reliable approach in the cleaning and preparation of data.

Data

Three datasets are collected and used in the data wrangling in this report. These datasets are the COVID-19 Confirmed cases data (Source: (Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) , 2020)), COVID-19 Deaths data (Source: (Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) , 2020)) and Continents data (Source: (Chaitanya, 2017)).

The COVID-19 Confirmed cases data contained 266 attributes of 266 observations. The first four variables in the dataset were Province/State (Where applicable), Country/Region, Lat (Latitude) and Long (Longitude). The remaining variables; from variable 5 to variable 266 contained dates in the time series format from 22nd January 2020 to 9th October 2020, with records for the confirmed cases of COVID-19 for the date in question.

#Importing Datasets

#Importing COVID-19 Confirmed Cases Data
covid19_confirmed <- read_csv("time_series_covid19_confirmed_global.csv")
head(covid19_confirmed)

The COVID-19 Deaths data similarly contained 266 attributes of 266 observations. The first four variables in the dataset were Province/State (Where applicable), Country/Region, Lat (Latitude) and Long (Longitude). The remaining variables; from variable 5 to variable 266 contained dates in the time series format from 22nd January 2020 to 9th October 2020, with records for the deaths related to COVID-19 for the date in question.

#Importing COVID-19 Deaths Data
covid19_deaths <- read_csv("time_series_covid19_deaths_global.csv")
head(covid19_deaths)

The Continents dataset contained 9 attributes of 249 observations. The variables in the dataset are Country, code 2, code 3, country code, iso, continent, sub region, region code and sub region code.

#Importing Countries and Continents Data
countryContinent <- read_csv("countryContinent.csv")
head(countryContinent)

Sub-setting I

In case selection, a 30 day period was considered for the COVID-19 Confirmed case data and the COVID-19 Deaths data. This presented the most recent data on the COVID-19 pandemic which would then be useful for short-term predictions and analyses. This formed the first stage of sub setting in the data wrangling process. Sub setting by case is an approach in data wrangling that selects the observations of interest for a study through the selection of rows (Witten, 2011; O’Neil & Schutt, 2013). The two datasets however, are untidy and the cases are in the form of the dates from variable 5 to variable 266.

#Subsetting
#Selecting the most recent 30-day period
covid19_confirmed <- covid19_confirmed[, c(1:4, 
                                           (ncol(covid19_confirmed)-29):
                                             ncol(covid19_confirmed))]
covid19_deaths <- covid19_deaths[, c(1:4, 
                                     (ncol(covid19_deaths)-29):
                                       ncol(covid19_deaths))]

Tidy and Manipulate Data I

The tiding process for the COVID-19 Confirmed case data and the COVID-19 Deaths data involve the process of rearranging the cases under the dates variables from variable 5 to variable 266. Untidy data can be described as data that is not in the correct format in which data frames are normally in, with regards to the observation and variable arrangements in the data and completeness as a result of the data entry process (Shaffer, 2011; Arif & Mujtaba, 2015; Revels & Nussbaumer, 2013). In the first stage of the data tiding process, two vectors were initially created to store the country names and dates. The country names observations were replicated a number of times equal to the number of date variables. The dates variables were replicated a number of times equal to the number of observations.

#Making datasets Tidy
#Converting time series data dates to dates variables
#Confirmed cases
#Replicating the country names by the number of dates in dataset
Country_rep1 <- rep(covid19_confirmed$`Country/Region`, 
                    ncol(covid19_confirmed)-4)

#Sorting the countries
Country_rep1 <- sort(Country_rep1, decreasing = F)

#Replicating the dates columns by the number of elements in the country names vector
Dates_rep1 <- rep(colnames(covid19_confirmed)[5:ncol(covid19_confirmed)], 
                  nrow(covid19_confirmed))

In the second stage of the data tiding process, an additional vector, actual data, was created to store all the transposed entries under the dates variables-only data frame.

#Transposing and storing as vector the actual data in the dataset
actual_data1 <- as.vector(t(covid19_confirmed[,c(-1,-2,-3,-4)]))

In the final stage, all the three vectors created above for the country name, dates and actual data were compiled together to form the new tidy dataset of the confirmed cases data with the variables as; country, dates and confirmed cases.

#Compiling the New confirmed cases data and renaming columns
covid19_confirmed_new <- data.frame(Country_rep1, 
                                    Dates_rep1, actual_data1)
colnames(covid19_confirmed_new) <- c("Country", "Dates", 
                                     "Confirmed_Cases")

The three stages above were then duplicated for the Deaths data to create the new tidy dataset of the deaths data with the variables as country, dates and deaths.

#Deaths
#Replicating the country names by the number of dates in dataset
Country_rep2 <- rep(covid19_deaths$`Country/Region`, 
                    ncol(covid19_deaths)-4)

#Sorting the countries
Country_rep2 <- sort(Country_rep2, decreasing = F)

#Replicating the dates columns by the number of elements in the country names vector
Dates_rep2 <- rep(colnames(covid19_deaths)[5:ncol(covid19_deaths)], 
                  nrow(covid19_deaths))

#Transposing and storing as vector the actual data in the dataset
actual_data2 <- as.vector(t(covid19_deaths[,c(-1,-2,-3,-4)]))

#Compiling the New confirmed cases data and renaming columns
covid19_deaths_new <- data.frame(Country_rep2, 
                                 Dates_rep2, actual_data2)
colnames(covid19_deaths_new) <- c("Country", "Dates", "Deaths")

Sub-setting II

In variable selection, 2 variables were considered from the continents dataset. These variables were; Country and Continent, this makes it possible for the countries under the country variable in the COVID-19 Confirmed cases data and COVID-19 Deaths data to be identified continent-wise. This formed the second stage of sub setting in the data wrangling process. Sub setting by variable is an approach in data wrangling that selects the variables of interest for a study through the selection of columns (Albright S. , 2015; Albright & Winston, 2014). This reduced the number of variables in the continents dataset from 9 to 2 variables.

#Subsetting 
#Variable selection and renaming columns in country continent data
countryContinent <- countryContinent %>% 
  select(country, continent)
colnames(countryContinent) <- c("Country", "Continent")

Combining Datasets

The sub setting stages and tiding processes prepared the three datasets for merging. Merging refers to the process of combining datasets that are perceived to be related as a means of allowing the analysis of the variables in the different datasets to be done together (Ulf-Dietrich & Uwe, 2014; Zekic-Susac & Has, 2015; Vicenc, 2017). The merging process was completed in two stages; the first stage combined the COVID-19 Confirmed cases data and COVID-19 Deaths data by adding the Deaths column in the latter to the former.

#Merging datasets
#Merging COVID-19 datasets and renaming columns
Full_Data <- data.frame(covid19_confirmed_new, 
                        covid19_deaths_new$Deaths)
colnames(Full_Data)[4] <- c("Deaths")

The second stage combined the new COVID-19 dataset formed from the merging of the Confirmed cases data and COVID-19 Deaths data above with the continents data by matching the Country columns in both datasets. The final combined dataset had 5 attributes of 7980 observations. The attributes of the full datasets were; Country, Dates, Confirmed Cases, Deaths and Continent.

#Merging Covid Data with Continents Data
Full_Data <- merge(Full_Data, countryContinent)
kable(head(Full_Data))

Country	Dates	Confirmed_Cases	Deaths	Continent
Afghanistan	9/26/20	39192	1453	Asia
Afghanistan	9/13/20	38716	1420	Asia
Afghanistan	9/25/20	39186	1451	Asia
Afghanistan	10/3/20	39297	1462	Asia
Afghanistan	10/4/20	39341	1462	Asia
Afghanistan	9/10/20	38572	1420	Asia

Understand

The final dataset had two factor variables; Country and Dates, two numeric variables; Confirmed Cases and Deaths, and a character variable; Continent. Checking for this was achieved using the str function to check for the structure of the data.

#Checking variable types
str(Full_Data)

## 'data.frame':    6930 obs. of  5 variables:
##  $ Country        : Factor w/ 188 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dates          : Factor w/ 30 levels "10/1/20","10/2/20",..: 26 13 25 3 4 10 2 7 20 12 ...
##  $ Confirmed_Cases: num  39192 38716 39186 39297 39341 ...
##  $ Deaths         : num  1453 1420 1451 1462 1462 ...
##  $ Continent      : chr  "Asia" "Asia" "Asia" "Asia" ...

The variables date and continent are incorrectly assigned as factor and character respectively. To correct this, the mdy function was used to convert the date variable from factor to date in the first step. In the second step, the continent variable was converted from character to factor using the factor function which simultaneously assigned the continent variable labels.

#Coverting Variable types
#Date Variable
Full_Data$Dates <- mdy(Full_Data$Dates)

#Continent Variable
Full_Data$Continent <- factor(Full_Data$Continent)

In the final step the arrange function was used to sort the observations in the dataset by Country and Dates in a ascending order.

#Arranging dataset using country and dates variables
Full_Data <- Full_Data %>% arrange(Country, Dates)

Tidy and Manipulate Data II

The date variable was mutated to create two more variables, week and day variables which will allow the observation of the trends in the confirmed cases and deaths due to COVID-19 on a day-by-day and week-by-week basis. This was completed using the week and day functions, resulting in the dataset having 7 variables from the initial 5 variables.

#Creating week and day variables from the dates variable
Full_Data <- Full_Data %>% mutate(Week = week(Dates),
                                    Day = day(Dates))
kable(head(Full_Data))

Country	Dates	Confirmed_Cases	Deaths	Continent	Week	Day
Afghanistan	2020-09-10	38572	1420	Asia	37	10
Afghanistan	2020-09-11	38606	1420	Asia	37	11
Afghanistan	2020-09-12	38641	1420	Asia	37	12
Afghanistan	2020-09-13	38716	1420	Asia	37	13
Afghanistan	2020-09-14	38772	1425	Asia	37	14
Afghanistan	2020-09-15	38815	1426	Asia	37	15

Scan I

The scanning of the dataset for missing entries, special values and obvious errors returned 0 cases. This was completed using the is.na and is.infinite functions. Therefore, no further action was necessary for this instance.

#Scanning Dataset for null entries
which(is.na(Full_Data))

## integer(0)

#Scanning Dataset for infinite values
which(is.infinite(Full_Data$Confirmed_Cases))

## integer(0)

which(is.infinite(Full_Data$Deaths))

## integer(0)

Scan II

The first step in the scan for outliers involved the plotting of boxplots for the two numerical variables in the dataset; Confirmed cases and Deaths. Boxplots are visualizations that are used for the identification of the outliers in a dataset through the representation of various measures of both central tendency and spread (Kirk, 2016; Chambers, 2017; Minelli, Chambers, & Diraj, 2013). The boxplots below for both the confirmed cases and the deaths revealed the existence of extreme outliers in the data.

#Outliers
#Checking for Outliers
#Confirmed Cases
ggplot(Full_Data) +
  aes(x = "", y = Confirmed_Cases) +
  geom_boxplot(fill = "#0c4c8a") +
  labs(title = "Boxplot for Confirmed Cases", 
       subtitle = "Outliers Check", 
       caption = "COVID-19 Data") +
  theme_minimal()

#Deaths
ggplot(Full_Data) +
  aes(x = "", y = Deaths) +
  geom_boxplot(fill = "#0c4c8a") +
  labs(title = "Boxplot for Deaths", 
       subtitle = "Outliers Check", 
       caption = "COVID-19 Data") +
  theme_minimal()

The second step in the scan for outliers involved the handling of the outliers identified in the plots above. Given that the incidence of COVID-19 are unique for every country, replacing the outliers would prove erroneous and therefore misrepresent the actual situation on the ground. Since there exist factors that may lead to incorrect figures such as false information presented by countries, removing the outliers presents the best alternative in this case. The out result from the boxplot function was used to identify the outliers and the corresponding row numbers used to exclude the observations with the outliers in both the confirmed cases and deaths variables. The final dataset had 5071 observations from the initial 7980.

#Handling Outliers
#Confirmed Cases
Outliers1 <- boxplot(Full_Data$Confirmed_Cases, plot = F)$out
Full_Data <- Full_Data[-which(Full_Data$Confirmed_Cases %in% Outliers1),]

#Deaths
Outliers2 <- boxplot(Full_Data$Deaths, plot = F)$out
Full_Data <- Full_Data[-which(Full_Data$Deaths %in% Outliers2),]
kable(head(Full_Data))

	Country	Dates	Confirmed_Cases	Deaths	Continent	Week	Day
31	Albania	2020-09-10	10860	324	Europe	37	10
32	Albania	2020-09-11	11021	327	Europe	37	11
33	Albania	2020-09-12	11185	330	Europe	37	12
34	Albania	2020-09-13	11353	334	Europe	37	13
35	Albania	2020-09-14	11520	338	Europe	37	14
36	Albania	2020-09-15	11672	340	Europe	37	15

Data Transformation

Data transformation refers to the conversion of data through mathematical computations with the aim to either scale the data or normalize it (Barbara & Susan, 2014; Everitt & Skrondal, 2010). In this instance, the first step involved the preprocessing of the Confirmed cases and Deaths variables data where a model for the normalizing was created using the preProcess function. The second step applied the predict function to obtain the normalized forms of the two variables by applying the model to the Confirmed cases and Deaths variables data. The final step replaced the two variables in the dataset with the normalized outputs from the predict function.

#Data Transformation
#Normalizing numerical variable in dataset
Prep <- preProcess(Full_Data[, 3:4], method = c("center","scale"))
normData <- predict(Prep, Full_Data[, 3:4])
kable(head(normData))

	Confirmed_Cases	Deaths
31	0.3417674	1.741032
32	0.3537501	1.762705
33	0.3659561	1.784378
34	0.3784598	1.813276
35	0.3908891	1.842173
36	0.4022019	1.856622

#Replacing in Data
Full_Data$Confirmed_Cases <- normData$Confirmed_Cases
Full_Data$Deaths <- normData$Deaths
kable(head(Full_Data))

	Country	Dates	Confirmed_Cases	Deaths	Continent	Week	Day
31	Albania	2020-09-10	0.3417674	1.741032	Europe	37	10
32	Albania	2020-09-11	0.3537501	1.762705	Europe	37	11
33	Albania	2020-09-12	0.3659561	1.784378	Europe	37	12
34	Albania	2020-09-13	0.3784598	1.813276	Europe	37	13
35	Albania	2020-09-14	0.3908891	1.842173	Europe	37	14
36	Albania	2020-09-15	0.4022019	1.856622	Europe	37	15

References

Albright, C. S., & Winston, W. L. (2014). Business Analytics: Data Analysis & Decision Making (1 ed.). New York: Cengage Learning.

Albright, S. (2015). Business Analytics: Data Analysis and Decision Making (1st ed.). Stamford, CT: Cengage Learning.

Arif, M., & Mujtaba, G. (2015). A survey: data warehouse architecture. International journal of hybrid information technology, 8(5), 349-356.

Barbara, I., & Susan, D. (2014). Introductory Statistics (1st ed.). New York: OpenStax CNX.

Chaitanya, G. (2017). country to continent. Retrieved from Kaggle: https://www.kaggle.com/statchaitya/country-to-continent

Chambers, J. M. (2017). Graphical Methods for Data Analysis: (1st ed.). New York: Chapman and Hall/CRC.

Everitt, B. S., & Skrondal, A. (2010). Cambridge Dictionary of Statistics (4th ed.). London: Cambridge University Press.

Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) . (2020). Novel Coronavirus (COVID-19) Cases Data. Retrieved from Human Data: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

Kirk, A. (2016). Data Visualization: A Handbook for Data Driven Design (2nd ed.). Thousand Oaks, CA: Sage Publications, Ltd.

Minelli, M., Chambers, M., & Diraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st ed.). New York: Wiley.

O’Neil, C., & Schutt, R. (2013). Doing Data Science (3rd ed.). London: O’Reily.

Revels, M., & Nussbaumer, H. (2013). Data mining and data warehousing in the airline industry . Academy of Business Research Journal, 3, 69-82.

Shaffer, C. A. (2011). Data Structures and Algorithms Analysis. Mineola: Dover.

Ulf-Dietrich, R., & Uwe, M. (2014). Mining “Big Data” Using Big Data Services. International Journal of Internet Science, 1(1), 1-8.

Vicenc, T. (2017). Studies in Big Data (1st ed.). Chicago: Springer International Publishing.

Witten, I. H. (2011). Data Mining: Practical Machine Learning Tools (3rd ed.). Sydney : Morgan Kaufmann.

Zekic-Susac, M., & Has, A. (2015). Data mining as support to knowledge management in marketing. Business Systems Research, 6(2), 18-30.

Data Wrangling of COVID-19 Data

Vishesh Sharma (3819655)