library(readr)
library(dplyr)
library(tidyr)
library(forecast)
First the three datasets were downloaded. The datasets were then loaded to R Studio using read_csv() function of “readr” package. Out of 3 datasets, 2 datasets doesn’t comply with the tidy data principles. The untidy datasets were then cleaned using gather() function of “tidyr” package.The variables were analyzed and then variable of interest were chosen. The chosen variables were then converted to appropriate data types and appropriate labels were added to factor variable, then the 3 datasets were merged. The final dataset had 6 variables, the dataset was then subset using select() and filter() function based on need and a new variables were created using mutate() function of “dplyr” package. All the variables were then scanned for missing values, the numeric variables were scanned for special values and obvious errors. In the end the numeric variables were also scanned for outlier.
The three datasets I have used are COVID-19 datasets, downloaded from https://ourworldindata.org/coronavirus-source-data. These datasets are based on the data from the European Centre for Disease Prevention and Control related to confirmed cases and deaths. The description for three datasets are as follows:
Variable of interest:
Reading all datsets using read_csv() and displaying the first few observations using head().
# Reading data from locations.csv
locations <- read_csv("locations.csv")
# View first few observations
head(locations)
# Reading data from total_cases.csv
total_cases <- read_csv("total_cases.csv")
# View first few observations
head(total_cases)
# Reading data from total_cases.csv
new_cases <- read_csv("new_cases.csv")
# View first few observations
head(new_cases)
Understanding locations dataset:
From this dataset only 3 variables are of interest, so only those variables were selected using select(). On checking the structure, “continent” variable was found to be character so converted it to factor and then labeled. The “population” variable was converted to integer from double. The “location” variable was left to be character.
# Checking the structure of locations dataframe
str(locations)
## tibble [210 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ countriesAndTerritories: chr [1:210] "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ location : chr [1:210] "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ continent : chr [1:210] "Asia" "Europe" "Africa" "Europe" ...
## $ population_year : num [1:210] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ population : num [1:210] 38928341 2877800 43851043 77265 32866268 ...
## - attr(*, "spec")=
## .. cols(
## .. countriesAndTerritories = col_character(),
## .. location = col_character(),
## .. continent = col_character(),
## .. population_year = col_double(),
## .. population = col_double()
## .. )
# Selecting only the required variables
locations <- locations %>% select(location, continent, population)
# Converting continent variable to factor and adding labels
locations$continent <- locations$continent %>% factor(levels = c("Africa", "Asia", "Europe", "North America", "Oceania", "South America"))
# Checking levels
levels(locations$continent)
## [1] "Africa" "Asia" "Europe" "North America"
## [5] "Oceania" "South America"
# Checking the type of population variable
typeof(locations$population)
## [1] "double"
# Converting the population variable to integer
locations$population <- locations$population %>% as.integer()
Understanding total_cases & new_cases datasets:
These two datasets are untidy. The structure of the dataframes were checked using str(), but the outputs are hidden as it had lots of columns.Both the datasets has “location”/“countries” variable spread among 200 columns which is supposed to be character, one “date” variable and “Total Cases”/“New Cases” variable which is supposed to be an integer. Therefore, both the datasets has 3 variables each. Tidying these datasets is covered in the next section “Tidy & Manipulate Data I”.
# Checking the structure of locations dataframe
str(total_cases)
str(new_cases)
Removing the column “World”, this column contains the number of total cases and number of new cases in the whole world. It is not our variable of interest and as we have the information of all the countries we can simply add them and get the numbers for this if we want it later.
# Removing the column "World" from both datasets
total_cases <- total_cases %>% select(-World)
new_cases <- new_cases %>% select(-World)
The total_cases and new_cases dataset is untidy because it violates the first condition of “Tidy Data Principles”. The first condition states that each variable must have its own column, but both the datasets has “Countries” variable spread among 200 column headers.
Tidying datasets
First the values in 200 column header were gathered into “Countries” variable i.e. transformed data from wide format to long format. Then checked the structure of total_cases and new_cases datasets. The new variables “Total Cases” and “New Cases”, so converted it to integer. The “date” variable was already in a date format and “Countries” are character. After transforming we are left with 3 variables for both the datasets.
# Applying gather() function to create a "Countries" and "Total Cases" Variable in total_cases dataset
total_cases <- total_cases %>% gather((Afghanistan:Zimbabwe), key = "Countries", value = "Total Cases")
# Cheking the structure of total_cases dataset after gather()
str(total_cases)
## tibble [62,034 x 3] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:62034], format: "2019-12-31" "2020-01-01" ...
## $ Countries : chr [1:62034] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Total Cases: num [1:62034] NA NA NA NA NA NA NA NA NA NA ...
# Checking the type of "Total Cases" variable
typeof(total_cases$`Total Cases`)
## [1] "double"
# Coverting the "Total Cases" to integer from double
total_cases$`Total Cases` <- total_cases$`Total Cases` %>% as.integer()
# View the first few observations of total_cases dataset after tidying
head(total_cases)
# Applying gather() function to create a "Countries" and "Total Cases" Variable in new_cases dataset
new_cases <- new_cases %>% gather((Afghanistan:Zimbabwe), key = "Countries", value = "New Cases")
# Cheking the structure of new_cases dataset after gather()
str(new_cases)
## tibble [62,034 x 3] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:62034], format: "2019-12-31" "2020-01-01" ...
## $ Countries: chr [1:62034] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ New Cases: num [1:62034] 0 0 0 0 0 0 0 0 0 0 ...
# Checking the type of "New Cases" variable
typeof(new_cases$`New Cases`)
## [1] "double"
# Coverting the "Total Cases" to integer from double
new_cases$`New Cases` <- new_cases$`New Cases` %>% as.integer()
# View the first few observations of new_cases dataset after tidying
head(new_cases)
The datasets are tidy, we can merge all the 3 datasets. First the total_cases and new_cases datasets are merged based on “date” and “Countries” variables using full_join (retains all values, all rows from both datasets). With this dataset we get information about the total cases and new cases on a particular date for a given country. Now, this merged dataset can be merged with the locations dataset based on “Countries” variable which is the same as “location” variable in locations dataset using left_join (retains all observations from merged_cases). Now, the final merged dataset contains additional variables, i.e “Continent” and “Population”.
All the variables in the final merged dataset are in the desired format because the conversion has already been done before. The final dataset now contains 6 variables. The details of these variables has already been discussed in the “Data” section.
# Merging total_cases and new_cases
merged_cases <- full_join(total_cases, new_cases, by=c("date", "Countries"))
# Merging the previous merged dataset with the locations dataset
merged_dataset <- merged_cases %>% left_join(locations, c("Countries" = "location"))
# View the first few observations of the final merged dataset
head(merged_dataset)
# Checking the structure of the final merged dataset
str(merged_dataset)
## tibble [62,034 x 6] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:62034], format: "2019-12-31" "2020-01-01" ...
## $ Countries : chr [1:62034] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Total Cases: int [1:62034] NA NA NA NA NA NA NA NA NA NA ...
## $ New Cases : int [1:62034] 0 0 0 0 0 0 0 0 0 0 ...
## $ continent : Factor w/ 6 levels "Africa","Asia",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ population : int [1:62034] 38928341 38928341 38928341 38928341 38928341 38928341 38928341 38928341 38928341 38928341 ...
Now, we will focus only on the data from Australia and dates from 1/09/2020, so sub-setting the data using filter() function. Two additional variables are created using mutate() function.
# Creating a subset of data for only Australia
aus <- merged_dataset %>% filter(Countries == "Australia")
# Filtering dates > 1/09/2020
aus <- aus %>% filter(date >= "2020-09-01")
# Creating a new variable "Cases increased in %" based on "Total Cases" and "New Cases" variables
aus <- aus %>% mutate(`Cases increased in %` = round((`New Cases`/(`Total Cases` - `New Cases`)) * 100, 3))
# Creating a new variable "Population Affected in %" based on "Total Cases" and "population" variables
aus <- aus %>% mutate(`Population Affected in %` = round((`Total Cases`/population) * 100, 3))
# View the first few observations of the dataset
head(aus)
First, missing values are scanned for all the 6 variables. Secondly, for numeric variables the special values were scanned. Thirdly, a function to check negative numbers is written and then checked for negative numbers for all the numeric variables because the values for all the numeric variable must not be negative. As there were not missing values, special values and obvious errors, handling them was not needed.
# Checking whether there are any missing values in each columns
colSums(is.na(aus))
## date Countries Total Cases
## 0 0 0
## New Cases continent population
## 0 0 0
## Cases increased in % Population Affected in %
## 0 0
# Function that checks presence of special values like -Inf, Inf or NaN
# Reference: Code from Module 5
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
# Checking whether there are any special values in "Total Cases" column
sum(is.special(aus$`Total Cases`))
## [1] 0
# Checking whether there are any special values in "New Cases" column
sum(is.special(aus$`New Cases`))
## [1] 0
# Checking whether there are any special values in "Cases increased in %" column
sum(is.special(aus$`Cases increased in %`))
## [1] 0
# Checking whether there are any special values in "Population Affected in %" column
sum(is.special(aus$`Population Affected in %`))
## [1] 0
# Function that checks presenceo of negative numbers
is.negative <- function(x){
if (is.numeric(x)) (x < 0)
}
# Checking whether there are negative numbers in "Total Cases" column
sum(is.negative(aus$`Total Cases`))
## [1] 0
# Checking whether there are negative numbers in "New Cases" column
sum(is.negative(aus$`New Cases`))
## [1] 0
# Checking whether there are negative numbers in "Cases increased in %" column
sum(is.negative(aus$`Cases increased in %`))
## [1] 0
# Checking whether there are negative numbers in "Population in %" column
sum(is.negative(aus$`Population Affected in %`))
## [1] 0
All the numeric variables were plotted using histogram to check whether the data is distributed normally or not. No variables were normally distributed, therefore z-score method is not preferred for outlier detection.
A box plot has been plotted for all the numeric variables to check outliers. There are found to be 1 to 2 outliers for three variables but on looking at the box plot we can say that there are not outliers but the possible values, which doesn’t lie in between −1.5×IQR and 1.5×IQR. So, it doesn’t require to be handled.
# Check distribution of "Total Cases"
hist(aus$`Total Cases`)
# Check distribution of "New Cases"
hist(aus$`New Cases`)
# Check distribution of "Cases increased in %"
hist(aus$`Cases increased in %`)
# Check disstribution of "Population Affected in %"
hist(aus$`Population Affected in %`)
# Plot a box plot form "Total Cases" column to check outliers
boxplot(aus$`Total Cases`, ylab = "Total Cases")
# Plot a box plot form "New Cases" column to check outliers
boxplot(aus$`New Cases`, ylab = "New Cases")
# Plot a box plot form "Cases increased in %" column to check outliers
boxplot(aus$`Cases increased in %`, ylab = "Cases Increased (%)")
# Plot a box plot form "Population Affected in %" column to check outliers
boxplot(aus$`Population Affected in %`, ylab = "Population Affected (%)")
Looking at the histogram of “Cases increase in %” we can see it’s right skewed. In order to convert the distribution to normal log10, log and BoxCox transformation was applied. However log transformation gave the best out of three.
hist(aus$`Cases increased in %`, main = "Histogram of 'Cases Increase in %' before Transformation")
transformed1 <- log10(aus$`Cases increased in %`)
hist(transformed1, main = "Histogram of 'Cases Increase in %' after log10 Transformation")
transformed2 <- log(aus$`Cases increased in %`)
hist(transformed2, main = "Histogram of 'Cases Increase in %' after log Transformation")
transformed3 <- BoxCox(aus$`Cases increased in %`, lambda = "auto")
hist(transformed3, main = "Histogram of 'Cases Increase in %' after BoxCox Transformation")