Coronavirus Disease 2019 (COVID-19) has become a major health problem causing severe acute respiratory illness in humans. It has spread rapidly around the globe since its outbreak.The incidence of COVID-19 continues to increase with millions confirmed cases and deaths worldwide. In this assignment I have tried to analyze the impact of covid19 in US using R programming language.Data collected from Kaggle websites.Datasets provides information about covid19 related counts and related hospitals and beds availability over different parts of US.Both the datasets are very large dataset .I tried to implements all key concepts of Data Wrangling like tidying the messy data ,creating new variables using mutate ,scanning missing values ,outliers ,applying transformations and visualization.
# This is the R chunk for the required packages
library(readr)
library(Hmisc)
library(readxl)
library(dplyr)
library(tidyr)
library(Hmisc)
library(outliers)
library(forecast)
library(gdata)
library(rvest)
library(ggplot2)
library(knitr)
library(car)
library(lubridate)
library(magrittr)
This report has been prepared by analyzing Novel CoronaVirus cases for country US.I have used datasets from Kaggle. First dataset is the covid daily cases of US and Second dataset is details of avalability of hospitals beds and related data in US .Merging two dataset , and selecting some of the important columns ,I have applied all wrangling techniques.As the data has been collected from different sources,in various shapes and forms,therefore it is necessary to apply relevant methods to prepare the collection of data by using pre-processing techniques before analyzing and visualizing it.
First dataset ‘us_states_covid19_daily.csv’ contains missing values whereas second dataset ‘hospital_beds_USA_v1.csv’ does not contain any missing values. The following steps has been implemented for this entire assignment:
-> Download datasets into RMarkdown
-> Merging two datasets.
-> Selecting the columns from the merged dataset which are required to carry out further assignment.
-> Data type conversions like converting characters to factors.
-> Tidying up data:
-> Scanning:
-> Transformation using appropriate R function
First dataset ‘us_states_covid19_daily.csv’ contains state wise and date wise daily covid cases counts,confirmed deaths,positive cases and other coulmns which gives a comprehensive details of covid cases in USA for all states.It consists of 54 coulmns .Some of the columns contains missing values and is untidy dataset. Second dataset ‘hospital_beds_USA_v1.csv’ contains details of types of beds ,number of beds ,population and others important variables with state wise details of hospitals in US. This dataset contains 12 coulmns .There are no missing data in this dataset. Both the datasets has been merged using left join by state.Please find the links of two dataset below.
https://www.kaggle.com/sudalairajkumar/covid19-in-usa?select=us_states_covid19_daily.csv
# First dataset
US_covid_data <- read_csv("us_states_covid19_daily.csv")
Parsed with column specification:
cols(
.default = col_double(),
state = col_character(),
dataQualityGrade = col_character(),
lastUpdateEt = col_character(),
dateModified = col_datetime(format = ""),
checkTimeEt = col_character(),
dateChecked = col_datetime(format = ""),
fips = col_character(),
totalTestResultsSource = col_character(),
hash = col_character(),
grade = col_logical()
)
See spec(...) for full column specifications.
head(US_covid_data)
# second dataset
hospital_beds_USA_v1 <- read_csv("hospital_beds_USA_v1.csv")
Parsed with column specification:
cols(
country = col_character(),
state = col_character(),
county = col_character(),
lat = col_double(),
lng = col_double(),
type = col_character(),
measure = col_character(),
beds = col_double(),
population = col_double(),
year = col_double(),
source = col_character(),
source_url = col_character()
)
head(hospital_beds_USA_v1)
First I merged two above dataset using left join by state.The next steps I took is to select the columns with which I want to work from the merged dataset.Therefore I have filtered out the required columns using select function which is a part of dplyr package.Total 18 columns have been selected.Also the dataset seems to have very large number of observations ,so I have decided to work with 75000 observations.
After selection of required columns, I have checked data types of columns. Dataset have various datatypes like character,numeric. I found some columns are in characters which needs to be converted into factor. Therefore I converted them into factors by providing labels and levels. The structure of filtered merged dataset after selecting required columns have been studied using str().
# This is the R chunk for the Understand Section
#MERGE
merged <- left_join( hospital_beds_USA_v1 , US_covid_data , by ="state" )
# SELECT ONLY REQUIRED COLUMNS FROM MERGED DATASET
filtered_merged <-merged %>% select(country,state,type,measure,beds,population,year,date ,positive ,negative,totalTestResults,
hospitalizedCurrently,inIcuCurrently,
onVentilatorCurrently, recovered,death ,hospitalized,
deathConfirmed )
# take 75000 observation fron large dataset
filtered_merged <- filtered_merged [1:75000 ,]
str(filtered_merged)
tibble [75,000 x 18] (S3: tbl_df/tbl/data.frame)
$ country : chr [1:75000] "US" "US" "US" "US" ...
$ state : chr [1:75000] "AK" "AK" "AK" "AK" ...
$ type : chr [1:75000] "ICU" "ICU" "ICU" "ICU" ...
$ measure : chr [1:75000] "1000HAB" "1000HAB" "1000HAB" "1000HAB" ...
$ beds : num [1:75000] 0 0 0 0 0 0 0 0 0 0 ...
$ population : num [1:75000] 3338 3338 3338 3338 3338 ...
$ year : num [1:75000] 2019 2019 2019 2019 2019 ...
$ date : num [1:75000] 20200927 20200926 20200925 20200924 20200923 ...
$ positive : num [1:75000] 8431 8315 8202 7941 7941 ...
$ negative : num [1:75000] 434554 434554 433130 425257 425257 ...
$ totalTestResults : num [1:75000] 442869 442869 441332 433198 433198 ...
$ hospitalizedCurrently: num [1:75000] 43 43 43 43 43 43 47 43 41 36 ...
$ inIcuCurrently : num [1:75000] NA NA NA NA NA NA NA NA NA NA ...
$ onVentilatorCurrently: num [1:75000] 14 14 14 14 14 14 13 13 13 13 ...
$ recovered : num [1:75000] 3502 3267 3042 2731 2731 ...
$ death : num [1:75000] 56 52 51 45 45 45 45 45 45 45 ...
$ hospitalized : num [1:75000] NA NA NA NA NA NA NA NA NA NA ...
$ deathConfirmed : num [1:75000] 56 52 51 45 45 45 45 45 45 45 ...
#DATA TYPE CONVERSION
filtered_merged$type <-factor(filtered_merged$type, levels= c("ACUTE" , "ICU" ,"OTHER" ,"PSYCHIATRIC") ,labels=c("ACUTE" , "ICU" ,"OTHER" ,"PSYCHIATRIC") ,ordered=TRUE)
levels(filtered_merged$type)
[1] "ACUTE" "ICU" "OTHER" "PSYCHIATRIC"
filtered_merged$state <- as.factor(filtered_merged$state)
levels(filtered_merged$state)
[1] "AK" "AL" "AR" "AZ" "CA"
filtered_merged$measure <- as.factor(filtered_merged$measure)
levels(filtered_merged$measure)
[1] "1000HAB"
filtered_merged$country <- as.factor(filtered_merged$country)
levels(filtered_merged$country)
[1] "US"
Dataset ‘filtered_merged’ which was formed after merging two dataset and selection required columns was found to be in improper format and have many missing values.Secondly I aim to find out covid cases details along with hospital beds details for each states.I have found multiple variables like hospital bed types have been stored in rows .These types needs to be converted into columns so that we can know the counts of each type of bed.That is why I have used spread() function to wide format by generating new columns from ‘type’ column .The columns so formed after reshaping are ‘ACUTE’ ,‘ICU’,‘OTHER’,‘PSYCHIATRIC’. This reshaped columns will give the counts of different types of beds available in different states of US .
# This is the R chunk for the Tidy & Manipulate Data I
#RESHAPE
reshape_covid_data <-spread(filtered_merged ,key =type ,value =beds)
head(reshape_covid_data)
NA
I have created two new columns using mutate() function which is a part of dplyr package.
-The first new column is the percentage of positive covid cases which has been calculated using formula : (positive tested cases / total tested cases) *100 .
-The second new column is the percentage of negative covid cases which uses the formula : (negative tested cases / total tested cases) *100 .
# This is the R chunk for the Tidy & Manipulate Data II
#MUTATE
# create new column with percentage of positive covid case = (negative /totalTestResults)*100
reshape_covid_data <-reshape_covid_data %>% mutate(Positive_case_percentage = (positive /totalTestResults)*100 )
# create new column with percentage of negative covid case = (negative /totalTestResults)*100
reshape_covid_data <-reshape_covid_data %>% mutate(Negative_case_percentage = (negative /totalTestResults)*100 )
# After mutate
head(reshape_covid_data)
NA
First phase of scanning involves Scan of the data for missing values, special values and obvious errors (i.e. inconsistencies).
1.First I checked for missing values from original two datasets and then merged filtered dataset. We found some columns with missing values.
2.Imputing some of the columns like ACUTE ,ICU,OTHER ,PSYCHIATRIC with mean value using impute() which is a part of Hmisc package.Assuming as numbers of missing is high so imputing with mean value would ensure consistency and will reduce overall variance. For other missing columns, I have replaced the missing values with mean of that column with mean() so as to retain overall consistent data.
3.Next step was to check for special values like infinite values ,nan values or numeric values. None of the columns has been found with special values.
4.Finally I checked for any inconsistencies and error such as negative values in columns. No such columns has been found with such criteria.
# This is the R chunk for the Scan I
#Scanning for missing values
any(is.na(reshape_covid_data))
[1] TRUE
colSums(is.na(reshape_covid_data))
country state measure population year date
0 0 0 0 0 0
positive negative totalTestResults hospitalizedCurrently inIcuCurrently onVentilatorCurrently
0 375 0 6893 55260 35597
recovered death hospitalized deathConfirmed ACUTE ICU
17784 3196 16488 37043 40770 25120
OTHER PSYCHIATRIC Positive_case_percentage Negative_case_percentage
60326 59868 500 500
# Action on missing values on refined merged data
reshape_covid_data$ACUTE <- impute(reshape_covid_data$ACUTE, fun = mean)
reshape_covid_data$ICU <- impute(reshape_covid_data$ICU, fun = mean)
reshape_covid_data$OTHER <- impute(reshape_covid_data$OTHER, fun = mean)
reshape_covid_data$PSYCHIATRIC <- impute(reshape_covid_data$PSYCHIATRIC, fun = mean)
reshape_covid_data$negative[is.na(reshape_covid_data$negative)] <- mean(reshape_covid_data$negative ,na.rm = TRUE)
reshape_covid_data$hospitalizedCurrently[is.na(reshape_covid_data$hospitalizedCurrently)] <- mean(reshape_covid_data$hospitalizedCurrently ,na.rm = TRUE)
reshape_covid_data$inIcuCurrently[is.na(reshape_covid_data$inIcuCurrently)] <- mean(reshape_covid_data$inIcuCurrently ,na.rm = TRUE)
reshape_covid_data$onVentilatorCurrently[is.na(reshape_covid_data$onVentilatorCurrently)] <-mean(reshape_covid_data$onVentilatorCurrently ,na.rm = TRUE)
reshape_covid_data$recovered[is.na(reshape_covid_data$recovered)] <- mean(reshape_covid_data$recovered ,na.rm = TRUE)
reshape_covid_data$hospitalized[is.na(reshape_covid_data$hospitalized)] <- mean(reshape_covid_data$hospitalized,na.rm = TRUE)
reshape_covid_data$death[is.na(reshape_covid_data$death)] <-mean(reshape_covid_data$death ,na.rm = TRUE)
reshape_covid_data$deathConfirmed[is.na(reshape_covid_data$deathConfirmed)] <- mean(reshape_covid_data$deathConfirmed ,na.rm = TRUE)
reshape_covid_data$Negative_case_percentage[is.na(reshape_covid_data$Negative_case_percentage)] <-mean(reshape_covid_data$Negative_case_percentage,na.rm = TRUE)
reshape_covid_data$Positive_case_percentage[is.na(reshape_covid_data$Positive_case_percentage)] <-mean(reshape_covid_data$Positive_case_percentage,na.rm = TRUE)
colSums(is.na(reshape_covid_data))
country state measure population year date
0 0 0 0 0 0
positive negative totalTestResults hospitalizedCurrently inIcuCurrently onVentilatorCurrently
0 0 0 0 0 0
recovered death hospitalized deathConfirmed ACUTE ICU
0 0 0 0 0 0
OTHER PSYCHIATRIC Positive_case_percentage Negative_case_percentage
0 0 0 0
#Check for special values
#Define function
special_values <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sp_value<-sapply(reshape_covid_data, FUN = special_values)
sapply(reshape_covid_data, function(x){if (is.numeric(x)) sum(special_values(x))})
$country
NULL
$state
NULL
$measure
NULL
$population
[1] 0
$year
[1] 0
$date
[1] 0
$positive
[1] 0
$negative
[1] 0
$totalTestResults
[1] 0
$hospitalizedCurrently
[1] 0
$inIcuCurrently
[1] 0
$onVentilatorCurrently
[1] 0
$recovered
[1] 0
$death
[1] 0
$hospitalized
[1] 0
$deathConfirmed
[1] 0
$ACUTE
[1] 0
$ICU
[1] 0
$OTHER
[1] 0
$PSYCHIATRIC
[1] 0
$Positive_case_percentage
[1] 0
$Negative_case_percentage
[1] 0
##Check for inconsistencies or obvious errors for numerical columns
nonnegative<- function(x){x<0}
nonnegative_values<-sapply(reshape_covid_data[,7:22], FUN = nonnegative)
colSums(nonnegative_values)
positive negative totalTestResults hospitalizedCurrently inIcuCurrently onVentilatorCurrently
0 0 0 0 0 0
recovered death hospitalized deathConfirmed ACUTE ICU
0 0 0 0 0 0
OTHER PSYCHIATRIC Positive_case_percentage Negative_case_percentage
0 0 0 0
Next phase of scanning involves detecting outliers and how to handle them. As the dataset is very large ,therefore there are possibilities of large number of outliers.
First I have taken subsets from above dataset by taking some of the numerical columns to do a better visual inspection of outliers and later transformations.
Then I checked for outliers using boxplot.I found columns ‘recovered’,‘positive,’death’ and ‘hospitalizedCurrently’ have few outliers in it.
Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers instead of removing outliers.Since it is not always necessary that outliers are generated because of errors .Sometimes outliers can be legitimate observations.For example positive or death have outlier might mean that positive covid cases or covid death cases are high. Therefore I decided to use cap() instead of removing outliers.
I created cap() function to replace those data points lying outside the lower limit and above the upper limit with the value of 5th percentile and 95th percentile.
Summary of the subset data has been computed before and after capping .
#subset numerical columns into dataset
covid_subset<- reshape_covid_data %>% select( positive ,negative ,totalTestResults,hospitalizedCurrently ,recovered ,Positive_case_percentage ,death)
#CHECK OUTLIERS USING BOXPLOTS
# boxplot for finding outliers
par(mfrow=c(1,2))
boxplot(covid_subset$recovered ,main ="recovered cases")
grid(col = "grey")
boxplot(covid_subset$positive ,main ="positive cases")
grid(col = "grey")
boxplot(covid_subset$hospitalizedCurrently ,main ="hospitalizedCurrently cases")
grid(col = "grey")
boxplot(covid_subset$death ,main ="death cases")
grid(col = "grey")
NA
NA
#CAPPING
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Descriptive statistics of the subset
summary(covid_subset)
positive negative totalTestResults hospitalizedCurrently recovered Positive_case_percentage death
Min. : 0 Min. : 0 Min. : 0 Min. : 7.0 Min. : 11 Min. : 0.000 Min. : 0
1st Qu.: 3006 1st Qu.: 45124 1st Qu.: 47398 1st Qu.: 224.0 1st Qu.: 7116 1st Qu.: 6.490 1st Qu.: 85
Median : 16957 Median : 254491 Median : 264492 Median : 509.0 Median :22820 Median : 8.498 Median : 517
Mean : 53391 Mean : 568349 Mean : 617287 Mean : 846.8 Mean :22820 Mean : 9.650 Mean : 1106
3rd Qu.: 66757 3rd Qu.: 602939 3rd Qu.: 673704 3rd Qu.: 846.8 3rd Qu.:30682 3rd Qu.:12.303 3rd Qu.: 1181
Max. :802308 Max. :13531190 Max. :14333498 Max. :8820.0 Max. :72602 Max. :81.579 Max. :15587
#Applying cap() to the numeric variables
positive_capped <- covid_subset$positive %>%cap()
recovered_capped<- covid_subset$recovered %>% cap()
hospitalized_capped <- covid_subset$hospitalizedCurrently %>%cap()
positivePercent_cap<- covid_subset$Positive_case_percentage %>% cap()
totalTestResults_capped <- covid_subset$totalTestResults %>%cap()
death_capped <- covid_subset$death %>%cap()
Covid_capped <- sapply(covid_subset, FUN = cap)
#Descriptive statistics of the capped dataset
summary(Covid_capped)
positive negative totalTestResults hospitalizedCurrently recovered Positive_case_percentage death
Min. : 0 Min. : 0 Min. : 0 Min. : 7.0 Min. : 11 Min. : 0.000 Min. : 0.0
1st Qu.: 3006 1st Qu.: 45124 1st Qu.: 47398 1st Qu.: 224.0 1st Qu.: 7116 1st Qu.: 6.490 1st Qu.: 85.0
Median : 16957 Median : 254491 Median : 264492 Median : 509.0 Median :22820 Median : 8.498 Median : 517.0
Mean : 43113 Mean : 350384 Mean : 389790 Mean : 763.3 Mean :22701 Mean : 8.952 Mean : 947.8
3rd Qu.: 66757 3rd Qu.: 602939 3rd Qu.: 673704 3rd Qu.: 846.8 3rd Qu.:30682 3rd Qu.:12.303 3rd Qu.:1181.0
Max. :185053 Max. :1424675 Max. :1582745 Max. :3454.0 Max. :66003 Max. :20.091 Max. :4626.0
Transformation is required in Data wrangling for scaling,standardizing,normalizing ,reducing skewness of distribution.I have applied different types of transformation like log , log10 ,Box-Cox transformation to the capped data which I got from above steps after capping outliers .After experimenting with different transformation types,I found the best types suitable for each of the variables.
Below are the observations after applying transformation.
hospitalized_capped was right skewed .This variable does not have any negative values .So applying log10 transformation, graph produce more symmetric distribution.
positive_capped was shown as extremely right skewed. This variable also does not have any negative values Applying log transformation ,distribution look approximately more symmetrical.
positivePercent_cap was unevenly distributed and not was symmetric .Here I applied Box-Cox transformation rather than log transformation which produced more symmetric distribution.
death_capped was rightly skewed.Applying Box-Cox transformation produced the best result than other transformation and resulted in approximately symmetrical distribution.
par(mfrow=c(1,2))
hist(hospitalized_capped ,main = " Covid Hospitalized ", xlab = "Covid Hospitalized ")
grid(col = "grey")
# It is highly right skewed
# Applying log10 transformation
log_hospitalized <- log10(hospitalized_capped)
# Histogram after transformation
hist(log_hospitalized ,main = "Log10 transformation", xlab = "Covid Hospitalized " ,col = "bisque" )
grid(col = "grey")
par(mfrow=c(1,2))
hist(positive_capped,main = "Number of covid positive cases ", xlab = "covid_positive_cases")
grid(col = "grey")
# It is highly right skewed
# Applying log transformation
log_positive <- log(positive_capped)
# Histogram after transformation
hist(log_positive ,main = "Log transformation ", xlab = "covid_positive_cases" ,col = "bisque")
grid(col = "grey")
NA
NA
par(mfrow=c(1,2))
hist(positivePercent_cap ,main ="Percentage of positive cases", xlab = "positive case percentage")
grid(col = "grey")
# Applying Box-Cox transformation
boxcox_positive<- BoxCox(positivePercent_cap,lambda = "auto")
hist(boxcox_positive ,main ="Box-Cox transformation", xlab = "positive case perecentage" ,col = "bisque")
grid(col = "grey")
NA
NA
par(mfrow=c(1,2))
hist(death_capped ,main = "Number of Covid deaths ", xlab = "number of Covid deaths ")
grid(col = "grey")
# It is highly right skewed
# Applying boxcox transformation
boxcox_death<- BoxCox(death_capped,lambda = "auto")
# Histogram after transformation
hist(boxcox_death ,main = "BoxCox transformation ", xlab = "number of Covid deaths " ,col = "bisque")
grid(col = "grey")
NA
NA
-From the above assignment,we can see how large messy data sets and be pre-processed using Data Wrangling techniques in R.
-Tidying up data ensures variable have its own columns and observations have its own rows.
-Scanning missing data,special values and outliers is necessary before any statistical analysis so that clean data can be used as input into a variety of other functions that may transform, model, or visualize the data.
-Also we can see that it is not always good idea to eliminate missing data or outliers as it can remove important data from the dataset. Techniques like Imputation ,capping are some of the methods which can be applied to make data more efficent for further use.
-To make better-organized and properly formatted data, transformation is necessary.Not always same transformation type is applicable to all data variables .Each variable can be experiment by applying different transformation method and the best one can be found which ensure maximum possible symmetry and normality.
-Overall this assignment helped me to gain insights into practical aspects of Data Wrangling and understand real world problems.