library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
There are two data sets chosen for the analysis purpose. The first data set represents the daily COVID cases of different countries whereas the second dataset represents the GDP of different countries. The columns with all null values and unnecessary columns were dropped from both data sets. All the variables in both data sets were converted into different classes.
In the first data set, the new variable active cases was mutated by subtracting confirmed, death and recovered cases. The negative values of variable active cases were filled by subtracting all present values with previous values. Moreover, the new variable of quarter was mutated from the date variable.
The second dataset had repeated characters in a string variable which were ignored using R base functions. Also, the dataset was filtered to match 2020 entries since COVID data set is only applicable for 2020th year.
Spread function was used to spread the key category to value to tidy the data in wider format so that all entries could be matched between both data sets for merging. Finally, both the data sets were merged using inner join to match entries from both sides. Box plot was used to detect the outliers and histogram was used for visual check on using the log transformation.
The first data set is sourced from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series and is maintained by Johns Hopkins University.
The second data set is sourced from https://stats.oecd.org/index.aspx?queryid=33940 which is maintained by Organisation for Economic Co-operation and Development.
Both the data sets are untidy as they include some variables which are not useful for analysis. Both the data sets have repeated values which increases the number of observations and so are transformed into proper structure or format.
The first data set includes:-
Date: Date for the daily no of cases.
Country : The names of the different countries.
Lat and long: Latitude and longitude of the countries.
type: Different types of cases i.e. Confirmed, Death or Recovered.
Cases: The number of cases.
The second data set includes:-
Location: Acronym’s of different countries
Country: The names of the different countries
Measure: GPSA or GYSA
where GPSA = Economic growth rate compared to previous quarter
GYSA = Economic growth rate compared to same quarter of the previous year
Frequency: Represents value as annual or quarterly
Time/Period: The year and the quarter of that year
Value: The GDP value
# Importing first data set
covid <- read_csv("C:/Users/winuser/Downloads/coronavirus.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## province = col_logical(),
## country = col_character(),
## lat = col_double(),
## long = col_double(),
## type = col_character(),
## cases = col_double()
## )
## Warning: 62059 parsing failures.
## row col expected actual file
## 50407 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50408 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50409 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50410 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50411 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## ..... ........ .................. ....... ............................................
## See problems(...) for more details.
head(covid)
## # A tibble: 6 x 7
## date province country lat long type cases
## <date> <lgl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
## 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
## 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
## 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
## 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
## 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# dropping not needed columns
covid <- select(covid, -matches("province|lat|long"))
head(covid)
## # A tibble: 6 x 4
## date country type cases
## <date> <chr> <chr> <dbl>
## 1 2020-01-22 Afghanistan confirmed 0
## 2 2020-01-23 Afghanistan confirmed 0
## 3 2020-01-24 Afghanistan confirmed 0
## 4 2020-01-25 Afghanistan confirmed 0
## 5 2020-01-26 Afghanistan confirmed 0
## 6 2020-01-27 Afghanistan confirmed 0
# Importing second data set
GDPWorld_un <- read_csv("C:/Users/winuser/Downloads/QNA_19102020172848076.csv")
## Parsed with column specification:
## cols(
## LOCATION = col_character(),
## Country = col_character(),
## SUBJECT = col_character(),
## Subject = col_character(),
## MEASURE = col_character(),
## Measure = col_character(),
## FREQUENCY = col_character(),
## Frequency = col_character(),
## TIME = col_character(),
## Period = col_character(),
## `Unit Code` = col_character(),
## Unit = col_character(),
## `PowerCode Code` = col_double(),
## PowerCode = col_character(),
## `Reference Period Code` = col_logical(),
## `Reference Period` = col_logical(),
## Value = col_double(),
## `Flag Codes` = col_character(),
## Flags = col_character()
## )
head(GDPWorld_un)
## # A tibble: 6 x 19
## LOCATION Country SUBJECT Subject MEASURE Measure FREQUENCY Frequency TIME
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2009
## 2 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2010
## 3 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2011
## 4 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2012
## 5 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2013
## 6 AUS Austra~ B1_GE Gross ~ GPSA Growth~ A Annual 2014
## # ... with 10 more variables: Period <chr>, `Unit Code` <chr>, Unit <chr>,
## # `PowerCode Code` <dbl>, PowerCode <chr>, `Reference Period Code` <lgl>,
## # `Reference Period` <lgl>, Value <dbl>, `Flag Codes` <chr>, Flags <chr>
# Dropping unnecessary columns including columns with null values
GDPWorld_un <- select(GDPWorld_un, matches("Country|MEASURE|TIME|Value", ignore.case = FALSE))
# converting column headers to lower case
names(GDPWorld_un) <- tolower(names(GDPWorld_un))
head(GDPWorld_un)
## # A tibble: 6 x 4
## country measure time value
## <chr> <chr> <chr> <dbl>
## 1 Australia GPSA 2009 1.89
## 2 Australia GPSA 2010 2.45
## 3 Australia GPSA 2011 2.73
## 4 Australia GPSA 2012 3.93
## 5 Australia GPSA 2013 2.10
## 6 Australia GPSA 2014 2.55
The first data had some negative values,
Confirmed cases have negative values which means no of patients tested and found are not confirmed so those values would be 0.
Recovered cases have negative values which means no of patients tested again could be covid positive (confirmed).
Death cases have negative values which means it could be possible that negative deaths are patients who are not dead are still alive with covid severe condition(confirmed).
Likewise, the second data set had negative GDP values. Considering GDP growth could have negative values the negative values weren’t changed.
The variables were converted into date, factor and numeric class. The factor levels were checked to see if they are properly ordered.
# First Data set [COVID]
# total days with negative numbers reported
count <- covid[covid$cases < 0, ]
count %>% summarise(total_negative_cases = n())
## # A tibble: 1 x 1
## total_negative_cases
## <int>
## 1 359
# checking format of negative cases with respect to types
covid %>% filter(date == "2020-10-08") %>% filter(country == "Angola")
## # A tibble: 3 x 4
## date country type cases
## <date> <chr> <chr> <dbl>
## 1 2020-10-08 Angola confirmed 233
## 2 2020-10-08 Angola death -3
## 3 2020-10-08 Angola recovered 37
covid %>% filter(date == "2020-08-28") %>% filter(country == "Angola")
## # A tibble: 3 x 4
## date country type cases
## <date> <chr> <chr> <dbl>
## 1 2020-08-28 Angola confirmed 56
## 2 2020-08-28 Angola death 1
## 3 2020-08-28 Angola recovered -307
# Changing data types to factor, dates and numeric
covid$date <- as.Date(covid$date)
covid$country <- as.factor(covid$country)
covid$type <- as.factor(covid$type)
covid[, 4] <- sapply(covid[, 4], as.numeric)
# checking the data types
sapply(covid, class)
## date country type cases
## "Date" "factor" "factor" "numeric"
# Checking if factors are labeled correctly
levels(covid$country)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "Andorra"
## [5] "Angola" "Antigua and Barbuda"
## [7] "Argentina" "Armenia"
## [9] "Australia" "Austria"
## [11] "Azerbaijan" "Bahamas"
## [13] "Bahrain" "Bangladesh"
## [15] "Barbados" "Belarus"
## [17] "Belgium" "Belize"
## [19] "Benin" "Bhutan"
## [21] "Bolivia" "Bosnia and Herzegovina"
## [23] "Botswana" "Brazil"
## [25] "Brunei" "Bulgaria"
## [27] "Burkina Faso" "Burma"
## [29] "Burundi" "Cabo Verde"
## [31] "Cambodia" "Cameroon"
## [33] "Canada" "Central African Republic"
## [35] "Chad" "Chile"
## [37] "China" "Colombia"
## [39] "Comoros" "Congo (Brazzaville)"
## [41] "Congo (Kinshasa)" "Costa Rica"
## [43] "Cote d'Ivoire" "Croatia"
## [45] "Cuba" "Cyprus"
## [47] "Czechia" "Denmark"
## [49] "Diamond Princess" "Djibouti"
## [51] "Dominica" "Dominican Republic"
## [53] "Ecuador" "Egypt"
## [55] "El Salvador" "Equatorial Guinea"
## [57] "Eritrea" "Estonia"
## [59] "Eswatini" "Ethiopia"
## [61] "Fiji" "Finland"
## [63] "France" "Gabon"
## [65] "Gambia" "Georgia"
## [67] "Germany" "Ghana"
## [69] "Greece" "Grenada"
## [71] "Guatemala" "Guinea"
## [73] "Guinea-Bissau" "Guyana"
## [75] "Haiti" "Holy See"
## [77] "Honduras" "Hungary"
## [79] "Iceland" "India"
## [81] "Indonesia" "Iran"
## [83] "Iraq" "Ireland"
## [85] "Israel" "Italy"
## [87] "Jamaica" "Japan"
## [89] "Jordan" "Kazakhstan"
## [91] "Kenya" "Korea, South"
## [93] "Kosovo" "Kuwait"
## [95] "Kyrgyzstan" "Laos"
## [97] "Latvia" "Lebanon"
## [99] "Lesotho" "Liberia"
## [101] "Libya" "Liechtenstein"
## [103] "Lithuania" "Luxembourg"
## [105] "Madagascar" "Malawi"
## [107] "Malaysia" "Maldives"
## [109] "Mali" "Malta"
## [111] "Mauritania" "Mauritius"
## [113] "Mexico" "Moldova"
## [115] "Monaco" "Mongolia"
## [117] "Montenegro" "Morocco"
## [119] "Mozambique" "MS Zaandam"
## [121] "Namibia" "Nepal"
## [123] "Netherlands" "New Zealand"
## [125] "Nicaragua" "Niger"
## [127] "Nigeria" "North Macedonia"
## [129] "Norway" "Oman"
## [131] "Pakistan" "Panama"
## [133] "Papua New Guinea" "Paraguay"
## [135] "Peru" "Philippines"
## [137] "Poland" "Portugal"
## [139] "Qatar" "Romania"
## [141] "Russia" "Rwanda"
## [143] "Saint Kitts and Nevis" "Saint Lucia"
## [145] "Saint Vincent and the Grenadines" "San Marino"
## [147] "Sao Tome and Principe" "Saudi Arabia"
## [149] "Senegal" "Serbia"
## [151] "Seychelles" "Sierra Leone"
## [153] "Singapore" "Slovakia"
## [155] "Slovenia" "Solomon Islands"
## [157] "Somalia" "South Africa"
## [159] "South Sudan" "Spain"
## [161] "Sri Lanka" "Sudan"
## [163] "Suriname" "Sweden"
## [165] "Switzerland" "Syria"
## [167] "Taiwan*" "Tajikistan"
## [169] "Tanzania" "Thailand"
## [171] "Timor-Leste" "Togo"
## [173] "Trinidad and Tobago" "Tunisia"
## [175] "Turkey" "Uganda"
## [177] "Ukraine" "United Arab Emirates"
## [179] "United Kingdom" "Uruguay"
## [181] "US" "Uzbekistan"
## [183] "Venezuela" "Vietnam"
## [185] "West Bank and Gaza" "Western Sahara"
## [187] "Yemen" "Zambia"
## [189] "Zimbabwe"
levels(covid$type)
## [1] "confirmed" "death" "recovered"
#check structure and attributes
str(covid)
## tibble [213,548 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:213548], format: "2020-01-22" "2020-01-23" ...
## $ country: Factor w/ 189 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 3 levels "confirmed","death",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cases : num [1:213548] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "problems")= tibble [62,059 x 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:62059] 50407 50408 50409 50410 50411 50412 50413 50414 50415 50416 ...
## ..$ col : chr [1:62059] "province" "province" "province" "province" ...
## ..$ expected: chr [1:62059] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
## ..$ actual : chr [1:62059] "Alberta" "Alberta" "Alberta" "Alberta" ...
## ..$ file : chr [1:62059] "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. province = col_logical(),
## .. country = col_character(),
## .. lat = col_double(),
## .. long = col_double(),
## .. type = col_character(),
## .. cases = col_double()
## .. )
# Second data set [GDP]
# checking the out put before making some changes
head(GDPWorld_un)
## # A tibble: 6 x 4
## country measure time value
## <chr> <chr> <chr> <dbl>
## 1 Australia GPSA 2009 1.89
## 2 Australia GPSA 2010 2.45
## 3 Australia GPSA 2011 2.73
## 4 Australia GPSA 2012 3.93
## 5 Australia GPSA 2013 2.10
## 6 Australia GPSA 2014 2.55
# changing the time column name to quarter
names(GDPWorld_un)[names(GDPWorld_un) == "time"] <- "quarter"
head(GDPWorld_un)
## # A tibble: 6 x 4
## country measure quarter value
## <chr> <chr> <chr> <dbl>
## 1 Australia GPSA 2009 1.89
## 2 Australia GPSA 2010 2.45
## 3 Australia GPSA 2011 2.73
## 4 Australia GPSA 2012 3.93
## 5 Australia GPSA 2013 2.10
## 6 Australia GPSA 2014 2.55
# Changing data types to factor, dates and numeric
GDPWorld_un$country <- as.factor(GDPWorld_un$country)
GDPWorld_un$measure <- as.factor(GDPWorld_un$measure)
# checking the data types
sapply(GDPWorld_un, class)
## country measure quarter value
## "factor" "factor" "character" "numeric"
# Checking if factors are labeled correctly
levels(GDPWorld_un$country)
## [1] "Argentina"
## [2] "Australia"
## [3] "Brazil"
## [4] "Canada"
## [5] "China (People's Republic of)"
## [6] "Euro area (19 countries)"
## [7] "European Union – 27 countries (from 01/02/2020)"
## [8] "France"
## [9] "G20"
## [10] "Germany"
## [11] "India"
## [12] "Indonesia"
## [13] "Italy"
## [14] "Japan"
## [15] "Korea"
## [16] "Mexico"
## [17] "Russia"
## [18] "Saudi Arabia"
## [19] "South Africa"
## [20] "Turkey"
## [21] "United Kingdom"
## [22] "United States"
#check structure and attributes
str(GDPWorld_un)
## tibble [2,443 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country: Factor w/ 22 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ measure: Factor w/ 2 levels "GPSA","GYSA": 1 1 1 1 1 1 1 1 1 1 ...
## $ quarter: chr [1:2443] "2009" "2010" "2011" "2012" ...
## $ value : num [1:2443] 1.89 2.45 2.73 3.93 2.1 ...
## - attr(*, "spec")=
## .. cols(
## .. LOCATION = col_character(),
## .. Country = col_character(),
## .. SUBJECT = col_character(),
## .. Subject = col_character(),
## .. MEASURE = col_character(),
## .. Measure = col_character(),
## .. FREQUENCY = col_character(),
## .. Frequency = col_character(),
## .. TIME = col_character(),
## .. Period = col_character(),
## .. `Unit Code` = col_character(),
## .. Unit = col_character(),
## .. `PowerCode Code` = col_double(),
## .. PowerCode = col_character(),
## .. `Reference Period Code` = col_logical(),
## .. `Reference Period` = col_logical(),
## .. Value = col_double(),
## .. `Flag Codes` = col_character(),
## .. Flags = col_character()
## .. )
Merging the data sets required common unique values in both the data sets. The first data set had dates which was converted mutating it into quarterly values. The data was grouped by country, type and quarter to sum all the number of cases. After summarizing the cases the type category was spread with cases value. In short, the longer data was transformed to wider data. A new active variable was mutated by subtracting confirmed, death and recovered.
There were some negative values. This happens because the past active patients might have been recovered in present/future and so recovered counts are higher. So, present cases were subtracted by the previous cases.
# Since dates are used in covid dataset sub setting the dates into quarterly for GDP dataset
# sub setting data quarterly
covid$quarter <- quarters(covid$date)
head(covid)
## # A tibble: 6 x 5
## date country type cases quarter
## <date> <fct> <fct> <dbl> <chr>
## 1 2020-01-22 Afghanistan confirmed 0 Q1
## 2 2020-01-23 Afghanistan confirmed 0 Q1
## 3 2020-01-24 Afghanistan confirmed 0 Q1
## 4 2020-01-25 Afghanistan confirmed 0 Q1
## 5 2020-01-26 Afghanistan confirmed 0 Q1
## 6 2020-01-27 Afghanistan confirmed 0 Q1
# Wider format
# spreading the cases type and mutating active column
covidW <- covid %>%
select(country, type, cases, quarter) %>%
group_by(country, type, quarter) %>%
summarise(cases = sum(cases)) %>%
spread(key = type, value = cases) %>%
mutate(active = confirmed - death - recovered) %>%
arrange(country)
head(covidW)
## # A tibble: 6 x 6
## # Groups: country [2]
## country quarter confirmed death recovered active
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Q1 174 4 5 165
## 2 Afghanistan Q2 31343 742 14126 16475
## 3 Afghanistan Q3 7751 712 18658 -11619
## 4 Afghanistan Q4 932 34 825 73
## 5 Albania Q1 243 15 52 176
## 6 Albania Q2 2292 47 1407 838
# active cases are negative. This happens because the past active patients might have been recovered in present/future and so recovered counts are higher.
# subtracting previous to current active cases.
covidW <- covidW %>%
mutate(active = abs((active) - lag(active, default = first(country))))
head(covidW)
## # A tibble: 6 x 6
## # Groups: country [2]
## country quarter confirmed death recovered active
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Q1 174 4 5 164
## 2 Afghanistan Q2 31343 742 14126 16310
## 3 Afghanistan Q3 7751 712 18658 28094
## 4 Afghanistan Q4 932 34 825 11692
## 5 Albania Q1 243 15 52 174
## 6 Albania Q2 2292 47 1407 662
# changing quarter to factor
covidW$quarter <- as.factor(covidW$quarter)
The time variable was needed to be in proper format to match with the quarter variable of first data set. The data set had entries of multiple years. Thus, filtering was done using grep function to get the entries of 2020 year as COVID cases are of 2020 year. Now, the extracted entries had repeated 2020 which were ignored using substring function.
Each measure GPSA or GYSA had 2 different quarter entries for every country which would be 4. So, there would have been multiple entries while merging as it is with first data set. To avoid multiple entries, the key measure was spread with the GDP value using spread function.
# filtering only 2020 year's related entries
GDPWorld_un <- GDPWorld_un %>% filter(grepl("2020-", quarter)) %>% arrange(country)
head(GDPWorld_un)
## # A tibble: 6 x 4
## country measure quarter value
## <fct> <fct> <chr> <dbl>
## 1 Argentina GPSA 2020-Q1 -4.16
## 2 Argentina GPSA 2020-Q2 -16.2
## 3 Argentina GYSA 2020-Q1 -4.65
## 4 Argentina GYSA 2020-Q2 -19.8
## 5 Australia GPSA 2020-Q1 -0.262
## 6 Australia GPSA 2020-Q2 -7.00
# now changing it into quarter format
GDPWorld_un$quarter <- substr(GDPWorld_un$quarter, 6, 7)
head(GDPWorld_un)
## # A tibble: 6 x 4
## country measure quarter value
## <fct> <fct> <chr> <dbl>
## 1 Argentina GPSA Q1 -4.16
## 2 Argentina GPSA Q2 -16.2
## 3 Argentina GYSA Q1 -4.65
## 4 Argentina GYSA Q2 -19.8
## 5 Australia GPSA Q1 -0.262
## 6 Australia GPSA Q2 -7.00
# now changing quarter to factor
GDPWorld_un$quarter <- as.factor(GDPWorld_un$quarter)
# Since there are 2 measures spreading the measure values
# GPSA = Economic growth rate compared to previous quarter
# GYSA = Economic growth rate compared to same quarter of the previous year
GDPWorld_un <- GDPWorld_un %>%
select(country, measure, quarter, value) %>%
group_by(country, quarter) %>%
spread(key = measure, value = value)
Null values were checked using sapply function and no null values were found.
Box plots are a great way to visualize the outliers and so box plots were used in this case to detect the outliers.
There were outliers in numeric columns. Using box plot the vector of outliers was created. Then the rows containing those outlier values were displayed. Furthermore, the box plot without outliers was plotted to check if the outliers were removed. The outliers would be eliminated further if model has to be build.
Cap function is used to cap the outliers with percentiles.
# checking for total null values
sapply(covidW, function(x) sum(is.na(x)))
## country quarter confirmed death recovered active
## 0 0 0 0 0 0
# detecting outliers. Found in 4 types of cases
boxplot(covidW$confirmed)$out
## [1] 37623 63476 686471 238679 44106 145432 217996 61966 48652
## [10] 57025 103801 102092 1396324 3408894 413427 97570 55010 39697
## [19] 276549 183598 82279 96940 731833 129893 72301 58809 103122
## [28] 79641 54192 80615 67601 34887 69522 52827 149236 401968
## [37] 272311 71808 123610 97495 75758 73650 57342 584084 5727103
## [46] 1237689 54857 230623 74859 44605 183057 229557 73161 48415
## [55] 313872 63653 220250 57615 105792 134786 74283 99380 64976
## [64] 85671 45906 58987 41163 37856 224874 517127 108011 36429
## [73] 111120 49979 64253 54429 37816 75505 108215 39878 58515
## [82] 211352 99336 79045 38537 284148 526555 53781 35430 274180
## [91] 44924 57121 84252 95307 100602 52816 644592 523870 220025
## [100] 189260 143782 42981 149856 523130 95923 153348 519917 167372
## [109] 63090 186375 118757 44609 169192 92855 48003 45523 38815
## [118] 246401 170630 269446 192091 2441780 4599171 921552 48214 69290
## [127] 37471
# Extracting those outliers in a vector
co <- boxplot(covidW$confirmed, plot=FALSE)$out
# Getting rows where those outliers are present. Since there are lot of observations the line is commented
# covidW[which(covidW$confirmed %in% co),]
#checking box plot without outliers
boxplot(which(covidW$confirmed %in% co))
# using cap function to remove outliers
cap <- function(x){
quantiles <- quantile( x, c(0.05, 0.25, 0.75, 0.95))
iqr <- IQR(x)
x[x < quantiles[2] - 1.5*iqr] <- quantiles[1]
x[x > quantiles[3] + 1.5*iqr] <- quantiles[4]
x}
# Replacing extreme values with percentiles using cap function
covidL = cap(covidW$confirmed)
Null values were checked using sapply function and no null values were found.
Box plots are a great way to visualize the outliers and so box plots were used in this case to detect the outliers.
There were no outliers detected in second data set.
# checking for total null values
sapply(GDPWorld_un, function(x) sum(is.na(x)))
## country quarter GPSA GYSA
## 0 0 0 0
# detecting outliers in GPSA and GYSA. No outliers found
boxplot(GDPWorld_un$GPSA)
boxplot(GDPWorld_un$GYSA)
Merging the data sets using common attributes as country and quarter. The entries only matching both the data sets are joined using inner join.
# checking the structure of both data frames
head(covidW)
## # A tibble: 6 x 6
## # Groups: country [2]
## country quarter confirmed death recovered active
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Q1 174 4 5 164
## 2 Afghanistan Q2 31343 742 14126 16310
## 3 Afghanistan Q3 7751 712 18658 28094
## 4 Afghanistan Q4 932 34 825 11692
## 5 Albania Q1 243 15 52 174
## 6 Albania Q2 2292 47 1407 662
head(GDPWorld_un)
## # A tibble: 6 x 4
## # Groups: country, quarter [6]
## country quarter GPSA GYSA
## <fct> <fct> <dbl> <dbl>
## 1 Argentina Q1 -4.16 -4.65
## 2 Argentina Q2 -16.2 -19.8
## 3 Australia Q1 -0.262 1.56
## 4 Australia Q2 -7.00 -6.26
## 5 Brazil Q1 -2.45 -1.43
## 6 Brazil Q2 -9.69 -11.4
# Merging the data sets
World <- inner_join(covidW, GDPWorld_un, by = c("country" = "country", "quarter" = "quarter"))
## Warning: Column `country` joining factors with different levels, coercing to
## character vector
## Warning: Column `quarter` joining factors with different levels, coercing to
## character vector
head(World)
## # A tibble: 6 x 8
## # Groups: country [3]
## country quarter confirmed death recovered active GPSA GYSA
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Argentina Q1 1054 27 240 780 -4.16 -4.65
## 2 Argentina Q2 63476 1280 21788 39621 -16.2 -19.8
## 3 Australia Q1 4559 18 358 4174 -0.262 1.56
## 4 Australia Q2 3361 86 6682 7590 -7.00 -6.26
## 5 Brazil Q1 5717 201 127 5365 -2.45 -1.43
## 6 Brazil Q2 1396324 59393 788191 543351 -9.69 -11.4
# Data sets are merged successfully
The histogram of confirmed cases is rightly skewed. Transforming the distribution might be done for better understanding.
Recommended transformation to correct for right skew include:
Log transformation
Natural log transformation
Square root transformation
Reciprocal transformation
In first data set, since most rows are excluded while merging (inner join) data sets, transformation wouldn’t affect the underlying structure of data. Out of all, log transformation values appear more closer to mean.
Since second dataset contains several negative GDP values, Normalization is not done.
# Normal histogram
hist(World$confirmed)
# Using log transformation to check histogram for normalization.
# Since most rows are excluded transformation wouldn't affect the underlying structure
hist(log(World$confirmed))
hist(log10(World$confirmed))
hist(sqrt(World$confirmed))
hist(1/(World$confirmed))
# applying log transformation
World$confirmed <- log(World$confirmed)
# Normal histogram for measures. Normalization is not applied since it contains negative GDP values.
hist(World$GPSA)
[1]“CSSEGISandData/COVID-19”, GitHub, 2020. [Online].
Available: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series.
[Accessed: 20- Oct- 2020].
[2]“Quarterly National Accounts : G20 - Quarterly Growth Rates of GDP in volume”, Stats.oecd.org, 2020. [Online].
Available: https://stats.oecd.org/index.aspx?queryid=33940.
[Accessed: 20- Oct- 2020].