Data Wrangling A2

Available on Rpubs: https://rpubs.com/NightxWalker/678631

Required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

Executive Summary

There are two data sets chosen for the analysis purpose. The first data set represents the daily COVID cases of different countries whereas the second dataset represents the GDP of different countries. The columns with all null values and unnecessary columns were dropped from both data sets. All the variables in both data sets were converted into different classes.

In the first data set, the new variable active cases was mutated by subtracting confirmed, death and recovered cases. The negative values of variable active cases were filled by subtracting all present values with previous values. Moreover, the new variable of quarter was mutated from the date variable.

The second dataset had repeated characters in a string variable which were ignored using R base functions. Also, the dataset was filtered to match 2020 entries since COVID data set is only applicable for 2020th year.

Spread function was used to spread the key category to value to tidy the data in wider format so that all entries could be matched between both data sets for merging. Finally, both the data sets were merged using inner join to match entries from both sides. Box plot was used to detect the outliers and histogram was used for visual check on using the log transformation.

Data Description and Importing Data

The first data set is sourced from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series and is maintained by Johns Hopkins University.

The second data set is sourced from https://stats.oecd.org/index.aspx?queryid=33940 which is maintained by Organisation for Economic Co-operation and Development.

Both the data sets are untidy as they include some variables which are not useful for analysis. Both the data sets have repeated values which increases the number of observations and so are transformed into proper structure or format.

The first data set includes:-

Date: Date for the daily no of cases.
Country : The names of the different countries.
Lat and long: Latitude and longitude of the countries.
type: Different types of cases i.e. Confirmed, Death or Recovered.
Cases: The number of cases.

The second data set includes:-

Location: Acronym’s of different countries
Country: The names of the different countries
Measure: GPSA or GYSA
where GPSA = Economic growth rate compared to previous quarter
GYSA = Economic growth rate compared to same quarter of the previous year
Frequency: Represents value as annual or quarterly
Time/Period: The year and the quarter of that year
Value: The GDP value

# Importing first data set
covid <- read_csv("C:/Users/winuser/Downloads/coronavirus.csv")

## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   province = col_logical(),
##   country = col_character(),
##   lat = col_double(),
##   long = col_double(),
##   type = col_character(),
##   cases = col_double()
## )

## Warning: 62059 parsing failures.
##   row      col           expected  actual                                         file
## 50407 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50408 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50409 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50410 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## 50411 province 1/0/T/F/TRUE/FALSE Alberta 'C:/Users/winuser/Downloads/coronavirus.csv'
## ..... ........ .................. ....... ............................................
## See problems(...) for more details.

head(covid)

## # A tibble: 6 x 7
##   date       province country       lat  long type      cases
##   <date>     <lgl>    <chr>       <dbl> <dbl> <chr>     <dbl>
## 1 2020-01-22 NA       Afghanistan  33.9  67.7 confirmed     0
## 2 2020-01-23 NA       Afghanistan  33.9  67.7 confirmed     0
## 3 2020-01-24 NA       Afghanistan  33.9  67.7 confirmed     0
## 4 2020-01-25 NA       Afghanistan  33.9  67.7 confirmed     0
## 5 2020-01-26 NA       Afghanistan  33.9  67.7 confirmed     0
## 6 2020-01-27 NA       Afghanistan  33.9  67.7 confirmed     0

# dropping not needed columns
covid <- select(covid, -matches("province|lat|long"))
head(covid)

## # A tibble: 6 x 4
##   date       country     type      cases
##   <date>     <chr>       <chr>     <dbl>
## 1 2020-01-22 Afghanistan confirmed     0
## 2 2020-01-23 Afghanistan confirmed     0
## 3 2020-01-24 Afghanistan confirmed     0
## 4 2020-01-25 Afghanistan confirmed     0
## 5 2020-01-26 Afghanistan confirmed     0
## 6 2020-01-27 Afghanistan confirmed     0

# Importing second data set
GDPWorld_un <- read_csv("C:/Users/winuser/Downloads/QNA_19102020172848076.csv")

## Parsed with column specification:
## cols(
##   LOCATION = col_character(),
##   Country = col_character(),
##   SUBJECT = col_character(),
##   Subject = col_character(),
##   MEASURE = col_character(),
##   Measure = col_character(),
##   FREQUENCY = col_character(),
##   Frequency = col_character(),
##   TIME = col_character(),
##   Period = col_character(),
##   `Unit Code` = col_character(),
##   Unit = col_character(),
##   `PowerCode Code` = col_double(),
##   PowerCode = col_character(),
##   `Reference Period Code` = col_logical(),
##   `Reference Period` = col_logical(),
##   Value = col_double(),
##   `Flag Codes` = col_character(),
##   Flags = col_character()
## )

head(GDPWorld_un)

## # A tibble: 6 x 19
##   LOCATION Country SUBJECT Subject MEASURE Measure FREQUENCY Frequency TIME 
##   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <chr>     <chr>
## 1 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2009 
## 2 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2010 
## 3 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2011 
## 4 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2012 
## 5 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2013 
## 6 AUS      Austra~ B1_GE   Gross ~ GPSA    Growth~ A         Annual    2014 
## # ... with 10 more variables: Period <chr>, `Unit Code` <chr>, Unit <chr>,
## #   `PowerCode Code` <dbl>, PowerCode <chr>, `Reference Period Code` <lgl>,
## #   `Reference Period` <lgl>, Value <dbl>, `Flag Codes` <chr>, Flags <chr>

# Dropping unnecessary columns including columns with null values
GDPWorld_un <- select(GDPWorld_un, matches("Country|MEASURE|TIME|Value", ignore.case = FALSE))

# converting column headers to lower case
names(GDPWorld_un) <- tolower(names(GDPWorld_un))
head(GDPWorld_un)

## # A tibble: 6 x 4
##   country   measure time  value
##   <chr>     <chr>   <chr> <dbl>
## 1 Australia GPSA    2009   1.89
## 2 Australia GPSA    2010   2.45
## 3 Australia GPSA    2011   2.73
## 4 Australia GPSA    2012   3.93
## 5 Australia GPSA    2013   2.10
## 6 Australia GPSA    2014   2.55

Understand

The first data had some negative values,
Confirmed cases have negative values which means no of patients tested and found are not confirmed so those values would be 0.
Recovered cases have negative values which means no of patients tested again could be covid positive (confirmed).
Death cases have negative values which means it could be possible that negative deaths are patients who are not dead are still alive with covid severe condition(confirmed).

Likewise, the second data set had negative GDP values. Considering GDP growth could have negative values the negative values weren’t changed.

The variables were converted into date, factor and numeric class. The factor levels were checked to see if they are properly ordered.

# First Data set [COVID]

# total days with negative numbers reported
count <- covid[covid$cases < 0, ]
count %>% summarise(total_negative_cases = n())

## # A tibble: 1 x 1
##   total_negative_cases
##                  <int>
## 1                  359

# checking format of negative cases with respect to types
covid %>% filter(date == "2020-10-08") %>% filter(country == "Angola")

## # A tibble: 3 x 4
##   date       country type      cases
##   <date>     <chr>   <chr>     <dbl>
## 1 2020-10-08 Angola  confirmed   233
## 2 2020-10-08 Angola  death        -3
## 3 2020-10-08 Angola  recovered    37

covid %>% filter(date == "2020-08-28") %>% filter(country == "Angola")

## # A tibble: 3 x 4
##   date       country type      cases
##   <date>     <chr>   <chr>     <dbl>
## 1 2020-08-28 Angola  confirmed    56
## 2 2020-08-28 Angola  death         1
## 3 2020-08-28 Angola  recovered  -307

# Changing data types to factor, dates and numeric
covid$date <- as.Date(covid$date)
covid$country <- as.factor(covid$country)
covid$type <- as.factor(covid$type)
covid[, 4] <- sapply(covid[, 4], as.numeric)

# checking the data types
sapply(covid, class)

##      date   country      type     cases 
##    "Date"  "factor"  "factor" "numeric"

# Checking if factors are labeled correctly
levels(covid$country)

##   [1] "Afghanistan"                      "Albania"                         
##   [3] "Algeria"                          "Andorra"                         
##   [5] "Angola"                           "Antigua and Barbuda"             
##   [7] "Argentina"                        "Armenia"                         
##   [9] "Australia"                        "Austria"                         
##  [11] "Azerbaijan"                       "Bahamas"                         
##  [13] "Bahrain"                          "Bangladesh"                      
##  [15] "Barbados"                         "Belarus"                         
##  [17] "Belgium"                          "Belize"                          
##  [19] "Benin"                            "Bhutan"                          
##  [21] "Bolivia"                          "Bosnia and Herzegovina"          
##  [23] "Botswana"                         "Brazil"                          
##  [25] "Brunei"                           "Bulgaria"                        
##  [27] "Burkina Faso"                     "Burma"                           
##  [29] "Burundi"                          "Cabo Verde"                      
##  [31] "Cambodia"                         "Cameroon"                        
##  [33] "Canada"                           "Central African Republic"        
##  [35] "Chad"                             "Chile"                           
##  [37] "China"                            "Colombia"                        
##  [39] "Comoros"                          "Congo (Brazzaville)"             
##  [41] "Congo (Kinshasa)"                 "Costa Rica"                      
##  [43] "Cote d'Ivoire"                    "Croatia"                         
##  [45] "Cuba"                             "Cyprus"                          
##  [47] "Czechia"                          "Denmark"                         
##  [49] "Diamond Princess"                 "Djibouti"                        
##  [51] "Dominica"                         "Dominican Republic"              
##  [53] "Ecuador"                          "Egypt"                           
##  [55] "El Salvador"                      "Equatorial Guinea"               
##  [57] "Eritrea"                          "Estonia"                         
##  [59] "Eswatini"                         "Ethiopia"                        
##  [61] "Fiji"                             "Finland"                         
##  [63] "France"                           "Gabon"                           
##  [65] "Gambia"                           "Georgia"                         
##  [67] "Germany"                          "Ghana"                           
##  [69] "Greece"                           "Grenada"                         
##  [71] "Guatemala"                        "Guinea"                          
##  [73] "Guinea-Bissau"                    "Guyana"                          
##  [75] "Haiti"                            "Holy See"                        
##  [77] "Honduras"                         "Hungary"                         
##  [79] "Iceland"                          "India"                           
##  [81] "Indonesia"                        "Iran"                            
##  [83] "Iraq"                             "Ireland"                         
##  [85] "Israel"                           "Italy"                           
##  [87] "Jamaica"                          "Japan"                           
##  [89] "Jordan"                           "Kazakhstan"                      
##  [91] "Kenya"                            "Korea, South"                    
##  [93] "Kosovo"                           "Kuwait"                          
##  [95] "Kyrgyzstan"                       "Laos"                            
##  [97] "Latvia"                           "Lebanon"                         
##  [99] "Lesotho"                          "Liberia"                         
## [101] "Libya"                            "Liechtenstein"                   
## [103] "Lithuania"                        "Luxembourg"                      
## [105] "Madagascar"                       "Malawi"                          
## [107] "Malaysia"                         "Maldives"                        
## [109] "Mali"                             "Malta"                           
## [111] "Mauritania"                       "Mauritius"                       
## [113] "Mexico"                           "Moldova"                         
## [115] "Monaco"                           "Mongolia"                        
## [117] "Montenegro"                       "Morocco"                         
## [119] "Mozambique"                       "MS Zaandam"                      
## [121] "Namibia"                          "Nepal"                           
## [123] "Netherlands"                      "New Zealand"                     
## [125] "Nicaragua"                        "Niger"                           
## [127] "Nigeria"                          "North Macedonia"                 
## [129] "Norway"                           "Oman"                            
## [131] "Pakistan"                         "Panama"                          
## [133] "Papua New Guinea"                 "Paraguay"                        
## [135] "Peru"                             "Philippines"                     
## [137] "Poland"                           "Portugal"                        
## [139] "Qatar"                            "Romania"                         
## [141] "Russia"                           "Rwanda"                          
## [143] "Saint Kitts and Nevis"            "Saint Lucia"                     
## [145] "Saint Vincent and the Grenadines" "San Marino"                      
## [147] "Sao Tome and Principe"            "Saudi Arabia"                    
## [149] "Senegal"                          "Serbia"                          
## [151] "Seychelles"                       "Sierra Leone"                    
## [153] "Singapore"                        "Slovakia"                        
## [155] "Slovenia"                         "Solomon Islands"                 
## [157] "Somalia"                          "South Africa"                    
## [159] "South Sudan"                      "Spain"                           
## [161] "Sri Lanka"                        "Sudan"                           
## [163] "Suriname"                         "Sweden"                          
## [165] "Switzerland"                      "Syria"                           
## [167] "Taiwan*"                          "Tajikistan"                      
## [169] "Tanzania"                         "Thailand"                        
## [171] "Timor-Leste"                      "Togo"                            
## [173] "Trinidad and Tobago"              "Tunisia"                         
## [175] "Turkey"                           "Uganda"                          
## [177] "Ukraine"                          "United Arab Emirates"            
## [179] "United Kingdom"                   "Uruguay"                         
## [181] "US"                               "Uzbekistan"                      
## [183] "Venezuela"                        "Vietnam"                         
## [185] "West Bank and Gaza"               "Western Sahara"                  
## [187] "Yemen"                            "Zambia"                          
## [189] "Zimbabwe"

levels(covid$type)

## [1] "confirmed" "death"     "recovered"

#check structure and attributes
str(covid)

## tibble [213,548 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date   : Date[1:213548], format: "2020-01-22" "2020-01-23" ...
##  $ country: Factor w/ 189 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ type   : Factor w/ 3 levels "confirmed","death",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cases  : num [1:213548] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "problems")= tibble [62,059 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:62059] 50407 50408 50409 50410 50411 50412 50413 50414 50415 50416 ...
##   ..$ col     : chr [1:62059] "province" "province" "province" "province" ...
##   ..$ expected: chr [1:62059] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
##   ..$ actual  : chr [1:62059] "Alberta" "Alberta" "Alberta" "Alberta" ...
##   ..$ file    : chr [1:62059] "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" "'C:/Users/winuser/Downloads/coronavirus.csv'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   province = col_logical(),
##   ..   country = col_character(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   type = col_character(),
##   ..   cases = col_double()
##   .. )

# Second data set [GDP]

# checking the out put before making some changes
head(GDPWorld_un)

## # A tibble: 6 x 4
##   country   measure time  value
##   <chr>     <chr>   <chr> <dbl>
## 1 Australia GPSA    2009   1.89
## 2 Australia GPSA    2010   2.45
## 3 Australia GPSA    2011   2.73
## 4 Australia GPSA    2012   3.93
## 5 Australia GPSA    2013   2.10
## 6 Australia GPSA    2014   2.55

# changing the time column name to quarter
names(GDPWorld_un)[names(GDPWorld_un) == "time"] <- "quarter"
head(GDPWorld_un)

## # A tibble: 6 x 4
##   country   measure quarter value
##   <chr>     <chr>   <chr>   <dbl>
## 1 Australia GPSA    2009     1.89
## 2 Australia GPSA    2010     2.45
## 3 Australia GPSA    2011     2.73
## 4 Australia GPSA    2012     3.93
## 5 Australia GPSA    2013     2.10
## 6 Australia GPSA    2014     2.55

# Changing data types to factor, dates and numeric
GDPWorld_un$country <- as.factor(GDPWorld_un$country)
GDPWorld_un$measure <- as.factor(GDPWorld_un$measure)

# checking the data types
sapply(GDPWorld_un, class)

##     country     measure     quarter       value 
##    "factor"    "factor" "character"   "numeric"

# Checking if factors are labeled correctly
levels(GDPWorld_un$country)

##  [1] "Argentina"                                      
##  [2] "Australia"                                      
##  [3] "Brazil"                                         
##  [4] "Canada"                                         
##  [5] "China (People's Republic of)"                   
##  [6] "Euro area (19 countries)"                       
##  [7] "European Union – 27 countries (from 01/02/2020)"
##  [8] "France"                                         
##  [9] "G20"                                            
## [10] "Germany"                                        
## [11] "India"                                          
## [12] "Indonesia"                                      
## [13] "Italy"                                          
## [14] "Japan"                                          
## [15] "Korea"                                          
## [16] "Mexico"                                         
## [17] "Russia"                                         
## [18] "Saudi Arabia"                                   
## [19] "South Africa"                                   
## [20] "Turkey"                                         
## [21] "United Kingdom"                                 
## [22] "United States"

#check structure and attributes
str(GDPWorld_un)

## tibble [2,443 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country: Factor w/ 22 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ measure: Factor w/ 2 levels "GPSA","GYSA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quarter: chr [1:2443] "2009" "2010" "2011" "2012" ...
##  $ value  : num [1:2443] 1.89 2.45 2.73 3.93 2.1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   LOCATION = col_character(),
##   ..   Country = col_character(),
##   ..   SUBJECT = col_character(),
##   ..   Subject = col_character(),
##   ..   MEASURE = col_character(),
##   ..   Measure = col_character(),
##   ..   FREQUENCY = col_character(),
##   ..   Frequency = col_character(),
##   ..   TIME = col_character(),
##   ..   Period = col_character(),
##   ..   `Unit Code` = col_character(),
##   ..   Unit = col_character(),
##   ..   `PowerCode Code` = col_double(),
##   ..   PowerCode = col_character(),
##   ..   `Reference Period Code` = col_logical(),
##   ..   `Reference Period` = col_logical(),
##   ..   Value = col_double(),
##   ..   `Flag Codes` = col_character(),
##   ..   Flags = col_character()
##   .. )

Tidy & Manipulate Data I

Merging the data sets required common unique values in both the data sets. The first data set had dates which was converted mutating it into quarterly values. The data was grouped by country, type and quarter to sum all the number of cases. After summarizing the cases the type category was spread with cases value. In short, the longer data was transformed to wider data. A new active variable was mutated by subtracting confirmed, death and recovered.

There were some negative values. This happens because the past active patients might have been recovered in present/future and so recovered counts are higher. So, present cases were subtracted by the previous cases.

# Since dates are used in covid dataset sub setting the dates into quarterly for GDP dataset
# sub setting data quarterly
covid$quarter <- quarters(covid$date)
head(covid)

## # A tibble: 6 x 5
##   date       country     type      cases quarter
##   <date>     <fct>       <fct>     <dbl> <chr>  
## 1 2020-01-22 Afghanistan confirmed     0 Q1     
## 2 2020-01-23 Afghanistan confirmed     0 Q1     
## 3 2020-01-24 Afghanistan confirmed     0 Q1     
## 4 2020-01-25 Afghanistan confirmed     0 Q1     
## 5 2020-01-26 Afghanistan confirmed     0 Q1     
## 6 2020-01-27 Afghanistan confirmed     0 Q1

# Wider format
# spreading the cases type and mutating active column
covidW <- covid %>% 
  select(country, type, cases, quarter) %>% 
  group_by(country, type, quarter) %>% 
  summarise(cases = sum(cases)) %>% 
  spread(key = type, value = cases) %>% 
  mutate(active = confirmed - death - recovered) %>% 
  arrange(country) 

head(covidW)

## # A tibble: 6 x 6
## # Groups:   country [2]
##   country     quarter confirmed death recovered active
##   <fct>       <chr>       <dbl> <dbl>     <dbl>  <dbl>
## 1 Afghanistan Q1            174     4         5    165
## 2 Afghanistan Q2          31343   742     14126  16475
## 3 Afghanistan Q3           7751   712     18658 -11619
## 4 Afghanistan Q4            932    34       825     73
## 5 Albania     Q1            243    15        52    176
## 6 Albania     Q2           2292    47      1407    838

# active cases are negative. This happens because the past active patients might have been recovered in present/future and so recovered counts are higher. 
# subtracting previous to current active cases.
covidW <- covidW %>% 
  mutate(active = abs((active) - lag(active, default = first(country))))

head(covidW)

## # A tibble: 6 x 6
## # Groups:   country [2]
##   country     quarter confirmed death recovered active
##   <fct>       <chr>       <dbl> <dbl>     <dbl>  <dbl>
## 1 Afghanistan Q1            174     4         5    164
## 2 Afghanistan Q2          31343   742     14126  16310
## 3 Afghanistan Q3           7751   712     18658  28094
## 4 Afghanistan Q4            932    34       825  11692
## 5 Albania     Q1            243    15        52    174
## 6 Albania     Q2           2292    47      1407    662

# changing quarter to factor
covidW$quarter <- as.factor(covidW$quarter)

Tidy & Manipulate Data II

The time variable was needed to be in proper format to match with the quarter variable of first data set. The data set had entries of multiple years. Thus, filtering was done using grep function to get the entries of 2020 year as COVID cases are of 2020 year. Now, the extracted entries had repeated 2020 which were ignored using substring function.

Each measure GPSA or GYSA had 2 different quarter entries for every country which would be 4. So, there would have been multiple entries while merging as it is with first data set. To avoid multiple entries, the key measure was spread with the GDP value using spread function.

# filtering only 2020 year's related entries
GDPWorld_un <- GDPWorld_un %>%  filter(grepl("2020-", quarter)) %>% arrange(country)
head(GDPWorld_un)

## # A tibble: 6 x 4
##   country   measure quarter   value
##   <fct>     <fct>   <chr>     <dbl>
## 1 Argentina GPSA    2020-Q1  -4.16 
## 2 Argentina GPSA    2020-Q2 -16.2  
## 3 Argentina GYSA    2020-Q1  -4.65 
## 4 Argentina GYSA    2020-Q2 -19.8  
## 5 Australia GPSA    2020-Q1  -0.262
## 6 Australia GPSA    2020-Q2  -7.00

# now changing it into quarter format
GDPWorld_un$quarter <- substr(GDPWorld_un$quarter, 6, 7)
head(GDPWorld_un)

## # A tibble: 6 x 4
##   country   measure quarter   value
##   <fct>     <fct>   <chr>     <dbl>
## 1 Argentina GPSA    Q1       -4.16 
## 2 Argentina GPSA    Q2      -16.2  
## 3 Argentina GYSA    Q1       -4.65 
## 4 Argentina GYSA    Q2      -19.8  
## 5 Australia GPSA    Q1       -0.262
## 6 Australia GPSA    Q2       -7.00

# now changing quarter to factor
GDPWorld_un$quarter <- as.factor(GDPWorld_un$quarter)

# Since there are 2 measures spreading the measure values
# GPSA = Economic growth rate compared to previous quarter
# GYSA = Economic growth rate compared to same quarter of the previous year
GDPWorld_un <- GDPWorld_un %>% 
  select(country, measure, quarter, value) %>% 
  group_by(country, quarter) %>% 
  spread(key = measure, value = value)

Scan I

Null values were checked using sapply function and no null values were found.
Box plots are a great way to visualize the outliers and so box plots were used in this case to detect the outliers.

There were outliers in numeric columns. Using box plot the vector of outliers was created. Then the rows containing those outlier values were displayed. Furthermore, the box plot without outliers was plotted to check if the outliers were removed. The outliers would be eliminated further if model has to be build.

Cap function is used to cap the outliers with percentiles.

# checking for total null values
sapply(covidW, function(x) sum(is.na(x)))

##   country   quarter confirmed     death recovered    active 
##         0         0         0         0         0         0

# detecting outliers. Found in 4 types of cases
boxplot(covidW$confirmed)$out

##   [1]   37623   63476  686471  238679   44106  145432  217996   61966   48652
##  [10]   57025  103801  102092 1396324 3408894  413427   97570   55010   39697
##  [19]  276549  183598   82279   96940  731833  129893   72301   58809  103122
##  [28]   79641   54192   80615   67601   34887   69522   52827  149236  401968
##  [37]  272311   71808  123610   97495   75758   73650   57342  584084 5727103
##  [46] 1237689   54857  230623   74859   44605  183057  229557   73161   48415
##  [55]  313872   63653  220250   57615  105792  134786   74283   99380   64976
##  [64]   85671   45906   58987   41163   37856  224874  517127  108011   36429
##  [73]  111120   49979   64253   54429   37816   75505  108215   39878   58515
##  [82]  211352   99336   79045   38537  284148  526555   53781   35430  274180
##  [91]   44924   57121   84252   95307  100602   52816  644592  523870  220025
## [100]  189260  143782   42981  149856  523130   95923  153348  519917  167372
## [109]   63090  186375  118757   44609  169192   92855   48003   45523   38815
## [118]  246401  170630  269446  192091 2441780 4599171  921552   48214   69290
## [127]   37471

# Extracting those outliers in a vector
co <- boxplot(covidW$confirmed, plot=FALSE)$out

# Getting rows where those outliers are present. Since there are lot of observations the line is commented
# covidW[which(covidW$confirmed %in% co),]

#checking box plot without outliers
boxplot(which(covidW$confirmed %in% co))

# using cap function to remove outliers
cap <- function(x){
    quantiles <- quantile( x, c(0.05, 0.25, 0.75, 0.95))
    iqr <- IQR(x)
    x[x < quantiles[2] - 1.5*iqr] <- quantiles[1]
    x[x > quantiles[3] + 1.5*iqr] <- quantiles[4]
  x}

# Replacing extreme values with percentiles using cap function
covidL = cap(covidW$confirmed)

Scan II

Null values were checked using sapply function and no null values were found.
Box plots are a great way to visualize the outliers and so box plots were used in this case to detect the outliers.

There were no outliers detected in second data set.

# checking for total null values
sapply(GDPWorld_un, function(x) sum(is.na(x)))

## country quarter    GPSA    GYSA 
##       0       0       0       0

# detecting outliers in GPSA and GYSA. No outliers found
boxplot(GDPWorld_un$GPSA)

boxplot(GDPWorld_un$GYSA)

Merge

Merging the data sets using common attributes as country and quarter. The entries only matching both the data sets are joined using inner join.

# checking the structure of both data frames
head(covidW)

## # A tibble: 6 x 6
## # Groups:   country [2]
##   country     quarter confirmed death recovered active
##   <fct>       <fct>       <dbl> <dbl>     <dbl>  <dbl>
## 1 Afghanistan Q1            174     4         5    164
## 2 Afghanistan Q2          31343   742     14126  16310
## 3 Afghanistan Q3           7751   712     18658  28094
## 4 Afghanistan Q4            932    34       825  11692
## 5 Albania     Q1            243    15        52    174
## 6 Albania     Q2           2292    47      1407    662

head(GDPWorld_un)

## # A tibble: 6 x 4
## # Groups:   country, quarter [6]
##   country   quarter    GPSA   GYSA
##   <fct>     <fct>     <dbl>  <dbl>
## 1 Argentina Q1       -4.16   -4.65
## 2 Argentina Q2      -16.2   -19.8 
## 3 Australia Q1       -0.262   1.56
## 4 Australia Q2       -7.00   -6.26
## 5 Brazil    Q1       -2.45   -1.43
## 6 Brazil    Q2       -9.69  -11.4

# Merging the data sets
World <- inner_join(covidW, GDPWorld_un, by = c("country" = "country", "quarter" = "quarter"))

## Warning: Column `country` joining factors with different levels, coercing to
## character vector

## Warning: Column `quarter` joining factors with different levels, coercing to
## character vector

head(World)

## # A tibble: 6 x 8
## # Groups:   country [3]
##   country   quarter confirmed death recovered active    GPSA   GYSA
##   <chr>     <chr>       <dbl> <dbl>     <dbl>  <dbl>   <dbl>  <dbl>
## 1 Argentina Q1           1054    27       240    780  -4.16   -4.65
## 2 Argentina Q2          63476  1280     21788  39621 -16.2   -19.8 
## 3 Australia Q1           4559    18       358   4174  -0.262   1.56
## 4 Australia Q2           3361    86      6682   7590  -7.00   -6.26
## 5 Brazil    Q1           5717   201       127   5365  -2.45   -1.43
## 6 Brazil    Q2        1396324 59393    788191 543351  -9.69  -11.4

# Data sets are merged successfully

Transform

The histogram of confirmed cases is rightly skewed. Transforming the distribution might be done for better understanding.

Recommended transformation to correct for right skew include:
Log transformation
Natural log transformation
Square root transformation
Reciprocal transformation

In first data set, since most rows are excluded while merging (inner join) data sets, transformation wouldn’t affect the underlying structure of data. Out of all, log transformation values appear more closer to mean.

Since second dataset contains several negative GDP values, Normalization is not done.

# Normal histogram
hist(World$confirmed)

# Using log transformation to check histogram for normalization. 
# Since most rows are excluded transformation wouldn't affect the underlying structure

hist(log(World$confirmed))

hist(log10(World$confirmed))

hist(sqrt(World$confirmed))

hist(1/(World$confirmed))

# applying log transformation
World$confirmed <- log(World$confirmed)

# Normal histogram for measures. Normalization is not applied since it contains negative GDP values.
hist(World$GPSA)

References

[1]“CSSEGISandData/COVID-19”, GitHub, 2020. [Online].
Available: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series.
[Accessed: 20- Oct- 2020].

[2]“Quarterly National Accounts : G20 - Quarterly Growth Rates of GDP in volume”, Stats.oecd.org, 2020. [Online].
Available: https://stats.oecd.org/index.aspx?queryid=33940.
[Accessed: 20- Oct- 2020].