Part A:

Task 1: Data wrangling and integration

-Requirement - We want to explore the three datasets and identify the features correlated with the spread of the Coronavirus. To achieve this goal, you are required to apply the following procedures:

1.Familiarise yourself with the three datasets (i.e., the attributes, number of instances, number of missing records).

2.Read and open the three .csv files in R.

3.Wrangle the datasets (e.g., deal with missing data) and explain the wrangling techniques you have applied.

4.Pre-process the three datasets (e.g., make your datasets consistent in format), specifically determine what changes you have made in the pre-processing step. Note, the “date” attribute is with different shapes, and you should try to fix that before applying the following tasks.

5.Join the three datasets properly.

6.Tidy the joined dataset (i.e., check for duplicates). Describe the techniques used for tidy data

The expected joined dataset should include the following columns: location, date, total_cases, new_cases, total_deaths, new_deaths, gdp_per_capita, population, lockdown_date, total doses administered, % of population fully vaccinated

My Answers for Task 1

Package Installation & Load (tidyverse and ggplot)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

installed.packages("ggplot2")

##      Package LibPath Version Priority Depends Imports LinkingTo Suggests
##      Enhances License License_is_FOSS License_restricts_use OS_type Archs
##      MD5sum NeedsCompilation Built

library(ggplot2)

1.Familiarise yourself with the three datasets (i.e., the attributes, number of instances, number of missing records). & 2.Read and open the three .csv files in R.

#install and load the package of "readr" for CSV reading.
library(readr)

#import 3 datasets from CSV files 
Covid_data <- read_csv("Covid-data.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 1575 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): location
## dbl  (6): total_cases, new_cases, total_deaths, new_deaths, gdp_per_capita, ...
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

CountryLockdowndates <-read_csv("CountryLockdowndates.csv")

## Rows: 307 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Country/Region, Province, Date, Type, Reference
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

WorldwideVaccineData <-read_csv("WorldwideVaccineData.csv")

## Rows: 187 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (4): Doses administered per 100 people, Total doses administered, % of p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Covid_data

## # A tibble: 1,575 × 8
##    location  date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2020-01-01           0         0            0          0
##  3 Australia 2020-01-02           0         0            0          0
##  4 Australia 2020-01-03           0         0            0          0
##  5 Australia 2020-01-04           0         0            0          0
##  6 Australia 2020-01-05           0         0            0          0
##  7 Australia 2020-01-06           0         0            0          0
##  8 Australia 2020-01-07           0         0            0          0
##  9 Australia 2020-01-08           0         0            0          0
## 10 Australia 2020-01-09           0         0            0          0
## # ℹ 1,565 more rows
## # ℹ 2 more variables: gdp_per_capita <dbl>, population <dbl>

CountryLockdowndates

## # A tibble: 307 × 5
##    `Country/Region`    Province                     Date       Type  Reference  
##    <chr>               <chr>                        <chr>      <chr> <chr>      
##  1 Afghanistan         <NA>                         24/03/2020 Full  https://ww…
##  2 Albania             <NA>                         08/03/2020 Full  https://en…
##  3 Algeria             <NA>                         24/03/2020 Full  https://ww…
##  4 Andorra             <NA>                         16/03/2020 Full  https://en…
##  5 Angola              <NA>                         24/03/2020 Full  https://en…
##  6 Antigua and Barbuda <NA>                         <NA>       None  <NA>       
##  7 Argentina           <NA>                         20/03/2020 Full  https://ww…
##  8 Armenia             <NA>                         24/03/2020 Full  https://ww…
##  9 Australia           Australian Capital Territory <NA>       None  https://en…
## 10 Australia           New South Wales              <NA>       None  https://en…
## # ℹ 297 more rows

WorldwideVaccineData

## # A tibble: 187 × 5
##    Country  Doses administered p…¹ Total doses administ…² % of population vacc…³
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 Afghani…                     17                6445359                     15
##  2 Albania                     102                2906126                     46
##  3 Algeria                      35               15205854                     19
##  4 Angola                       64               20397115                     41
##  5 Argenti…                    237              106474858                     92
##  6 Armenia                      73                2150112                     38
##  7 Aruba                       162                 172216                     84
##  8 Austral…                    229               57988175                     88
##  9 Austria                     207               18418001                     77
## 10 Azerbai…                    137               13772531                     53
## # ℹ 177 more rows
## # ℹ abbreviated names: ¹`Doses administered per 100 people`,
## #   ²`Total doses administered`, ³`% of population vaccinated`
## # ℹ 1 more variable: `% of population fully vaccinated` <dbl>

#Gain overview and familiarization of the 3 datasheets
glimpse(Covid_data)

## Rows: 1,575
## Columns: 8
## $ location       <chr> "Australia", "Australia", "Australia", "Australia", "Au…
## $ date           <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-0…
## $ total_cases    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_cases      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_deaths   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_deaths     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gdp_per_capita <dbl> 44648.71, 44648.71, 44648.71, 44648.71, 44648.71, 44648…
## $ population     <dbl> 25499881, 25499881, 25499881, 25499881, 25499881, 25499…

glimpse(CountryLockdowndates)

## Rows: 307
## Columns: 5
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ango…
## $ Province         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital T…
## $ Date             <chr> "24/03/2020", "08/03/2020", "24/03/2020", "16/03/2020…
## $ Type             <chr> "Full", "Full", "Full", "Full", "Full", "None", "Full…
## $ Reference        <chr> "https://www.thestatesman.com/world/afghan-govt-impos…

glimpse(WorldwideVaccineData)

## Rows: 187
## Columns: 5
## $ Country                             <chr> "Afghanistan", "Albania", "Algeria…
## $ `Doses administered per 100 people` <dbl> 17, 102, 35, 64, 237, 73, 162, 229…
## $ `Total doses administered`          <dbl> 6445359, 2906126, 15205854, 203971…
## $ `% of population vaccinated`        <dbl> 15, 46, 19, 41, 92, 38, 84, 88, 77…
## $ `% of population fully vaccinated`  <dbl> 13.0, 44.0, 16.0, 22.0, 84.0, 33.0…

summary(Covid_data)

##    location              date             total_cases        new_cases     
##  Length:1575        Min.   :2019-12-31   Min.   :      0   Min.   :-29726  
##  Class :character   1st Qu.:2020-02-17   1st Qu.:     22   1st Qu.:     1  
##  Mode  :character   Median :2020-04-07   Median :  58226   Median :   205  
##                     Mean   :2020-04-06   Mean   : 180452   Mean   :  2971  
##                     3rd Qu.:2020-05-25   3rd Qu.: 173133   3rd Qu.:  1880  
##                     Max.   :2020-07-14   Max.   :3363056   Max.   : 66625  
##                     NA's   :173                                            
##   total_deaths      new_deaths      gdp_per_capita    population       
##  Min.   :     0   Min.   :-1918.0   Min.   :15309   Min.   :2.550e+07  
##  1st Qu.:     0   1st Qu.:    0.0   1st Qu.:26677   1st Qu.:6.046e+07  
##  Median :  2837   Median :    5.0   Median :38606   Median :6.789e+07  
##  Mean   : 14060   Mean   :  183.8   Mean   :35140   Mean   :2.652e+08  
##  3rd Qu.: 25100   3rd Qu.:  149.0   3rd Qu.:42201   3rd Qu.:2.075e+08  
##  Max.   :135605   Max.   : 4928.0   Max.   :54225   Max.   :1.439e+09  
##  NA's   :6        NA's   :7

summary(CountryLockdowndates)

##  Country/Region       Province             Date               Type          
##  Length:307         Length:307         Length:307         Length:307        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   Reference        
##  Length:307        
##  Class :character  
##  Mode  :character

summary(WorldwideVaccineData)

##    Country          Doses administered per 100 people Total doses administered
##  Length:187         Min.   :  0                       Min.   :1.714e+04       
##  Class :character   1st Qu.: 62                       1st Qu.:1.810e+06       
##  Mode  :character   Median :130                       Median :8.179e+06       
##                     Mean   :131                       Mean   :6.493e+07       
##                     3rd Qu.:199                       3rd Qu.:2.865e+07       
##                     Max.   :343                       Max.   :3.408e+09       
##  % of population vaccinated % of population fully vaccinated
##  Min.   : 0.10              Min.   : 0.10                   
##  1st Qu.:36.50              1st Qu.:29.00                   
##  Median :62.00              Median :55.00                   
##  Mean   :56.91              Mean   :51.94                   
##  3rd Qu.:80.00              3rd Qu.:75.00                   
##  Max.   :99.00              Max.   :99.00

Overview of Missing Values for each of the 3 datasets To identify which dataset contains missing values? and by how many?

library(naniar)
vis_miss(Covid_data)

vis_miss(CountryLockdowndates)

vis_miss(WorldwideVaccineData)

Close Examination and identification of Missing values for each data set.

library("simputation")

## 
## Attaching package: 'simputation'

## The following object is masked from 'package:naniar':
## 
##     impute_median

glimpse_na(Covid_data)

## 
## na count: 186

##        columns nNA
## 1         date 173
## 2   new_deaths   7
## 3 total_deaths   6

glimpse_na(CountryLockdowndates)

## 
## na count: 324

##     columns nNA
## 1  Province 178
## 2      Date  77
## 3 Reference  69

glimpse_na(WorldwideVaccineData)

## 
## No NA's.

3 & 4.Wrangle and Pre-process the datasets

3.1 Dealing with Missing values, ready up packages

library(naniar)
library(simputation)
library(tidyverse)

For Covid_data #Applying the function of “impute_mean_at” for variables of “date” ,“new_deaths” and “total_deaths”.

Covid_data_new <- Covid_data %>%
  impute_mean_at(vars(date,
                      new_deaths,total_deaths))
Covid_data_new

## # A tibble: 1,575 × 8
##    location  date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2020-01-01           0         0            0          0
##  3 Australia 2020-01-02           0         0            0          0
##  4 Australia 2020-01-03           0         0            0          0
##  5 Australia 2020-01-04           0         0            0          0
##  6 Australia 2020-01-05           0         0            0          0
##  7 Australia 2020-01-06           0         0            0          0
##  8 Australia 2020-01-07           0         0            0          0
##  9 Australia 2020-01-08           0         0            0          0
## 10 Australia 2020-01-09           0         0            0          0
## # ℹ 1,565 more rows
## # ℹ 2 more variables: gdp_per_capita <dbl>, population <dbl>

glimpse_na(Covid_data_new)

## 
## No NA's.

For CountryLockdowndates #Applying the function of “impute_mean_at” for variables of “Date”. Although there are missing values exist in ‘Province’ variable, due to its categorical nature, we will exclude ‘Province’ variable in our analysis, as it is not a required variable in written assessment requirement.

Firstly, we need to fix the “Date” variable to ‘Year-Month-Day’ format, to ensure its consistency with Covid_data dataset for our analysis.

#Fixing the “date”attributes

#use ymd function from the lubridate package, I changed format and class of Date variable here from 'character' to 'date', although they appears as NA(s) , but we will disregard them in our analysis, since the fact that there are lots of countries did not have any lockdown, thus these blank cells here don't represent missing values. Thus we disregard these NA(s), as they aren't missing values.
library(lubridate)
library(tidyverse)

CountryLockdowndates$Date <- ymd(CountryLockdowndates$Date)

## Warning: All formats failed to parse. No formats found.

CountryLockdowndates

## # A tibble: 307 × 5
##    `Country/Region`    Province                     Date   Type  Reference      
##    <chr>               <chr>                        <date> <chr> <chr>          
##  1 Afghanistan         <NA>                         NA     Full  https://www.th…
##  2 Albania             <NA>                         NA     Full  https://en.wik…
##  3 Algeria             <NA>                         NA     Full  https://www.ga…
##  4 Andorra             <NA>                         NA     Full  https://en.wik…
##  5 Angola              <NA>                         NA     Full  https://en.wik…
##  6 Antigua and Barbuda <NA>                         NA     None  <NA>           
##  7 Argentina           <NA>                         NA     Full  https://www.bl…
##  8 Armenia             <NA>                         NA     Full  https://www.az…
##  9 Australia           Australian Capital Territory NA     None  https://en.wik…
## 10 Australia           New South Wales              NA     None  https://en.wik…
## # ℹ 297 more rows

glimpse(CountryLockdowndates)

## Rows: 307
## Columns: 5
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ango…
## $ Province         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital T…
## $ Date             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Type             <chr> "Full", "Full", "Full", "Full", "Full", "None", "Full…
## $ Reference        <chr> "https://www.thestatesman.com/world/afghan-govt-impos…

#check result for fixing the "date" attribute
class(CountryLockdowndates$Date)

## [1] "Date"

3.2 Drop unnecessary variables in accordance with the assessment requirement

For CountryLockdowndates drop 3 variables of “Province”,“Type” and “Reference”。

CountryLockdowndates_new = subset(CountryLockdowndates, select = -c(Province,Type,Reference) )
glimpse(CountryLockdowndates_new)

## Rows: 307
## Columns: 2
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ango…
## $ Date             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

For WorldwideVaccineData drop 2 variables of “Doses administrated per 100 people” and “% of population fully vaccinated”.

WorldwideVaccineData_new = subset(WorldwideVaccineData, select = -c(`Doses administered per 100 people`,`% of population vaccinated`) )
glimpse(WorldwideVaccineData_new)

## Rows: 187
## Columns: 3
## $ Country                            <chr> "Afghanistan", "Albania", "Algeria"…
## $ `Total doses administered`         <dbl> 6445359, 2906126, 15205854, 2039711…
## $ `% of population fully vaccinated` <dbl> 13.0, 44.0, 16.0, 22.0, 84.0, 33.0,…

3.3 Modify variable names for consistency and clarity

# View the names for all variables in 3 datasets
names(Covid_data_new)

## [1] "location"       "date"           "total_cases"    "new_cases"     
## [5] "total_deaths"   "new_deaths"     "gdp_per_capita" "population"

names(CountryLockdowndates)

## [1] "Country/Region" "Province"       "Date"           "Type"          
## [5] "Reference"

names(WorldwideVaccineData_new)

## [1] "Country"                          "Total doses administered"        
## [3] "% of population fully vaccinated"

3.3.1. Rename in dataset: Covid_data_new

# Rename "Country" to ensure consistency across the 3 datasets
names(Covid_data_new)[names(Covid_data_new) == "location"] <- "Country"

# View the new names
names(Covid_data_new)

## [1] "Country"        "date"           "total_cases"    "new_cases"     
## [5] "total_deaths"   "new_deaths"     "gdp_per_capita" "population"

3.3.2 Rename in dataset: CountryLockdowndate_New

# name "Date" to 'lockdown_date" ensure consistency across 3 datasets.
names(CountryLockdowndates_new)[names(CountryLockdowndates) == "Date"] <- "Lockdown_date"

## Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble
## 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

names(CountryLockdowndates_new)[names(CountryLockdowndates) == "Country/Region"] <- "Country"
names(CountryLockdowndates_new)[names(CountryLockdowndates) == "Date"] <- "Lockdown_date"

names(CountryLockdowndates_new)

## [1] "Country" "Date"

3.3.3 Rename in dataset: WorldwideVaccineData_New

# Rename "Total doses administered" to "total doses administered' to avoid confusion
names(WorldwideVaccineData_new)[names(WorldwideVaccineData_new) == "Total doses administered"] <- "total doses administered"

# View the new names
names(WorldwideVaccineData_new)

## [1] "Country"                          "total doses administered"        
## [3] "% of population fully vaccinated"

Join the 3 datasets properly into 1 dataset. #Using the dplyr way to perform full join

#overview of variables for all 3 datasets
names(Covid_data_new)

## [1] "Country"        "date"           "total_cases"    "new_cases"     
## [5] "total_deaths"   "new_deaths"     "gdp_per_capita" "population"

names(CountryLockdowndates_new)

## [1] "Country" "Date"

names(WorldwideVaccineData_new)

## [1] "Country"                          "total doses administered"        
## [3] "% of population fully vaccinated"

#Firstly, full-join the first 2 datasets
library(dplyr)
data<-full_join(Covid_data_new,CountryLockdowndates_new,by='Country')

## Warning in full_join(Covid_data_new, CountryLockdowndates_new, by = "Country"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 9 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

names(data)

## [1] "Country"        "date"           "total_cases"    "new_cases"     
## [5] "total_deaths"   "new_deaths"     "gdp_per_capita" "population"    
## [9] "Date"

#Then, full-join the "data" with the last dataset.
library(dplyr)
Final_data<-full_join(data,WorldwideVaccineData_new,by='Country')
names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "Date"                             "total doses administered"        
## [11] "% of population fully vaccinated"

view(Final_data)

#ensure “Date” variable is changed into “lockdown_date” to avoid any confusion

names(Final_data)[names(Final_data) == "Date"] <- "lockdown_date"
names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total doses administered"        
## [11] "% of population fully vaccinated"

#remove any spaces in column of variable names, with underscore _ and replacee symbol like % with P

names(Final_data)[names(Final_data) == "total doses administered"] <- "total_doses_administered"
names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total_doses_administered"        
## [11] "% of population fully vaccinated"

names(Final_data)[names(Final_data) == "% of population fully vaccinated"] <- "P_of_population_fully_vaccinated"
names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total_doses_administered"        
## [11] "P_of_population_fully_vaccinated"

Tidy the dataset.

#fixing the typo mistakes 
#iran-->Iran, Itly-->Italy, UnitedKingdom-->United Kingdom, 
#United Stats-->United State
library(stringr)

Final_data$Country[Final_data$Country == 'iran'] <- 'Iran'
Final_data$Country[Final_data$Country == 'Itly'] <- 'Italy'
Final_data$Country[Final_data$Country == 'UnitedKingdom'] <- 'United Kingdom'
Final_data$Country[Final_data$Country == 'United Stats'] <- 'United States'
Final_data$Country[Final_data$Country == 'United State'] <- 'United States'

print(Final_data)

## # A tibble: 12,918 × 11
##    Country   date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2019-12-31           0         0            0          0
##  3 Australia 2019-12-31           0         0            0          0
##  4 Australia 2019-12-31           0         0            0          0
##  5 Australia 2019-12-31           0         0            0          0
##  6 Australia 2019-12-31           0         0            0          0
##  7 Australia 2019-12-31           0         0            0          0
##  8 Australia 2019-12-31           0         0            0          0
##  9 Australia 2020-01-01           0         0            0          0
## 10 Australia 2020-01-01           0         0            0          0
## # ℹ 12,908 more rows
## # ℹ 5 more variables: gdp_per_capita <dbl>, population <dbl>,
## #   lockdown_date <date>, total_doses_administered <dbl>,
## #   P_of_population_fully_vaccinated <dbl>

Final_data

## # A tibble: 12,918 × 11
##    Country   date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2019-12-31           0         0            0          0
##  3 Australia 2019-12-31           0         0            0          0
##  4 Australia 2019-12-31           0         0            0          0
##  5 Australia 2019-12-31           0         0            0          0
##  6 Australia 2019-12-31           0         0            0          0
##  7 Australia 2019-12-31           0         0            0          0
##  8 Australia 2019-12-31           0         0            0          0
##  9 Australia 2020-01-01           0         0            0          0
## 10 Australia 2020-01-01           0         0            0          0
## # ℹ 12,908 more rows
## # ℹ 5 more variables: gdp_per_capita <dbl>, population <dbl>,
## #   lockdown_date <date>, total_doses_administered <dbl>,
## #   P_of_population_fully_vaccinated <dbl>

view(Final_data)
glimpse(Final_data)

## Rows: 12,918
## Columns: 11
## $ Country                          <chr> "Australia", "Australia", "Australia"…
## $ date                             <date> 2019-12-31, 2019-12-31, 2019-12-31, …
## $ total_cases                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_cases                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_deaths                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_deaths                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gdp_per_capita                   <dbl> 44648.71, 44648.71, 44648.71, 44648.7…
## $ population                       <dbl> 25499881, 25499881, 25499881, 2549988…
## $ lockdown_date                    <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ total_doses_administered         <dbl> 57988175, 57988175, 57988175, 5798817…
## $ P_of_population_fully_vaccinated <dbl> 86, 86, 86, 86, 86, 86, 86, 86, 86, 8…

Task 2: Data visualisation and analysis

Part1: Create a plot (e.g., line plot, scatter plot, histogram) to investigate the trend of new cases for each country. The following image is an example of a plot, you should create such a graph for each country, and address your findings.

section1: PLOT

#step1:modify data contains only the relevant variables for linegraph
Final_data_p1 <- select(Final_data, date, Country, new_cases)
Final_data_p1

## # A tibble: 12,918 × 3
##    date       Country   new_cases
##    <date>     <chr>         <dbl>
##  1 2019-12-31 Australia         0
##  2 2019-12-31 Australia         0
##  3 2019-12-31 Australia         0
##  4 2019-12-31 Australia         0
##  5 2019-12-31 Australia         0
##  6 2019-12-31 Australia         0
##  7 2019-12-31 Australia         0
##  8 2019-12-31 Australia         0
##  9 2020-01-01 Australia         0
## 10 2020-01-01 Australia         0
## # ℹ 12,908 more rows

#step2:Making a Stacked Area Graph

#4.7 Making a Stacked Area Graph
#https://r-graphics.org/recipe-line-graph-stacked-area

p1<-ggplot(Final_data_p1, aes(x = date, y = new_cases, fill = Country)) +
  geom_area()+ ggtitle("Task-1-p1-Stacked Area Graph- New Cases for Each Country")
p1

## Warning: Removed 276 rows containing non-finite values (`stat_align()`).

Section2: Write an exploratory statement for your findings, you should include:

Q1. Some research into the events surrounding significant points on the plots (rather than simply describing your figures). For example, an investigation into events that may have caused the peak or a sharp reduction in cases in a country.

My Answers A1:

New cases starts from China in early Jan soon with a sharp drop in Feb-Mar
New cases starts to grow rapidly from Feb-Mar till Early April globally.
Overall,There are some major drops in May for a lot of countries. -Starts from mid-late June, new cases has bouced back quickly to the new peak in July.

Q2.Is the trend similar among different countries?

My Answers A2:

-Overall, trend differences appears varied among countries. Although it appears to be following somewhat similarly growing-peak-decline patterns from Mar-May, it varied with some significance in Jan-Feb and June-July.

-Between Jan and early Feb, new cases record starts in China, but not so much in the rest of most countries in the globe. There is a sharp drop in China in Feb-Mar. In terms of new cases proportion,US beats china in terms of total proportion of new cases in the world since March.

-New cases for all countries among the globe follows somewhat similarly growing-peak-declining patterns from Mar-May, however varied with some significance from June - July. -Some countries e.g. Spain has a steady drop since late May and early June.

-Unlike the rest of world,UK has dropped significantly even with negative new cases in July, which differs vastly from the rest of world.

Q3. What is your understanding of the changes in the trend?

My Answers A3: - Covid had been reported with its earliest case in China, where it is the country has the earliest records in the analysis. - It’s interesting to know reasons cause the sharp drops in both China(Feb) and UK who has negative new cases occurred (in July), as well as the steadily drop in countries like Spain since May.

An exploratory statement : An investigation into a series of events that may have caused increasing in new cases a sharp drop in China in Feb, a sharp drop in UK in July, and steadily drop in countries like Spain since May.

Part2: Next, we want to observe the relation between the death rate or new case numbers with the GDP of each country. You are expected to:

My Answers:

1.Calculate the mean GDP of all the countries.

#https://sparkbyexamples.com/r-programming/calculate-mean-or-average-in-r/

mean_gdp_of_all<-mean(Final_data$gdp_per_capita,na.rm=TRUE)
print(mean_gdp_of_all)

## [1] 27362.19

2&3.Add another column in your joined dataset as ‘GDP_Status’. Determine whether each country has a: —>GDP higher (and equal to) —>lower than the average.

My Answer:

Final_data$GDP_Status <- "High_GDP"
Final_data$GDP_Status[Final_data$gdp_per_capita < 27362.19] <- "Low_GDP"

#To verify if the conditional column added correctly
unique(Final_data$GDP_Status)

## [1] "High_GDP" "Low_GDP"

Final_data

## # A tibble: 12,918 × 12
##    Country   date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2019-12-31           0         0            0          0
##  3 Australia 2019-12-31           0         0            0          0
##  4 Australia 2019-12-31           0         0            0          0
##  5 Australia 2019-12-31           0         0            0          0
##  6 Australia 2019-12-31           0         0            0          0
##  7 Australia 2019-12-31           0         0            0          0
##  8 Australia 2019-12-31           0         0            0          0
##  9 Australia 2020-01-01           0         0            0          0
## 10 Australia 2020-01-01           0         0            0          0
## # ℹ 12,908 more rows
## # ℹ 6 more variables: gdp_per_capita <dbl>, population <dbl>,
## #   lockdown_date <date>, total_doses_administered <dbl>,
## #   P_of_population_fully_vaccinated <dbl>, GDP_Status <chr>

4.Calculate the daily infected case rate and daily death rate (new_case or new_death divided by population in each day for each country).

My Answer:

names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total_doses_administered"        
## [11] "P_of_population_fully_vaccinated" "GDP_Status"

4.1 The daily infected case rate

Final_data$Daily_infected_case_rate <-Final_data$new_cases/Final_data$population * 100
Final_data

## # A tibble: 12,918 × 13
##    Country   date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2019-12-31           0         0            0          0
##  3 Australia 2019-12-31           0         0            0          0
##  4 Australia 2019-12-31           0         0            0          0
##  5 Australia 2019-12-31           0         0            0          0
##  6 Australia 2019-12-31           0         0            0          0
##  7 Australia 2019-12-31           0         0            0          0
##  8 Australia 2019-12-31           0         0            0          0
##  9 Australia 2020-01-01           0         0            0          0
## 10 Australia 2020-01-01           0         0            0          0
## # ℹ 12,908 more rows
## # ℹ 7 more variables: gdp_per_capita <dbl>, population <dbl>,
## #   lockdown_date <date>, total_doses_administered <dbl>,
## #   P_of_population_fully_vaccinated <dbl>, GDP_Status <chr>,
## #   Daily_infected_case_rate <dbl>

4.2 The daily death rate

Final_data$Daily_death_rate <-Final_data$new_deaths/Final_data$population * 100
Final_data

## # A tibble: 12,918 × 14
##    Country   date       total_cases new_cases total_deaths new_deaths
##    <chr>     <date>           <dbl>     <dbl>        <dbl>      <dbl>
##  1 Australia 2019-12-31           0         0            0          0
##  2 Australia 2019-12-31           0         0            0          0
##  3 Australia 2019-12-31           0         0            0          0
##  4 Australia 2019-12-31           0         0            0          0
##  5 Australia 2019-12-31           0         0            0          0
##  6 Australia 2019-12-31           0         0            0          0
##  7 Australia 2019-12-31           0         0            0          0
##  8 Australia 2019-12-31           0         0            0          0
##  9 Australia 2020-01-01           0         0            0          0
## 10 Australia 2020-01-01           0         0            0          0
## # ℹ 12,908 more rows
## # ℹ 8 more variables: gdp_per_capita <dbl>, population <dbl>,
## #   lockdown_date <date>, total_doses_administered <dbl>,
## #   P_of_population_fully_vaccinated <dbl>, GDP_Status <chr>,
## #   Daily_infected_case_rate <dbl>, Daily_death_rate <dbl>

names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total_doses_administered"        
## [11] "P_of_population_fully_vaccinated" "GDP_Status"                      
## [13] "Daily_infected_case_rate"         "Daily_death_rate"

5.Create a plot to show the relationship between the daily infected case rate within different GDP groups (greater or equal to the average, or less than the average).

p2 <- ggplot(Final_data, aes(x =GDP_Status , y =Daily_infected_case_rate , fill =GDP_Status)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_brewer(palette="Dark2")+ ggtitle("Task-2-p2- Daily infected case rate and GDP satus")
p2

## Warning: Removed 276 rows containing missing values (`position_stack()`).

The number of Daily infected case rate are higher in the High GDP group in comprasion to the low GDP group.

6.Create a plot to show the relationship between the daily death rate within different GDP groups (greater or equal to the average, or less than the average)。

p3 <- ggplot(Final_data, aes(x =GDP_Status , y =Daily_death_rate , fill =GDP_Status)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_brewer(palette="Dark2")+ ggtitle("Task-2-p2- Daily death rate and GDP satus")
p3

## Warning: Removed 276 rows containing missing values (`position_stack()`).

The number of daily death rate are higher in the High GDP group in comprasion to the low GDP group.

Interpret and justify your findings.

Q: Are the number of newly infected cases and death cases higher in the High GDP group or the Low GDP group? And why?

Interpretation: The number of newly infected cases and death cases each day are higher in the High GDP group in comprasion to the low GDP group.

YES, the number of newly infected cases and death cases higher in the High GDP group. Because we are measuring rates/ratio differences by comparing between 2 groups, the rates comparison has disregarded the population differences (or the denominator differences) among the comparing groups.

For verification purposes, see p5& p6 at the below.

p5 <- ggplot(Final_data, aes(x =GDP_Status , y =new_cases , fill =GDP_Status)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_brewer(palette="Dark2")+ ggtitle("Task-2-p5- No. of newly infected cases and GDP satus")
p5

## Warning: Removed 276 rows containing missing values (`position_stack()`).

p6 <- ggplot(Final_data, aes(x =GDP_Status , y =new_deaths , fill =GDP_Status)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_brewer(palette="Dark2")+ ggtitle("Task-2-p6- No. of newly death cases and GDP satus")
p6

## Warning: Removed 276 rows containing missing values (`position_stack()`).

Task : 3 In this section, you will need to analyse and interpret the feature correlations among diversified attributes. Feature causation and correlation concepts were covered in Module 3.

We want to explore the relationships among newly infected cases, newly deaths, total doses administered, and % of population fully vaccinated.

Plot the correlation matrix for those attributes, and justify your findings:

My Answers :

Firstly ready up data

names(Final_data)

##  [1] "Country"                          "date"                            
##  [3] "total_cases"                      "new_cases"                       
##  [5] "total_deaths"                     "new_deaths"                      
##  [7] "gdp_per_capita"                   "population"                      
##  [9] "lockdown_date"                    "total_doses_administered"        
## [11] "P_of_population_fully_vaccinated" "GDP_Status"                      
## [13] "Daily_infected_case_rate"         "Daily_death_rate"

Plot the correlation matrix for the below variables: “new_cases”,“new_deaths”,“total doses_administered”,and “%_of_population_fully_vaccinated”.

library(tidyverse)

#Step1.Dataframe preparation including only the relevant variables for analysis.These relevant vriables are "new_cases","new_deaths","total doses administered",and "% of population fully vaccinated" 

Final_data_correlation <-select(Final_data,new_cases,new_deaths,total_doses_administered,P_of_population_fully_vaccinated)
Final_data_correlation

## # A tibble: 12,918 × 4
##    new_cases new_deaths total_doses_administered P_of_population_fully_vaccina…¹
##        <dbl>      <dbl>                    <dbl>                           <dbl>
##  1         0          0                 57988175                              86
##  2         0          0                 57988175                              86
##  3         0          0                 57988175                              86
##  4         0          0                 57988175                              86
##  5         0          0                 57988175                              86
##  6         0          0                 57988175                              86
##  7         0          0                 57988175                              86
##  8         0          0                 57988175                              86
##  9         0          0                 57988175                              86
## 10         0          0                 57988175                              86
## # ℹ 12,908 more rows
## # ℹ abbreviated name: ¹P_of_population_fully_vaccinated

#Correlation Matrix Plot for the above dataframe with 4 selected variables

#Step1: Create the Correlation Matrix Table

Final_data_correlation.cor = cor(Final_data_correlation)

library("Hmisc")

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:simputation':
## 
##     impute

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Final_data_correlation.rcorr = rcorr(as.matrix(Final_data_correlation))
Final_data_correlation.rcorr

##                                  new_cases new_deaths total_doses_administered
## new_cases                             1.00       0.58                     0.65
## new_deaths                            0.58       1.00                     0.44
## total_doses_administered              0.65       0.44                     1.00
## P_of_population_fully_vaccinated     -0.45      -0.35                    -0.25
##                                  P_of_population_fully_vaccinated
## new_cases                                                   -0.45
## new_deaths                                                  -0.35
## total_doses_administered                                    -0.25
## P_of_population_fully_vaccinated                             1.00
## 
## n
##                                  new_cases new_deaths total_doses_administered
## new_cases                            12642      12642                     4294
## new_deaths                           12642      12642                     4294
## total_doses_administered              4294       4294                     4490
## P_of_population_fully_vaccinated      4294       4294                     4490
##                                  P_of_population_fully_vaccinated
## new_cases                                                    4294
## new_deaths                                                   4294
## total_doses_administered                                     4490
## P_of_population_fully_vaccinated                             4490
## 
## P
##                                  new_cases new_deaths total_doses_administered
## new_cases                                   0          0                      
## new_deaths                        0                    0                      
## total_doses_administered          0         0                                 
## P_of_population_fully_vaccinated  0         0          0                      
##                                  P_of_population_fully_vaccinated
## new_cases                         0                              
## new_deaths                        0                              
## total_doses_administered          0                              
## P_of_population_fully_vaccinated

Clean the dataset for correlation analysis: Final_data_Correlation (Drop Missing Values) I dropped missing values instead of replacing them with mean or k=nearest, because these missing values occurred after joining datasheet, it’s inaccurate to replace them with any value, they aren’t all represent missing values, e.g. not all countries had lock-down, and not all countries had the same vaccine records for the same period of time. Therefore dropping is the most appropriate technique here for the accuracy reason.

#drop the columns with nNA for all variables, (I don't think replace with mean or K nearest is a good idea, since these nNA appeared after I've joined the 3 datasets)

#Import the tidyr package                 
library("tidyr")

#Remove rows with NA's using drop_na()
Final_data_correlation_clean <- Final_data_correlation %>% drop_na()

#Load the dplyr package                      
library("dplyr") 

#Remove rows that contains all NA's
Final_data_correlation_clean <- filter(Final_data_correlation_clean, rowSums(is.na(Final_data_correlation_clean)) != ncol(Final_data_correlation_clean))

library(simputation)
glimpse_na(Final_data_correlation_clean)

## 
## No NA's.

# correlation for all variables
round(cor(Final_data_correlation_clean),
  digits = 3 # rounded to 2 decimals
)

##                                  new_cases new_deaths total_doses_administered
## new_cases                            1.000      0.609                    0.651
## new_deaths                           0.609      1.000                    0.441
## total_doses_administered             0.651      0.441                    1.000
## P_of_population_fully_vaccinated    -0.451     -0.354                   -0.819
##                                  P_of_population_fully_vaccinated
## new_cases                                                  -0.451
## new_deaths                                                 -0.354
## total_doses_administered                                   -0.819
## P_of_population_fully_vaccinated                            1.000

#Step1: Create the Correlation Matrix Table

Final_data_correlation_clean.cor = cor(Final_data_correlation_clean)

library("Hmisc")

Final_data_correlation_clean.rcorr = rcorr(as.matrix(Final_data_correlation_clean))
Final_data_correlation_clean.rcorr

##                                  new_cases new_deaths total_doses_administered
## new_cases                             1.00       0.61                     0.65
## new_deaths                            0.61       1.00                     0.44
## total_doses_administered              0.65       0.44                     1.00
## P_of_population_fully_vaccinated     -0.45      -0.35                    -0.82
##                                  P_of_population_fully_vaccinated
## new_cases                                                   -0.45
## new_deaths                                                  -0.35
## total_doses_administered                                    -0.82
## P_of_population_fully_vaccinated                             1.00
## 
## n= 4294 
## 
## 
## P
##                                  new_cases new_deaths total_doses_administered
## new_cases                                   0          0                      
## new_deaths                        0                    0                      
## total_doses_administered          0         0                                 
## P_of_population_fully_vaccinated  0         0          0                      
##                                  P_of_population_fully_vaccinated
## new_cases                         0                              
## new_deaths                        0                              
## total_doses_administered          0                              
## P_of_population_fully_vaccinated

#To extract the above values from this object into a useable data structure, you can use the following syntax:
Final_data_correlation_clean.coeff = Final_data_correlation_clean.rcorr$r
Final_data_correlation_clean.p = Final_data_correlation_clean.rcorr$P

Final_data_correlation_clean.p

##                                  new_cases new_deaths total_doses_administered
## new_cases                               NA          0                        0
## new_deaths                               0         NA                        0
## total_doses_administered                 0          0                       NA
## P_of_population_fully_vaccinated         0          0                        0
##                                  P_of_population_fully_vaccinated
## new_cases                                                       0
## new_deaths                                                      0
## total_doses_administered                                        0
## P_of_population_fully_vaccinated                               NA

# improved correlation matrix
library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(Final_data_correlation_clean),
  method = "number",
)

library(correlation)

correlation::correlation(Final_data_correlation_clean,
  include_factors = TRUE, method = "auto"
)

## # Correlation Matrix (auto-method)
## 
## Parameter1               |                       Parameter2 |     r |         95% CI | t(4292) |         p
## ----------------------------------------------------------------------------------------------------------
## new_cases                |                       new_deaths |  0.61 | [ 0.59,  0.63] |   50.29 | < .001***
## new_cases                |         total_doses_administered |  0.65 | [ 0.63,  0.67] |   56.22 | < .001***
## new_cases                | P_of_population_fully_vaccinated | -0.45 | [-0.47, -0.43] |  -33.08 | < .001***
## new_deaths               |         total_doses_administered |  0.44 | [ 0.42,  0.46] |   32.19 | < .001***
## new_deaths               | P_of_population_fully_vaccinated | -0.35 | [-0.38, -0.33] |  -24.80 | < .001***
## total_doses_administered | P_of_population_fully_vaccinated | -0.82 | [-0.83, -0.81] |  -93.39 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 4294

Correlograms

## plot with statistical results
# do not edit
corrplot2 <- function(data,
                      method = "pearson",
                      sig.level = 0.05,
                      order = "original",
                      diag = FALSE,
                      type = "upper",
                      tl.srt = 90,
                      number.font = 1,
                      number.cex = 1,
                      mar = c(0, 0, 0, 0)) {
  library(corrplot)
  data_incomplete <- data
  data <- data[complete.cases(data), ]
  mat <- cor(data, method = method)
  cor.mtest <- function(mat, method) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        tmp <- cor.test(mat[, i], mat[, j], method = method)
        p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
  }
  p.mat <- cor.mtest(data, method = method)
  col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
  corrplot(mat,
    method = "color", col = col(200), number.font = number.font,
    mar = mar, number.cex = number.cex,
    type = type, order = order,
    addCoef.col = "black", # add correlation coefficient
    tl.col = "black", tl.srt = tl.srt, # rotation of text labels
    # combine with significance level
    p.mat = p.mat, sig.level = sig.level, insig = "blank",
    # hide correlation coefficients on the diagonal
    diag = diag
  )
}

# edit from here
corrplot2(
  data =Final_data_correlation_clean ,
  method = "pearson",
  sig.level = 0.05,
  order = "original",
  diag = FALSE,
  type = "upper",
  tl.srt = 75
)

View Correlation in pairs among these variables

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggstatsplot)

## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

ggpairs(Final_data_correlation_clean)

Q1.Which features are strongly correlated with each other? What hypotheses can you draw from the results?

Answer to Q1.

*Analyzing correlation of these variables in pairs By purely looking at the paring result, I can see that “total_doses_administered” and “% of population fully vaccinated” are strongly negatively correlated with each other with r=-0.82 (closest to r= -1) and t = -93.39 (the furthest away from o). The smaller the % of population being fully vaccined, the more total_doses_administered. It made logical and practical sense, since more doses had been administered for turning the smaller % of fully vaccination into a bigger % at the initial phrases in 2020, where the data was recorded.

*Analyzing the correlation matrix as a whole for hypothesis (prediction) There are 4 variables altogether seemed strongly correlated with each others, “p_of_population_fully_vaccinated”, and “total_doses_administered”, “new_deaths’ altogether with”new_cases”.

“new_cases” seemed quite strongly positive-correlated with “total_doses_administered”(r=0.65, t=56.22), and somewhat strongly positive-related with “new_deaths”(r=0.61, t=50.29), and fairly negative-correlated with “% of population fully vaccinated”(r=-0,45,t=-33.08). The fact that ‘total_doses_administered’ are extremely strong negative-corelated with “% of population fully vaccinated”(r=-0.82). Thus these 4 variables can lead us to make a hypothesis statement

Hypothesis statement

I can draw to formulate a multiple linear regression model (the hypothesis) with 4 variables

DV=new_case, IV1=total_doses_administere, IV2=% of population fully vaccinated and IV3= new_deaths

-Multiple_linear_regression.1-

               new_cases ~ total_doses_administered + P_of_population_fully_vaccinated + new_deaths

“new_cases” can be predicted by “total_doses_administered”,“% of population fully vaccinated” and “new_deaths”.

regression.1 <- lm( new_cases ~ total_doses_administered + P_of_population_fully_vaccinated + new_deaths,  
                     data = Final_data_correlation_clean )

print( regression.1 )

## 
## Call:
## lm(formula = new_cases ~ total_doses_administered + P_of_population_fully_vaccinated + 
##     new_deaths, data = Final_data_correlation_clean)
## 
## Coefficients:
##                      (Intercept)          total_doses_administered  
##                       -2.273e+04                         3.142e-05  
## P_of_population_fully_vaccinated                        new_deaths  
##                        2.370e+02                         6.509e+00

summary( regression.1)

## 
## Call:
## lm(formula = new_cases ~ total_doses_administered + P_of_population_fully_vaccinated + 
##     new_deaths, data = Final_data_correlation_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13847   -595    165    550  49460 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -2.273e+04  1.472e+03  -15.45   <2e-16 ***
## total_doses_administered          3.142e-05  8.453e-07   37.17   <2e-16 ***
## P_of_population_fully_vaccinated  2.370e+02  1.706e+01   13.89   <2e-16 ***
## new_deaths                        6.509e+00  1.824e-01   35.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3256 on 4290 degrees of freedom
## Multiple R-squared:  0.5718, Adjusted R-squared:  0.5715 
## F-statistic:  1910 on 3 and 4290 DF,  p-value: < 2.2e-16

plot( regression.1)

Hypothesis statement

I can draw to formulate a multiple linear regression model (the hypothesis) with 4 variables DV=new_case, IV1=total_doses_administered, IV2=% of population fully vaccinated and IV3= new_deaths

new_cases = 3.142e-05*total_doses_administered + 2.370e+02 *P_of_population_fully_vaccinated + 6.509e+00* new_deaths + (Intercept)-2.273e+04

Q2.Which features are the least influential with each other? Are the results surprising to you? Please explain.

ANSWERS TO Q2:

Correlograms visualization

## plot with statistical results
# do not edit
corrplot2 <- function(data,
                      method = "pearson",
                      sig.level = 0.05,
                      order = "original",
                      diag = FALSE,
                      type = "upper",
                      tl.srt = 90,
                      number.font = 1,
                      number.cex = 1,
                      mar = c(0, 0, 0, 0)) {
  library(corrplot)
  data_incomplete <- data
  data <- data[complete.cases(data), ]
  mat <- cor(data, method = method)
  cor.mtest <- function(mat, method) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        tmp <- cor.test(mat[, i], mat[, j], method = method)
        p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
  }
  p.mat <- cor.mtest(data, method = method)
  col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
  corrplot(mat,
    method = "color", col = col(200), number.font = number.font,
    mar = mar, number.cex = number.cex,
    type = type, order = order,
    addCoef.col = "black", # add correlation coefficient
    tl.col = "black", tl.srt = tl.srt, # rotation of text labels
    # combine with significance level
    p.mat = p.mat, sig.level = sig.level, insig = "blank",
    # hide correlation coefficients on the diagonal
    diag = diag
  )
}

# edit from here
corrplot2(
  data =Final_data_correlation_clean ,
  method = "pearson",
  sig.level = 0.05,
  order = "original",
  diag = FALSE,
  type = "upper",
  tl.srt = 75
)

The above Correlograms tells me the ” new_deaths” and “% of popuation_fully vaccinated” are the least influential with each others, with r=-0.35 (the least close to r=-1), and t=-24.80 (the least away from 0).

new_deaths and % _f_population_fully_vaccinated being negatively correlated are the least influential to each others.

I feel a bit surprised before doing this assessment, thought vaccination efficacy have a quite strong correlation with decreasing death number, but results tells me % of fully vaccination is somewhat negatively correlated with decreasing new_deaths, but not as strong as how it was with decreasing new_cases, and total vaccine dosage number doesn’t even have significant p indicator with new-death.

I ran a new multiple variable linear regression model even yields with very small t values. and insignificant p value too.

-Multiple_linear_regression.2-

new_deaths ~ total_doses_administered + P_of_population_fully_vaccinated + new_cases

regression.2 <- lm( new_deaths ~ total_doses_administered + P_of_population_fully_vaccinated + new_cases,  
                     data = Final_data_correlation_clean )

print( regression.2 )

## 
## Call:
## lm(formula = new_deaths ~ total_doses_administered + P_of_population_fully_vaccinated + 
##     new_cases, data = Final_data_correlation_clean)
## 
## Coefficients:
##                      (Intercept)          total_doses_administered  
##                        6.778e+02                        -9.939e-08  
## P_of_population_fully_vaccinated                         new_cases  
##                       -7.362e+00                         3.518e-02

summary( regression.2)

## 
## Call:
## lm(formula = new_deaths ~ total_doses_administered + P_of_population_fully_vaccinated + 
##     new_cases, data = Final_data_correlation_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1940.2   -81.7   -39.7   -37.5  3742.3 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       6.778e+02  1.107e+02   6.124 9.96e-10 ***
## total_doses_administered         -9.939e-08  7.144e-08  -1.391    0.164    
## P_of_population_fully_vaccinated -7.362e+00  1.277e+00  -5.763 8.84e-09 ***
## new_cases                         3.518e-02  9.856e-04  35.692  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239.4 on 4290 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.3786 
## F-statistic: 872.9 on 3 and 4290 DF,  p-value: < 2.2e-16

plot(regression.2)

Conclusion:

So far, it seemed to me that vaccination dosage and its efficacy does not have much influential effects on decreasing the No. of new_deaths, in comparison to its effects on decreasing the No. of new cases. Vaccination jabs and its fully dosage among population might be effective to curb the spread but not so much on preventing deaths in a relative comparison context.

Limitation:

For the scope of this assessment, it is far too early to draw conclusion, and I haven’t done model checking, ANOVA analysis, and performance evaluation therefore it isn’t concluded and still open to further analysis.

Task 4: Linear/Polynomial Regression Model

They were covered in Module 4, and this task expects to test your ability on fitting a model and make predictions. You are expected to:

1.From the above tasks 1, 2, 3, and 4, choose a potential country or a GDP group for further analysis

My choice: High GDP group
Final_data
DV = new_cases
IV1=date
IV2= P_of_population_fully_vaccinated
IV3= total_doses_administered
IV4= P_of_population_fully_vaccinated

2.Plot your choice and draw a polynomial / linear line for your selection

#step1 dataset preparation 

Final_data_task4 <-select(Final_data,date,GDP_Status,new_cases,total_doses_administered,P_of_population_fully_vaccinated)

Final_data_task4

## # A tibble: 12,918 × 5
##    date       GDP_Status new_cases total_doses_administ…¹ P_of_population_full…²
##    <date>     <chr>          <dbl>                  <dbl>                  <dbl>
##  1 2019-12-31 High_GDP           0               57988175                     86
##  2 2019-12-31 High_GDP           0               57988175                     86
##  3 2019-12-31 High_GDP           0               57988175                     86
##  4 2019-12-31 High_GDP           0               57988175                     86
##  5 2019-12-31 High_GDP           0               57988175                     86
##  6 2019-12-31 High_GDP           0               57988175                     86
##  7 2019-12-31 High_GDP           0               57988175                     86
##  8 2019-12-31 High_GDP           0               57988175                     86
##  9 2020-01-01 High_GDP           0               57988175                     86
## 10 2020-01-01 High_GDP           0               57988175                     86
## # ℹ 12,908 more rows
## # ℹ abbreviated names: ¹total_doses_administered,
## #   ²P_of_population_fully_vaccinated

#filter only the relevant GDP group
df_task4<-Final_data_task4 %>% 
  filter(GDP_Status == "High_GDP")
df_task4

## # A tibble: 6,220 × 5
##    date       GDP_Status new_cases total_doses_administ…¹ P_of_population_full…²
##    <date>     <chr>          <dbl>                  <dbl>                  <dbl>
##  1 2019-12-31 High_GDP           0               57988175                     86
##  2 2019-12-31 High_GDP           0               57988175                     86
##  3 2019-12-31 High_GDP           0               57988175                     86
##  4 2019-12-31 High_GDP           0               57988175                     86
##  5 2019-12-31 High_GDP           0               57988175                     86
##  6 2019-12-31 High_GDP           0               57988175                     86
##  7 2019-12-31 High_GDP           0               57988175                     86
##  8 2019-12-31 High_GDP           0               57988175                     86
##  9 2020-01-01 High_GDP           0               57988175                     86
## 10 2020-01-01 High_GDP           0               57988175                     86
## # ℹ 6,210 more rows
## # ℹ abbreviated names: ¹total_doses_administered,
## #   ²P_of_population_fully_vaccinated

#View Missing Values

glimpse_na(df_task4)

## 
## na count: 4382

##                            columns  nNA
## 1 P_of_population_fully_vaccinated 1915
## 2         total_doses_administered 1915
## 3                        new_cases  276
## 4                             date  276

#replace Missing Values with Mean value (for an accuracy)

library(naniar)
library(simputation)
library(tidyverse)

df_task4_clean <- df_task4 %>%
  impute_mean_at(vars(date,
                      new_cases,total_doses_administered,P_of_population_fully_vaccinated))
df_task4_clean

## # A tibble: 6,220 × 5
##    date       GDP_Status new_cases total_doses_administ…¹ P_of_population_full…²
##    <date>     <chr>          <dbl>                  <dbl>                  <dbl>
##  1 2019-12-31 High_GDP           0               57988175                     86
##  2 2019-12-31 High_GDP           0               57988175                     86
##  3 2019-12-31 High_GDP           0               57988175                     86
##  4 2019-12-31 High_GDP           0               57988175                     86
##  5 2019-12-31 High_GDP           0               57988175                     86
##  6 2019-12-31 High_GDP           0               57988175                     86
##  7 2019-12-31 High_GDP           0               57988175                     86
##  8 2019-12-31 High_GDP           0               57988175                     86
##  9 2020-01-01 High_GDP           0               57988175                     86
## 10 2020-01-01 High_GDP           0               57988175                     86
## # ℹ 6,210 more rows
## # ℹ abbreviated names: ¹total_doses_administered,
## #   ²P_of_population_fully_vaccinated

glimpse_na(df_task4_clean)

## 
## No NA's.

#Now run a preliminary prediction for the coefficient and correlation for relevant variables

used objects:

- df_task4_clean

- DV = new_cases -

IV1= date

unused objects:

- IV2= total_doses_administered

- IV3= P_of_population_fully_vaccinated

Task4_Polynomial_degree1<- lm( new_cases ~ date,  
                     data = df_task4_clean )

print(Task4_Polynomial_degree1)

## 
## Call:
## lm(formula = new_cases ~ date, data = df_task4_clean)
## 
## Coefficients:
## (Intercept)         date  
##  -218079.09        11.95

summary(Task4_Polynomial_degree1)

## 
## Call:
## lm(formula = new_cases ~ date, data = df_task4_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32148  -1374   -701   -211  64107 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.181e+05  1.963e+04  -11.11   <2e-16 ***
## date         1.195e+01  1.069e+00   11.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4520 on 6218 degrees of freedom
## Multiple R-squared:  0.01971,    Adjusted R-squared:  0.01955 
## F-statistic:   125 on 1 and 6218 DF,  p-value: < 2.2e-16

#Adjusted R-squared:  0.2773

Preliminary Polynomial degree=1 Plot

ggplot(df_task4_clean, aes(new_cases, date) ) + geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 1, raw = TRUE)) + ggtitle("Task4_Preliminary Polynomial_degree1")

plot(Task4_Polynomial_degree1)

Make predictions on the newly infected case number or death cases for the next five to seven days.

3.1 Apply training and testing splits for evaluate your predictions. Which training and testing split ratio (e.g., 7:3, 8:2, 9:1) you have followed?

# Importing required library
library(tidyverse)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

theme_set(theme_classic())
 
# Load the data
df_task4_clean

## # A tibble: 6,220 × 5
##    date       GDP_Status new_cases total_doses_administ…¹ P_of_population_full…²
##    <date>     <chr>          <dbl>                  <dbl>                  <dbl>
##  1 2019-12-31 High_GDP           0               57988175                     86
##  2 2019-12-31 High_GDP           0               57988175                     86
##  3 2019-12-31 High_GDP           0               57988175                     86
##  4 2019-12-31 High_GDP           0               57988175                     86
##  5 2019-12-31 High_GDP           0               57988175                     86
##  6 2019-12-31 High_GDP           0               57988175                     86
##  7 2019-12-31 High_GDP           0               57988175                     86
##  8 2019-12-31 High_GDP           0               57988175                     86
##  9 2020-01-01 High_GDP           0               57988175                     86
## 10 2020-01-01 High_GDP           0               57988175                     86
## # ℹ 6,210 more rows
## # ℹ abbreviated names: ¹total_doses_administered,
## #   ²P_of_population_fully_vaccinated

# Split the data into training and test set 8:2

set.seed(123)
training.samples <- df_task4_clean$new_cases %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data<- df_task4_clean[training.samples, ]
test.data<- df_task4_clean[-training.samples, ]

#Verify the split in samples by reviewing the dimensions for each splitted sets
dim(train.data)

## [1] 4978    5

dim(test.data)

## [1] 1242    5

# Build the model in train data

model <- lm(new_cases ~ poly(date, 1, raw = TRUE), data = train.data)
summary(model)

## 
## Call:
## lm(formula = new_cases ~ poly(date, 1, raw = TRUE), data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32117  -1369   -709   -197  64141 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.126e+05  2.216e+04  -9.593   <2e-16 ***
## poly(date, 1, raw = TRUE)  1.165e+01  1.207e+00   9.654   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4555 on 4976 degrees of freedom
## Multiple R-squared:  0.01839,    Adjusted R-squared:  0.01819 
## F-statistic:  93.2 on 1 and 4976 DF,  p-value: < 2.2e-16

#Visualization the plot of the prediction model we trained in train data with a red fitted line
ggplot(train.data, aes(date, new_cases) ) + geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 1, raw = TRUE),aes(color="Train_line"))+ggtitle("Model_Creation_Fitted_Train_Line_Polynomial_degree1")

# Make predictions on test data
predictions <- model %>% predict(test.data)

# Model performance (extract RMSE=4375.069   and R2=0.02568286)
data.frame(RMSE = RMSE(predictions, test.data$new_cases),
           R2 = R2(predictions, test.data$new_cases))

##       RMSE         R2
## 1 4375.069 0.02568286

# Visualization of the Prediction Model on the test data to check for the red fitted line. (compare the line position in the train data)
 
ggplot(test.data, aes(date, new_cases)) + geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE),aes(color="test_line"))+ggtitle("Model_Prediction_Fitted_Test_Line_Polymonial_degree1")

## Warning in predict.lm(model, newdata = data_frame0(x = xseq), se.fit = se, :
## prediction from a rank-deficient fit may be misleading

predictions <- lm(new_cases ~ poly(date, 1, raw = TRUE), data = test.data)
summary(predictions)

## 
## Call:
## lm(formula = new_cases ~ poly(date, 1, raw = TRUE), data = test.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32270  -1392   -689   -154  60368 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.397e+05  4.217e+04  -5.684 1.64e-08 ***
## poly(date, 1, raw = TRUE)  1.313e+01  2.297e+00   5.717 1.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4378 on 1240 degrees of freedom
## Multiple R-squared:  0.02568,    Adjusted R-squared:  0.0249 
## F-statistic: 32.69 on 1 and 1240 DF,  p-value: 1.356e-08

used variables are:

- df_task4_clean

model

- DV = new_cases

- IV1= date - degree1

- train.data

- test.data

unused variables are:

- IV2= P_of_population_fully_vaccinated

- IV3= total_doses_administered

- IV4= P_of_population_fully_vaccinated

(2)Apply K-fold Cross-Validation (CV) for evaluate your predictions.

K-fold Cross-Validation (CV) k=9

#k=9 K-fold Cross-Validation (CV)
library(caret)

#specify the cross-validation method to choose k=9
ctrl <- trainControl(method = "cv", number = 9)

#fit a regression model and use k-fold CV to evaluate performance
k_fold_cv_model <- train(new_cases ~ date, data =df_task4_clean, method = "lm", trControl = ctrl)

#view summary of k-fold CV               
print(k_fold_cv_model)

## Linear Regression 
## 
## 6220 samples
##    1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (9 fold) 
## Summary of sample sizes: 5529, 5529, 5529, 5529, 5529, 5528, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   4485.303  0.02190759  1757.344
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

summary(k_fold_cv_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32148  -1374   -701   -211  64107 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.181e+05  1.963e+04  -11.11   <2e-16 ***
## date         1.195e+01  1.069e+00   11.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4520 on 6218 degrees of freedom
## Multiple R-squared:  0.01971,    Adjusted R-squared:  0.01955 
## F-statistic:   125 on 1 and 6218 DF,  p-value: < 2.2e-16

Question: Is there any difference between the results generated from the simply training and testing split compared to the CV results?

My Answers:

Yes, for CV results (t=11, RMSE=4424.062 , R squared=0.02183066), wheres for the simply training_and_testing split results (t=9.7, RMSE=4375.069 , R squared=0.02568286). Clearly CV results are fairer at indicating what is actually performing, CV results have also detected more errors with a more precised analysis, as both its RMSE and R squared indicators indicate our model performance not as well as how the simply training_and_testing_split suggested.

4.Observe the results predicted by your model and then conduct some research to see if the results aligned with the actual situation. It’s okay if the model performs poorly, justify your reasons, think from the perspective of over-fitting and under-fitting phenomena:

(1)Which category your model belongs to?

My answers:

From a mathematician’s viewpoint, my model belongs to polynomial regression model, with degree =1, which can be treated the same as a linear regression model.

From a analyst’s viewpoint, my model belongs to predictive modelling.

(2)Does the CV technique help to reduce the over-fitting ? If so, justify your findings.

My answers:

Yes.

In my case, an （high-bias) under_fitting scenario (I created a polynomial model with degree = 1, which can be treated as a simple linear regression model, error-train is just slightly bigger than error-train). And the overall CV results yields a greater t-value(t=11), although a smaller R squared value=0.0218306, comparing with the simply training_and_testing split results (t=9.7, R squared=0.02568286). Clearly the CV results have detected more errors with a more precised analysis, as in both of its t value, and RMSE and R squared indicators also indicate our model does not perform as well as how the simply training_and_testing_split has suggested suggested.

Also my observation tells me the bigger the k-values, the more folds , the more number of experiments I ran, results change. I am glad to see the R squared value is also gradually increasing too, although same goes with the RMSE, which in overall, bigger K might be indicating better and more improved model performance results.

Therefore it helps reduce high-bias for my case, as in helped overcoming the under-fitting issues. And I would assume the same goes to high-variance/ over-fitting scenario too, so that CV can improve a fairer model.

(3)Are there any assumptions to be made for enhancing the model’s performance (e.g., add more data? Apply feature selection?). If so, justify your choices.

My Answers:

Yes, in this date/time series prediction/forecasting scenario, my model only involves one independent variable (the”date) with only the degree=1 in polynomial regression (or a simple linear regression model), where it leads to a high-bias issue. Thus adding more independent variables, e.g. total doses administered, and or % of population fully vaccinated, and or increasing the degree level of polynomial in our model,turning our equation into a quadratic or cubic …etc. And of course, increasing the sample size with more data would definitely helpful with an better model’s performance overall.

#Save the joined and cleaned dataset as a CSV, export to my desktop.

write.csv(Final_data,  "C:\\Users\\Krystal\\Desktop\\Final_data.csv", row.names=TRUE)

Ass1

(Krystal) Ke Zhao

2023-09-24

R Markdown

Including Plots

Part A:

Task 1: Data wrangling and integration

Task 2: Data visualisation and analysis

Task : 3 In this section, you will need to analyse and interpret the feature correlations among diversified attributes. Feature causation and correlation concepts were covered in Module 3.

Task 4: Linear/Polynomial Regression Model