# load packages and function
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(outliers)
library(forecast)
library(infotheo)
is.special <- function(x) {sum(is.nan(x)|is.infinite(x))}
The purpose of this data prepocesing was to compare how covid cases to see the correlation of the movement of people at the state and council level with the increase of covid cases. Raw data of .csv are imported into R using the readr package with the given name. Once brought in the structure of each dataset is checked by str() function, and also the output of other information such as the datasets dimensions, variable type, etc. I created an transformation to tidy up the AU_Mob dataset to tidy form using the gather function to be ready for the merge in the later steps. Once I applied the appropriate transformation data type conversions were applied for changing characters to factors as well as character to dates, etc. In order to tidy up the data for easier data analysis, data manipulations were done on both the datasets using (gathering, select, mutation and merging). New columns were mutated from date and state to create a key column for both datasets to merge cleanly. Column names were also renamed appropriately due to them being too long. Data scans were done to remove outliers and missing values found in the dataset and decided to remove missing values confirming they were irrelevant and kept outliers due to the dataset being so large. Once applying the appropriate normalization method more analysis will be done with the outcome of the Mathematical Operations technique.
The COVID AU State dataset was obtained from the https://www.covid19data.com.au/ website The dataset contains 2,136 observations and 19 variables associated with COVID19 data for each state of Australia. Specifically, the dataset contains the following variables:
date: The date of observations captured.
state, state_abbrev: These variables include the full spelling of the states as well as the abbreviated versions of the states.
confirmed, confirmed_cum (cumulative): The number of confirmed cases related to COVID19 as well as the cumulative amount.
deaths, deaths_cum (cumulative): The number of deaths related to COVID19 as well as the cumulative amount.
tests, test_cum (cumulative): The number of tests related to COVID19 as well as the cumulative amount.
positives, positives_cum (cumulative): The number of positive cases related to COVID19 as well as the cumulative amount.
recovered, recovered_cum (cumulative): The number of recovered cases related to COVID19 as well as the cumulative amount.
hosp, hosp_cum (cumulative): The number of hospitalizations related to COVID19 as well as the cumulative amount.
icu, icu_cum (cumulative): The number of patients in ICU related to COVID19 as well as the cumulative amount.
vent, vent_cum(cumulative): The number of patients on ventilations related to COVID19 as well as the cumulative amount.
The Global Mobility report was obtained from the https://www.google.com/covid19/mobility/ website. The dataset contains 62,596 observations and 14 variables associated with the impact of movement due to COVID19. This dataset is untidy as 6 of the 14 variables are observations and should be combined to create movement type and movement percentage from baseline. Specifically, the dataset contains the following variables:
country region code, country region, sub region 1, sub region 2, metro area: These observations are all country and regional codes related to the specific country of the data as well at the state/city level.
iso 3166 2 code: This is the standard of short and unique alphanumeric codes to respresent the administrative divsions and dependant territories of all countrie.
census fips code: These are unique codes within nations to both identify stat and nesting counties. date: dates of observations captured.
retail_and_recreation_percent_change_from_baseline: percentage difference of retail and recreation movement visits from baseline 0.
grocery_and_pharmacy_percent_change_from_baseline: percentage difference of grocery and pharmacy visits from baseline 0.
parks_percent_change_from_baseline: percentage difference of park visits from baseline 0.
transit_stations_percent_change_from_baseline: percentage difference of transit movements from baseline 0.
workplaces_percent_change_from_baseline: percentage difference of workplace movements from baseline 0.
residential_percent_change_from_baseline: percentage difference of residential movements from baseline 0.
#the library to readr the csv has already been loaded above
#read csv from working directory and save it as a dataframe by the same name (without .extension)
#I will be doing the merge in the Tidy & Manipulate Data II step. Reason for this is that I want to tidy both datasets first as I am dealing with
COVID_AU_state <- read_csv("COVID_AU_state.csv")
Parsed with column specification:
cols(
date = col_date(format = ""),
state = col_character(),
state_abbrev = col_character(),
confirmed = col_double(),
confirmed_cum = col_double(),
deaths = col_double(),
deaths_cum = col_double(),
tests = col_double(),
tests_cum = col_double(),
positives = col_double(),
positives_cum = col_double(),
recovered = col_double(),
recovered_cum = col_double(),
hosp = col_double(),
hosp_cum = col_double(),
icu = col_double(),
icu_cum = col_double(),
vent = col_double(),
vent_cum = col_double()
)
head(COVID_AU_state) #view the head
tail(COVID_AU_state) #view the tail
AU_Mob <- read_csv("2020_AU_Region_Mobility_Report.csv")
Parsed with column specification:
cols(
country_region_code = col_character(),
country_region = col_character(),
sub_region_1 = col_character(),
sub_region_2 = col_character(),
metro_area = col_logical(),
iso_3166_2_code = col_character(),
census_fips_code = col_logical(),
date = col_character(),
retail_and_recreation_percent_change_from_baseline = col_double(),
grocery_and_pharmacy_percent_change_from_baseline = col_double(),
parks_percent_change_from_baseline = col_double(),
transit_stations_percent_change_from_baseline = col_double(),
workplaces_percent_change_from_baseline = col_double(),
residential_percent_change_from_baseline = col_double()
)
head(AU_Mob) #view the head
tail(AU_Mob) #view the tail
NA
This dataset is untidy for what is needed as the observations are in variable form from 9-14 these will need to use the gather function to create 2 columns. I will also be subsetting the data as well as renaming variable to be in a workable form for the merging section.
#before applying any more rules I will only be selecting 11 variables out of the 14.
df1 <- AU_Mob %>% select(country_region_code, country_region,sub_region_1, sub_region_2, date, retail_and_recreation_percent_change_from_baseline, grocery_and_pharmacy_percent_change_from_baseline, parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline)
str(df1) #view the overall structure of df1 to check I have selected the correct columns.
tibble [62,596 x 11] (S3: tbl_df/tbl/data.frame)
$ country_region_code : chr [1:62596] "AU" "AU" "AU" "AU" ...
$ country_region : chr [1:62596] "Australia" "Australia" "Australia" "Australia" ...
$ sub_region_1 : chr [1:62596] NA NA NA NA ...
$ sub_region_2 : chr [1:62596] NA NA NA NA ...
$ date : chr [1:62596] "15/02/2020" "16/02/2020" "17/02/2020" "18/02/2020" ...
$ retail_and_recreation_percent_change_from_baseline: num [1:62596] 4 3 -1 -3 -1 0 3 5 3 -1 ...
$ grocery_and_pharmacy_percent_change_from_baseline : num [1:62596] 3 5 0 -2 -1 1 4 4 4 1 ...
$ parks_percent_change_from_baseline : num [1:62596] -2 9 -6 -13 -6 5 -1 10 9 -10 ...
$ transit_stations_percent_change_from_baseline : num [1:62596] 3 3 7 7 8 9 12 8 4 8 ...
$ workplaces_percent_change_from_baseline : num [1:62596] 3 -1 17 14 13 13 16 3 -2 17 ...
$ residential_percent_change_from_baseline : num [1:62596] 0 0 -2 -1 -1 -2 -3 -1 0 -1 ...
df1 <- df1 %>% rename(state = sub_region_1,council = sub_region_2, retail_and_recreation = retail_and_recreation_percent_change_from_baseline, grocery_and_pharmacy = grocery_and_pharmacy_percent_change_from_baseline, parks = parks_percent_change_from_baseline, transit_stations = transit_stations_percent_change_from_baseline, workplaces= workplaces_percent_change_from_baseline, residential = residential_percent_change_from_baseline)
colnames(df1) #checking if the columns were renamed
[1] "country_region_code" "country_region" "state" "council"
[5] "date" "retail_and_recreation" "grocery_and_pharmacy" "parks"
[9] "transit_stations" "workplaces" "residential"
df2 <- df1 %>% gather("movement type", "percentage_from_baseline", 6:11)
head(df2) #checking the transformation is correct
Looking at the data frame, we can see categorical attributes in the rows as well as rows that summarize a set of observations. the tail also showed some text which were not observations so in order to wrangle the data, we:
str(COVID_AU_state) #view the structure of COVID_AU_state
tibble [2,136 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:2136], format: "2020-01-25" "2020-01-25" "2020-01-25" "2020-01-25" ...
$ state : chr [1:2136] "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
$ state_abbrev : chr [1:2136] "ACT" "NSW" "NT" "QLD" ...
$ confirmed : num [1:2136] 0 3 0 0 0 0 1 0 0 0 ...
$ confirmed_cum: num [1:2136] 0 3 0 0 0 0 1 0 0 3 ...
$ deaths : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ deaths_cum : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ tests : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ tests_cum : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ positives : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ positives_cum: num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ recovered : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ recovered_cum: num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ hosp : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ hosp_cum : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ icu : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ icu_cum : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ vent : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
$ vent_cum : num [1:2136] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. date = col_date(format = ""),
.. state = col_character(),
.. state_abbrev = col_character(),
.. confirmed = col_double(),
.. confirmed_cum = col_double(),
.. deaths = col_double(),
.. deaths_cum = col_double(),
.. tests = col_double(),
.. tests_cum = col_double(),
.. positives = col_double(),
.. positives_cum = col_double(),
.. recovered = col_double(),
.. recovered_cum = col_double(),
.. hosp = col_double(),
.. hosp_cum = col_double(),
.. icu = col_double(),
.. icu_cum = col_double(),
.. vent = col_double(),
.. vent_cum = col_double()
.. )
class(COVID_AU_state$date) #date is format type
[1] "Date"
COVID_AU_state$date <- as.Date(COVID_AU_state$date, "%d/%m/%y") #converting data type to date.
class(COVID_AU_state$date) #it is now converted to date type
[1] "Date"
class(COVID_AU_state$state) #state is character type
[1] "character"
COVID_AU_state$state <- as.factor(COVID_AU_state$state) #leveling the nominal data
levels(COVID_AU_state$state) #checking the levels. the data is not ordinal so it can stay factored and not ordered.
[1] "Australian Capital Territory" "New South Wales" "Northern Territory"
[4] "Queensland" "South Australia" "Tasmania"
[7] "Victoria" "Western Australia"
class(COVID_AU_state$state_abbrev) #state is character type
[1] "character"
COVID_AU_state$state_abbrev <- as.factor(COVID_AU_state$state_abbrev) #leveling the nominal data
levels(COVID_AU_state$state_abbrev) #checking the levels. the data is not ordinal so it can stay factored and not ordered.
[1] "ACT" "NSW" "NT" "QLD" "SA" "TAS" "VIC" "WA"
class(COVID_AU_state$state_abbrev)
[1] "factor"
class(COVID_AU_state$confirmed) #numeric type
[1] "numeric"
str(df2) #view the structure of COVID_AU_state
tibble [375,576 x 7] (S3: tbl_df/tbl/data.frame)
$ country_region_code : chr [1:375576] "AU" "AU" "AU" "AU" ...
$ country_region : chr [1:375576] "Australia" "Australia" "Australia" "Australia" ...
$ state : chr [1:375576] NA NA NA NA ...
$ council : chr [1:375576] NA NA NA NA ...
$ date : chr [1:375576] "15/02/2020" "16/02/2020" "17/02/2020" "18/02/2020" ...
$ movement type : chr [1:375576] "retail_and_recreation" "retail_and_recreation" "retail_and_recreation" "retail_and_recreation" ...
$ percentage_from_baseline: num [1:375576] 4 3 -1 -3 -1 0 3 5 3 -1 ...
class(df2$date) #date is character type will need to converge it to date
[1] "character"
df2$date <- as.Date(df2$date, "%d/%m/%y") #converting data type to date.
class(COVID_AU_state$date) #it is now converted to date type
[1] "Date"
class(df2$country_region_code) #character type
[1] "character"
class(df2$country_region) #character type
[1] "character"
class(df2$state) #character type
[1] "character"
df2$state <- as.factor(df2$state) #leveling the nominal data
levels(df2$state) #checking the levels. the data is not ordinal so it can stay factored and not ordered.
[1] "Australian Capital Territory" "New South Wales" "Northern Territory"
[4] "Queensland" "South Australia" "Tasmania"
[7] "Victoria" "Western Australia"
class(df2$`movement type`)
[1] "character"
df2$`movement type`<- as.factor(df2$`movement type`)
levels(df2$`movement type`)
[1] "grocery_and_pharmacy" "parks" "residential" "retail_and_recreation"
[5] "transit_stations" "workplaces"
class(df2$percentage_from_baseline) #numeric type
[1] "numeric"
typeof(df2$percentage_from_baseline) #happy for it to stay as double
[1] "double"
dim(COVID_AU_state)
[1] 2136 19
dim(df2)
[1] 375576 7
In this step I merged both the datasets and selected what I needed for the final dataset. During the merged there were 20 variables and at the final stage of the dataset I ended with 7 variables. After applying the column selection I mutated confirmed_cum - recovered_cum to create an active_cases column. This column will be used for analysis later on for the mobility percentage.
df2$key = paste(df2$date, "-", df2$state) #created this to merge with COVID_AU_state dataset
COVID_AU_state$key = paste(COVID_AU_state$date, "-", COVID_AU_state$state) #created this to merge with COVID_AU_state dataset
covid <- COVID_AU_state %>% left_join(df2, by = "key") #merging both datasets to one linked by the key
covid_mob <- covid %>%
select(date.x, state_abbrev, council,confirmed_cum, recovered_cum, `movement type`, percentage_from_baseline) %>%
mutate(active_cases = confirmed_cum - recovered_cum) # Created active_cases column
head(covid_mob)
First I identified any missing data within the covid_mob dataset. Using the colSum function I found that The council column had a high number of NAs.I did a simple technique of removing the NAs as I am wanting to see at the council level how covid affected movement city wide.I also applied and is.special to find any special values in the dataset and the dataset had no special values within the observational rows.
colSums(is.na(covid_mob)) # Looks like council will be the main dataset we will remove NAs from as I want to know the movements from the council level.
date.x state_abbrev council confirmed_cum
0 0 11816 0
recovered_cum movement type percentage_from_baseline active_cases
0 200 114058 0
df3 <- covid_mob %>% drop_na() #removing NAs from All
colSums(is.na(df3)) #checking if any other column has NA. I have NAs in percentage_from_baseline I will not remove or change the data as I have no assumption on what the missing default data is so will keep the NAs in.
date.x state_abbrev council confirmed_cum
0 0 0 0
recovered_cum movement type percentage_from_baseline active_cases
0 0 0 0
sapply(df3, is.special) #No special values found
date.x state_abbrev council confirmed_cum
0 0 0 0
recovered_cum movement type percentage_from_baseline active_cases
0 0 0 0
head(df3) #checking the final dataset is good for scan II
NA
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
df3 %>%
group_by(state_abbrev) %>%
summarise(Min = min(percentage_from_baseline,na.rm = TRUE),
Q1 = quantile(percentage_from_baseline,probs = .25,na.rm = TRUE),
Median = median(percentage_from_baseline, na.rm = TRUE),
Q3 = quantile(percentage_from_baseline,probs = .75,na.rm = TRUE),
Max = max(percentage_from_baseline,na.rm = TRUE),
Mean = mean(percentage_from_baseline, na.rm = TRUE),
SD = sd(percentage_from_baseline, na.rm = TRUE),
n = n(),
Missing = sum(is.na(percentage_from_baseline)),
.groups = 'drop')
#summarizing by the state level of percentage of movement to find reasoning for outliers
df3 %>% boxplot(percentage_from_baseline ~ state_abbrev,data = ., main = "Boxplot of movement by state",
ylab="percent movement", xlab="state", horizontal=TRUE)
#After applying the multivariate outlier detection method. There seems to be a large amount of missing data for each state besides TAS with far fewer.The reason I chose this method as it is the easiest method to find outliers visually. I will need further analysis as this does not give me a key indication on where outliers may appear.
#According the scatterplot, there seem to outliers on the top left corner of the graph which is very few so will decide to keep the outliers
df3 %>% plot(percentage_from_baseline ~ active_cases, data = ., main ="percent movement by active cases")
I will apply the data transformation via mathematical operations to achieve normality. This method will help me distinguish what mathematical method to use for further analysis of the dataset.I have filtered out the appropriate mathematical operations with log10, square root and reciprocal. After analyzing the histograms of each method the Reciprocal transformation had the best outcome with a much more normalized distribution for further analysis.
hist(df3$percentage_from_baseline) # data appears to right skewed
recper <- 1/df3$percentage_from_baseline #applying reciprocal transformation
hist(recper) #This seemed to be one that normalized the data well.
df3 <- df3 %>% mutate(rec_percentage = 1/df3$percentage_from_baseline) #Column added in for comparison for further analysis
head(df3)
NA
https://www.google.com/covid19/mobility/
https://github.com/M3IT/COVID-19_Data/blob/master/Data/COVID_AU_state.csv