Data Wrangling Assessment

Required packages

library(openxlsx)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(outliers)

Data

Two data sets chosen were chosen to investigate the relationship between coronavirus case numbers and vaccination rates in Australia by age and gender. The data was pulled on 26th September 2021 and relates to the statistics as they stood at that time.

Data Set 1 was accessed from the Australian Government Department of Health Website at: https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/coronavirus-covid-19-case-numbers-and-statistics

It contains data related to coronavirus case numbers and deaths by gender and age group. Variables include:

Age Group
Cases-Male
Cases-Female
Deaths-Male
Deaths-Female

Dataset 2 was accessed from the Australian Bureau of Statistics website at: https://www.health.gov.au/resources/publications/covid-19-vaccination-vaccination-data-25-september-2021

It contains data related to vaccinations by state, age group and gender. In its current state the dataset contains two variables:

Measure name
Value

However, once tidied the following variables will be extracted:

Age Group
Gender
Fully Vaccinated Population
Total Population

Note: The data is quite untidy so the data frames will both need to undergo extensive tidying before I can merge them together at a later stage.

#Read in datasets and run the head functions
age_gender <- read.xlsx("Covid Cases.xlsx", sheet = "By age group")
vax <- read.xlsx("covid-19-vaccination-vaccination-data-25-september-2021.xlsx", sheet = "Vaccination data")

head(age_gender)

head(vax)

Understand

For each of the dataframes, the variable classes and structures were checked and converted where necessary. Character classes were converted to factors, and the age group levels were ordered.

# Check age_gender variable classes and convert where necessary
sapply(age_gender, class)

##   Age.Group     Cases-M     Cases-F    Deaths-M    Deaths-F 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"

age_gender$Age.Group<-as.factor(age_gender$Age.Group)
sapply(age_gender, class)

## Age.Group   Cases-M   Cases-F  Deaths-M  Deaths-F 
##  "factor" "numeric" "numeric" "numeric" "numeric"

# Check vax variable classes and convert where necessary
sapply(vax, class)

## Measure.Name        Value 
##  "character"    "numeric"

vax$Measure.Name<-as.factor(vax$Measure.Name)
sapply(vax, class)

## Measure.Name        Value 
##     "factor"    "numeric"

# Check the structure of each data frame
str(age_gender)

## 'data.frame':    10 obs. of  5 variables:
##  $ Age.Group: Factor w/ 10 levels "0-9","10-19",..: 1 2 3 4 5 6 7 8 9 10
##  $ Cases-M  : num  5496 6600 10619 8914 6198 ...
##  $ Cases-F  : num  5170 6230 9776 7818 5558 ...
##  $ Deaths-M : num  0 1 5 6 9 25 56 145 236 134
##  $ Deaths-F : num  0 0 2 3 7 21 23 82 239 214

str(vax)

## 'data.frame':    290 obs. of  2 variables:
##  $ Measure.Name: Factor w/ 290 levels "ACT - Administration state - Daily increase doses administered by jurisdictions - commonwealth",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Value       : num  3997 4374 8371 329384 307951 ...

# Order age group variable
age_gender$Age.Group<-factor(age_gender$Age.Group,
              levels=c('0-9','10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90+'),
              ordered=TRUE)

# Check ordered levels now applied to age group
str(age_gender)

## 'data.frame':    10 obs. of  5 variables:
##  $ Age.Group: Ord.factor w/ 10 levels "0-9"<"10-19"<..: 1 2 3 4 5 6 7 8 9 10
##  $ Cases-M  : num  5496 6600 10619 8914 6198 ...
##  $ Cases-F  : num  5170 6230 9776 7818 5558 ...
##  $ Deaths-M : num  0 1 5 6 9 25 56 145 236 134
##  $ Deaths-F : num  0 0 2 3 7 21 23 82 239 214

Tidy & Manipulate Data I

The data does not conform to tidy data principles. There are occurrences where variables don’t have their own columns, observations don’t have their own rows, and values don’t have their own cells. This will be corrected in the following steps.

For the vax data frame, the Measure.Name variable needed to be separated out into 5 separate columns, refactored and have white space removed. It was then subset into two separate dataframes:

vaxagepop which summarised the total population by gender and age group.
vaxage which summarised the total population that has been fully vaccinated, by gender and age group.

Both subsets then required Measure1 and Measure2 columns to be reunited to create an age group column, and age groups that were not consistent with the age_gender dataframe were filtered out.

Once this had been done, I was then able to join both subsets together.

#separate Measure Name by separator
vax<-vax%>%
   separate(Measure.Name, c("Grouping", "Measure1", "Measure2", "Measure3", "Measure4"), sep="-")
head(vax)

#Check classes
sapply(vax, class)

##    Grouping    Measure1    Measure2    Measure3    Measure4       Value 
## "character" "character" "character" "character" "character"   "numeric"

#Convert new variables from character to factor class
vax$Measure1<-as.factor(vax$Measure1)
vax$Measure2<-as.factor(vax$Measure2)
vax$Measure3<-as.factor(vax$Measure3)
vax$Measure4<-as.factor(vax$Measure4)

#remove whitespace
vax$Grouping<-str_trim(vax$Grouping)
vax$Measure1<-str_trim(vax$Measure1)
vax$Measure2<-str_trim(vax$Measure2)
vax$Measure3<-str_trim(vax$Measure3)
vax$Measure4<-str_trim(vax$Measure4)

#subset vax to vaxage data frame

#create new dataset with Grouping filtered only to State Jurisdiction (exclude National)
vaxage<-vax %>% 
  filter(Grouping %in% "Age group")

#create dataset for total population
vaxagepop<-vaxage %>% 
  filter(Measure4 %in% "Population") %>% 
  unite(Age.Group, Measure1, Measure2, sep="-") %>% 
   filter(!Age.Group %in% c("16-19", "90-94")) %>% 
   rename(Gender=Measure3, Type=Measure4, Total=Value) %>% 
   select(-(Grouping))

 vaxagepop<-pivot_wider(data = vaxagepop, names_from = "Type", values_from = "Total") 


# filter to Number of People Fully Vaccinated in measure4 to capture age and gender 
vaxage<-vaxage %>% 
  filter(Measure4 %in% "Number of people fully vaccinated") %>% 
  select(-(Grouping))

#Unite Measure1 and Measure2 as age group and filter out ages below 20 to enable like-for-like merge with age_gender

vaxage<-vaxage %>% 
  unite(Age.Group, Measure1, Measure2, sep="-") %>% 
  filter(!Age.Group %in% c("16-19", "90-94")) %>% 
  rename(Gender=Measure3, Type=Measure4, Total=Value) 

 vaxage<-pivot_wider(data = vaxage, names_from = "Type", values_from = "Total") %>% 
   rename(PopFullyVaxxed="Number of people fully vaccinated")

# merge vaxagepop and vaxage dataframes
vaxage<-vaxage %>% 
  left_join(vaxagepop, by=c("Age.Group", "Gender")) 

head(vaxage)

For the age_gender dataframe, tidying included pivoting longer so each observation could have its own row. I then separated the “Variable” column into separate columns for Type and Gender. As with vaxage, age groups that were not consistent with the vaxage dataframe were filtered out.The data was then pivoted wider to best showcase the total number of cases and deaths by age group and gender.

age_gender <- pivot_longer(data = age_gender, names_to = "Variable", values_to = "Total", 2:5)

# separate Variable column into Type and Gender
age_gender<-age_gender%>%
  separate(Variable, c("Type", "Gender"), sep="-") %>% 
  filter(!Age.Group %in% c("0-9", "10-19", "90+")) %>% 
  select(Age.Group, Gender, Type, Total)

age_gender<-pivot_wider(data = age_gender, names_from = "Type", values_from = "Total") 

head(age_gender)

Tidy & Manipulate Data II

Before merging the two age dataframes, vaxage age groups needed to be adjusted so they are consistent with the age_gender data frame.This was done by creating a new age group variable (x) that aligns with age_gender age groups. The data was then summarising according to this new age group variable.

#Create a new variable x in vaxage to assign age groups to brackets consistent with age_gender dataframe
vaxage$x <-ifelse(vaxage$Age.Group=="20-24", "20-29", 
                  ifelse(vaxage$Age.Group=="25-29", "20-29",
                     ifelse(vaxage$Age.Group=="30-34", "30-39",   
                            ifelse(vaxage$Age.Group=="35-39", "30-39",
                                   ifelse(vaxage$Age.Group=="40-44", "40-49",
                                          ifelse(vaxage$Age.Group=="45-49", "40-49",
                                                 ifelse(vaxage$Age.Group=="50-54", "50-59",
                                                        ifelse(vaxage$Age.Group=="55-59", "50-59",
                                                               ifelse(vaxage$Age.Group=="60-64", "60-69",
                                                                      ifelse(vaxage$Age.Group=="65-69", "60-69",
                                                                             ifelse(vaxage$Age.Group=="70-74", "70-79",
                                                                                    ifelse(vaxage$Age.Group=="75-79", "70-79",
                                                                                           ifelse(vaxage$Age.Group=="80-84", "80-89",
                                                                                                  ifelse(vaxage$Age.Group=="85-89", "80-89",""))))))))))))))


#summarise df by new agegroup variable
vaxage<-vaxage %>% select(-(Age.Group)) %>% group_by(x, Gender) %>% summarise(PopFullyVaxxed = sum(PopFullyVaxxed), Population= sum(Population))%>% rename(Age.Group=x)

head(vaxage)

Both data frames were then merged into the final single data frame and character variables were once again converted to factors.

finaldf<-age_gender %>% 
  full_join(vaxage, by=c("Age.Group", "Gender"))

finaldf$Age.Group<-as.factor(finaldf$Age.Group)
finaldf$Gender<-as.factor(finaldf$Gender)

head(finaldf)

Three new variables were created from existing variables to demonstrate the percentage of the total population that have been fully vaccinated, covid cases as a percentage of the total population, and covid deaths as a percentage of the total population.

#create new variable - Percentage of population fully vaxxed Population
finaldf$FullVaxPercentOfPop<-finaldf$PopFullyVaxxed/finaldf$Population*100 
finaldf$FullVaxPercentOfPop<-round(finaldf$FullVaxPercentOfPop, digits=0)

#create new variable - Covid Cases as Percentage of Population
finaldf$CasesPercentOfPop<-finaldf$Cases/finaldf$Population*100 
finaldf$CasesPercentOfPop<-round(finaldf$CasesPercentOfPop, digits=2)

#create new variable - Covid Deaths as Percentage of Population 
finaldf$DeathsPercentOfPop<-finaldf$Deaths/finaldf$Population*100 
finaldf$DeathsPercentOfPop<-round(finaldf$DeathsPercentOfPop, digits=2)
head(finaldf)

Scan I

The data frame and individual columns were reviewed for missing values using is.na functions, and none were found. If there had been missing values, the impute function would have been used, most likely using average or median values for that column.

Any potential inconsistencies and errors have previously been addressed as part of the data cleansing process and will be further investigated and analysed when reviewing potential outliers in the next section.

# No missing values were found in the final data frame.
sum(is.na(finaldf))

## [1] 0

colSums(is.na(finaldf))

##           Age.Group              Gender               Cases              Deaths 
##                   0                   0                   0                   0 
##      PopFullyVaxxed          Population FullVaxPercentOfPop   CasesPercentOfPop 
##                   0                   0                   0                   0 
##  DeathsPercentOfPop 
##                   0

Scan II

A multivariate analysis was first used to detect potential outliers. There were 3 potential outliers identified.

sub<-finaldf %>% select(Cases, Deaths, PopFullyVaxxed, Population, FullVaxPercentOfPop, CasesPercentOfPop, DeathsPercentOfPop)

results<- sub %>% 
  MVN::mvn(multivariateOutlierMethod = "quan", 
           showOutliers = TRUE)

results$multivariateOutliers

The data frame was filtered down to the observation numbers identified in the results. It was found that these outliers were the two youngest male age groups and the youngest female age group.

#subset data to observation numbers found in the results stage
finaldf %>% filter(row(finaldf)==1|row(finaldf)==2|row(finaldf)==3)

I then ran scatter plots for all variables by age group but nothing stood out as unique for the youngest age groups.

At this stage I decided to retain the proposed outliers, noting that the youngest age groups appeared to somehow behave significantly different from the other age groups. This will be revisited at a later point when a visualisation is created to explore the relationship between case and vaccination rates.

#Example 1 for scatter plots - individual variables by age group
ggplot(data=finaldf, 
       mapping=aes(x=Age.Group, y=Cases, color=Gender))+
  labs(title= "Cases by Age Group",
       x="Age Group",
       y="Cases")+
  geom_point()

By running box plot visualisations for all variables, possible outliers were also detected in relation to female deaths and deaths as a percent of the total population for both genders.

ggplot(data=finaldf, 
       mapping=aes(x=Gender, y=Deaths))+
  labs(title= "Deaths by Gender",
       x="Gender",
       y="Deaths")+
  geom_boxplot()

ggplot(data=finaldf, 
       mapping=aes(x=Gender, y=DeathsPercentOfPop))+
  labs(title= "Death Rate by Gender",
       x="Gender",
       y="Death %")+
  geom_boxplot()

However, by using the outliers package it was confirmed that these points do in fact fall within an acceptable range and are not outliers after all.

# Z scores for deaths - Max=2.1026 which is below the min of 3
z.scores <- finaldf$Deaths %>%  
  outliers::scores(type = "z")
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7026 -0.6523 -0.4658  0.0000  0.1674  2.1026

#There are no scores that qualify as outliers
length( which( abs(z.scores) >3 ))

## [1] 0

# Z scores for deaths by percentage of population - Max=2.5 which is below the min of 3
z.scores <- finaldf$DeathsPercentOfPop %>%  
  outliers::scores(type = "z")
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.500  -0.500  -0.500   0.000  -0.125   2.500

#There are no scores that qualify as outliers
length( which( abs(z.scores) >3 ))

## [1] 0

When viewed in a scatterplot the data appears to demonstrate a significant exponential relationship between age groups and number of deaths, with the number of deaths significantly higher for both genders in the 80-89 Age Group.

Males also seem to have a substantially higher risk of death than females within the ages of 60-69 and 70-79 year old Age Groups.

ggplot(data=finaldf, 
       mapping=aes(x=Age.Group, y=Deaths, color=Gender))+
  labs(title= "Total number of covid deaths by age and gender",
       x="Age Group", 
       y="Deaths")+
  geom_point()

When deaths are viewed as a percentage of the population, the risk appears to increase exponentially for the 70+ age groups, with males experiencing a far greater prorated risk than females.

ggplot(data=finaldf, 
       mapping=aes(x=Age.Group, y=DeathsPercentOfPop, color=Gender))+
  labs(title= "Covid deaths by age and gender as a percentage of population",
       x="Age Group", 
       y="Death %")+
  geom_point()

It also appears that the age groups most at risk of death are also the most likely to get vaccinated.

ggplot(data=finaldf, 
       mapping=aes(x=FullVaxPercentOfPop, y=DeathsPercentOfPop, color=Gender, label=Age.Group))+
  labs(title= "Vaccination Rates by Covid Death Rates",
       x="Fully Vaccinated %",
       y="Death %")+
  geom_text(aes(label=Age.Group), size=3, hjust=0, vjust=-0.5)+
  geom_point()

However, quite inversely, the populations with the highest percentage of cases are also the least likely to be fully vaccinated.

Incidentally, the 3 data points with the strongest disassociation between cases and vaccination rates are the same 3 that were detected as outliers in the initial multivariate analysis.

This raises potential research questions about whether it is a disproportionate disinclination or a lack of availability that reduces the rate of vaccination for younger populations.

ggplot(data=finaldf, 
       mapping=aes(x=FullVaxPercentOfPop, y=CasesPercentOfPop, color=Gender, label=Age.Group))+
  labs(title= "Vaccination Rates by Covid Case Rates",
       x="Fully Vaccinated %",
       y="Case %")+
  geom_text(aes(label=Age.Group), size=3, hjust=0, vjust=-0.5)+
  geom_point()

Transform

When looking at the histogram of the number of deaths, a right skewed distribution can be observed

hist(finaldf$Deaths,
     main = "Histogram of Deaths",
     xlab = "Deaths")

By using a logarithmic transformation, the distribution has been adjusted to a more symmetrical shape.

log_deaths <-log10(finaldf$Deaths)

hist(log_deaths,
     main="Histogram of base 10 log of Deaths",
     xlab="Base 10 log of Deaths")