Exploring vaccination data - EDA

Let’s do some EDA on some of the data The Vaccinators group has gathered over the last week.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(readr)
library(readxl)
library(ggplot2)

About the datasets

DATASET 1- PHNData from https://beta.health.gov.au/file/4066/download?token=0HQuFDlF DATASET 2- DataPostcode, is from 2016 coverage by postcode from Healthy Communities DATASET 3- HealthyCommunities_Tab2 DATASET 4- PHN and Postal Area Code csv is from PHN concordances http://www.health.gov.au/internet/main/publishing.nsf/Content/PHN-Concordances

The 2017 PHN data by age group and % coverage

#reading in the first lot of data:
PHNData <- read.csv(file="PHNvaccination.csv", header=TRUE, sep=",")
summary(PHNData)
##    PHN.Number                                  PHN.Name 
##  NM     : 3   Adelaide                             : 3  
##  PHN101 : 3   Australian Capital Territory         : 3  
##  PHN102 : 3   Brisbane North                       : 3  
##  PHN103 : 3   Brisbane South                       : 3  
##  PHN104 : 3   Central and Eastern Sydney           : 3  
##  PHN105 : 3   Central Queensland and Sunshine Coast: 3  
##  (Other):78   (Other)                              :78  
##          Age.Group      X.DTP          X.Polio          X.HIB      
##  12-<15 Months:32   Min.   :88.90   Min.   :91.38   Min.   : 0.00  
##  24-<27 Months:32   1st Qu.:93.07   1st Qu.:94.27   1st Qu.: 0.00  
##  60-<63 Months:32   Median :94.22   Median :95.50   Median :94.33  
##                     Mean   :94.07   Mean   :95.30   Mean   :63.37  
##                     3rd Qu.:95.09   3rd Qu.:96.44   3rd Qu.:95.48  
##                     Max.   :97.41   Max.   :97.93   Max.   :97.49  
##                                                                    
##      X.HEP           X.MMR          X.Pneumo         X.MenC     
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00  
##  Median :94.80   Median : 0.00   Median : 0.00   Median : 0.00  
##  Mean   :63.73   Mean   :31.21   Mean   :31.48   Mean   :31.88  
##  3rd Qu.:96.06   3rd Qu.:92.59   3rd Qu.:93.68   3rd Qu.:94.81  
##  Max.   :98.04   Max.   :95.58   Max.   :96.34   Max.   :97.70  
##                                                                 
##   X.Varicella       X..Fully    
##  Min.   : 0.00   Min.   :87.12  
##  1st Qu.: 0.00   1st Qu.:91.54  
##  Median : 0.00   Median :93.42  
##  Mean   :31.01   Mean   :93.03  
##  3rd Qu.:92.07   3rd Qu.:94.64  
##  Max.   :95.13   Max.   :97.00  
## 
str(PHNData)
## 'data.frame':    96 obs. of  12 variables:
##  $ PHN.Number : Factor w/ 32 levels "NM","PHN101",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ PHN.Name   : Factor w/ 32 levels "Adelaide","Australian Capital Territory",..: 23 5 21 32 17 28 27 30 14 18 ...
##  $ Age.Group  : Factor w/ 3 levels "12-<15 Months",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X.DTP      : num  93.8 94.4 95.1 94.2 95.1 ...
##  $ X.Polio    : num  93.7 94.3 95 94.1 95.1 ...
##  $ X.HIB      : num  93.6 93.9 94.8 93.8 94.9 ...
##  $ X.HEP      : num  93.8 94.2 94.9 94 95 ...
##  $ X.MMR      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X.Pneumo   : num  93.3 93.8 94.5 93.6 95 ...
##  $ X.MenC     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X.Varicella: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X..Fully   : num  92.9 93.2 94 93 94.7 ...
names(PHNData)
##  [1] "PHN.Number"  "PHN.Name"    "Age.Group"   "X.DTP"       "X.Polio"    
##  [6] "X.HIB"       "X.HEP"       "X.MMR"       "X.Pneumo"    "X.MenC"     
## [11] "X.Varicella" "X..Fully"

Let’s try plotting this data out

#let's take another look
ggplot(PHNData, aes(y = X..Fully, x=Age.Group)) + 
  geom_boxplot() +
  xlab("Age Group") +
  ylab("Percentage of bla")

#let's take another look
ggplot(PHNData) + geom_boxplot(aes(y=PHN.Name,x=X..Fully)) 

This shows that Sydney’s inner city suburbs and the NSW North Coast are the PHNs with the lowest full % coverage of the population sampled in the data. If we could get postcode level data, we could explore how poor Mullumbimby is at vaccinating their children (see this http://www.abc.net.au/news/2017-06-08/childhood-vaccination-rates-still-around-50-percent/8600196)

2016 rates by postcode

#let's take a look at ANOTHER dataset we will call DataPostcode, which is from healthy communities
DataPostcode <- read.csv(file="Healthy_Communities_Immunised_2016_POSTCODE.csv", header=TRUE, sep=",")
summary(DataPostcode)
##      State         Postcode   
##  VIC    :1994   Min.   : 800  
##  NSW    :1817   1st Qu.:2840  
##  QLD    :1274   Median :3875  
##  WA     : 968   Mean   :4099  
##  SA     : 952   3rd Qu.:5169  
##  TAS    : 324   Max.   :7470  
##  (Other): 185                 
##                                                    Associated.residential.areas
##  Gordon                                                          :   6         
##  Highbury                                                        :   6         
##  Hopetoun                                                        :   6         
##  Abbotsford                                                      :   3         
##  Aberdeen, Rouchel, Dartbrook, Davis Creek, Rossgole and 3 others:   3         
##  Acacia Ridge, Heathwood, Pallara, Willawong, Larapinta          :   3         
##  (Other)                                                         :7487         
##  Reporting.Year   Age.group    Percent.fully.immunised....
##  2016-17:7514   1 year :2502   NP        :3079            
##                 2 years:2496   95.0-100.0:1323            
##                 5 years:2516   92.5-94.9 :1291            
##                                90.0-92.4 : 926            
##                                85.0-89.9 : 671            
##                                80.0-84.9 : 155            
##                                (Other)   :  69            
##  Interpret.with.caution.....
##   :5809                     
##  #:1705                     
##                             
##                             
##                             
##                             
## 
str(DataPostcode)
## 'data.frame':    7514 obs. of  7 variables:
##  $ State                       : Factor w/ 12 levels "ACT","ACT/NSW",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Postcode                    : int  800 810 812 820 822 828 830 832 834 835 ...
##  $ Associated.residential.areas: Factor w/ 2586 levels "Abbotsford","Aberdeen, Rouchel, Dartbrook, Davis Creek, Rossgole and 3 others",..: 662 1685 1164 2149 749 209 2499 1980 2340 1083 ...
##  $ Reporting.Year              : Factor w/ 1 level "2016-17": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age.group                   : Factor w/ 3 levels "1 year","2 years",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Percent.fully.immunised.... : Factor w/ 9 levels "<70.0","70.0-74.9",..: 6 7 6 6 7 9 7 7 9 8 ...
##  $ Interpret.with.caution..... : Factor w/ 2 levels " ","#": 2 1 1 1 1 1 1 1 1 2 ...
names(DataPostcode)
## [1] "State"                        "Postcode"                    
## [3] "Associated.residential.areas" "Reporting.Year"              
## [5] "Age.group"                    "Percent.fully.immunised...." 
## [7] "Interpret.with.caution....."
#let's see if I can plot this
ggplot(DataPostcode, aes(y = Percent.fully.immunised...., x = Age.group)) + 
  geom_col()

I wish my ggplot skills were better. Le sigh.

#not all that interesting ... hmm
ggplot(DataPostcode) + geom_boxplot(aes(y=State,x=Age.group)) 

#it looks like some of the postcodes cross states
#not all that interesting ... hmm
ggplot(data = DataPostcode) + geom_point(aes(y=Age.group,x=Percent.fully.immunised...., color=State)) +
 facet_wrap(~ Percent.fully.immunised...., nrow = 2)

#it looks like some of the postcodes cross states

I really am not very good at this plotting business, mostly because the variables in this data suck.

ggplot(data = DataPostcode) + geom_smooth(mapping = aes(x=Postcode, y=Percent.fully.immunised....))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#very stupid visual

DataTab2 data with more variables and detail

Let’s look at more data #reading in the first lot of data:

DataTab2 <- read.csv(file="Healthy_CommunitiesTab2.csv", header=TRUE, sep=",")
summary(DataTab2)
##      State        PHN.code  
##  NSW    :150   PHN101 : 15  
##  QLD    :105   PHN102 : 15  
##  VIC    : 75   PHN103 : 15  
##  WA     : 45   PHN104 : 15  
##  SA     : 30   PHN105 : 15  
##  ACT    : 15   PHN106 : 15  
##  (Other): 45   (Other):375  
##                                          PHN.area.name Reporting.Year
##  Adelaide                                       : 15   2012-13:93    
##  Australian Capital Territory                   : 15   2013-14:93    
##  Brisbane North                                 : 15   2014-15:93    
##  Brisbane South                                 : 15   2015-16:93    
##  Central and Eastern Sydney                     : 15   2016-17:93    
##  Central Queensland, Wide Bay and Sunshine Coast: 15                 
##  (Other)                                        :375                 
##    Age.group   Number.of.registered.children Number.fully.immunised
##  1 year :155   1,116  :  2                   1,085  :  2           
##  2 years:155   4,310  :  2                   10,936 :  2           
##  5 years:155   5,040  :  2                   11,153 :  2           
##                5,548  :  2                   12,026 :  2           
##                5,613  :  2                   12,604 :  2           
##                6,080  :  2                   2,849  :  2           
##                (Other):453                   (Other):453           
##  Number.not.fully.immunised Percent.fully.immunised....
##  464    :  3                Min.   :86.10              
##  489    :  3                1st Qu.:90.80              
##  988    :  3                Median :92.30              
##  1,040  :  2                Mean   :92.06              
##  1,065  :  2                3rd Qu.:93.40              
##  1,081  :  2                Max.   :96.20              
##  (Other):450
str(DataTab2)
## 'data.frame':    465 obs. of  9 variables:
##  $ State                        : Factor w/ 9 levels "ACT","NSW","NT",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PHN.code                     : Factor w/ 31 levels "PHN101","PHN102",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ PHN.area.name                : Factor w/ 31 levels "Adelaide","Australian Capital Territory",..: 5 20 30 16 26 25 28 13 17 15 ...
##  $ Reporting.Year               : Factor w/ 5 levels "2012-13","2013-14",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Age.group                    : Factor w/ 3 levels "1 year","2 years",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Number.of.registered.children: Factor w/ 455 levels "1,031","1,051",..: 193 21 116 293 104 359 265 133 295 205 ...
##  $ Number.fully.immunised       : Factor w/ 457 levels "1,017","1,019",..: 173 449 100 280 96 347 248 124 265 193 ...
##  $ Number.not.fully.immunised   : Factor w/ 407 levels "1,001","1,010",..: 73 340 21 191 403 200 154 332 279 139 ...
##  $ Percent.fully.immunised....  : num  92.9 93.4 92.5 94.4 93.1 95.1 95.3 95.5 90.1 96.2 ...
#This is a much better dataset as it has numbers and percents

Let’s do some tidying though

#rename columns
colnames(DataTab2) = c("State","PHN","PHNarea","Year","Age","Nchild","Nfully","Nnotfully","Percfully")

Visualising DataTab2

adsg

#visualise more
Tab2_plot <- ggplot(DataTab2, aes(x = Year, y = Percfully))
Tab2_plot + geom_point(aes(color = Percfully))

#visualise more
ggplot(DataTab2, aes(x=Percfully,y=Year,z=Nfully)) + facet_wrap(~Age) + geom_point()

Set a new variable #rename columns colnames(DataTab2) = c(“State”,“PHN”,“PHNarea”,“Year”,“Age”,“Nchild”,“Nfully”,“Nnotfully”,“Percfully”)