Let’s do some EDA on some of the data The Vaccinators group has gathered over the last week.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
library(readxl)
library(ggplot2)
DATASET 1- PHNData from https://beta.health.gov.au/file/4066/download?token=0HQuFDlF DATASET 2- DataPostcode, is from 2016 coverage by postcode from Healthy Communities DATASET 3- HealthyCommunities_Tab2 DATASET 4- PHN and Postal Area Code csv is from PHN concordances http://www.health.gov.au/internet/main/publishing.nsf/Content/PHN-Concordances
#reading in the first lot of data:
PHNData <- read.csv(file="PHNvaccination.csv", header=TRUE, sep=",")
summary(PHNData)
## PHN.Number PHN.Name
## NM : 3 Adelaide : 3
## PHN101 : 3 Australian Capital Territory : 3
## PHN102 : 3 Brisbane North : 3
## PHN103 : 3 Brisbane South : 3
## PHN104 : 3 Central and Eastern Sydney : 3
## PHN105 : 3 Central Queensland and Sunshine Coast: 3
## (Other):78 (Other) :78
## Age.Group X.DTP X.Polio X.HIB
## 12-<15 Months:32 Min. :88.90 Min. :91.38 Min. : 0.00
## 24-<27 Months:32 1st Qu.:93.07 1st Qu.:94.27 1st Qu.: 0.00
## 60-<63 Months:32 Median :94.22 Median :95.50 Median :94.33
## Mean :94.07 Mean :95.30 Mean :63.37
## 3rd Qu.:95.09 3rd Qu.:96.44 3rd Qu.:95.48
## Max. :97.41 Max. :97.93 Max. :97.49
##
## X.HEP X.MMR X.Pneumo X.MenC
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median :94.80 Median : 0.00 Median : 0.00 Median : 0.00
## Mean :63.73 Mean :31.21 Mean :31.48 Mean :31.88
## 3rd Qu.:96.06 3rd Qu.:92.59 3rd Qu.:93.68 3rd Qu.:94.81
## Max. :98.04 Max. :95.58 Max. :96.34 Max. :97.70
##
## X.Varicella X..Fully
## Min. : 0.00 Min. :87.12
## 1st Qu.: 0.00 1st Qu.:91.54
## Median : 0.00 Median :93.42
## Mean :31.01 Mean :93.03
## 3rd Qu.:92.07 3rd Qu.:94.64
## Max. :95.13 Max. :97.00
##
str(PHNData)
## 'data.frame': 96 obs. of 12 variables:
## $ PHN.Number : Factor w/ 32 levels "NM","PHN101",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ PHN.Name : Factor w/ 32 levels "Adelaide","Australian Capital Territory",..: 23 5 21 32 17 28 27 30 14 18 ...
## $ Age.Group : Factor w/ 3 levels "12-<15 Months",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ X.DTP : num 93.8 94.4 95.1 94.2 95.1 ...
## $ X.Polio : num 93.7 94.3 95 94.1 95.1 ...
## $ X.HIB : num 93.6 93.9 94.8 93.8 94.9 ...
## $ X.HEP : num 93.8 94.2 94.9 94 95 ...
## $ X.MMR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X.Pneumo : num 93.3 93.8 94.5 93.6 95 ...
## $ X.MenC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X.Varicella: num 0 0 0 0 0 0 0 0 0 0 ...
## $ X..Fully : num 92.9 93.2 94 93 94.7 ...
names(PHNData)
## [1] "PHN.Number" "PHN.Name" "Age.Group" "X.DTP" "X.Polio"
## [6] "X.HIB" "X.HEP" "X.MMR" "X.Pneumo" "X.MenC"
## [11] "X.Varicella" "X..Fully"
Let’s try plotting this data out
#let's take another look
ggplot(PHNData, aes(y = X..Fully, x=Age.Group)) +
geom_boxplot() +
xlab("Age Group") +
ylab("Percentage of bla")
#let's take another look
ggplot(PHNData) + geom_boxplot(aes(y=PHN.Name,x=X..Fully))
This shows that Sydney’s inner city suburbs and the NSW North Coast are the PHNs with the lowest full % coverage of the population sampled in the data. If we could get postcode level data, we could explore how poor Mullumbimby is at vaccinating their children (see this http://www.abc.net.au/news/2017-06-08/childhood-vaccination-rates-still-around-50-percent/8600196)
#let's take a look at ANOTHER dataset we will call DataPostcode, which is from healthy communities
DataPostcode <- read.csv(file="Healthy_Communities_Immunised_2016_POSTCODE.csv", header=TRUE, sep=",")
summary(DataPostcode)
## State Postcode
## VIC :1994 Min. : 800
## NSW :1817 1st Qu.:2840
## QLD :1274 Median :3875
## WA : 968 Mean :4099
## SA : 952 3rd Qu.:5169
## TAS : 324 Max. :7470
## (Other): 185
## Associated.residential.areas
## Gordon : 6
## Highbury : 6
## Hopetoun : 6
## Abbotsford : 3
## Aberdeen, Rouchel, Dartbrook, Davis Creek, Rossgole and 3 others: 3
## Acacia Ridge, Heathwood, Pallara, Willawong, Larapinta : 3
## (Other) :7487
## Reporting.Year Age.group Percent.fully.immunised....
## 2016-17:7514 1 year :2502 NP :3079
## 2 years:2496 95.0-100.0:1323
## 5 years:2516 92.5-94.9 :1291
## 90.0-92.4 : 926
## 85.0-89.9 : 671
## 80.0-84.9 : 155
## (Other) : 69
## Interpret.with.caution.....
## :5809
## #:1705
##
##
##
##
##
str(DataPostcode)
## 'data.frame': 7514 obs. of 7 variables:
## $ State : Factor w/ 12 levels "ACT","ACT/NSW",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Postcode : int 800 810 812 820 822 828 830 832 834 835 ...
## $ Associated.residential.areas: Factor w/ 2586 levels "Abbotsford","Aberdeen, Rouchel, Dartbrook, Davis Creek, Rossgole and 3 others",..: 662 1685 1164 2149 749 209 2499 1980 2340 1083 ...
## $ Reporting.Year : Factor w/ 1 level "2016-17": 1 1 1 1 1 1 1 1 1 1 ...
## $ Age.group : Factor w/ 3 levels "1 year","2 years",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Percent.fully.immunised.... : Factor w/ 9 levels "<70.0","70.0-74.9",..: 6 7 6 6 7 9 7 7 9 8 ...
## $ Interpret.with.caution..... : Factor w/ 2 levels " ","#": 2 1 1 1 1 1 1 1 1 2 ...
names(DataPostcode)
## [1] "State" "Postcode"
## [3] "Associated.residential.areas" "Reporting.Year"
## [5] "Age.group" "Percent.fully.immunised...."
## [7] "Interpret.with.caution....."
#let's see if I can plot this
ggplot(DataPostcode, aes(y = Percent.fully.immunised...., x = Age.group)) +
geom_col()
I wish my ggplot skills were better. Le sigh.
#not all that interesting ... hmm
ggplot(DataPostcode) + geom_boxplot(aes(y=State,x=Age.group))
#it looks like some of the postcodes cross states
#not all that interesting ... hmm
ggplot(data = DataPostcode) + geom_point(aes(y=Age.group,x=Percent.fully.immunised...., color=State)) +
facet_wrap(~ Percent.fully.immunised...., nrow = 2)
#it looks like some of the postcodes cross states
I really am not very good at this plotting business, mostly because the variables in this data suck.
ggplot(data = DataPostcode) + geom_smooth(mapping = aes(x=Postcode, y=Percent.fully.immunised....))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#very stupid visual
Let’s look at more data #reading in the first lot of data:
DataTab2 <- read.csv(file="Healthy_CommunitiesTab2.csv", header=TRUE, sep=",")
summary(DataTab2)
## State PHN.code
## NSW :150 PHN101 : 15
## QLD :105 PHN102 : 15
## VIC : 75 PHN103 : 15
## WA : 45 PHN104 : 15
## SA : 30 PHN105 : 15
## ACT : 15 PHN106 : 15
## (Other): 45 (Other):375
## PHN.area.name Reporting.Year
## Adelaide : 15 2012-13:93
## Australian Capital Territory : 15 2013-14:93
## Brisbane North : 15 2014-15:93
## Brisbane South : 15 2015-16:93
## Central and Eastern Sydney : 15 2016-17:93
## Central Queensland, Wide Bay and Sunshine Coast: 15
## (Other) :375
## Age.group Number.of.registered.children Number.fully.immunised
## 1 year :155 1,116 : 2 1,085 : 2
## 2 years:155 4,310 : 2 10,936 : 2
## 5 years:155 5,040 : 2 11,153 : 2
## 5,548 : 2 12,026 : 2
## 5,613 : 2 12,604 : 2
## 6,080 : 2 2,849 : 2
## (Other):453 (Other):453
## Number.not.fully.immunised Percent.fully.immunised....
## 464 : 3 Min. :86.10
## 489 : 3 1st Qu.:90.80
## 988 : 3 Median :92.30
## 1,040 : 2 Mean :92.06
## 1,065 : 2 3rd Qu.:93.40
## 1,081 : 2 Max. :96.20
## (Other):450
str(DataTab2)
## 'data.frame': 465 obs. of 9 variables:
## $ State : Factor w/ 9 levels "ACT","NSW","NT",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ PHN.code : Factor w/ 31 levels "PHN101","PHN102",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ PHN.area.name : Factor w/ 31 levels "Adelaide","Australian Capital Territory",..: 5 20 30 16 26 25 28 13 17 15 ...
## $ Reporting.Year : Factor w/ 5 levels "2012-13","2013-14",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Age.group : Factor w/ 3 levels "1 year","2 years",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Number.of.registered.children: Factor w/ 455 levels "1,031","1,051",..: 193 21 116 293 104 359 265 133 295 205 ...
## $ Number.fully.immunised : Factor w/ 457 levels "1,017","1,019",..: 173 449 100 280 96 347 248 124 265 193 ...
## $ Number.not.fully.immunised : Factor w/ 407 levels "1,001","1,010",..: 73 340 21 191 403 200 154 332 279 139 ...
## $ Percent.fully.immunised.... : num 92.9 93.4 92.5 94.4 93.1 95.1 95.3 95.5 90.1 96.2 ...
#This is a much better dataset as it has numbers and percents
Let’s do some tidying though
#rename columns
colnames(DataTab2) = c("State","PHN","PHNarea","Year","Age","Nchild","Nfully","Nnotfully","Percfully")
adsg
#visualise more
Tab2_plot <- ggplot(DataTab2, aes(x = Year, y = Percfully))
Tab2_plot + geom_point(aes(color = Percfully))
#visualise more
ggplot(DataTab2, aes(x=Percfully,y=Year,z=Nfully)) + facet_wrap(~Age) + geom_point()
Set a new variable #rename columns colnames(DataTab2) = c(“State”,“PHN”,“PHNarea”,“Year”,“Age”,“Nchild”,“Nfully”,“Nnotfully”,“Percfully”)