Today we’ll use a new dataset for schools alongside the school roll
data, and some new data from the Ministry of Health. We’ll start by
loading in the roll and schools data, along with loading the
readxl
package for reading data directly from an excel
spreadsheet:
library(tidyverse)
library(readxl)
roll <- read_csv("http://www.massey.ac.nz/~jcmarsha/data/schools/roll.csv")
schools <- read_csv("http://www.massey.ac.nz/~jcmarsha/data/schools/schools.csv")
clean <- roll |> mutate(Level = as.numeric(substring(Level, 6)))
clean
## # A tibble: 66,261 × 4
## School EthnicGroup Level Students
## <chr> <chr> <dbl> <dbl>
## 1 3H School International Asian 10 1
## 2 3H School International Asian 7 2
## 3 3H School International European 7 1
## 4 3H School International Pacific 9 1
## 5 3H School International Asian 9 1
## 6 A1 Student Limited European 10 1
## 7 A1 Student Limited European 11 1
## 8 A1 Student Limited European 4 2
## 9 A1 Student Limited Asian 5 1
## 10 A1 Student Limited European 5 2
## # ℹ 66,251 more rows
We’ll be looking at how to summarise by more than one group, and how to turn that into a table. In addition, we’ll look at how we can join datasets together to summarise student level information by school level information (e.g. by region).
Make sure you can Knit this document successfully before you make changes.
In workshop 1 you produced a graph of the total number of students by year level. You should have got something like this:
year_totals <- clean |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(year_totals) +
geom_col(mapping=aes(x=Level, y=Total))
This is an interesting graph as there is an unexpected peak at level 7 (caused by a reclassification of students at year 7/8) and it is also interesting to see the decrease as students leave school early. We might be interested to see if there are any patterns in other variables, such as in ethnicity (or perhaps by region - we’ll see how to get that information later).
To do this, we want to produce the same plot for each ethnicity. Now,
we could just go ahead and create separate plots by using
filter
fill
.In the last exercise you produced a graph of the total number of
students by year level and EthnicGroup. The problem with this chart is
that even if you play with the position
argument of
geom_col()
it is hard to see if the patterns in total
number of students by year level are similar or different across
different ethnicities.
An alternate is to produce a plot for each ethnicity. We could just
go ahead and create separate plots by using filter
eu_year_by_level <- clean |> filter(EthnicGroup == "European") |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(eu_year_by_level) +
geom_col(mapping=aes(x=Level, y=Total))
as_year_by_level <- clean |> filter(EthnicGroup == "Asian") |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(as_year_by_level) +
geom_col(mapping=aes(x=Level, y=Total))
But that gets quite awkward! Fortunately, ggplot
can do
grouping for us via the facet_wrap
or
facet_grid
commands. These produce what is known as ‘small
multiple’ plots - basically divide the plot up into subplots, and in
each subplot we show a different group with consistent style/axes
etc.
To do this, we’ll need to group by EthnicGroup as well as Level, and
then use facet_wrap
: as well as Level and Gender, and then
use facet_wrap
:
ethnicity_by_year <- clean |> group_by(EthnicGroup, Level) |> summarise(Total = sum(Students))
## `summarise()` has grouped output by 'EthnicGroup'. You can override using the
## `.groups` argument.
ggplot(ethnicity_by_year) +
geom_col(mapping=aes(x=Level, y=Total)) +
facet_wrap(vars(EthnicGroup))
The vars
helper function here is there so that
facet_wrap
knows to look for the named column in the data
frame.
facet_wrap
.schools
dataStart by taking a look at the schools
data:
schools
## # A tibble: 2,534 × 12
## School Sector Type Authority SchoolGender AffiliationType Affiliation
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3H School In… Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 2 A1 Student L… Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 3 ACG Parnell … Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 4 ACG Strathal… Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 5 ACG Sunderla… Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 6 ACG Tauranga Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 7 ADDI Enrichm… Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 8 AGE School Compo… Comp… Private:… Co-Ed No Affiliation Not Applic…
## 9 Abbotsford S… Prima… Full… State: N… Co-Ed No Affiliation Not Applic…
## 10 Aberdeen Sch… Prima… Cont… State: N… Co-Ed No Affiliation Not Applic…
## # ℹ 2,524 more rows
## # ℹ 5 more variables: MāoriAffiliation <chr>, RegionalCouncil <chr>,
## # TerritorialAuthority <chr>, Takiwa <chr>, EQI <dbl>
count(SchoolGender)
for this.left_join
If we want to summarise the student information across schools, we’ll need to join the two datasets together. Generally this is done by matching one or more “key” columns in both datasets. The most common way to do this is via a left join, where every row in the dataset we supply on the left is kept, and we match it to rows in the dataset on the right that share the same information across the key columns.
The following will do this:
all_data <- clean |> left_join(schools)
## Joining with `by = join_by(School)`
all_data
## # A tibble: 66,261 × 15
## School EthnicGroup Level Students Sector Type Authority SchoolGender
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 3H School Int… Asian 10 1 Compo… Comp… Private:… Co-Ed
## 2 3H School Int… Asian 7 2 Compo… Comp… Private:… Co-Ed
## 3 3H School Int… European 7 1 Compo… Comp… Private:… Co-Ed
## 4 3H School Int… Pacific 9 1 Compo… Comp… Private:… Co-Ed
## 5 3H School Int… Asian 9 1 Compo… Comp… Private:… Co-Ed
## 6 A1 Student Li… European 10 1 Compo… Comp… Private:… Co-Ed
## 7 A1 Student Li… European 11 1 Compo… Comp… Private:… Co-Ed
## 8 A1 Student Li… European 4 2 Compo… Comp… Private:… Co-Ed
## 9 A1 Student Li… Asian 5 1 Compo… Comp… Private:… Co-Ed
## 10 A1 Student Li… European 5 2 Compo… Comp… Private:… Co-Ed
## # ℹ 66,251 more rows
## # ℹ 7 more variables: AffiliationType <chr>, Affiliation <chr>,
## # MāoriAffiliation <chr>, RegionalCouncil <chr>, TerritorialAuthority <chr>,
## # Takiwa <chr>, EQI <dbl>
Notice that the left_join
function automatically found
which column to match on (based on matching column names) and that we
end up with the same number of rows in this case, as each row in the
clean roll dataset only matches a single school. We do, ofcourse, get
the additional variables from the schools
dataset.
We can now use the all_data
dataset to answer questions
using the variables from both datasets:
facet_wrap()
to produce
side by side plots by Takiwa.We can facet with more than one variable. e.g. we could alter the level by Takiwa plot to include ethnic group as follows:
all_data |>
filter(EthnicGroup != "International fee paying") |>
group_by(Takiwa, EthnicGroup, Level) |>
summarise(Total = sum(Students)) |>
ggplot() +
geom_col(mapping=aes(x=Level, y=Total)) +
facet_wrap(vars(Takiwa, EthnicGroup), ncol=6)
## `summarise()` has grouped output by 'Takiwa', 'EthnicGroup'. You can override
## using the `.groups` argument.
While this works, we end up with a lot of space used for all the
headings. An alternate is to use facet_grid()
which uses
one variable for rows and another for columns.
facet_grid()
.scales
parameter to facet_grid()
to
produce something that shows all subgroups reasonably well.pivot_wider
to turn tidy data into wide dataUp to now, all the data we’ve been playing with has been tidy: each row consists of a single observation and each column is a separate variable.
e.g. for this data we have a column for ethnic group and another column for number of students: we don’t have separate columns counting each ethnic group - the counts are all in one column, and which groups that count applies to are denoted in the other columns.
This makes it easy to do data manipulation, and also makes it easy to
plot stuff, as ggplot2
and dplyr
(indeed, the
whole tidyverse
suite of packages) works best if the data
are tidy.
But, sometimes we want things in other forms. The following example tables up the total number of students by ethnicity for each takiwa:
ethnicity_by_takiwa <- all_data |>
group_by(Takiwa, EthnicGroup) |>
summarise(Students = sum(Students))
## `summarise()` has grouped output by 'Takiwa'. You can override using the
## `.groups` argument.
ethnicity_by_takiwa
## # A tibble: 21 × 3
## # Groups: Takiwa [3]
## Takiwa EthnicGroup Students
## <chr> <chr> <dbl>
## 1 Te Tai Raro Asian 84332
## 2 Te Tai Raro European 100770
## 3 Te Tai Raro International fee paying 3130
## 4 Te Tai Raro MELAA 11391
## 5 Te Tai Raro Māori 62260
## 6 Te Tai Raro Other 2272
## 7 Te Tai Raro Pacific 56733
## 8 Te Tai Runga Asian 34371
## 9 Te Tai Runga European 158433
## 10 Te Tai Runga International fee paying 1816
## # ℹ 11 more rows
It would be nicer if we could make this into a table so that we had
one row for each ethnic group, and separate columns for the takiwa
totals. We can do this using pivot_wider
. The key arguments
are names_from
which is the column that provides the new
column names, and values_from
which is the column that
provides the values that should go in the new columns.
ethnicity_by_takiwa |>
pivot_wider(names_from = Takiwa, values_from = Students)
## # A tibble: 7 × 4
## EthnicGroup `Te Tai Raro` `Te Tai Runga` `Te Tai Whenua`
## <chr> <dbl> <dbl> <dbl>
## 1 Asian 84332 34371 21205
## 2 European 100770 158433 106268
## 3 International fee paying 3130 1816 1282
## 4 MELAA 11391 7866 5007
## 5 Māori 62260 54877 88986
## 6 Other 2272 2629 1829
## 7 Pacific 56733 16504 9077
Notice we have the exact same information, it’s just in a more
human-readable form. It’s not in a form that’s easier to plot though!
ggplot
would struggle with these data, as we don’t have a
single “Count” column for the y-axis to use.
group_by
Takiwa and year level then
summarise
. Once done, pivot_wider
to a
table).pivot_wider
by using a
mutate
with sum(Students)
.mutate()
and get the percentage by using
Students/sum(Students)
within each takiwa.Sometimes1 the data we’re given isn’t in the nice, tidy form we’ve seen so far. Often it’s on a spreadsheet, and that spreadsheet will include a bunch of junk that might be useful if a person is looking at it (e.g. notes, totals, averages etc) but is useless for actually working with the data.
As an example, let’s look at some real-world data by downloading the 1 July 2021 to 30 September 2021 immunisation data from here:
https://www.tewhatuora.govt.nz/for-the-health-sector/vaccine-information/immunisation-coverage/
Click on the link and go to the “Immunisation coverage data –
three-month reporting period” drop down list at the bottom. Download the
excel sheet for 1 July 2021 to 30 September 2021 and load it into Excel.
Let’s suppose we want the information in the Ethnicity
sheet (sheet 2).
Notice that there are several issues with these data:
This is a lot of problems to deal with. We’re going to keep things quite simple. Our goal today is to read in the “Total” data for the all milestones across the DHBs (i.e. we’re going to not worry about the Ethnicity data).
We’ll start by automatically downloading the file we want and loading
it into R using the readxl
package:
download.file(url = "https://www.tewhatuora.govt.nz/assets/For-the-health-sector/Health-sector-guidance/Vaccine-information-for-healthcare-professionals/Immunisation-coverage-data-three-month-reporting-period/1-July-2021-to-30-September-2021-Excel-67-KB.xlsx",
destfile = "immunisation.xlsx",
mode = "wb")
excel_sheets("immunisation.xlsx")
## [1] "Deprivation" "Ethnicity"
imm <- read_excel("immunisation.xlsx", sheet="Ethnicity")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
glimpse(imm)
## Rows: 160
## Columns: 20
## $ `Childhood Immunisation Coverage by milestone - fully immunised - 1 July-30 September 2021 (report from Qlik obtained 11 October 2021)` <chr> …
## $ ...2 <chr> …
## $ ...3 <chr> …
## $ ...4 <chr> …
## $ ...5 <chr> …
## $ ...6 <chr> …
## $ ...7 <chr> …
## $ ...8 <chr> …
## $ ...9 <chr> …
## $ ...10 <chr> …
## $ ...11 <chr> …
## $ ...12 <chr> …
## $ ...13 <chr> …
## $ ...14 <chr> …
## $ ...15 <chr> …
## $ ...16 <chr> …
## $ ...17 <chr> …
## $ ...18 <chr> …
## $ ...19 <chr> …
## $ ...20 <chr> …
Notice that this is basically junk. All the column types are
character, and are unnamed except for the first, which is named after
the notice at the top. We can skip some fo the junk using the
skip
parameter. We’re going to skip 3 rows to try and get
something useful in the first row of data:
imm <- read_excel("immunisation.xlsx", skip=3)
## New names:
## • `# Eligible` -> `# Eligible...3`
## • `# Fully immunised` -> `# Fully immunised...4`
## • `% Fully immunised` -> `% Fully immunised...5`
## • `# Eligible` -> `# Eligible...6`
## • `# Fully immunised` -> `# Fully immunised...7`
## • `% Fully immunised` -> `% Fully immunised...8`
## • `# Eligible` -> `# Eligible...9`
## • `# Fully immunised` -> `# Fully immunised...10`
## • `% Fully immunised` -> `% Fully immunised...11`
## • `# Eligible` -> `# Eligible...12`
## • `# Fully immunised` -> `# Fully immunised...13`
## • `% Fully immunised` -> `% Fully immunised...14`
## • `# Eligible` -> `# Eligible...15`
## • `# Fully immunised` -> `# Fully immunised...16`
## • `% Fully immunised` -> `% Fully immunised...17`
## • `# Eligible` -> `# Eligible...18`
## • `# Fully immunised` -> `# Fully immunised...19`
## • `% Fully immunised` -> `% Fully immunised...20`
## • `# Eligible` -> `# Eligible...21`
## • `# Fully immunised` -> `# Fully immunised...22`
## • `% Fully immunised` -> `% Fully immunised...23`
imm
## # A tibble: 158 × 23
## `Milestone age` `DHB of residence` `# Eligible...3` # Fully immunised...…¹
## <chr> <chr> <chr> <dbl>
## 1 All Milstone Cove… <NA> 101628 81517
## 2 6 months National Total 15151 11004
## 3 6 months Auckland 1274 994
## 4 6 months Bay of Plenty 834 539
## 5 6 months Canterbury 1673 1400
## 6 6 months Capital and Coast 821 647
## 7 6 months Counties Manukau 2111 1438
## 8 6 months Hawke's Bay 534 358
## 9 6 months Hutt Valley 547 444
## 10 6 months Lakes 398 251
## # ℹ 148 more rows
## # ℹ abbreviated name: ¹`# Fully immunised...4`
## # ℹ 19 more variables: `% Fully immunised...5` <dbl>, `# Eligible...6` <dbl>,
## # `# Fully immunised...7` <dbl>, `% Fully immunised...8` <dbl>,
## # `# Eligible...9` <dbl>, `# Fully immunised...10` <dbl>,
## # `% Fully immunised...11` <dbl>, `# Eligible...12` <chr>,
## # `# Fully immunised...13` <chr>, `% Fully immunised...14` <chr>, …
Ok, this is getting somewhere. We have generally got the data we want there, but notice the names aren’t that useful, and we have all the ethnicity information which would be useful to have, but we’re going to keep things simple and just keep the first 5 columns. We can ofcourse do that with select. In this case, the names to use aren’t all that useful, so we might instead use numbers:
imm_cleanish <- imm |> select(Milestone = 1, DHB = 2, Eligible = 3, Immunised = 4, Proportion = 5)
imm_cleanish
## # A tibble: 158 × 5
## Milestone DHB Eligible Immunised Proportion
## <chr> <chr> <chr> <dbl> <dbl>
## 1 All Milstone Coverage <NA> 101628 81517 0.802
## 2 6 months National Total 15151 11004 0.726
## 3 6 months Auckland 1274 994 0.780
## 4 6 months Bay of Plenty 834 539 0.646
## 5 6 months Canterbury 1673 1400 0.837
## 6 6 months Capital and Coast 821 647 0.788
## 7 6 months Counties Manukau 2111 1438 0.681
## 8 6 months Hawke's Bay 534 358 0.670
## 9 6 months Hutt Valley 547 444 0.812
## 10 6 months Lakes 398 251 0.631
## # ℹ 148 more rows
This isn’t too bad! Our next step will be to filter some things out.
e.g. we don’t want the “All milstone Coverage”2 row, and we don’t want
the “National Total” rows, so let’s filter those out. To see how we did,
we’ll use count
:
imm_cleaner <- imm_cleanish |> filter(Milestone != "All Milstone Coverage",
DHB != "National Total")
imm_cleaner |> count(DHB)
## # A tibble: 20 × 2
## DHB n
## <chr> <int>
## 1 Auckland 7
## 2 Bay of Plenty 7
## 3 Canterbury 7
## 4 Capital and Coast 7
## 5 Counties Manukau 7
## 6 Hawke's Bay 7
## 7 Hutt Valley 7
## 8 Lakes 7
## 9 MidCentral 7
## 10 Nelson Marlborough 7
## 11 Northland 7
## 12 South Canterbury 7
## 13 Southern 7
## 14 Tairawhiti 7
## 15 Taranaki 7
## 16 Waikato 7
## 17 Wairarapa 7
## 18 Waitemata 7
## 19 West Coast 7
## 20 Whanganui 7
imm_cleaner |> count(Milestone)
## # A tibble: 7 × 2
## Milestone n
## <chr> <int>
## 1 12 months 20
## 2 18 months 20
## 3 24 months 20
## 4 5 years 20
## 5 54 months 20
## 6 6 months 20
## 7 8 months 20
Yay! This is looking really nice now - we have 20 DHBs, each with 7
rows, and 7 milestones each with 20 DHBs. Looks great! One thing we note
though is the ‘Eligible’ column is of type chr
(short for
character) instead of numbers. Let’s fix that:
imm_clean <- imm_cleaner |> mutate(Eligible = as.numeric(Eligible))
imm_clean
## # A tibble: 140 × 5
## Milestone DHB Eligible Immunised Proportion
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 6 months Auckland 1274 994 0.780
## 2 6 months Bay of Plenty 834 539 0.646
## 3 6 months Canterbury 1673 1400 0.837
## 4 6 months Capital and Coast 821 647 0.788
## 5 6 months Counties Manukau 2111 1438 0.681
## 6 6 months Hawke's Bay 534 358 0.670
## 7 6 months Hutt Valley 547 444 0.812
## 8 6 months Lakes 398 251 0.631
## 9 6 months MidCentral 562 391 0.696
## 10 6 months Nelson Marlborough 391 299 0.765
## # ℹ 130 more rows
Nice. As a final check, let’s see if the Proportion column is correct, by computing it ourselves:
imm_clean |> mutate(PropCheck = Immunised/Eligible) |>
filter(abs(PropCheck - Proportion) > 1e-6)
## # A tibble: 0 × 6
## # ℹ 6 variables: Milestone <chr>, DHB <chr>, Eligible <dbl>, Immunised <dbl>,
## # Proportion <dbl>, PropCheck <dbl>
Neat - the proportion seems to be accurate. Now we can do some charting.
Your goal is to use the imm_clean
dataset to produce the
plot below:
knitr::include_graphics("https://www.massey.ac.nz/~jcmarsha/161324/labs/immunisation.png")
Some hints:
fct_rev
command is useful to reverse the order of a
factor variable.fct_relevel
command is useful for redefining the
order of a factor variable.#C582B2
(pink) and #B7B7B2
(grey).alpha=0.9
.theme_minimal
.