Workshop 2: Pivoting, joining, tidying, data visualisation

Today we’ll use a new dataset for schools alongside the school roll data, and some new data from the Ministry of Health. We’ll start by loading in the roll and schools data, along with loading the readxl package for reading data directly from an excel spreadsheet:

library(tidyverse)
library(readxl)

roll <- read_csv("http://www.massey.ac.nz/~jcmarsha/data/schools/roll.csv")
schools <- read_csv("http://www.massey.ac.nz/~jcmarsha/data/schools/schools.csv")

clean <- roll |> mutate(Level = as.numeric(substring(Level, 6)))
clean

## # A tibble: 66,261 × 4
##    School                  EthnicGroup Level Students
##    <chr>                   <chr>       <dbl>    <dbl>
##  1 3H School International Asian          10        1
##  2 3H School International Asian           7        2
##  3 3H School International European        7        1
##  4 3H School International Pacific         9        1
##  5 3H School International Asian           9        1
##  6 A1 Student Limited      European       10        1
##  7 A1 Student Limited      European       11        1
##  8 A1 Student Limited      European        4        2
##  9 A1 Student Limited      Asian           5        1
## 10 A1 Student Limited      European        5        2
## # ℹ 66,251 more rows

We’ll be looking at how to summarise by more than one group, and how to turn that into a table. In addition, we’ll look at how we can join datasets together to summarise student level information by school level information (e.g. by region).

Make sure you can Knit this document successfully before you make changes.

Small multiple plots

In workshop 1 you produced a graph of the total number of students by year level. You should have got something like this:

year_totals <- clean |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(year_totals) +
  geom_col(mapping=aes(x=Level, y=Total))

This is an interesting graph as there is an unexpected peak at level 7 (caused by a reclassification of students at year 7/8) and it is also interesting to see the decrease as students leave school early. We might be interested to see if there are any patterns in other variables, such as in ethnicity (or perhaps by region - we’ll see how to get that information later).

To do this, we want to produce the same plot for each ethnicity. Now, we could just go ahead and create separate plots by using filter

Try yourself

Alter the chart above so as to include ethnic group on the chart. Hint: You’ll need to group by two variables for this, and utilise another aesthetic such as fill.

In the last exercise you produced a graph of the total number of students by year level and EthnicGroup. The problem with this chart is that even if you play with the position argument of geom_col() it is hard to see if the patterns in total number of students by year level are similar or different across different ethnicities.

An alternate is to produce a plot for each ethnicity. We could just go ahead and create separate plots by using filter

eu_year_by_level <- clean |> filter(EthnicGroup == "European") |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(eu_year_by_level) +
  geom_col(mapping=aes(x=Level, y=Total))

as_year_by_level <- clean |> filter(EthnicGroup == "Asian") |> group_by(Level) |> summarise(Total = sum(Students))
ggplot(as_year_by_level) +
  geom_col(mapping=aes(x=Level, y=Total))

But that gets quite awkward! Fortunately, ggplot can do grouping for us via the facet_wrap or facet_grid commands. These produce what is known as ‘small multiple’ plots - basically divide the plot up into subplots, and in each subplot we show a different group with consistent style/axes etc.

To do this, we’ll need to group by EthnicGroup as well as Level, and then use facet_wrap: as well as Level and Gender, and then use facet_wrap:

ethnicity_by_year <- clean |> group_by(EthnicGroup, Level) |> summarise(Total = sum(Students))

## `summarise()` has grouped output by 'EthnicGroup'. You can override using the
## `.groups` argument.

ggplot(ethnicity_by_year) +
  geom_col(mapping=aes(x=Level, y=Total)) +
  facet_wrap(vars(EthnicGroup))

The vars helper function here is there so that facet_wrap knows to look for the named column in the data frame.

Try yourself

Try altering the above plot so that it uses a different y-axis for each plot. This can be useful when there’s differing numbers of students in each group. Hint: see the help for facet_wrap.
Redo the plot without the International fee paying students. You might want to see how you can set the number of rows and columns as well.

Looking at the `schools` data

Start by taking a look at the schools data:

schools

## # A tibble: 2,534 × 12
##    School        Sector Type  Authority SchoolGender AffiliationType Affiliation
##    <chr>         <chr>  <chr> <chr>     <chr>        <chr>           <chr>      
##  1 3H School In… Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  2 A1 Student L… Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  3 ACG Parnell … Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  4 ACG Strathal… Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  5 ACG Sunderla… Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  6 ACG Tauranga  Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  7 ADDI Enrichm… Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  8 AGE School    Compo… Comp… Private:… Co-Ed        No Affiliation  Not Applic…
##  9 Abbotsford S… Prima… Full… State: N… Co-Ed        No Affiliation  Not Applic…
## 10 Aberdeen Sch… Prima… Cont… State: N… Co-Ed        No Affiliation  Not Applic…
## # ℹ 2,524 more rows
## # ℹ 5 more variables: MāoriAffiliation <chr>, RegionalCouncil <chr>,
## #   TerritorialAuthority <chr>, Takiwa <chr>, EQI <dbl>

Try yourself

How many schools are there in the Palmerston North Territorial Authority?
Of the schools in Palmerston North, how many are co-ed, single-sex girls or single-sex boys? You can use count(SchoolGender) for this.
Produce a chart that shows the breakdown in secondary schools by school gender (co-ed, single-sex girls, single-sex boys) for each of the three Takiwa (education regions). You may wish to remove the two schools that have senior co-ed and junior boys only.

Joining datasets with `left_join`

If we want to summarise the student information across schools, we’ll need to join the two datasets together. Generally this is done by matching one or more “key” columns in both datasets. The most common way to do this is via a left join, where every row in the dataset we supply on the left is kept, and we match it to rows in the dataset on the right that share the same information across the key columns.

The following will do this:

all_data <- clean |> left_join(schools)

## Joining with `by = join_by(School)`

all_data

## # A tibble: 66,261 × 15
##    School         EthnicGroup Level Students Sector Type  Authority SchoolGender
##    <chr>          <chr>       <dbl>    <dbl> <chr>  <chr> <chr>     <chr>       
##  1 3H School Int… Asian          10        1 Compo… Comp… Private:… Co-Ed       
##  2 3H School Int… Asian           7        2 Compo… Comp… Private:… Co-Ed       
##  3 3H School Int… European        7        1 Compo… Comp… Private:… Co-Ed       
##  4 3H School Int… Pacific         9        1 Compo… Comp… Private:… Co-Ed       
##  5 3H School Int… Asian           9        1 Compo… Comp… Private:… Co-Ed       
##  6 A1 Student Li… European       10        1 Compo… Comp… Private:… Co-Ed       
##  7 A1 Student Li… European       11        1 Compo… Comp… Private:… Co-Ed       
##  8 A1 Student Li… European        4        2 Compo… Comp… Private:… Co-Ed       
##  9 A1 Student Li… Asian           5        1 Compo… Comp… Private:… Co-Ed       
## 10 A1 Student Li… European        5        2 Compo… Comp… Private:… Co-Ed       
## # ℹ 66,251 more rows
## # ℹ 7 more variables: AffiliationType <chr>, Affiliation <chr>,
## #   MāoriAffiliation <chr>, RegionalCouncil <chr>, TerritorialAuthority <chr>,
## #   Takiwa <chr>, EQI <dbl>

Notice that the left_join function automatically found which column to match on (based on matching column names) and that we end up with the same number of rows in this case, as each row in the clean roll dataset only matches a single school. We do, ofcourse, get the additional variables from the schools dataset.

We can now use the all_data dataset to answer questions using the variables from both datasets:

Try yourself

How many students are there in each Regional Council?
How many students of each ethnicity are there in each Takiwa?
How many students are in schools with a religious affiliation?
Produce a chart showing the total number of students of each year level within each Takiwa by using facet_wrap() to produce side by side plots by Takiwa.

Facetting with two variables

We can facet with more than one variable. e.g. we could alter the level by Takiwa plot to include ethnic group as follows:

all_data |>
  filter(EthnicGroup != "International fee paying") |>
  group_by(Takiwa, EthnicGroup, Level) |>
  summarise(Total = sum(Students)) |>
  ggplot() +
  geom_col(mapping=aes(x=Level, y=Total)) +
  facet_wrap(vars(Takiwa, EthnicGroup), ncol=6)

## `summarise()` has grouped output by 'Takiwa', 'EthnicGroup'. You can override
## using the `.groups` argument.

While this works, we end up with a lot of space used for all the headings. An alternate is to use facet_grid() which uses one variable for rows and another for columns.

Try yourself

Alter the above chart to show how the number of students in each year level differs by ethnicity in each Takiwa using facet_grid().
Play around with which variable is in rows and which is in columns and the scales parameter to facet_grid() to produce something that shows all subgroups reasonably well.

Using `pivot_wider` to turn tidy data into wide data

Up to now, all the data we’ve been playing with has been tidy: each row consists of a single observation and each column is a separate variable.

e.g. for this data we have a column for ethnic group and another column for number of students: we don’t have separate columns counting each ethnic group - the counts are all in one column, and which groups that count applies to are denoted in the other columns.

This makes it easy to do data manipulation, and also makes it easy to plot stuff, as ggplot2 and dplyr (indeed, the whole tidyverse suite of packages) works best if the data are tidy.

But, sometimes we want things in other forms. The following example tables up the total number of students by ethnicity for each takiwa:

ethnicity_by_takiwa <- all_data |>
  group_by(Takiwa, EthnicGroup) |>
  summarise(Students = sum(Students))

## `summarise()` has grouped output by 'Takiwa'. You can override using the
## `.groups` argument.

ethnicity_by_takiwa

## # A tibble: 21 × 3
## # Groups:   Takiwa [3]
##    Takiwa       EthnicGroup              Students
##    <chr>        <chr>                       <dbl>
##  1 Te Tai Raro  Asian                       84332
##  2 Te Tai Raro  European                   100770
##  3 Te Tai Raro  International fee paying     3130
##  4 Te Tai Raro  MELAA                       11391
##  5 Te Tai Raro  Māori                       62260
##  6 Te Tai Raro  Other                        2272
##  7 Te Tai Raro  Pacific                     56733
##  8 Te Tai Runga Asian                       34371
##  9 Te Tai Runga European                   158433
## 10 Te Tai Runga International fee paying     1816
## # ℹ 11 more rows

It would be nicer if we could make this into a table so that we had one row for each ethnic group, and separate columns for the takiwa totals. We can do this using pivot_wider. The key arguments are names_from which is the column that provides the new column names, and values_from which is the column that provides the values that should go in the new columns.

ethnicity_by_takiwa |>
  pivot_wider(names_from = Takiwa, values_from = Students)

## # A tibble: 7 × 4
##   EthnicGroup              `Te Tai Raro` `Te Tai Runga` `Te Tai Whenua`
##   <chr>                            <dbl>          <dbl>           <dbl>
## 1 Asian                            84332          34371           21205
## 2 European                        100770         158433          106268
## 3 International fee paying          3130           1816            1282
## 4 MELAA                            11391           7866            5007
## 5 Māori                            62260          54877           88986
## 6 Other                             2272           2629            1829
## 7 Pacific                          56733          16504            9077

Notice we have the exact same information, it’s just in a more human-readable form. It’s not in a form that’s easier to plot though! ggplot would struggle with these data, as we don’t have a single “Count” column for the y-axis to use.

Try yourself

Create a table with total number of students in each year level in each Takiwa (Hint: group_by Takiwa and year level then summarise. Once done, pivot_wider to a table).
Create a table with the number of each ethnic group in each year level.
Try adding a “Total” column to the ethnic group by takiwa table. You could do this before the pivot_wider by using a mutate with sum(Students).
Try creating a table with the percentage of students of each ethnic group within each Takiwa. You can add a new column with mutate() and get the percentage by using Students/sum(Students) within each takiwa.

Tidying data

Sometimes¹ the data we’re given isn’t in the nice, tidy form we’ve seen so far. Often it’s on a spreadsheet, and that spreadsheet will include a bunch of junk that might be useful if a person is looking at it (e.g. notes, totals, averages etc) but is useless for actually working with the data.

As an example, let’s look at some real-world data by downloading the 1 July 2021 to 30 September 2021 immunisation data from here:

https://www.tewhatuora.govt.nz/for-the-health-sector/vaccine-information/immunisation-coverage/

Click on the link and go to the “Immunisation coverage data – three-month reporting period” drop down list at the bottom. Download the excel sheet for 1 July 2021 to 30 September 2021 and load it into Excel. Let’s suppose we want the information in the Ethnicity sheet (sheet 2).

Notice that there are several issues with these data:

There are merged cells with Ethnicity.
The same information (counts of immunisations, counts of eligible, percentage information) is repeated across multiple columns.
There is information rows at the top.
Column names are distributed across multiple rows.
The Milestone age is only filled in on some rows, and is otherwise to be inferred from the lines/boxes on the spreadsheet (i.e. data is encoded in the presentation).
There are summaries in rows as well as data (e.g. the Total row alongside the DHB rows, the “All Milestones” row).
There is ‘key’ and other note information down the bottom of the spreadsheet.
There is a special entry ‘n/s’ for suppressed data. There is also ‘-’ for missing data.

This is a lot of problems to deal with. We’re going to keep things quite simple. Our goal today is to read in the “Total” data for the all milestones across the DHBs (i.e. we’re going to not worry about the Ethnicity data).

We’ll start by automatically downloading the file we want and loading it into R using the readxl package:

download.file(url = "https://www.tewhatuora.govt.nz/assets/For-the-health-sector/Health-sector-guidance/Vaccine-information-for-healthcare-professionals/Immunisation-coverage-data-three-month-reporting-period/1-July-2021-to-30-September-2021-Excel-67-KB.xlsx",
              destfile = "immunisation.xlsx",
              mode = "wb")
excel_sheets("immunisation.xlsx")

## [1] "Deprivation" "Ethnicity"

imm <- read_excel("immunisation.xlsx", sheet="Ethnicity")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`

glimpse(imm)

## Rows: 160
## Columns: 20
## $ `Childhood Immunisation Coverage by milestone - fully immunised - 1 July-30 September 2021 (report from Qlik obtained 11 October 2021)` <chr> …
## $ ...2                                                                                                                                    <chr> …
## $ ...3                                                                                                                                    <chr> …
## $ ...4                                                                                                                                    <chr> …
## $ ...5                                                                                                                                    <chr> …
## $ ...6                                                                                                                                    <chr> …
## $ ...7                                                                                                                                    <chr> …
## $ ...8                                                                                                                                    <chr> …
## $ ...9                                                                                                                                    <chr> …
## $ ...10                                                                                                                                   <chr> …
## $ ...11                                                                                                                                   <chr> …
## $ ...12                                                                                                                                   <chr> …
## $ ...13                                                                                                                                   <chr> …
## $ ...14                                                                                                                                   <chr> …
## $ ...15                                                                                                                                   <chr> …
## $ ...16                                                                                                                                   <chr> …
## $ ...17                                                                                                                                   <chr> …
## $ ...18                                                                                                                                   <chr> …
## $ ...19                                                                                                                                   <chr> …
## $ ...20                                                                                                                                   <chr> …

Notice that this is basically junk. All the column types are character, and are unnamed except for the first, which is named after the notice at the top. We can skip some fo the junk using the skip parameter. We’re going to skip 3 rows to try and get something useful in the first row of data:

imm <- read_excel("immunisation.xlsx", skip=3)

## New names:
## • `# Eligible` -> `# Eligible...3`
## • `# Fully immunised` -> `# Fully immunised...4`
## • `% Fully immunised` -> `% Fully immunised...5`
## • `# Eligible` -> `# Eligible...6`
## • `# Fully immunised` -> `# Fully immunised...7`
## • `% Fully immunised` -> `% Fully immunised...8`
## • `# Eligible` -> `# Eligible...9`
## • `# Fully immunised` -> `# Fully immunised...10`
## • `% Fully immunised` -> `% Fully immunised...11`
## • `# Eligible` -> `# Eligible...12`
## • `# Fully immunised` -> `# Fully immunised...13`
## • `% Fully immunised` -> `% Fully immunised...14`
## • `# Eligible` -> `# Eligible...15`
## • `# Fully immunised` -> `# Fully immunised...16`
## • `% Fully immunised` -> `% Fully immunised...17`
## • `# Eligible` -> `# Eligible...18`
## • `# Fully immunised` -> `# Fully immunised...19`
## • `% Fully immunised` -> `% Fully immunised...20`
## • `# Eligible` -> `# Eligible...21`
## • `# Fully immunised` -> `# Fully immunised...22`
## • `% Fully immunised` -> `% Fully immunised...23`

imm

## # A tibble: 158 × 23
##    `Milestone age`    `DHB of residence` `# Eligible...3` # Fully immunised...…¹
##    <chr>              <chr>              <chr>                             <dbl>
##  1 All Milstone Cove… <NA>               101628                            81517
##  2 6 months           National Total     15151                             11004
##  3 6 months           Auckland           1274                                994
##  4 6 months           Bay of Plenty      834                                 539
##  5 6 months           Canterbury         1673                               1400
##  6 6 months           Capital and Coast  821                                 647
##  7 6 months           Counties Manukau   2111                               1438
##  8 6 months           Hawke's Bay        534                                 358
##  9 6 months           Hutt Valley        547                                 444
## 10 6 months           Lakes              398                                 251
## # ℹ 148 more rows
## # ℹ abbreviated name: ¹`# Fully immunised...4`
## # ℹ 19 more variables: `% Fully immunised...5` <dbl>, `# Eligible...6` <dbl>,
## #   `# Fully immunised...7` <dbl>, `% Fully immunised...8` <dbl>,
## #   `# Eligible...9` <dbl>, `# Fully immunised...10` <dbl>,
## #   `% Fully immunised...11` <dbl>, `# Eligible...12` <chr>,
## #   `# Fully immunised...13` <chr>, `% Fully immunised...14` <chr>, …

Ok, this is getting somewhere. We have generally got the data we want there, but notice the names aren’t that useful, and we have all the ethnicity information which would be useful to have, but we’re going to keep things simple and just keep the first 5 columns. We can ofcourse do that with select. In this case, the names to use aren’t all that useful, so we might instead use numbers:

imm_cleanish <- imm |> select(Milestone = 1, DHB = 2, Eligible = 3, Immunised = 4, Proportion = 5)
imm_cleanish

## # A tibble: 158 × 5
##    Milestone             DHB               Eligible Immunised Proportion
##    <chr>                 <chr>             <chr>        <dbl>      <dbl>
##  1 All Milstone Coverage <NA>              101628       81517      0.802
##  2 6 months              National Total    15151        11004      0.726
##  3 6 months              Auckland          1274           994      0.780
##  4 6 months              Bay of Plenty     834            539      0.646
##  5 6 months              Canterbury        1673          1400      0.837
##  6 6 months              Capital and Coast 821            647      0.788
##  7 6 months              Counties Manukau  2111          1438      0.681
##  8 6 months              Hawke's Bay       534            358      0.670
##  9 6 months              Hutt Valley       547            444      0.812
## 10 6 months              Lakes             398            251      0.631
## # ℹ 148 more rows

This isn’t too bad! Our next step will be to filter some things out. e.g. we don’t want the “All milstone Coverage”² row, and we don’t want the “National Total” rows, so let’s filter those out. To see how we did, we’ll use count:

imm_cleaner <- imm_cleanish |> filter(Milestone != "All Milstone Coverage",
                               DHB != "National Total")
imm_cleaner |> count(DHB)

## # A tibble: 20 × 2
##    DHB                    n
##    <chr>              <int>
##  1 Auckland               7
##  2 Bay of Plenty          7
##  3 Canterbury             7
##  4 Capital and Coast      7
##  5 Counties Manukau       7
##  6 Hawke's Bay            7
##  7 Hutt Valley            7
##  8 Lakes                  7
##  9 MidCentral             7
## 10 Nelson Marlborough     7
## 11 Northland              7
## 12 South Canterbury       7
## 13 Southern               7
## 14 Tairawhiti             7
## 15 Taranaki               7
## 16 Waikato                7
## 17 Wairarapa              7
## 18 Waitemata              7
## 19 West Coast             7
## 20 Whanganui              7

imm_cleaner |> count(Milestone)

## # A tibble: 7 × 2
##   Milestone     n
##   <chr>     <int>
## 1 12 months    20
## 2 18 months    20
## 3 24 months    20
## 4 5 years      20
## 5 54 months    20
## 6 6 months     20
## 7 8 months     20

Yay! This is looking really nice now - we have 20 DHBs, each with 7 rows, and 7 milestones each with 20 DHBs. Looks great! One thing we note though is the ‘Eligible’ column is of type chr (short for character) instead of numbers. Let’s fix that:

imm_clean <- imm_cleaner |> mutate(Eligible = as.numeric(Eligible))
imm_clean

## # A tibble: 140 × 5
##    Milestone DHB                Eligible Immunised Proportion
##    <chr>     <chr>                 <dbl>     <dbl>      <dbl>
##  1 6 months  Auckland               1274       994      0.780
##  2 6 months  Bay of Plenty           834       539      0.646
##  3 6 months  Canterbury             1673      1400      0.837
##  4 6 months  Capital and Coast       821       647      0.788
##  5 6 months  Counties Manukau       2111      1438      0.681
##  6 6 months  Hawke's Bay             534       358      0.670
##  7 6 months  Hutt Valley             547       444      0.812
##  8 6 months  Lakes                   398       251      0.631
##  9 6 months  MidCentral              562       391      0.696
## 10 6 months  Nelson Marlborough      391       299      0.765
## # ℹ 130 more rows

Nice. As a final check, let’s see if the Proportion column is correct, by computing it ourselves:

imm_clean |> mutate(PropCheck = Immunised/Eligible) |>
  filter(abs(PropCheck - Proportion) > 1e-6)

## # A tibble: 0 × 6
## # ℹ 6 variables: Milestone <chr>, DHB <chr>, Eligible <dbl>, Immunised <dbl>,
## #   Proportion <dbl>, PropCheck <dbl>

Neat - the proportion seems to be accurate. Now we can do some charting.

Try yourself

Your goal is to use the imm_clean dataset to produce the plot below:

knitr::include_graphics("https://www.massey.ac.nz/~jcmarsha/161324/labs/immunisation.png")

Some hints:

The fct_rev command is useful to reverse the order of a factor variable.
The fct_relevel command is useful for redefining the order of a factor variable.
The line is at 90% - this is the goal for each of these milestones.
The bars are coloured based on whether the proportion has reached or surpassed the 90% goal.
The colours are #C582B2 (pink) and #B7B7B2 (grey).
The bars are slightly transparent with alpha=0.9.
The chart uses theme_minimal.
Don’t worry about font size as this is display-specific.

Almost all the time.↩︎
Real data has typos All. The. Time.↩︎

Workshop 2: Pivoting, joining, tidying, data visualisation

Small multiple plots

Try yourself

Try yourself

Looking at the schools data

Try yourself

Joining datasets with left_join

Try yourself

Facetting with two variables

Try yourself

Using pivot_wider to turn tidy data into wide data

Try yourself

Tidying data

Try yourself

Looking at the `schools` data

Joining datasets with `left_join`

Using `pivot_wider` to turn tidy data into wide data