MATH2349 Data Wrangling

Submission Steps:

1 . Prepare Assignment 2 report using this R Markdown template. 2 . Upload the report as a PDF file via the File Upload tab under the Assignment 2 page in CANVAS (see instructions file for details) after you attach the file click Submit assignment. 3 . Publish the report to RPubs

If you have any questions regarding the assignment instructions and the R Markdown template, please post it on discussion board. NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(readr)
library(xlsx)
library(readxl)
library(foreign)
library(gdata)

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:gdata':
## 
##     combine, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

The objective of this assignment is to demonstrate Data Wrangling skills acquired over Semester 2, 2020. We will use national data relating to Schools sourced from the Australian Curriculum, Assessment and Reporting Authority (ACARA) and match it up with a Socio-Economic Index for Areas (SEIFA) sourced from the Australian Bureau of Statistics (ABS). The large dataframe can then form the basis for analysis across a number of school variables against socio-economic data at the Local Government Area (LGA) level. For the purpose of this assignment however we will create the initial larger database, then dwelve into the SEIFA data as it relates to Student-Teacher ratios only.

We note that a proper analysis would require the development of both a null, and an alternative, hypotheseis. However this is beyond the scope of this assignment.

There are three data sets that will be used:

-(1) ACARA School Profiles;

-(2) ACARA School Location;and,

-(3) SEIFA.

We also note that while ACARA’s School Profile dataset does contain a socio-economic index (ICSEA), we will not use this, and instead look to use the ABS-sourced SEIFA data. There are three key differences:

The ICSEA data is created by ACARA, whereas the SEIFA data is created by the ABS. That is, we will examine the school variable (Student-Teacher ratio) using a socio-economic methodology created by an organisation external to ACARA.
The ICSEA data is at the school level. On the other hand, the SEIFA data is at the LGA level.
The SEIFA rankings are based on deciles of 1 to 10, whereas ACARA’s ICSEA number ranges from 100 to 1,278. This makes the ABS SEIFA decile rankings easier to collate into Low, Medium, and High groupings for our analysis.

There is also a mild temporal mismatch between the ACARA data (2019) vs the SEIFA data (2016). However, importantly, the LGA data from both datasets is contemporaneous as it is from 2016. The temporal mismatch of 3 years that relates to the socioeconomic analysis only, a period which is too small to affect our analysis.

Data

1. ACARA School Profiles

The ACARA School Profiles data is produced by the Australian Curriculum, Assessment and Reporting Authority.

It contains School Profiles for the year 2019, covering 32 variables and 9,562 observations.

Data source: www.acara.edu.au > Contact Us > ACARA Data Access Program > Data Available > School Profile 2019 https://www.acara.edu.au/docs/default-source/default-document-library/school-profile-2019d55613404c94637ead88ff00003e0139.xlsx

Although the ACARA School Profiles dataset contains 32 variables, ultimately only five are relevant for the purposes of this analysis. These variables are:

ACARA.SML.ID: An ID variable that uniquely identifies each school (i.e. each observation). That is, every row in the data frame has a unique ID.
School.Name.x: The official name of the school.
Geolocation: A character variable that classifies the locations of schools into Very Remote, Remote, Outer Regional, Inner Regional, and Major Cities.
Full.Time.Equivalent.Teaching.Staff: A numerical variable that quantifies the full-time equivalent of full-time and part-time teaching staff at each school.
Full.Time.Equivalent.Enrolments: The number of students enrolled at the school expressed as full-time students.

2. ACARA School Locations dataset

The ACARA School Profiles data will be matched up with ACARA School Location data. This is because the Locations data matches schools against Local Government Area (LGA) codes, which we can then use to match against the SEIFA data.

Data source: www.acara.edu.au > Contact Us > ACARA Data Access Program > Data Available > School location 2019 https://www.acara.edu.au/docs/default-source/default-document-library/school-location-2019195713404c94637ead88ff00003e0139.xlsx?sfvrsn=a0c37007_0

The ACARA School Locations dataset contains 25 variables, however only three are relevant for the purposes of this analysis. These variables are:

ACARA.SML.ID: An ID variable that uniquely identifies each school (i.e. each observation). This is the same identifier in the ACARA Schools Profiles dataset.
Local Government Area: A code identifying the Local Government Area, sourced from the ABS Australian Statistical Geography Standard (ASGS) Allocation Files (2016). This is the same identifier that exists in the SEIFA dataset.
Local Government Area Name: Name of Local Government Area.

3. SEIFA (Table 2) dataset

The SEIFA (Table 2) dataset is produced by the Australian Bureau of Statistics and is current as at 2016. It ranks areas throughout Australia according to relative socio-economic advantage and disadvantage. It includes Local Government Area (LGA) codes, which we can use to match against the ACARA data.

Source: https://www.abs.gov.au/ausstats/abs@.nsf/Lookup/by%20Subject/2033.0.55.001~2016~Main%20Features~SOCIO-ECONOMIC%20INDEXES%20FOR%20AREAS%20(SEIFA)%202016~1

The SEIFA Table 2 dataset contains 15 variables, however only three are relevant for the purposes of this analysis. These variables are:

2016 Local Government Area (LGA) Code: A code identifying the Local Government Area, sourced from the ABS Australian Statistical Geography Standard (ASGS) Allocation Files (2016). This is the same identifier that exists in the ACARA Schools Location dataset.
2016 Local Government Area (LGA) Name: Name of Local Government Area.
Decile Ranking of the LGA within Australia: This column gives the decile ranking of the LGA based on the LGA level SEIFA score.

# Read in the ACARA Schools data
ACARA <- read.xlsx("ACARA_Schools.xlsx", sheetName = "SchoolProfile 2019")

# Read the head of ACARA 
head(ACARA)

# Provide the structure of ACARA
str(ACARA)

## 'data.frame':    9562 obs. of  32 variables:
##  $ Calendar.Year                             : num  2019 2019 2019 2019 2019 ...
##  $ ACARA.SML.ID                              : num  40000 40001 40002 40003 40004 ...
##  $ AGE.ID                                    : num  3 4 5 7 9 10 11 12 13 18 ...
##  $ School.Name                               : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Suburb                                    : chr  "Bellerive" "Sandy Bay" "Latrobe" "Claremont" ...
##  $ State                                     : chr  "TAS" "TAS" "TAS" "TAS" ...
##  $ Postcode                                  : chr  "7018" "7005" "7307" "7011" ...
##  $ School.Sector                             : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ School.Type                               : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Campus.Type                               : chr  "School Single Entity" "School Single Entity" "School Single Entity" "School Single Entity" ...
##  $ Rolled.Reporting.Description              : chr  "Individual Reporting" "Individual Reporting" "Individual Reporting" "Individual Reporting" ...
##  $ School.URL                                : chr  "http://www.corpuschristi.tas.edu.au" "http://www.fahan.tas.edu.au" "http://www.geneva.tas.edu.au" "http://www.holyrosary.tas.edu.au" ...
##  $ Governing.Body                            : chr  "Tasmanian Catholic Education Office" "Independent Schools Tasmania" "Independent Schools Tasmania" "Tasmanian Catholic Education Office" ...
##  $ Governing.Body.URL                        : chr  "http://www.catholic.tas.edu.au" "http://www.independentschools.tas.edu.au" "http://www.independentschools.tas.edu.au" "http://www.catholic.tas.edu.au" ...
##  $ Year.Range                                : chr  "Prep-6" "Prep-12" "Prep-12" "Prep-6" ...
##  $ Geolocation                               : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ ICSEA                                     : num  1061 1136 995 997 992 ...
##  $ ICSEA.Percentile                          : num  73 93 44 44 42 71 47 84 8 55 ...
##  $ Bottom.SEA.Quarter....                    : num  11 3 29 27 32 11 26 6 55 21 ...
##  $ Lower.Middle.SEA.Quarter....              : num  29 10 33 39 28 29 38 17 29 31 ...
##  $ Upper.Middle.SEA.Quarter....              : num  33 24 27 25 23 38 23 32 12 28 ...
##  $ Top.SEA.Quarter....                       : num  28 63 11 9 16 23 12 46 4 20 ...
##  $ Teaching.Staff                            : num  27 41 23 29 15 13 18 94 30 29 ...
##  $ Full.Time.Equivalent.Teaching.Staff       : num  21.8 36.3 17 25.1 12.4 10 12.4 82.1 24.6 23 ...
##  $ Non.Teaching.Staff                        : num  19 29 42 21 15 13 16 66 21 31 ...
##  $ Full.Time.Equivalent.Non.Teaching.Staff   : num  11.4 19.3 24.2 11.3 7.1 4.9 5.9 49.9 11.8 14.4 ...
##  $ Total.Enrolments                          : num  395 404 235 392 202 105 189 753 260 411 ...
##  $ Girls.Enrolments                          : num  185 404 110 179 110 56 99 369 124 201 ...
##  $ Boys.Enrolments                           : num  210 0 125 213 92 49 90 384 136 210 ...
##  $ Full.Time.Equivalent.Enrolments           : num  395 404 235 392 202 ...
##  $ Indigenous.Enrolments....                 : num  2 1 6 6 11 0 6 3 37 9 ...
##  $ Language.Background.Other.Than.English....: num  5 10 3 1 14 1 4 9 0 4 ...

# Read in the Schools_LGA data
LGA_Sch <- read.xlsx("LGA_Schools.xlsx", sheetName = "School Location 2019")

# Read the head of LGA_Sch
head(LGA_Sch)

# Provide the structure of LGA_Sch
str(LGA_Sch)

## 'data.frame':    10507 obs. of  25 variables:
##  $ Calendar.Year               : num  2019 2019 2019 2019 2019 ...
##  $ ACARA.SML.ID                : num  40000 40001 40002 40003 40004 ...
##  $ AGE.ID                      : num  3 4 5 7 9 10 11 12 13 18 ...
##  $ School.Name                 : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Suburb                      : chr  "BELLERIVE" "SANDY BAY" "LATROBE" "CLAREMONT" ...
##  $ State                       : chr  "TAS" "TAS" "TAS" "TAS" ...
##  $ Postcode                    : chr  "7018" "7005" "7307" "7011" ...
##  $ School.Sector               : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ School.Type                 : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Campus.Type                 : chr  "School Single Entity" "School Single Entity" "School Single Entity" "School Single Entity" ...
##  $ Rolled.Reporting.Description: chr  "Individual Reporting" "Individual Reporting" "Individual Reporting" "Individual Reporting" ...
##  $ Latitude                    : num  -42.9 -42.9 -41.2 -42.8 -42.9 ...
##  $ Longitude                   : num  147 147 146 147 147 ...
##  $ ABS.Remoteness.Area         : num  1 1 2 1 1 1 1 1 2 1 ...
##  $ ABS.Remoteness.Area.Name    : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ Meshblock                   : num  6.00e+10 6.01e+10 6.01e+10 6.00e+10 6.01e+10 ...
##  $ Statistical.Area.1          : num  6.01e+10 6.01e+10 6.04e+10 6.01e+10 6.01e+10 ...
##  $ Statistical.Area.2          : num  6.01e+08 6.01e+08 6.04e+08 6.01e+08 6.01e+08 ...
##  $ Statistical.Area.2.Name     : chr  "Bellerive - Rosny" "Sandy Bay" "Latrobe" "Claremont (Tas.)" ...
##  $ Statistical.Area.3          : num  60102 60105 60402 60103 60105 ...
##  $ Statistical.Area.3.Name     : chr  "Hobart - North East" "Hobart Inner" "Devonport" "Hobart - North West" ...
##  $ Statistical.Area.4          : num  601 601 604 601 601 602 602 602 603 604 ...
##  $ Statistical.Area.4.Name     : chr  "Hobart" "Hobart" "West and North West" "Hobart" ...
##  $ Local.Government.Area       : num  61410 62810 63810 62610 62810 ...
##  $ Local.Government.Area.Name  : chr  "Clarence (C)" "Hobart (C)" "Latrobe (M) (Tas.)" "Glenorchy (C)" ...

# Read in SEIFA Decile rankings into R
 SEIFA <- read.xlsx("SEIFA.xls", sheetName = "Table 2")

# Check head of SEIFA
head(SEIFA)

# Check structure of SEIFA
str(SEIFA)

## 'data.frame':    2186 obs. of  16 variables:
##  $ X2016.Local.Government.Area..LGA..Code                                    : chr  NA "10050" "10130" "10250" ...
##  $ X2016.Local.Government.Area..LGA..Name.                                   : chr  NA "Albury (C)" "Armidale Regional (A)" "Ballina (A)" ...
##  $ Decile.ranking.of.the.LGA.within.Australia                                : chr  NA "5" "7" "8" ...
##  $ NA.                                                                       : logi  NA NA NA NA NA NA ...
##  $ Decile.distribution.of.the.Statistical.Area.Level.1s..SA1s..within.the.LGA: chr  "Decile 1 SA1s" "28" "8" "10" ...
##  $ NA..1                                                                     : chr  "Decile 2 SA1s" "13" "11" "12" ...
##  $ NA..2                                                                     : chr  "Decile 3 SA1s" "16" "7" "11" ...
##  $ NA..3                                                                     : chr  "Decile 4 SA1s" "10" "7" "12" ...
##  $ NA..4                                                                     : chr  "Decile 5 SA1s" "12" "11" "13" ...
##  $ NA..5                                                                     : chr  "Decile 6 SA1s" "11" "8" "16" ...
##  $ NA..6                                                                     : chr  "Decile 7 SA1s" "13" "5" "16" ...
##  $ NA..7                                                                     : chr  "Decile 8 SA1s" "10" "9" "13" ...
##  $ NA..8                                                                     : chr  "Decile 9 SA1s" "6" "6" "7" ...
##  $ NA..9                                                                     : chr  "Decile 10 SA1s" "2" "0" "0" ...
##  $ Number.of.SA1s.with.a.SEIFA.score                                         : num  NA 121 72 110 6 103 97 41 23 740 ...
##  $ Number.of.SA1s.without.a.SEIFA.score                                      : num  NA 6 1 5 0 6 9 1 0 34 ...

Understand (1)

Please note that minimum requirements 2-4 are spread throughout this assignment and not confined to the section immediately below.

The data sets and variables of interest have been explained in the previous Data section.

Our datasets do contain multiple data types, particularly numerics and characters.

Data conversions will occur. For example: (-) The LGA Codes in the ACARA data will be converted from numeric to character, as they are used as identifiers, and have no numeric value. (-) The SEIFA Decile rankings will be converted from character to numeric, to allow them to order according to number, where needed.

An ordered factor variable will be created and labelled. Specifically, the SEIFA Deciles will be grouped into LowDec, MidDec, and HighDec. This is detailed further down under section ## Understand (2).

# Please note that minimum requirements 2-4 are spread throughout this assignment and not confined to this immediate section. Refer to the comments section above for more detail.

Tidy & Manipulate Data I

Presently, the data is Untidy because:

(i). We do not have a common identifier (i.e. LGA codes) in the School Profiles dataset that will allow us to match up observations with the SEIFA dataset. To correct this, we need to match LGA codes from the School Locations dataset to School Profiles using the identifier ACARA.SML.ID which is common to both.

(ii). There are variables need to be converted. For example LGA codes need to convert from character to numeric. The SEIFA Decile rankings codes need to convert from character to numeric.

(iii). The names for several variables are lengthy. Concise names make for efficient coding.

(iv). We need to join the SEIFA data to the ACARA data using the common identifier LGA_code.

(v). The dataframes contains variables which are superfluous to our analysis. They need to be removed.

# This is the R chunk for the Tidy & Manipulate Data I 

# JOIN ACARA and LGA_Sch using ACARA.SML.ID as the key
ACARA_LGA <- ACARA %>% left_join(LGA_Sch, by = "ACARA.SML.ID")

# Provide the structure of ACARA_LGA
str(ACARA_LGA)

## 'data.frame':    9562 obs. of  56 variables:
##  $ Calendar.Year.x                           : num  2019 2019 2019 2019 2019 ...
##  $ ACARA.SML.ID                              : num  40000 40001 40002 40003 40004 ...
##  $ AGE.ID.x                                  : num  3 4 5 7 9 10 11 12 13 18 ...
##  $ School.Name.x                             : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Suburb.x                                  : chr  "Bellerive" "Sandy Bay" "Latrobe" "Claremont" ...
##  $ State.x                                   : chr  "TAS" "TAS" "TAS" "TAS" ...
##  $ Postcode.x                                : chr  "7018" "7005" "7307" "7011" ...
##  $ School.Sector.x                           : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ School.Type.x                             : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Campus.Type.x                             : chr  "School Single Entity" "School Single Entity" "School Single Entity" "School Single Entity" ...
##  $ Rolled.Reporting.Description.x            : chr  "Individual Reporting" "Individual Reporting" "Individual Reporting" "Individual Reporting" ...
##  $ School.URL                                : chr  "http://www.corpuschristi.tas.edu.au" "http://www.fahan.tas.edu.au" "http://www.geneva.tas.edu.au" "http://www.holyrosary.tas.edu.au" ...
##  $ Governing.Body                            : chr  "Tasmanian Catholic Education Office" "Independent Schools Tasmania" "Independent Schools Tasmania" "Tasmanian Catholic Education Office" ...
##  $ Governing.Body.URL                        : chr  "http://www.catholic.tas.edu.au" "http://www.independentschools.tas.edu.au" "http://www.independentschools.tas.edu.au" "http://www.catholic.tas.edu.au" ...
##  $ Year.Range                                : chr  "Prep-6" "Prep-12" "Prep-12" "Prep-6" ...
##  $ Geolocation                               : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ ICSEA                                     : num  1061 1136 995 997 992 ...
##  $ ICSEA.Percentile                          : num  73 93 44 44 42 71 47 84 8 55 ...
##  $ Bottom.SEA.Quarter....                    : num  11 3 29 27 32 11 26 6 55 21 ...
##  $ Lower.Middle.SEA.Quarter....              : num  29 10 33 39 28 29 38 17 29 31 ...
##  $ Upper.Middle.SEA.Quarter....              : num  33 24 27 25 23 38 23 32 12 28 ...
##  $ Top.SEA.Quarter....                       : num  28 63 11 9 16 23 12 46 4 20 ...
##  $ Teaching.Staff                            : num  27 41 23 29 15 13 18 94 30 29 ...
##  $ Full.Time.Equivalent.Teaching.Staff       : num  21.8 36.3 17 25.1 12.4 10 12.4 82.1 24.6 23 ...
##  $ Non.Teaching.Staff                        : num  19 29 42 21 15 13 16 66 21 31 ...
##  $ Full.Time.Equivalent.Non.Teaching.Staff   : num  11.4 19.3 24.2 11.3 7.1 4.9 5.9 49.9 11.8 14.4 ...
##  $ Total.Enrolments                          : num  395 404 235 392 202 105 189 753 260 411 ...
##  $ Girls.Enrolments                          : num  185 404 110 179 110 56 99 369 124 201 ...
##  $ Boys.Enrolments                           : num  210 0 125 213 92 49 90 384 136 210 ...
##  $ Full.Time.Equivalent.Enrolments           : num  395 404 235 392 202 ...
##  $ Indigenous.Enrolments....                 : num  2 1 6 6 11 0 6 3 37 9 ...
##  $ Language.Background.Other.Than.English....: num  5 10 3 1 14 1 4 9 0 4 ...
##  $ Calendar.Year.y                           : num  2019 2019 2019 2019 2019 ...
##  $ AGE.ID.y                                  : num  3 4 5 7 9 10 11 12 13 18 ...
##  $ School.Name.y                             : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Suburb.y                                  : chr  "BELLERIVE" "SANDY BAY" "LATROBE" "CLAREMONT" ...
##  $ State.y                                   : chr  "TAS" "TAS" "TAS" "TAS" ...
##  $ Postcode.y                                : chr  "7018" "7005" "7307" "7011" ...
##  $ School.Sector.y                           : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ School.Type.y                             : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Campus.Type.y                             : chr  "School Single Entity" "School Single Entity" "School Single Entity" "School Single Entity" ...
##  $ Rolled.Reporting.Description.y            : chr  "Individual Reporting" "Individual Reporting" "Individual Reporting" "Individual Reporting" ...
##  $ Latitude                                  : num  -42.9 -42.9 -41.2 -42.8 -42.9 ...
##  $ Longitude                                 : num  147 147 146 147 147 ...
##  $ ABS.Remoteness.Area                       : num  1 1 2 1 1 1 1 1 2 1 ...
##  $ ABS.Remoteness.Area.Name                  : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ Meshblock                                 : num  6.00e+10 6.01e+10 6.01e+10 6.00e+10 6.01e+10 ...
##  $ Statistical.Area.1                        : num  6.01e+10 6.01e+10 6.04e+10 6.01e+10 6.01e+10 ...
##  $ Statistical.Area.2                        : num  6.01e+08 6.01e+08 6.04e+08 6.01e+08 6.01e+08 ...
##  $ Statistical.Area.2.Name                   : chr  "Bellerive - Rosny" "Sandy Bay" "Latrobe" "Claremont (Tas.)" ...
##  $ Statistical.Area.3                        : num  60102 60105 60402 60103 60105 ...
##  $ Statistical.Area.3.Name                   : chr  "Hobart - North East" "Hobart Inner" "Devonport" "Hobart - North West" ...
##  $ Statistical.Area.4                        : num  601 601 604 601 601 602 602 602 603 604 ...
##  $ Statistical.Area.4.Name                   : chr  "Hobart" "Hobart" "West and North West" "Hobart" ...
##  $ Local.Government.Area                     : num  61410 62810 63810 62610 62810 ...
##  $ Local.Government.Area.Name                : chr  "Clarence (C)" "Hobart (C)" "Latrobe (M) (Tas.)" "Glenorchy (C)" ...

# Remove unwanted columns
ACARA_14 = select(ACARA_LGA, -1, -3, -5,-6, -10:-14, -18:-23, -25:-29, -33:-54)

# Create ACARA14rn where we rename columns from ACARA_14
ACARA14rn <- ACARA_14 %>% rename(
School = School.Name.x,
Postcode = Postcode.x,
Sector = School.Sector.x,
Type = School.Type.x,
Years = Year.Range,
Region = Geolocation,
Teachers = Full.Time.Equivalent.Teaching.Staff,
Students = Full.Time.Equivalent.Enrolments,
Indigenous = Indigenous.Enrolments....,
LBOTE = Language.Background.Other.Than.English....,
LGA = Local.Government.Area,
LGA_Name = Local.Government.Area.Name
)         

# Check the structure of ACARA14rn
str(ACARA14rn)

## 'data.frame':    9562 obs. of  14 variables:
##  $ ACARA.SML.ID: num  40000 40001 40002 40003 40004 ...
##  $ School      : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Postcode    : chr  "7018" "7005" "7307" "7011" ...
##  $ Sector      : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ Type        : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Years       : chr  "Prep-6" "Prep-12" "Prep-12" "Prep-6" ...
##  $ Region      : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ ICSEA       : num  1061 1136 995 997 992 ...
##  $ Teachers    : num  21.8 36.3 17 25.1 12.4 10 12.4 82.1 24.6 23 ...
##  $ Students    : num  395 404 235 392 202 ...
##  $ Indigenous  : num  2 1 6 6 11 0 6 3 37 9 ...
##  $ LBOTE       : num  5 10 3 1 14 1 4 9 0 4 ...
##  $ LGA         : num  61410 62810 63810 62610 62810 ...
##  $ LGA_Name    : chr  "Clarence (C)" "Hobart (C)" "Latrobe (M) (Tas.)" "Glenorchy (C)" ...

# Convert ACARA14rn$LGA into character
ACARALGAchar <- as.character(ACARA14rn$LGA)

# Bind ACARALGAchar to ACARA14rn
ACARALGAc <- cbind(ACARA14rn, ACARALGAchar)

# Drop the 13th column
ACARAm14  = select(ACARALGAc, -13)

# Rename the final column
ACARALGA14 <- ACARAm14 %>% rename(
LGA = ACARALGAchar
)

# Remove unwanted columns from SEIFA
SEIFA3 = select(SEIFA, -4:-16)

# Create SEIFA3rn where we rename columns from SEIFA3
SEIFA3rn <- SEIFA3 %>% rename(
LGA = X2016.Local.Government.Area..LGA..Code,
LGA_Name = X2016.Local.Government.Area..LGA..Name.,
SEIFA_Dec = Decile.ranking.of.the.LGA.within.Australia
)  

# Check structure of SEIFA3rn
str(SEIFA3rn)

## 'data.frame':    2186 obs. of  3 variables:
##  $ LGA      : chr  NA "10050" "10130" "10250" ...
##  $ LGA_Name : chr  NA "Albury (C)" "Armidale Regional (A)" "Ballina (A)" ...
##  $ SEIFA_Dec: chr  NA "5" "7" "8" ...

# Convert SEIFA3rn$SEIFA_Dec from character to numeric
SEIFADec_num <- as.numeric(SEIFA3rn$SEIFA_Dec)

## Warning: NAs introduced by coercion

# Bind SEIFADec_num to SEIFA3rn
SEIFADecnum <- cbind(SEIFA3rn, SEIFADec_num)

# Drop the 3rd column
SEIFADec  = select(SEIFADecnum, -3)

# Join SEIFADec to ACARALGA14
ACARASEIFA1 <- ACARALGA14 %>% left_join(SEIFADec, by = "LGA")

# Check the structure of ACARASEIFA1 
str(ACARASEIFA1)

## 'data.frame':    9562 obs. of  16 variables:
##  $ ACARA.SML.ID: num  40000 40001 40002 40003 40004 ...
##  $ School      : chr  "Corpus Christi Catholic School" "Fahan School" "Geneva Christian College" "Holy Rosary Catholic School" ...
##  $ Postcode    : chr  "7018" "7005" "7307" "7011" ...
##  $ Sector      : chr  "Catholic" "Independent" "Independent" "Catholic" ...
##  $ Type        : chr  "Primary" "Combined" "Combined" "Primary" ...
##  $ Years       : chr  "Prep-6" "Prep-12" "Prep-12" "Prep-6" ...
##  $ Region      : chr  "Inner Regional" "Inner Regional" "Outer Regional" "Inner Regional" ...
##  $ ICSEA       : num  1061 1136 995 997 992 ...
##  $ Teachers    : num  21.8 36.3 17 25.1 12.4 10 12.4 82.1 24.6 23 ...
##  $ Students    : num  395 404 235 392 202 ...
##  $ Indigenous  : num  2 1 6 6 11 0 6 3 37 9 ...
##  $ LBOTE       : num  5 10 3 1 14 1 4 9 0 4 ...
##  $ LGA_Name.x  : chr  "Clarence (C)" "Hobart (C)" "Latrobe (M) (Tas.)" "Glenorchy (C)" ...
##  $ LGA         : chr  "61410" "62810" "63810" "62610" ...
##  $ LGA_Name.y  : chr  "Clarence (C)" "Hobart (C)" "Latrobe (M) (Tas.)" "Glenorchy (C)" ...
##  $ SEIFADec_num: num  7 9 4 2 9 3 3 3 4 2 ...

# Drop the 15th column because it is a duplicate
ACARASEIFA15 = select(ACARASEIFA1, -15)

Tidy & Manipulate Data II

We will create the new variable ST_ratio, which will be a Student/Teacher ratio derived from the relevant variables in the ACARA15 dataset.

We will then use cbind the new variable to the ACARA15 dataset. As the cbind is immediate, it should be a lossless bind, with the rows matching precisely.

# Create Student to Teacher Ratio (ST_ratio) vector
ST_ratio <- ACARASEIFA15$Students / ACARASEIFA15$Teachers

# Bind ST_ratio to ACARASEIFA15
ACARASEIFA16 <- cbind(ACARASEIFA15, ST_ratio)

Scan I

ACARASEIFA16 is our main dataframe with several variables of interest (e.g. Region, Indigenous, LBOTE, ST_ratio, SEIFA_Dec) which we can use for future analyses.

Now, we will setup a more focused dataframe to look at Student-Teacher ratios (ST_ratios) across SEIFA Deciles and Local Government Areas (LGAs).

In turn, this will be further confined to Major Cities only, so as to avoid any unique biases (for now) that may occur due to geographical areas (such as Remote areas).

We will scan for missing values. We will find that less than 5% are missing values, and so we can simply eliminate these observations from our dataframe.

Understand (2) - Additional Comments

With our missing values taken care of, we can then create SEIFA groupings and an ordered factor according to Low Decile, Medium Decile, and High Deciles.

# Setup a more focused dataframe to look at Student-Teacher ratios (ST_ratios) across SEIFA Deciles and Local Government Areas (LGAs)
ACARASEIFA14 = select(ACARASEIFA16, -11:-12) 

# Let us focus on Student-Teacher ratios in Major Cities only for now. Regional areas would be analysed separately.  Filter the schools in Major Cities.
School_Cities <- ACARASEIFA14 %>% filter(Region == "Major Cities")  

# Identify sum of missing values across variables
colSums(is.na(School_Cities))

## ACARA.SML.ID       School     Postcode       Sector         Type        Years 
##            0            0            0            0            0            0 
##       Region        ICSEA     Teachers     Students   LGA_Name.x          LGA 
##            0          119            0           26            0            0 
## SEIFADec_num     ST_ratio 
##           41           26

# Remove all the NA entries as they are < 5% of observations (160 / 5159 obs = 3%) 
SchoolCities_clean <- School_Cities[complete.cases(School_Cities),]

# Filter all the Low SEIFA Deciles (<5) and save as "LowDec" 
LowDec <- SchoolCities_clean %>% filter(SEIFADec_num <5)

# Filter all the Middle SEIFA Deciles (5 to 7) and save as "MidDec" 
MidDec <- SchoolCities_clean %>% filter(SEIFADec_num >4 & SEIFADec_num <8)

# Filter all the High SEIFA Deciles (8 to 10) and save as "HighDec"
HighDec <- SchoolCities_clean %>% filter(SEIFADec_num >7)

#  Create an ordered factor using the SEIFA Deciles
SEIFADec_factor <- factor(c("LowDec", "MidDec", "HighDec"), 
levels=c("LowDec", "MidDec", "HighDec"),
labels=c("LowDec", "MidDec", "HighDec"))

levels(SEIFADec_factor)

## [1] "LowDec"  "MidDec"  "HighDec"

class(SEIFADec_factor)

## [1] "factor"

Scan II

Using a boxplot, we will detect the outliers for Student-to-Teacher ratios across LowDec, MidDec, HighDec.

We will then define a function to cap the outliers using Tukey’s method. As the data is nonparametric, the cap can apply consistently across LowDec, MidDec, HighDec, and it will help remove outliers outside the defined cap of 1.5 x IQR.

We will then re-apply the boxplot on the capped data.

# Detect outliers in LowDec, MidDec, HighDec
boxplot(LowDec$ST_ratio, MidDec$ST_ratio, HighDec$ST_ratio, 
main = "Schools in Major Aus Cities: Student-Teacher Ratio", 
names=c("Low","Medium","High"), 
xlab = "Socio-Economic Area",
col = "orange")

# Define a  function to cap the outliers (Tukey's method).  
cap <-function(x){ 
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) ) 
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1] 
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4] 
x 
}

# Apply the cap (Tukey's method) to ST_ratios for LowDec, MidDec, HighDec
LDST_capped <- LowDec$ST_ratio %>% cap()
MDST_capped <- MidDec$ST_ratio %>% cap()
HDST_capped <- HighDec$ST_ratio %>% cap()

# Compare ST_ratios across capped values for LowDec, MidDec, HighDec 
boxplot(LDST_capped, MDST_capped, HDST_capped, 
main = "Schools in Major Aus Cities: Student-Teacher Ratio", 
names=c("Low","Medium","High"), 
xlab = "Socio-Economic Area",
ylab = "Class size - Students",
col = "skyblue")

Transform

We will check the frequency distribution for a single variable (HDST_capped) using a histogram.

We will find that HDST_scapped is left-skewed. We will apply the Box_Cox transformation to normalise the distribution. This is a commonly accepted method to transform non-normal data into a normal distribution.

The Box_Cox transformation will successfully skew distributions for HDST_capped into a more symmetric arrangement, to be saved as boxcox_HDSTr.

We have not included the raw data for the Box_Cox transformation as it is 9 pages and will take us over the 25-page word limit.

This same process can be applied to analyse frequency distributions for MDST_capped and LDST_capped, and create Box-Cox symmetric distributions as needed.

By converting the data into normal distribution frequency, further analysis can be applied such as parametric hypothesis testing, linear regression, and in other data sets, time-series analysis.

# Check the frequency distribution for HDST_capped using a histogram
hist(HDST_capped)

# Apply the Box_Cox transformation to normalise the distribution.
boxcox_HDSTr <- BoxCox(HDST_capped, lambda = "auto")
hist(boxcox_HDSTr)

References

Australian Bureau of Statistics (2020). 2033.0.55.001 Socio-Economic Indexes for Areas (SEIFA), 2016, Local Government Area, SA1 Distributions, SEIFA 2016. Located at https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2033.0.55.0012016 Accessed on 17/10/2020.
Australian Curriculum Assessment and Reporting Authority (2020). ACARA Data Access Program: School Profile 2019. Located at www.acara.edu.au/contact-us/acara-data-access. Accessed on 17/10/2020.
Australian Curriculum Assessment and Reporting Authority (2020). ACARA Data Access Program: School location 2019. Located at www.acara.edu.au/contact-us/acara-data-access. Accessed on 17/10/2020.
Dolgun, A (2020). Week 2 Demonstration - Get: Importing, Scraping and Exporting Data with R. Delivered on 28/07/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 3 Demonstration – Understand: Data Types and Data Structures. Delivered on 04/08/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 4 Demonstration – Tidy & Manipulate: Part I - Tidy. Delivered on 11/08/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 5 Demonstration – Tidy & Manipulate: Part II - Manipulate. Delivered on 18/08/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 6 Demonstration – Scan: Missing Values. Delivered on 25/08/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 8 Demonstration – Scan: Outliers. Delivered on 15/09/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.
Dolgun, A (2020). Week 9 Demonstration – Transform: Data Transformation, Standardisation, and Reduction. Delivered on 22/09/2020 by Dr Sona Taheri as part the Lecture Series for MATH2349: Data Wrangling, at RMIT University Melbourne, Australia.