For this analysis I have used the following libraries, they are loaded silently message = FALSE with no console output. The output from the library loading indicates masking and conflicts, choosing not to display this improve the readability of the report.
# This is the R chunk for the required packages
library(readxl) #from module 2
library(rvest) #from module 2
library(dplyr) #from module 4
library(tidyr) #from module 4
library(stringr) #from module 8
library(knitr) #from r bootcamp 4 in applied analytics
library(outliers) #from module 6
To explore the connection between student population, age of university and equity group (focusing on students with disabilities) participation. This report details the steps involved in pre-processing this data (Figure 1).
The location and age of Australian universities was extracted from Wikipedia (2020). The Department of Education, Skills and Employment published Selected Higher Education Statistics - 2019 Student Data in 2020. This dataset includes information on equity group participation in higher education. The combined dataset allows for the exploration of different equity groups in Australian higher education with respect to age and state.
As the data from the Department of Education, Skills and Employment was untidy, it was cleaned, joined to location and age data and then examined for outliers using box plots and a Z-score transformation. The percent of student population with a reported disability was calculated and graphed according to university age.
University location data was scraped using rvest from Wikipedia (2020) - Dataset 1.
Data on students who belong to equity groups was retrieved from the Australian Government Department of Education, Skills and Employment and Training (2020) and imported using read_excel - Dataset 2.
This data is untidy, in that there are multiple observations in some cells (i.e. campus contains a list) but for the information I require from this table it tidy once the unnecessary columns have been removed. The list of campus locations has not been expanded to allow it to be tidy, as this information is not required for the analysis.
“University” - Accredited universities in Australia “Type” - Factor with three levels indicating type of University (Public, Private(Australian), Private (International))
“Campus[4]” - List containing location of university campuses
“State/Territory[4]” - Home state (base of major campus) “Established” - Year University was established “University status” - Year University was granted University Status “Times rank[5]” - Ranking or band for 2016 according to Times Higher Education - World University Rankings “ARWU rank[6]” - Ranking or band for 2016 according to Shanghai Ranking - ARWU University Rankings “QS rank[7]” - Ranking or band for 2020 according to QS World University Rankings - Top Universities “CWTS rank[note 1]” - Ranking or band for 2016 according to Centre for Science and Technology Studies - CWRS Leiden Ranking
#Dataset 1 - University locations
#library(rvest)
unis <- read_html("https://en.wikipedia.org/wiki/List_of_universities_in_Australia")
all_tables <- html_nodes(unis,"table")
uni_state1 <- html_table(all_tables[[2]], fill=TRUE)
kable(head(uni_state1), caption = "Example (first 6 rows) of University location data (Dataset 1)")
| University | Type | Campus[4] | State/Territory[4] | Established | University status | Times rank[5] | ARWU rank[6] | QS rank[7] | US News rank[8] | CWTS rank[note 1] |
|---|---|---|---|---|---|---|---|---|---|---|
| Australian Catholic University | Public | Sydney, Brisbane, Canberra, Ballarat, Melbourne | National | 1991 | 1991 | 251-300 | 501-600 | 801-1000 | 624= | 1000+ |
| Australian National University | Public | Canberra | ACT | 1946 | 1946 | 59 | 67 | 31= | 69= | 209 |
| Bond University | Private (Australian) | Gold Coast | QLD | 1987 | 1987 | 501-600 | 901-1000 | 414 | 1000+ | N/A |
| Carnegie Mellon University (Australia campus) | Private (International) | Adelaide | SA | 2006 | 2006 | - | - | - | - | - |
| Central Queensland University | Public | Rockhampton, Mackay, Brisbane, Sydney, Perth, Townsville, Melbourne, Adelaide, Bundaberg, Gladstone, Noosa | QLD | 1967 | 1992 | 801-1000 | N/A | 601–650 | 936= | N/A |
| Charles Darwin University | Public | Darwin | NT | 1989 | 2003 | 501-600 | N/A | 551–600 | 841= | N/A |
This data was downloaded from the Department of Education, Skills and Employment website on 12th October, 2020. As the dataset is an Excel document, it has been imported using read_excel skipping the first three rows of data.
This dataset is information about Australian student commencing higher education in 2019 presented by state, institution and recognised equity groups. This data is untidy because the first column “state/institution” contains both state data and institution name. The presence of column totals and interpretation notes also makes this data untidy on first import. Beyond this the data is tidy in that there is one row per observation, each variable has its own column.
“State/Institution” - Name of Australian University
“Students from a Non English speaking background” - Number of students from a Non English speaking background “Students with a disability” - Number of students with a disability “Women in Non-Traditional Area” - Number of Women who are studying in non-Traditional areas “Indigenous” - Number of students who identify of Indigenous of of Torres Strait Islander “Low SES postcode measure(b)” - Number of students who are from a Low SES area, when using Post Code “Low SES by SA1(b,c)” - Number of students who are from a low SES area when using Statistical Area 1 boundaries “First Address Low SES by SA1(b,c,d)” - Number of students with a first address classified as Low SES either by postcode or SA1 boundaries “Regional(e)” - Number of students from Regional as determined by ASGS based on students permanent home address “First Address Regional(d,e)” - Number of students from Regional as determined by ASGS based on students first address “Remote(e)” - Number of students from Remote as determined by ASGS based on students permanent home address “First Address Remote(d,e)” - Number of students from Remote as determined by ASGS based on students first address “All Commencing Domestic Students” - Total number of students who commenced in 2019 “…14” - empty column “Total 2018” - Total number of students who commenced in 2018
#Dataset 2 - Student Equity data
equity_19 <- read_excel("../Sheryl/2019_section_11_-_equity_groups.xls",
sheet = "3", skip = 3)
## New names:
## * `` -> ...14
kable(head (equity_19), caption = "Example (first 6 rows) of the Student Equity data (Dataset 2)")
| State/Institution | Students from a Non English speaking background | Students with a disability | Women in Non-Traditional Area | Indigenous | Low SES postcode measure(b) | Low SES by SA1(b,c) | First Address Low SES by SA1(b,c,d) | Regional(e) | First Address Regional(d,e) | Remote(e) | First Address Remote(d,e) | All Commencing Domestic Students | …14 | Total 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| New South Wales | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Charles Sturt University | 166 | 505 | 1312 | 641 | 3318 | 3285 | 3329 | 5868 | 5706 | 169 | 177 | 14015 | NA | 13529 |
| Macquarie University | 478 | 829 | 2292 | 139 | 1055 | 1048 | 1085 | 532 | 640 | 26 | 32 | 11304 | NA | 10896 |
| Southern Cross University | 87 | 489 | 615 | 270 | 1376 | 1393 | 1391 | 2619 | 2560 | 65 | 55 | 6015 | NA | 5852 |
| The University of New England | 121 | 1212 | 1526 | 469 | 2364 | 2142 | 2164 | 3728 | 3560 | 174 | 174 | 9410 | NA | 8578 |
| The University of Newcastle | 199 | 889 | 1142 | 535 | 2921 | 2751 | 2809 | 2357 | 2627 | 46 | 54 | 11526 | NA | 11542 |
#Rename the column with University names to facilitate joining.
equity_19 <-rename(equity_19, name = `State/Institution`)
This data was scraped from Wikipedia, using the rvest library. The imported data was checked and the data types corrected. All data was processed, even data which was removed prior to final analysis. Base R functions, were used to correct the data types were preserved.
To confirm the data was imported correctly, and stored in the correct formats I used nrow and str in combination the with sample data print from above. The data appears to have correctly formatted, with no obvious errors.
# This is the R chunk for the Understand Section
#identify variables
nrow(uni_state1)
## [1] 42
str(uni_state1)
## 'data.frame': 42 obs. of 11 variables:
## $ University : chr "Australian Catholic University" "Australian National University" "Bond University" "Carnegie Mellon University (Australia campus)" ...
## $ Type : chr "Public" "Public" "Private (Australian)" "Private (International)" ...
## $ Campus[4] : chr "Sydney, Brisbane, Canberra, Ballarat, Melbourne" "Canberra" "Gold Coast" "Adelaide" ...
## $ State/Territory[4]: chr "National" "ACT" "QLD" "SA" ...
## $ Established : int 1991 1946 1987 2006 1967 1989 1948 1966 1974 1902 ...
## $ University status : int 1991 1946 1987 2006 1992 2003 1989 1986 1974 1991 ...
## $ Times rank[5] : chr "251-300" "59" "501-600" "-" ...
## $ ARWU rank[6] : chr "501-600" "67" "901-1000" "-" ...
## $ QS rank[7] : chr "801-1000" "31=" "414" "-" ...
## $ US News rank[8] : chr "624=" "69=" "1000+" "-" ...
## $ CWTS rank[note 1] : chr "1000+" "209" "N/A" "-" ...
As the data had the desired information including expected number of observations and columns factor was used to create relevant factors for the Type, State/Territory, and the various University Ranking columns. For the University ranking columns, unique was used as a comprimise. Ideally the rankings would be presented as ordered factors, but due to the dynamic nature of the datasource, manually coding the factors will likely result in the introduction of errors. To demonstrate the consequence of manually coding the data, one example of a hand coded ordered factors has been retained and compared the with the derived factor creation using set_diff. The manual input, was accurate for the data source on October 5th 2020, but by the 17th October this code introduced NAs into the analysis, at which point the more generic, but unordered approach was implemented.
To ensure the dataset contained an ordered factor, a new variable - Age was created based on the year an institute was granted university status, by the Australian Government.
#there are three types of universities within Australia, as such these are presented as factors, there is no order to these categories.
uni_state1$Type <- factor (uni_state1$Type,
levels = c ("Public", "Private (Australian)", "Private (International)"),
labels = c ("Public", "Private (Au)", "Private (Int)"))
#states should be presented as a factor. It is not necessary to order the states factor as presenting it alphabetically is appropriate.
uni_state1$location <- factor (uni_state1$`State/Territory[4]`,
levels = c ("National", "ACT", "QLD", "SA" ,"NT", "NSW", "WA", "VIC", "National, New Zealand", "TAS" ),
labels = c("AU", "ACT" , "QLD", "SA", "NT", "NSW" ,"WA" , "VIC", "AU/NZ", "TAS" ))
#converting the ranking data to ordered factors, this data is not currently relevant for the analysis, but has been conducted to allow the dataset to be re-used in the future.
#As the data is being scraped, the opportunity for data to be modified, between preparing the code and the final run of the analysis is high, particularly with the annual ranking data.
#example of manually processing ARWU ranking column - but this is risky as the dataset is 'live' and scraped for each analysis.
uni_state1$`ARWU rank Oct 5`<- factor(uni_state1$`ARWU rank[6]`,
levels = c("40","55","77=","79=","82","96=","101-150","201-300","301-400","401-500", "701-800","N/A","-"),
labels = c("40","55","77=","79=","82","96=","101-150","201-300","301-400","401-500","701-800","unranked", "unranked"),
ordered = TRUE )
#ARWU rank as an unordered factor
ARWU <- unique(uni_state1$`ARWU rank[6]`)
uni_state1$`ARWU rank[6]` <- factor(uni_state1$`ARWU rank[6]`, levels = ARWU, labels = ARWU)
#What is the current different between deriving the factors and manually setting the factors?
setdiff( uni_state1$`ARWU rank[6]`, uni_state1$`ARWU rank Oct 5`)
## [1] "501-600" "67" "901-1000" "-" "N/A" "601-700"
## [7] "85" "151-200" "35" "74" "301–400" "54"
## [13] "801-900" "74=" "201–300"
#based on this substantial difference (within 1 week), accurate data is more important than ordered data, thus the ranks are imported as unordered factors.
# Times rank an unordered factor
Times<- unique(uni_state1$`Times rank[5]`)
uni_state1$`Times rank[5]` <- factor (uni_state1$`Times rank[5]`, levels = Times, labels = Times)
# QS rank as an unordered factor
Qs<- unique(uni_state1$`QS rank[7]`)
uni_state1$`QS rank[7]` <- factor (uni_state1$`QS rank[7]`, levels = Qs, labels = Qs)
#Create a new factored variable grouping universities based on age, and order the factor.
uni_state1$age <- cut(uni_state1$`University status`, 3, labels = c("Older", "Middle", "Newer"), ordered_result = TRUE)
# confirm structure
str(uni_state1)
## 'data.frame': 42 obs. of 14 variables:
## $ University : chr "Australian Catholic University" "Australian National University" "Bond University" "Carnegie Mellon University (Australia campus)" ...
## $ Type : Factor w/ 3 levels "Public","Private (Au)",..: 1 1 2 3 1 1 1 1 1 1 ...
## $ Campus[4] : chr "Sydney, Brisbane, Canberra, Ballarat, Melbourne" "Canberra" "Gold Coast" "Adelaide" ...
## $ State/Territory[4]: chr "National" "ACT" "QLD" "SA" ...
## $ Established : int 1991 1946 1987 2006 1967 1989 1948 1966 1974 1902 ...
## $ University status : int 1991 1946 1987 2006 1992 2003 1989 1986 1974 1991 ...
## $ Times rank[5] : Factor w/ 24 levels "251-300","59",..: 1 2 3 4 5 3 5 6 7 8 ...
## $ ARWU rank[6] : Factor w/ 18 levels "501-600","67",..: 1 2 3 4 5 5 5 6 6 7 ...
## $ QS rank[7] : Factor w/ 37 levels "801-1000","31=",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ US News rank[8] : chr "624=" "69=" "1000+" "-" ...
## $ CWTS rank[note 1] : chr "1000+" "209" "N/A" "-" ...
## $ location : Factor w/ 10 levels "AU","ACT","QLD",..: 1 2 3 4 3 5 6 7 8 7 ...
## $ ARWU rank Oct 5 : Ord.factor w/ 12 levels "40"<"55"<"77="<..: NA NA NA 12 12 12 12 8 8 NA ...
## $ age : Ord.factor w/ 3 levels "Older"<"Middle"<..: 3 2 3 3 3 3 3 3 3 3 ...
Dataset 1 (university location data) now has the appropriate attributes and structures, as confirmed using str.
This data was imported using the read_xls library. The imported data was checked and the data types corrected. All data was processed, even data which will be removed prior to final analysis. Base R functions, were used to correct the data types.
To confirm the data was imported correctly, and stored in the correct formats I used nrow and str.
#check the data
nrow(equity_19)
## [1] 70
str(equity_19)
## tibble [70 x 15] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:70] "New South Wales" "Charles Sturt University" "Macquarie University" "Southern Cross University" ...
## $ Students from a Non English speaking background: num [1:70] NA 166 478 87 121 199 761 711 537 128 ...
## $ Students with a disability : num [1:70] NA 505 829 489 1212 ...
## $ Women in Non-Traditional Area : num [1:70] NA 1312 2292 615 1526 ...
## $ Indigenous : chr [1:70] NA "641" "139" "270" ...
## $ Low SES postcode measure(b) : chr [1:70] NA "3318" "1055" "1376" ...
## $ Low SES by SA1(b,c) : chr [1:70] NA "3285" "1048" "1393" ...
## $ First Address Low SES by SA1(b,c,d) : chr [1:70] NA "3329" "1085" "1391" ...
## $ Regional(e) : chr [1:70] NA "5868" "532" "2619" ...
## $ First Address Regional(d,e) : chr [1:70] NA "5706" "640" "2560" ...
## $ Remote(e) : chr [1:70] NA "169" "26" "65" ...
## $ First Address Remote(d,e) : chr [1:70] NA "177" "32" "55" ...
## $ All Commencing Domestic Students : chr [1:70] NA "14015" "11304" "6015" ...
## $ ...14 : logi [1:70] NA NA NA NA NA NA ...
## $ Total 2018 : chr [1:70] NA "13529" "10896" "5852" ...
Columns 1 to 4 have imported correctly, and have the right attribute types. Column 14 is not required and has been removed using the subset function.
Incomplete rows were removed using not complete cases (!complete.cases) at this point because, when the remaining columns are forced to be numbers, the process will introduce NAs, thus introduce new ‘incomplete cases’. Therefore, removing the incomplete cases at this point, minimised the risk of valuable data being accidentally discarded.
Many of the columns of interest (columns 4-12) had imported as characters, rather than numbers. This was due to the presence of annotations, or the use of <5 to maintain student privacy and prevent identifiable release of information. Using as.numeric coerced any non-numeric data to NA(further explored in the next section) warning messages are not printed for this code block.
#confirm all columns have correct structure
str(equity_19b)
## tibble [50 x 14] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:50] "Charles Sturt University" "Macquarie University" "Southern Cross University" "The University of New England" ...
## $ Students from a Non English speaking background: num [1:50] 166 478 87 121 199 ...
## $ Students with a disability : num [1:50] 505 829 489 1212 889 ...
## $ Women in Non-Traditional Area : num [1:50] 1312 2292 615 1526 1142 ...
## $ Indigenous : num [1:50] 641 139 270 469 535 129 152 132 199 274 ...
## $ Low SES postcode measure(b) : num [1:50] 3318 1055 1376 2364 2921 ...
## $ Low SES by SA1(b,c) : num [1:50] 3285 1048 1393 2142 2751 ...
## $ First Address Low SES by SA1(b,c,d) : num [1:50] 3329 1085 1391 2164 2809 ...
## $ Regional(e) : num [1:50] 5868 532 2619 3728 2357 ...
## $ First Address Regional(d,e) : num [1:50] 5706 640 2560 3560 2627 ...
## $ Remote(e) : num [1:50] 169 26 65 174 46 17 38 17 15 19 ...
## $ First Address Remote(d,e) : num [1:50] 177 32 55 174 54 24 38 16 18 25 ...
## $ All Commencing Domestic Students : num [1:50] 14015 11304 6015 9410 11526 ...
## $ Total 2018 : num [1:50] 13529 10896 5852 8578 11542 ...
The attributes for all columns are now correct. Further data processing can now be completed.
The data extracted from Wikipedia, was mostly tidy. As mentioned earlier the campus column contains as a list, thus strictly this data does not meet the tidy principles. However, the campus column will not be used as part of this analysis thus once that column has been removed, this remaining data set does not require further cleaning as it meets the tidy data principles of each variable in its own column and each observation in its own row (Wickham, 2014).
#remove campus column because it contains a list of data and is not tidy.
uni_state1 <- uni_state1[,-3]
clean_uni <- uni_state1[,c(1:2, 11, 13)]
#confirm data is complete
rowSums(is.na(clean_uni))
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0
colSums(is.na(clean_uni))
## University Type location age
## 0 0 0 0
#rename institute column to help with joining
clean_uni <- clean_uni %>% rename(name = University)
#preview data
kable (head(clean_uni), caption = "University Location and Type data, in tidy format)")
| name | Type | location | age |
|---|---|---|---|
| Australian Catholic University | Public | AU | Newer |
| Australian National University | Public | ACT | Middle |
| Bond University | Private (Au) | QLD | Newer |
| Carnegie Mellon University (Australia campus) | Private (Int) | SA | Newer |
| Central Queensland University | Public | QLD | Newer |
| Charles Darwin University | Public | NT | Newer |
Minimal tidying - specifically the removal of one column and rows with no data or only headings - was performed as part of the initial data import. This was conducted before the type conversion of the rows, as NAs are introduced during the coercion increasing the number of ‘incomplete’ cases, thus increasing the potential of accidentally removing valid data. The use of complete cases allowed effective subsetting of the data, prior to coercion.
To produce a tidy dataframe for the information on student equity groups NAs were identified using is.na across rows and columns. The columns which were converted from characters to numbers are expected to have NAs in the dataframe. Impacted institutes were identified and further investigated using complete.cases and filter.
From the initial scan of all columns and rows, there is a small amount of missing data. When focusing on the two columns of particular interest for this analysis - it appears there are only two rows which have NAs. In an effort to determine if the data is critical for the overall analysis, the raw data for the two identified institutes was visually scanned.
# This is the R chunk for the Tidy & Manipulate Data II
#data is in correct format, determine how many data points are missing or NAs
colSums(is.na(equity_19b))
## name
## 0
## Students from a Non English speaking background
## 0
## Students with a disability
## 0
## Women in Non-Traditional Area
## 0
## Indigenous
## 2
## Low SES postcode measure(b)
## 2
## Low SES by SA1(b,c)
## 2
## First Address Low SES by SA1(b,c,d)
## 2
## Regional(e)
## 2
## First Address Regional(d,e)
## 2
## Remote(e)
## 2
## First Address Remote(d,e)
## 2
## All Commencing Domestic Students
## 2
## Total 2018
## 2
rowSums(is.na(equity_19b))
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
## [39] 0 0 0 0 8 0 0 0 0 0 0 0
#A small number of missing values, likely to be due to data not supplied occurrences of less than 5 students in an equity group
sum(is.na(equity_19b$`Students with a disability`))
## [1] 0
sum(is.na(equity_19b$`All Commencing Domestic Students`))
## [1] 2
#Find which institutes have NAs
CoercedNas <-equity_19b[!complete.cases(equity_19b),]
kable(head(c(CoercedNas), caption = "List of institutes which have NAs introduce do the coersion"))
|
|
|
|
|
|
# missing enrolment data for 2019 from two institutes - Batchelor Institute in the Northern Territory and Divinity University
#Find what the original input (prior to coercion) was for these institutes.
A<- filter(equity_19, name == "Batchelor Institute of Indigenous Tertiary Education")
B <-filter(equity_19, name == "University of Divinity")
#display relevant data to make judgement on inclusion
A$`All Commencing Domestic Students` #small number remove
## [1] "< 5"
t(B) #large number, but no data available to substitute
## [,1]
## name "University of Divinity"
## Students from a Non English speaking background "16"
## Students with a disability "33"
## Women in Non-Traditional Area "0"
## Indigenous "np"
## Low SES postcode measure(b) "92"
## Low SES by SA1(b,c) "95"
## First Address Low SES by SA1(b,c,d) "106"
## Regional(e) "134"
## First Address Regional(d,e) "140"
## Remote(e) "< 5"
## First Address Remote(d,e) "< 5"
## All Commencing Domestic Students "np"
## ...14 NA
## Total 2018 "np"
#create a new dataframe, removing the two universities with NA's in the total column
equity_19c<- equity_19b %>% filter(name !="Batchelor Institute of Indigenous Tertiary Education" & name !="University of Divinity")
Batchelor institutes had <5 students who met a number of equity metrics, to prevent individual identification of students the specific data is not publicly available. The total data for the prior year (2018) is provided, however this also is less than 5. Less than 5 students represents a tiny fraction of the dataset. As such the Batchelor institute data has been removed prior to further analysis.
The second institute The University of Divinity, has a much higher student population >140 students but has not published their total enrolment figures for 2018 or 2019. Replacement data was not available from other datasets. Based on other data within the dataset, if the ratio of total students to students with a disability holds, this institute would have around 1000 students. A small faction of Domestic students within the Australian higher education sector as such the data from this institute is removed. In total it is estimated less than 5% of observations have been excluded.
The forced coercion of numeric columns earlier using as.numeric, converted any characters to NA. As explored above, there were multiple examples (<5, np) or characters in the numeric field. Using summary the numeric values of the dataset were explored. Initial results revealed very high maximum values more than a factor of 10 above the 3rd quartile readings. This type of result is indicative of a total column being included in the dataset. The presence of such a column was confirmed using stringr::str_detect and then this data was removed using filter. Upon further inspection the data looks better, although specific exploration for outliers of interested data will be required prior to further analysis.
# This is the R chunk for the Scan I
##check the values for obvious errors
summary(equity_19c)
## name Students from a Non English speaking background
## Length:48 Min. : 18.0
## Class :character 1st Qu.: 133.2
## Mode :character Median : 233.5
## Mean : 636.5
## 3rd Qu.: 503.0
## Max. :15283.0
##
## Students with a disability Women in Non-Traditional Area Indigenous
## Min. : 12.0 Min. : 0.0 Min. : 6.00
## 1st Qu.: 344.2 1st Qu.: 609.2 1st Qu.: 96.75
## Median : 556.5 Median : 1141.5 Median : 155.00
## Mean : 1115.4 Mean : 2587.2 Mean : 391.08
## 3rd Qu.: 799.5 3rd Qu.: 2113.0 3rd Qu.: 310.50
## Max. :26786.0 Max. :62095.0 Max. :9389.00
##
## Low SES postcode measure(b) Low SES by SA1(b,c)
## Min. : 141.0 Min. : 128
## 1st Qu.: 873.5 1st Qu.: 961
## Median : 1399.0 Median : 1327
## Mean : 3179.0 Mean : 2904
## 3rd Qu.: 2238.5 3rd Qu.: 2114
## Max. :74798.0 Max. :68336
## NA's :1 NA's :1
## First Address Low SES by SA1(b,c,d) Regional(e)
## Min. : 123.0 Min. : 139.0
## 1st Qu.: 972.5 1st Qu.: 632.5
## Median : 1377.0 Median : 1468.0
## Mean : 3004.5 Mean : 3431.7
## 3rd Qu.: 2222.0 3rd Qu.: 2344.5
## Max. :70694.0 Max. :80728.0
## NA's :1 NA's :1
## First Address Regional(d,e) Remote(e) First Address Remote(d,e)
## Min. : 164 Min. : 5.0 Min. : 5.0
## 1st Qu.: 684 1st Qu.: 24.0 1st Qu.: 25.5
## Median : 1651 Median : 47.0 Median : 55.0
## Mean : 3588 Mean : 167.7 Mean : 173.3
## 3rd Qu.: 2594 3rd Qu.: 144.0 3rd Qu.: 135.5
## Max. :84404 Max. :3944.0 Max. :4074.0
## NA's :1 NA's :1 NA's :1
## All Commencing Domestic Students Total 2018
## Min. : 594 Min. : 726
## 1st Qu.: 4976 1st Qu.: 5034
## Median : 9072 Median : 8642
## Mean : 16828 Mean : 16886
## 3rd Qu.: 12810 3rd Qu.: 13175
## Max. :404148 Max. :405666
##
#the max values seem substantially higher than other observations, it is possible a total column is still in the data set.
equity_19c$name %>% str_to_upper() %>% str_detect( "TOTAL", negate = FALSE)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#remove total from dataset
equity_19d <-filter(equity_19c, name != "TOTAL")
summary(equity_19d)
## name Students from a Non English speaking background
## Length:47 Min. : 18.0
## Class :character 1st Qu.: 131.5
## Mode :character Median : 228.0
## Mean : 324.8
## 3rd Qu.: 499.0
## Max. :1146.0
##
## Students with a disability Women in Non-Traditional Area Indigenous
## Min. : 12.0 Min. : 0.0 Min. : 6.0
## 1st Qu.: 340.5 1st Qu.: 603.5 1st Qu.: 96.5
## Median : 551.0 Median :1141.0 Median :152.0
## Mean : 569.2 Mean :1321.1 Mean :199.6
## 3rd Qu.: 790.5 3rd Qu.:1934.0 3rd Qu.:291.0
## Max. :1329.0 Max. :3454.0 Max. :641.0
##
## Low SES postcode measure(b) Low SES by SA1(b,c)
## Min. : 141.0 Min. : 128.0
## 1st Qu.: 850.2 1st Qu.: 931.5
## Median :1389.0 Median :1276.5
## Mean :1622.1 Mean :1482.1
## 3rd Qu.:2099.8 3rd Qu.:2085.8
## Max. :4425.0 Max. :4192.0
## NA's :1 NA's :1
## First Address Low SES by SA1(b,c,d) Regional(e)
## Min. : 123.0 Min. : 139.0
## 1st Qu.: 942.2 1st Qu.: 631.8
## Median :1334.5 Median :1429.5
## Mean :1533.0 Mean :1751.3
## 3rd Qu.:2206.8 3rd Qu.:2316.5
## Max. :4266.0 Max. :8096.0
## NA's :1 NA's :1
## First Address Regional(d,e) Remote(e) First Address Remote(d,e)
## Min. : 164 Min. : 5.00 Min. : 5.00
## 1st Qu.: 683 1st Qu.: 23.50 1st Qu.: 25.25
## Median :1592 Median : 47.00 Median : 54.50
## Mean :1831 Mean : 85.61 Mean : 88.52
## 3rd Qu.:2460 3rd Qu.:122.00 3rd Qu.:128.50
## Max. :7550 Max. :433.00 Max. :381.00
## NA's :1 NA's :1 NA's :1
## All Commencing Domestic Students Total 2018
## Min. : 594 Min. : 726
## 1st Qu.: 4892 1st Qu.: 4962
## Median : 8733 Median : 8578
## Mean : 8587 Mean : 8614
## 3rd Qu.:12582 3rd Qu.:12651
## Max. :18154 Max. :18058
##
#Overall values seems reasonable, investigate specific variables of interest in more data after joining dataset.
clean_equity <- equity_19d %>% select (name, `Students with a disability`,`All Commencing Domestic Students`)
kable(head(clean_equity), caption = "Example of data contained in the cleaned equity dataset")
| name | Students with a disability | All Commencing Domestic Students |
|---|---|---|
| Charles Sturt University | 505 | 14015 |
| Macquarie University | 829 | 11304 |
| Southern Cross University | 489 | 6015 |
| The University of New England | 1212 | 9410 |
| The University of Newcastle | 889 | 11526 |
| The University of Sydney | 789 | 12367 |
With the removal of obvious errors, the two datasets have now been combined to allow a more detailed exploration of the columns of interest. The datasets were joined though an inner join using institute name. Matching on a string can lead to unexpected result, thus visual validation of alignment after each processing step was critical. The dataset was visually inspected to confirm alignment and accurate matching of columns. An example output of the first attempt at a join (using left_join) is provided to demonstrate the misalignment in datasets. After manipulating strings to remove white space, tabs and “The” from the names fields of both datasets, the quality of the join was considered acceptable. A sample of the fully aligned data is also provided as confirmation of the data cleaning undertaken.
Confirming the accuracy of the join the data which hasn’t been included in the new dataset was inspected. The anti_join function allowed this dataset to be created (although it does not contain a full list of column, as there is no data for these institutes in Dataset 1). The data contained in this table is equity data from unmatched institutes, upon visual inspected this data represents non-university providers and aggregate state based information.
#join tables
equity_state <- left_join(clean_equity, clean_uni, by = "name")
kable(head(equity_state), caption = "Example of data contained in the joined dataset")
| name | Students with a disability | All Commencing Domestic Students | Type | location | age |
|---|---|---|---|---|---|
| Charles Sturt University | 505 | 14015 | Public | NSW | Newer |
| Macquarie University | 829 | 11304 | Public | NSW | Newer |
| Southern Cross University | 489 | 6015 | Public | NSW | Newer |
| The University of New England | 1212 | 9410 | NA | NA | NA |
| The University of Newcastle | 889 | 11526 | Public | NSW | Newer |
| The University of Sydney | 789 | 12367 | NA | NA | NA |
#terrible alignment, there are many rows with NAs - try and clean up white space and non-printing characters to improve accuracy of name recognition.
#use base R to remove white spaces from both ends.
clean_equity$name <- trimws(clean_equity$name)
clean_uni$name <- trimws(clean_uni$name)
#use stringer to remove white spaces
clean_equity$name <- str_trim(clean_equity$name, side="both")
clean_uni$name <- str_trim(clean_uni$name, side="both")
#specifically remove tab characters
clean_equity$name <- str_replace_all(clean_equity$name, pattern="\t", replacement = "" )
clean_uni$name <- str_replace_all(clean_uni$name, pattern="\t", replacement = "" )
#Remove the word "the" present in some uni titles in one data set and not in the same title in the second dataset
clean_equity$name <- str_replace_all(clean_equity$name, pattern="The ", replacement = "" )
clean_uni$name <- str_replace_all(clean_uni$name, pattern="The ", replacement = "" )
#use anti_join a filtering join to see what records will not be matched
equity_state_dropped <- anti_join(clean_equity, clean_uni, by = "name")
kable(head (equity_state_dropped), caption = "Universities which did not have a name match in both datasets, and will be dropped from further anlysis")
| name | Students with a disability | All Commencing Domestic Students |
|---|---|---|
| Non-University Higher Education Institutions | 966 | 18154 |
| Non-University Higher Education Institutions | 286 | 3754 |
| Non-University Higher Education Institutions | 127 | 2615 |
| Non-University Higher Education Institutions | 43 | 594 |
| Private Universities (Table C) and Non-University Higher Education Institutions | 66 | 1762 |
| Non-University Higher Education Institutions | 12 | 919 |
#visually confirmed these are small non-university providers and aggregates thus the results can be dropped
#join tables (use an inner join to ensure all universities found in both tables are mapped), because we don't want unmatched records
equity_state <- inner_join(clean_equity, clean_uni, by = "name")
kable(head(equity_state), caption = "Example of data contained in the dataset after combining")
| name | Students with a disability | All Commencing Domestic Students | Type | location | age |
|---|---|---|---|---|---|
| Charles Sturt University | 505 | 14015 | Public | NSW | Newer |
| Macquarie University | 829 | 11304 | Public | NSW | Newer |
| Southern Cross University | 489 | 6015 | Public | NSW | Newer |
| University of New England | 1212 | 9410 | Public | NSW | Middle |
| University of Newcastle | 889 | 11526 | Public | NSW | Newer |
| University of Sydney | 789 | 12367 | Public | NSW | Older |
#data appears to be tidy and ready for further processing.
Confirm the final dataset does not contain any special characters or missing data using colSums and is.na. There shouldn’t be any special characters in the numeric columns from the equity dataset, because as.numeric forced coercion earlier, thus any characters or non-numeric outputs would have been converted to NA at that time, but this was checked using a custom function.
#Confirm the three variables selected for further analysis do not contain outliers
colSums(is.na(equity_state))
## name Students with a disability
## 0 0
## All Commencing Domestic Students Type
## 0 0
## location age
## 0 0
#The data set is complete
#used coercion earlier values should be numeric
is.specialorNA<- function (x) {
if (is.numeric(x)) (is.infinite(x)|is.nan(x)|is.na(x))
}
sapply(equity_state, function (x) sum(is.specialorNA(x)))
## name Students with a disability
## 0 0
## All Commencing Domestic Students Type
## 0 0
## location age
## 0 0
There are no special characters in the dataset.
To see if there are any outliers using “Tukey’s method of outlier detection”, this involved creating a box plot for each variable. On the box plot the outlier fence is drawn at 1.5 times the interquartile range (IQR), and values which appear outside this range are individually plotted and potentially considered outlier are considered.
#explore the combined dataset, to check there are no outliers or errors in dates or other columns
summary(equity_state)
## name Students with a disability All Commencing Domestic Students
## Length:40 Min. : 90.0 Min. : 1233
## Class :character 1st Qu.: 465.8 1st Qu.: 6121
## Mode :character Median : 569.0 Median : 9690
## Mean : 630.0 Mean : 9369
## 3rd Qu.: 799.5 3rd Qu.:12810
## Max. :1329.0 Max. :16171
##
## Type location age
## Public :37 NSW :10 Older : 4
## Private (Au) : 2 QLD : 8 Middle: 6
## Private (Int): 1 VIC : 8 Newer :30
## WA : 5
## SA : 3
## ACT : 2
## (Other): 4
#par(mfrow = c(2,1))
boxplot(equity_state$`Students with a disability`, equity_state$`All Commencing Domestic Students`,
names = c("students with a disability", "domestic students"),
main= "Higher Education 2019 enrolments",
ylab = 'Students')
There are no outliers ## create a new variable
The ordered factor of university age was created when data was imported. To further explore this dataset, a new variable %disable has been created using manually defined formula.
#Create a new variable (% of disabled students per university)----
equity_state$`%disable` <- (equity_state$`Students with a disability`/equity_state$`All Commencing Domestic Students`)*100
To additionally check of outliers a Z-score transformation was applied to the All Commending Domestic Students, Students with a disability and %disable columns. To confirm the data for these variables was normally distributed and thus able to be transformed using a z-score, the distribution was visually assessed by plotting a histogram and mathematically calculated using the Kolmogorov-Smirov test and the Shapiro-Wilk test. For the mathematical assessment the H0 is that the data is normally distributed, as such the p results of >0.05 indicate a lack of evidence to reject this hypothesis. As such there is no evidence the data for the three variable is anything other than normally distributed, and therefore a z-score transformation is appropriate. The transformation is completed using the scores function from the outliers package. Typically an outlier is a value which has a Z score of more than 3.
From the summary function, there are no minimum or maximum values of more than 3, thus according to the standard interpretation of a Z-score there are no outliers in this dataset. Additionally there are only three values which are more than 2 standard deviations away from the mean. Thus all data is retained for further analysis.
par(mfrow = c(1,3))
#check if the all commencing domestic students variable is normally distributed
hist(equity_state$`All Commencing Domestic Students`,
main = "Domestic Students \n commencing 2019",
xlab = "Students",
ylab = "Frequency")
shapiro.test(equity_state$`All Commencing Domestic Students`)
##
## Shapiro-Wilk normality test
##
## data: equity_state$`All Commencing Domestic Students`
## W = 0.96175, p-value = 0.1922
ks.test(equity_state$`All Commencing Domestic Students`, "pnorm", mean = mean(equity_state$`All Commencing Domestic Students`), sd = sd(equity_state$`All Commencing Domestic Students`))
##
## One-sample Kolmogorov-Smirnov test
##
## data: equity_state$`All Commencing Domestic Students`
## D = 0.12013, p-value = 0.5695
## alternative hypothesis: two-sided
#check if the students with disability variable is normally distributed
hist(equity_state$`Students with a disability`,
main = " Students with a Disability",
xlab = "Students with a disablity",
ylab = "Frequency")
shapiro.test(equity_state$`Students with a disability`)
##
## Shapiro-Wilk normality test
##
## data: equity_state$`Students with a disability`
## W = 0.96039, p-value = 0.1727
ks.test(equity_state$`Students with a disability`, "pnorm", mean = mean(equity_state$`Students with a disability`), sd = sd(equity_state$`Students with a disability`))
##
## One-sample Kolmogorov-Smirnov test
##
## data: equity_state$`Students with a disability`
## D = 0.15655, p-value = 0.2531
## alternative hypothesis: two-sided
#check to see if the new %disable variable is normally distributed
hist(equity_state$`%disable`,
main = "%Disabled",
xlab = "%students with a disability",
ylab = "Frequency")
shapiro.test(equity_state$`%disable`)
##
## Shapiro-Wilk normality test
##
## data: equity_state$`%disable`
## W = 0.97994, p-value = 0.6872
ks.test(equity_state$`%disable`, "pnorm", mean = mean(equity_state$`%disable`), sd = sd(equity_state$`%disable`))
##
## One-sample Kolmogorov-Smirnov test
##
## data: equity_state$`%disable`
## D = 0.086372, p-value = 0.9015
## alternative hypothesis: two-sided
#as the data is normally distributed it can be transformed using z scores
equity_state$zdom <- scores(equity_state$`All Commencing Domestic Students`, type = "z")
equity_state$zdis <- scores(equity_state$`Students with a disability`, type = "z")
equity_state$`z%` <- scores(equity_state$`%disable`, type = "z")
summary(equity_state$zdom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.02402 -0.80808 0.07973 0.00000 0.85602 1.69214
summary(equity_state$zdis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8297 -0.5565 -0.2066 0.0000 0.5744 2.3686
summary(equity_state$`z%`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0469 -0.4556 -0.1538 0.0000 0.6235 2.8091
length(which(abs(equity_state$zdom)>2))
## [1] 1
length(which(abs(equity_state$zdis)>2))
## [1] 2
length(which(abs(equity_state$`z%`)>2))
## [1] 2
To more readily interpret the student number variable, it is converted into ‘thousands of students’. This involves dividing the All Commencing Domestic Students by 1000.
#transform student numbers to '000's of students.
equity_state$students <- equity_state$`All Commencing Domestic Students`/1000
The datset has been cleaned and is ready for further investigation. Examples of questions which could be answered with this data are included below.
plot(equity_state$`%disable` ~ equity_state$location,
main = "Does state influence equity student enrolment?",
xlab = "Location",
ylab = "Student Identifing as disabled",
ylim = c(0,25))
abline (h =20, col = "Blue", lwd =2)
text (5, 22, "National average (disability)", col = "blue")
text (6, 15, "UNE 12%")
plot(equity_state$`%disable` ~ equity_state$students,
main = "Does size influence equity student enrolment?",
xlab = "Student number ('000s)",
ylab = "Student Identifing as disabled",
ylim = c(0,25))
abline (h =20, col = "Blue", lwd =2)
text (5, 22, "National average (disability)", col = "blue")
plot(equity_state$students ~ equity_state$location,
main = "Where are larger universities located?",
xlab = "Location",
ylab = "Student number ('000s)")
# ylim = c(0,25))
# abline (h =20, col = "Blue", lwd =2)
# text (5, 22, "National average (disability)", col = "blue")
# text (6, 15, "UNE 12%")
#Colour by state or age
List of universities in Australia. (2020). In Wikipedia. https://en.wikipedia.org/w/index.php?title=List_of_universities_in_Australia&oldid=982617437 Selected Higher Education Statistics – 2019 Student data | Department of Education, Skills and Employment. (n.d.). Retrieved 1 October 2020, from https://www.education.gov.au/selected-higher-education-statistics-2019-student-data Wickham, H. (2014). Tidy data. The American Statistician, 14. https://doi.org/10.18637/jss.v059.i10