MATH2349 Data Wrangling

Required packages

For this analysis I have used the following libraries, they are loaded silently message = FALSE with no console output. The output from the library loading indicates masking and conflicts, choosing not to display this improve the readability of the report.

# This is the R chunk for the required packages

library(readxl) #from module 2
library(rvest) #from module 2
library(dplyr) #from module 4 
library(tidyr) #from module 4
library(stringr) #from module 8
library(knitr) #from r bootcamp 4 in applied analytics
library(outliers) #from module 6

Executive Summary

To explore the connection between student population, age of university and equity group (focusing on students with disabilities) participation. This report details the steps involved in pre-processing this data (Figure 1).

The location and age of Australian universities was extracted from Wikipedia (2020). The Department of Education, Skills and Employment published Selected Higher Education Statistics - 2019 Student Data in 2020. This dataset includes information on equity group participation in higher education. The combined dataset allows for the exploration of different equity groups in Australian higher education with respect to age and state.

As the data from the Department of Education, Skills and Employment was untidy, it was cleaned, joined to location and age data and then examined for outliers using box plots and a Z-score transformation. The percent of student population with a reported disability was calculated and graphed according to university age.

Figure 1: Overview of data processing methodology used

Data

University location data was scraped using rvest from Wikipedia (2020) - Dataset 1.

Data on students who belong to equity groups was retrieved from the Australian Government Department of Education, Skills and Employment and Training (2020) and imported using read_excel - Dataset 2.

Data set 1: University location data

This data is untidy, in that there are multiple observations in some cells (i.e. campus contains a list) but for the information I require from this table it tidy once the unnecessary columns have been removed. The list of campus locations has not been expanded to allow it to be tidy, as this information is not required for the analysis.

List of Australian University- Data description

“University” - Accredited universities in Australia “Type” - Factor with three levels indicating type of University (Public, Private(Australian), Private (International))
“Campus[4]” - List containing location of university campuses
“State/Territory[4]” - Home state (base of major campus) “Established” - Year University was established “University status” - Year University was granted University Status “Times rank[5]” - Ranking or band for 2016 according to Times Higher Education - World University Rankings “ARWU rank[6]” - Ranking or band for 2016 according to Shanghai Ranking - ARWU University Rankings “QS rank[7]” - Ranking or band for 2020 according to QS World University Rankings - Top Universities “CWTS rank[note 1]” - Ranking or band for 2016 according to Centre for Science and Technology Studies - CWRS Leiden Ranking

#Dataset 1 - University locations 
#library(rvest) 
unis <- read_html("https://en.wikipedia.org/wiki/List_of_universities_in_Australia")
all_tables <- html_nodes(unis,"table")
uni_state1 <- html_table(all_tables[[2]], fill=TRUE)
kable(head(uni_state1), caption = "Example (first 6 rows) of University location data (Dataset 1)")

Example (first 6 rows) of University location data (Dataset 1)
University	Type	Campus[4]	State/Territory[4]	Established	University status	Times rank[5]	ARWU rank[6]	QS rank[7]	US News rank[8]	CWTS rank[note 1]
Australian Catholic University	Public	Sydney, Brisbane, Canberra, Ballarat, Melbourne	National	1991	1991	251-300	501-600	801-1000	624=	1000+
Australian National University	Public	Canberra	ACT	1946	1946	59	67	31=	69=	209
Bond University	Private (Australian)	Gold Coast	QLD	1987	1987	501-600	901-1000	414	1000+	N/A
Carnegie Mellon University (Australia campus)	Private (International)	Adelaide	SA	2006	2006	-	-	-	-	-
Central Queensland University	Public	Rockhampton, Mackay, Brisbane, Sydney, Perth, Townsville, Melbourne, Adelaide, Bundaberg, Gladstone, Noosa	QLD	1967	1992	801-1000	N/A	601–650	936=	N/A
Charles Darwin University	Public	Darwin	NT	1989	2003	501-600	N/A	551–600	841=	N/A

Data set 2: Student Equity data

This data was downloaded from the Department of Education, Skills and Employment website on 12th October, 2020. As the dataset is an Excel document, it has been imported using read_excel skipping the first three rows of data.

This dataset is information about Australian student commencing higher education in 2019 presented by state, institution and recognised equity groups. This data is untidy because the first column “state/institution” contains both state data and institution name. The presence of column totals and interpretation notes also makes this data untidy on first import. Beyond this the data is tidy in that there is one row per observation, each variable has its own column.

Selected Higher Education Statistics 2019 Student - Data description

“State/Institution” - Name of Australian University
“Students from a Non English speaking background” - Number of students from a Non English speaking background “Students with a disability” - Number of students with a disability “Women in Non-Traditional Area” - Number of Women who are studying in non-Traditional areas “Indigenous” - Number of students who identify of Indigenous of of Torres Strait Islander “Low SES postcode measure(b)” - Number of students who are from a Low SES area, when using Post Code “Low SES by SA1(b,c)” - Number of students who are from a low SES area when using Statistical Area 1 boundaries “First Address Low SES by SA1(b,c,d)” - Number of students with a first address classified as Low SES either by postcode or SA1 boundaries “Regional(e)” - Number of students from Regional as determined by ASGS based on students permanent home address “First Address Regional(d,e)” - Number of students from Regional as determined by ASGS based on students first address “Remote(e)” - Number of students from Remote as determined by ASGS based on students permanent home address “First Address Remote(d,e)” - Number of students from Remote as determined by ASGS based on students first address “All Commencing Domestic Students” - Total number of students who commenced in 2019 “…14” - empty column “Total 2018” - Total number of students who commenced in 2018

#Dataset 2 - Student Equity data
 equity_19 <- read_excel("../Sheryl/2019_section_11_-_equity_groups.xls", 
                        sheet = "3", skip = 3)

## New names:
## * `` -> ...14

kable(head (equity_19), caption = "Example (first 6 rows) of the Student Equity data (Dataset 2)")

Example (first 6 rows) of the Student Equity data (Dataset 2)
State/Institution	Students from a Non English speaking background	Students with a disability	Women in Non-Traditional Area	Indigenous	Low SES postcode measure(b)	Low SES by SA1(b,c)	First Address Low SES by SA1(b,c,d)	Regional(e)	First Address Regional(d,e)	Remote(e)	First Address Remote(d,e)	All Commencing Domestic Students	…14	Total 2018
New South Wales	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Charles Sturt University	166	505	1312	641	3318	3285	3329	5868	5706	169	177	14015	NA	13529
Macquarie University	478	829	2292	139	1055	1048	1085	532	640	26	32	11304	NA	10896
Southern Cross University	87	489	615	270	1376	1393	1391	2619	2560	65	55	6015	NA	5852
The University of New England	121	1212	1526	469	2364	2142	2164	3728	3560	174	174	9410	NA	8578
The University of Newcastle	199	889	1142	535	2921	2751	2809	2357	2627	46	54	11526	NA	11542

#Rename the column with University names to facilitate joining. 
equity_19 <-rename(equity_19, name = `State/Institution`)

Understand

Data processing: Dataset 1:

This data was scraped from Wikipedia, using the rvest library. The imported data was checked and the data types corrected. All data was processed, even data which was removed prior to final analysis. Base R functions, were used to correct the data types were preserved.

To confirm the data was imported correctly, and stored in the correct formats I used nrow and str in combination the with sample data print from above. The data appears to have correctly formatted, with no obvious errors.

# This is the R chunk for the Understand Section
#identify variables
nrow(uni_state1)

## [1] 42

str(uni_state1)

## 'data.frame':    42 obs. of  11 variables:
##  $ University        : chr  "Australian Catholic University" "Australian National University" "Bond University" "Carnegie Mellon University (Australia campus)" ...
##  $ Type              : chr  "Public" "Public" "Private (Australian)" "Private (International)" ...
##  $ Campus[4]         : chr  "Sydney, Brisbane, Canberra, Ballarat, Melbourne" "Canberra" "Gold Coast" "Adelaide" ...
##  $ State/Territory[4]: chr  "National" "ACT" "QLD" "SA" ...
##  $ Established       : int  1991 1946 1987 2006 1967 1989 1948 1966 1974 1902 ...
##  $ University status : int  1991 1946 1987 2006 1992 2003 1989 1986 1974 1991 ...
##  $ Times rank[5]     : chr  "251-300" "59" "501-600" "-" ...
##  $ ARWU rank[6]      : chr  "501-600" "67" "901-1000" "-" ...
##  $ QS rank[7]        : chr  "801-1000" "31=" "414" "-" ...
##  $ US News rank[8]   : chr  "624=" "69=" "1000+" "-" ...
##  $ CWTS rank[note 1] : chr  "1000+" "209" "N/A" "-" ...

As the data had the desired information including expected number of observations and columns factor was used to create relevant factors for the Type, State/Territory, and the various University Ranking columns. For the University ranking columns, unique was used as a comprimise. Ideally the rankings would be presented as ordered factors, but due to the dynamic nature of the datasource, manually coding the factors will likely result in the introduction of errors. To demonstrate the consequence of manually coding the data, one example of a hand coded ordered factors has been retained and compared the with the derived factor creation using set_diff. The manual input, was accurate for the data source on October 5th 2020, but by the 17th October this code introduced NAs into the analysis, at which point the more generic, but unordered approach was implemented.

To ensure the dataset contained an ordered factor, a new variable - Age was created based on the year an institute was granted university status, by the Australian Government.

#there are three types of universities within Australia, as such these are presented as factors, there is no order to these categories. 

uni_state1$Type <- factor (uni_state1$Type,
                            levels = c ("Public", "Private (Australian)", "Private (International)"), 
                            labels = c ("Public", "Private (Au)", "Private (Int)"))

#states should be presented as a factor. It is not necessary to order the states factor as presenting it alphabetically is appropriate. 

uni_state1$location <- factor (uni_state1$`State/Territory[4]`, 
                           levels = c ("National", "ACT", "QLD", "SA" ,"NT",   "NSW",   "WA",  "VIC",  "National,  New Zealand", "TAS"   ), 
                           labels = c("AU",  "ACT" , "QLD",  "SA", "NT", "NSW" ,"WA"  ,  "VIC", "AU/NZ", "TAS" ))

#converting the ranking data to ordered factors, this data is not currently relevant for the analysis, but has been conducted to allow the dataset to be re-used in the future.  
#As the data is being scraped, the opportunity for data to be modified, between preparing the code and the final run of the analysis is high, particularly  with the annual ranking data.

#example of manually processing ARWU ranking column - but this is risky as the dataset is 'live' and scraped for each analysis. 
uni_state1$`ARWU rank Oct 5`<- factor(uni_state1$`ARWU rank[6]`,
                                          levels = c("40","55","77=","79=","82","96=","101-150","201-300","301-400","401-500", "701-800","N/A","-"),
                                          labels = c("40","55","77=","79=","82","96=","101-150","201-300","301-400","401-500","701-800","unranked", "unranked"),
                                          ordered = TRUE )
#ARWU rank as an unordered factor
ARWU <- unique(uni_state1$`ARWU rank[6]`)
uni_state1$`ARWU rank[6]` <- factor(uni_state1$`ARWU rank[6]`, levels = ARWU, labels = ARWU)


#What is the current different between deriving the factors and manually setting the factors?
setdiff( uni_state1$`ARWU rank[6]`, uni_state1$`ARWU rank Oct 5`)

##  [1] "501-600"  "67"       "901-1000" "-"        "N/A"      "601-700" 
##  [7] "85"       "151-200"  "35"       "74"       "301–400"  "54"      
## [13] "801-900"  "74="      "201–300"

#based on this substantial difference (within 1 week), accurate data is more important than ordered data, thus the ranks are imported as unordered factors.

# Times rank an unordered factor
Times<- unique(uni_state1$`Times rank[5]`)
uni_state1$`Times rank[5]` <- factor (uni_state1$`Times rank[5]`, levels = Times, labels = Times)

# QS rank as an unordered factor
Qs<- unique(uni_state1$`QS rank[7]`)
uni_state1$`QS rank[7]` <- factor (uni_state1$`QS rank[7]`, levels = Qs, labels = Qs)

#Create a new factored variable grouping universities based on age, and order the factor. 
uni_state1$age <- cut(uni_state1$`University status`, 3, labels = c("Older", "Middle", "Newer"), ordered_result = TRUE)
# confirm structure
str(uni_state1)

## 'data.frame':    42 obs. of  14 variables:
##  $ University        : chr  "Australian Catholic University" "Australian National University" "Bond University" "Carnegie Mellon University (Australia campus)" ...
##  $ Type              : Factor w/ 3 levels "Public","Private (Au)",..: 1 1 2 3 1 1 1 1 1 1 ...
##  $ Campus[4]         : chr  "Sydney, Brisbane, Canberra, Ballarat, Melbourne" "Canberra" "Gold Coast" "Adelaide" ...
##  $ State/Territory[4]: chr  "National" "ACT" "QLD" "SA" ...
##  $ Established       : int  1991 1946 1987 2006 1967 1989 1948 1966 1974 1902 ...
##  $ University status : int  1991 1946 1987 2006 1992 2003 1989 1986 1974 1991 ...
##  $ Times rank[5]     : Factor w/ 24 levels "251-300","59",..: 1 2 3 4 5 3 5 6 7 8 ...
##  $ ARWU rank[6]      : Factor w/ 18 levels "501-600","67",..: 1 2 3 4 5 5 5 6 6 7 ...
##  $ QS rank[7]        : Factor w/ 37 levels "801-1000","31=",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ US News rank[8]   : chr  "624=" "69=" "1000+" "-" ...
##  $ CWTS rank[note 1] : chr  "1000+" "209" "N/A" "-" ...
##  $ location          : Factor w/ 10 levels "AU","ACT","QLD",..: 1 2 3 4 3 5 6 7 8 7 ...
##  $ ARWU rank Oct 5   : Ord.factor w/ 12 levels "40"<"55"<"77="<..: NA NA NA 12 12 12 12 8 8 NA ...
##  $ age               : Ord.factor w/ 3 levels "Older"<"Middle"<..: 3 2 3 3 3 3 3 3 3 3 ...

Dataset 1 (university location data) now has the appropriate attributes and structures, as confirmed using str.

Data processing: Dataset 2:

This data was imported using the read_xls library. The imported data was checked and the data types corrected. All data was processed, even data which will be removed prior to final analysis. Base R functions, were used to correct the data types.

To confirm the data was imported correctly, and stored in the correct formats I used nrow and str.

#check the data 
nrow(equity_19)

## [1] 70

str(equity_19)

## tibble [70 x 15] (S3: tbl_df/tbl/data.frame)
##  $ name                                           : chr [1:70] "New South Wales" "Charles Sturt University" "Macquarie University" "Southern Cross University" ...
##  $ Students from a Non English speaking background: num [1:70] NA 166 478 87 121 199 761 711 537 128 ...
##  $ Students with a disability                     : num [1:70] NA 505 829 489 1212 ...
##  $ Women in Non-Traditional Area                  : num [1:70] NA 1312 2292 615 1526 ...
##  $ Indigenous                                     : chr [1:70] NA "641" "139" "270" ...
##  $ Low SES postcode measure(b)                    : chr [1:70] NA "3318" "1055" "1376" ...
##  $ Low SES by SA1(b,c)                            : chr [1:70] NA "3285" "1048" "1393" ...
##  $ First Address Low SES by SA1(b,c,d)            : chr [1:70] NA "3329" "1085" "1391" ...
##  $ Regional(e)                                    : chr [1:70] NA "5868" "532" "2619" ...
##  $ First Address Regional(d,e)                    : chr [1:70] NA "5706" "640" "2560" ...
##  $ Remote(e)                                      : chr [1:70] NA "169" "26" "65" ...
##  $ First Address Remote(d,e)                      : chr [1:70] NA "177" "32" "55" ...
##  $ All Commencing Domestic Students               : chr [1:70] NA "14015" "11304" "6015" ...
##  $ ...14                                          : logi [1:70] NA NA NA NA NA NA ...
##  $ Total 2018                                     : chr [1:70] NA "13529" "10896" "5852" ...

Columns 1 to 4 have imported correctly, and have the right attribute types. Column 14 is not required and has been removed using the subset function.

Incomplete rows were removed using not complete cases (!complete.cases) at this point because, when the remaining columns are forced to be numbers, the process will introduce NAs, thus introduce new ‘incomplete cases’. Therefore, removing the incomplete cases at this point, minimised the risk of valuable data being accidentally discarded.

Many of the columns of interest (columns 4-12) had imported as characters, rather than numbers. This was due to the presence of annotations, or the use of <5 to maintain student privacy and prevent identifiable release of information. Using as.numeric coerced any non-numeric data to NA(further explored in the next section) warning messages are not printed for this code block.

#confirm all columns have correct structure 
str(equity_19b)

## tibble [50 x 14] (S3: tbl_df/tbl/data.frame)
##  $ name                                           : chr [1:50] "Charles Sturt University" "Macquarie University" "Southern Cross University" "The University of New England" ...
##  $ Students from a Non English speaking background: num [1:50] 166 478 87 121 199 ...
##  $ Students with a disability                     : num [1:50] 505 829 489 1212 889 ...
##  $ Women in Non-Traditional Area                  : num [1:50] 1312 2292 615 1526 1142 ...
##  $ Indigenous                                     : num [1:50] 641 139 270 469 535 129 152 132 199 274 ...
##  $ Low SES postcode measure(b)                    : num [1:50] 3318 1055 1376 2364 2921 ...
##  $ Low SES by SA1(b,c)                            : num [1:50] 3285 1048 1393 2142 2751 ...
##  $ First Address Low SES by SA1(b,c,d)            : num [1:50] 3329 1085 1391 2164 2809 ...
##  $ Regional(e)                                    : num [1:50] 5868 532 2619 3728 2357 ...
##  $ First Address Regional(d,e)                    : num [1:50] 5706 640 2560 3560 2627 ...
##  $ Remote(e)                                      : num [1:50] 169 26 65 174 46 17 38 17 15 19 ...
##  $ First Address Remote(d,e)                      : num [1:50] 177 32 55 174 54 24 38 16 18 25 ...
##  $ All Commencing Domestic Students               : num [1:50] 14015 11304 6015 9410 11526 ...
##  $ Total 2018                                     : num [1:50] 13529 10896 5852 8578 11542 ...

The attributes for all columns are now correct. Further data processing can now be completed.

Tidy & Manipulate Dataset I

The data extracted from Wikipedia, was mostly tidy. As mentioned earlier the campus column contains as a list, thus strictly this data does not meet the tidy principles. However, the campus column will not be used as part of this analysis thus once that column has been removed, this remaining data set does not require further cleaning as it meets the tidy data principles of each variable in its own column and each observation in its own row (Wickham, 2014).

#remove campus column because it contains a list of data and is not tidy. 
uni_state1 <- uni_state1[,-3]
clean_uni <- uni_state1[,c(1:2, 11, 13)]

#confirm data is complete
rowSums(is.na(clean_uni))

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0

colSums(is.na(clean_uni))

## University       Type   location        age 
##          0          0          0          0

#rename institute column to help with joining 
clean_uni <- clean_uni %>% rename(name = University)


#preview data
kable (head(clean_uni), caption = "University Location and Type data, in tidy format)")

University Location and Type data, in tidy format)
name	Type	location	age
Australian Catholic University	Public	AU	Newer
Australian National University	Public	ACT	Middle
Bond University	Private (Au)	QLD	Newer
Carnegie Mellon University (Australia campus)	Private (Int)	SA	Newer
Central Queensland University	Public	QLD	Newer
Charles Darwin University	Public	NT	Newer

Tidy & Manipulate Dataset II

Minimal tidying - specifically the removal of one column and rows with no data or only headings - was performed as part of the initial data import. This was conducted before the type conversion of the rows, as NAs are introduced during the coercion increasing the number of ‘incomplete’ cases, thus increasing the potential of accidentally removing valid data. The use of complete cases allowed effective subsetting of the data, prior to coercion.

To produce a tidy dataframe for the information on student equity groups NAs were identified using is.na across rows and columns. The columns which were converted from characters to numbers are expected to have NAs in the dataframe. Impacted institutes were identified and further investigated using complete.cases and filter.

From the initial scan of all columns and rows, there is a small amount of missing data. When focusing on the two columns of particular interest for this analysis - it appears there are only two rows which have NAs. In an effort to determine if the data is critical for the overall analysis, the raw data for the two identified institutes was visually scanned.

# This is the R chunk for the Tidy & Manipulate Data II 

#data is in correct format, determine how many data points are missing or NAs

colSums(is.na(equity_19b))

##                                            name 
##                                               0 
## Students from a Non English speaking background 
##                                               0 
##                      Students with a disability 
##                                               0 
##                   Women in Non-Traditional Area 
##                                               0 
##                                      Indigenous 
##                                               2 
##                     Low SES postcode measure(b) 
##                                               2 
##                             Low SES by SA1(b,c) 
##                                               2 
##             First Address Low SES by SA1(b,c,d) 
##                                               2 
##                                     Regional(e) 
##                                               2 
##                     First Address Regional(d,e) 
##                                               2 
##                                       Remote(e) 
##                                               2 
##                       First Address Remote(d,e) 
##                                               2 
##                All Commencing Domestic Students 
##                                               2 
##                                      Total 2018 
##                                               2

rowSums(is.na(equity_19b))

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
## [39] 0 0 0 0 8 0 0 0 0 0 0 0

#A small number of missing values, likely to be due to data not supplied occurrences of less than 5 students in an equity group

sum(is.na(equity_19b$`Students with a disability`))

## [1] 0

sum(is.na(equity_19b$`All Commencing Domestic Students`))

## [1] 2

#Find which institutes have NAs

CoercedNas <-equity_19b[!complete.cases(equity_19b),]
kable(head(c(CoercedNas), caption = "List of institutes which have NAs introduce do the coersion"))

x
University of Divinity
Victoria University
Bond University
Non-University Higher Education Institutions
Batchelor Institute of Indigenous Tertiary Education

x
16
503
18
86
0

x
33
566
90
43
0

x
0
589
174
79
0

x
NA
57
35
6
NA

x
92
1339
NA
147
NA

# missing enrolment data for 2019 from two institutes - Batchelor Institute in the Northern Territory and Divinity University 
#Find what the original input (prior to coercion) was for these institutes. 

A<- filter(equity_19, name == "Batchelor Institute of Indigenous Tertiary Education")
B <-filter(equity_19, name ==  "University of Divinity")

#display relevant data to make judgement on inclusion
A$`All Commencing Domestic Students` #small number remove

## [1] "< 5"

t(B) #large number, but no data available to substitute

##                                                 [,1]                    
## name                                            "University of Divinity"
## Students from a Non English speaking background "16"                    
## Students with a disability                      "33"                    
## Women in Non-Traditional Area                   "0"                     
## Indigenous                                      "np"                    
## Low SES postcode measure(b)                     "92"                    
## Low SES by SA1(b,c)                             "95"                    
## First Address Low SES by SA1(b,c,d)             "106"                   
## Regional(e)                                     "134"                   
## First Address Regional(d,e)                     "140"                   
## Remote(e)                                       "< 5"                   
## First Address Remote(d,e)                       "< 5"                   
## All Commencing Domestic Students                "np"                    
## ...14                                           NA                      
## Total 2018                                      "np"

#create a new dataframe, removing the two universities with NA's in the total column
equity_19c<- equity_19b %>%  filter(name !="Batchelor Institute of Indigenous Tertiary Education" & name !="University of Divinity")

Batchelor institutes had <5 students who met a number of equity metrics, to prevent individual identification of students the specific data is not publicly available. The total data for the prior year (2018) is provided, however this also is less than 5. Less than 5 students represents a tiny fraction of the dataset. As such the Batchelor institute data has been removed prior to further analysis.

The second institute The University of Divinity, has a much higher student population >140 students but has not published their total enrolment figures for 2018 or 2019. Replacement data was not available from other datasets. Based on other data within the dataset, if the ratio of total students to students with a disability holds, this institute would have around 1000 students. A small faction of Domestic students within the Australian higher education sector as such the data from this institute is removed. In total it is estimated less than 5% of observations have been excluded.

Scan I (Dataset 2)

The forced coercion of numeric columns earlier using as.numeric, converted any characters to NA. As explored above, there were multiple examples (<5, np) or characters in the numeric field. Using summary the numeric values of the dataset were explored. Initial results revealed very high maximum values more than a factor of 10 above the 3rd quartile readings. This type of result is indicative of a total column being included in the dataset. The presence of such a column was confirmed using stringr::str_detect and then this data was removed using filter. Upon further inspection the data looks better, although specific exploration for outliers of interested data will be required prior to further analysis.

# This is the R chunk for the Scan I

##check the values for obvious errors
summary(equity_19c)

##      name           Students from a Non English speaking background
##  Length:48          Min.   :   18.0                                
##  Class :character   1st Qu.:  133.2                                
##  Mode  :character   Median :  233.5                                
##                     Mean   :  636.5                                
##                     3rd Qu.:  503.0                                
##                     Max.   :15283.0                                
##                                                                    
##  Students with a disability Women in Non-Traditional Area   Indigenous     
##  Min.   :   12.0            Min.   :    0.0               Min.   :   6.00  
##  1st Qu.:  344.2            1st Qu.:  609.2               1st Qu.:  96.75  
##  Median :  556.5            Median : 1141.5               Median : 155.00  
##  Mean   : 1115.4            Mean   : 2587.2               Mean   : 391.08  
##  3rd Qu.:  799.5            3rd Qu.: 2113.0               3rd Qu.: 310.50  
##  Max.   :26786.0            Max.   :62095.0               Max.   :9389.00  
##                                                                            
##  Low SES postcode measure(b) Low SES by SA1(b,c)
##  Min.   :  141.0             Min.   :  128      
##  1st Qu.:  873.5             1st Qu.:  961      
##  Median : 1399.0             Median : 1327      
##  Mean   : 3179.0             Mean   : 2904      
##  3rd Qu.: 2238.5             3rd Qu.: 2114      
##  Max.   :74798.0             Max.   :68336      
##  NA's   :1                   NA's   :1          
##  First Address Low SES by SA1(b,c,d)  Regional(e)     
##  Min.   :  123.0                     Min.   :  139.0  
##  1st Qu.:  972.5                     1st Qu.:  632.5  
##  Median : 1377.0                     Median : 1468.0  
##  Mean   : 3004.5                     Mean   : 3431.7  
##  3rd Qu.: 2222.0                     3rd Qu.: 2344.5  
##  Max.   :70694.0                     Max.   :80728.0  
##  NA's   :1                           NA's   :1        
##  First Address Regional(d,e)   Remote(e)      First Address Remote(d,e)
##  Min.   :  164               Min.   :   5.0   Min.   :   5.0           
##  1st Qu.:  684               1st Qu.:  24.0   1st Qu.:  25.5           
##  Median : 1651               Median :  47.0   Median :  55.0           
##  Mean   : 3588               Mean   : 167.7   Mean   : 173.3           
##  3rd Qu.: 2594               3rd Qu.: 144.0   3rd Qu.: 135.5           
##  Max.   :84404               Max.   :3944.0   Max.   :4074.0           
##  NA's   :1                   NA's   :1        NA's   :1                
##  All Commencing Domestic Students   Total 2018    
##  Min.   :   594                   Min.   :   726  
##  1st Qu.:  4976                   1st Qu.:  5034  
##  Median :  9072                   Median :  8642  
##  Mean   : 16828                   Mean   : 16886  
##  3rd Qu.: 12810                   3rd Qu.: 13175  
##  Max.   :404148                   Max.   :405666  
##

#the max values seem substantially higher than other observations,  it is possible a total column is still in the data set. 
equity_19c$name %>% str_to_upper() %>% str_detect( "TOTAL", negate = FALSE)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

#remove total from dataset 
equity_19d <-filter(equity_19c, name !=  "TOTAL")

summary(equity_19d)

##      name           Students from a Non English speaking background
##  Length:47          Min.   :  18.0                                 
##  Class :character   1st Qu.: 131.5                                 
##  Mode  :character   Median : 228.0                                 
##                     Mean   : 324.8                                 
##                     3rd Qu.: 499.0                                 
##                     Max.   :1146.0                                 
##                                                                    
##  Students with a disability Women in Non-Traditional Area   Indigenous   
##  Min.   :  12.0             Min.   :   0.0                Min.   :  6.0  
##  1st Qu.: 340.5             1st Qu.: 603.5                1st Qu.: 96.5  
##  Median : 551.0             Median :1141.0                Median :152.0  
##  Mean   : 569.2             Mean   :1321.1                Mean   :199.6  
##  3rd Qu.: 790.5             3rd Qu.:1934.0                3rd Qu.:291.0  
##  Max.   :1329.0             Max.   :3454.0                Max.   :641.0  
##                                                                          
##  Low SES postcode measure(b) Low SES by SA1(b,c)
##  Min.   : 141.0              Min.   : 128.0     
##  1st Qu.: 850.2              1st Qu.: 931.5     
##  Median :1389.0              Median :1276.5     
##  Mean   :1622.1              Mean   :1482.1     
##  3rd Qu.:2099.8              3rd Qu.:2085.8     
##  Max.   :4425.0              Max.   :4192.0     
##  NA's   :1                   NA's   :1          
##  First Address Low SES by SA1(b,c,d)  Regional(e)    
##  Min.   : 123.0                      Min.   : 139.0  
##  1st Qu.: 942.2                      1st Qu.: 631.8  
##  Median :1334.5                      Median :1429.5  
##  Mean   :1533.0                      Mean   :1751.3  
##  3rd Qu.:2206.8                      3rd Qu.:2316.5  
##  Max.   :4266.0                      Max.   :8096.0  
##  NA's   :1                           NA's   :1       
##  First Address Regional(d,e)   Remote(e)      First Address Remote(d,e)
##  Min.   : 164                Min.   :  5.00   Min.   :  5.00           
##  1st Qu.: 683                1st Qu.: 23.50   1st Qu.: 25.25           
##  Median :1592                Median : 47.00   Median : 54.50           
##  Mean   :1831                Mean   : 85.61   Mean   : 88.52           
##  3rd Qu.:2460                3rd Qu.:122.00   3rd Qu.:128.50           
##  Max.   :7550                Max.   :433.00   Max.   :381.00           
##  NA's   :1                   NA's   :1        NA's   :1                
##  All Commencing Domestic Students   Total 2018   
##  Min.   :  594                    Min.   :  726  
##  1st Qu.: 4892                    1st Qu.: 4962  
##  Median : 8733                    Median : 8578  
##  Mean   : 8587                    Mean   : 8614  
##  3rd Qu.:12582                    3rd Qu.:12651  
##  Max.   :18154                    Max.   :18058  
##

#Overall values seems reasonable, investigate specific variables of interest in more data after joining dataset. 

clean_equity <- equity_19d %>% select (name, `Students with a disability`,`All Commencing Domestic Students`)

kable(head(clean_equity), caption = "Example of data contained in the cleaned equity dataset")

Example of data contained in the cleaned equity dataset
name	Students with a disability	All Commencing Domestic Students
Charles Sturt University	505	14015
Macquarie University	829	11304
Southern Cross University	489	6015
The University of New England	1212	9410
The University of Newcastle	889	11526
The University of Sydney	789	12367

With the removal of obvious errors, the two datasets have now been combined to allow a more detailed exploration of the columns of interest. The datasets were joined though an inner join using institute name. Matching on a string can lead to unexpected result, thus visual validation of alignment after each processing step was critical. The dataset was visually inspected to confirm alignment and accurate matching of columns. An example output of the first attempt at a join (using left_join) is provided to demonstrate the misalignment in datasets. After manipulating strings to remove white space, tabs and “The” from the names fields of both datasets, the quality of the join was considered acceptable. A sample of the fully aligned data is also provided as confirmation of the data cleaning undertaken.

Confirming the accuracy of the join the data which hasn’t been included in the new dataset was inspected. The anti_join function allowed this dataset to be created (although it does not contain a full list of column, as there is no data for these institutes in Dataset 1). The data contained in this table is equity data from unmatched institutes, upon visual inspected this data represents non-university providers and aggregate state based information.

#join tables 

equity_state <- left_join(clean_equity, clean_uni, by = "name")

kable(head(equity_state), caption = "Example of data contained in the joined dataset")

Example of data contained in the joined dataset
name	Students with a disability	All Commencing Domestic Students	Type	location	age
Charles Sturt University	505	14015	Public	NSW	Newer
Macquarie University	829	11304	Public	NSW	Newer
Southern Cross University	489	6015	Public	NSW	Newer
The University of New England	1212	9410	NA	NA	NA
The University of Newcastle	889	11526	Public	NSW	Newer
The University of Sydney	789	12367	NA	NA	NA

#terrible alignment, there are many rows with NAs - try and clean up white space and non-printing characters to improve accuracy of name recognition. 

#use base R to remove white spaces from both ends. 
clean_equity$name <- trimws(clean_equity$name)
clean_uni$name <- trimws(clean_uni$name)

#use stringer to remove white spaces
clean_equity$name <- str_trim(clean_equity$name, side="both")
clean_uni$name <- str_trim(clean_uni$name, side="both")

#specifically remove tab characters
clean_equity$name <- str_replace_all(clean_equity$name, pattern="\t", replacement = "" )
clean_uni$name <- str_replace_all(clean_uni$name, pattern="\t", replacement = "" )

#Remove the word "the" present in some uni titles in one data set and not in the same title in the second dataset
clean_equity$name <- str_replace_all(clean_equity$name, pattern="The ", replacement = "" )
clean_uni$name <- str_replace_all(clean_uni$name, pattern="The ", replacement = "" )

#use anti_join a filtering join to see what records will not be matched 
equity_state_dropped <- anti_join(clean_equity, clean_uni, by = "name")
kable(head (equity_state_dropped), caption = "Universities which did not have a name match in both datasets, and will be dropped from further anlysis")

Universities which did not have a name match in both datasets, and will be dropped from further anlysis
name	Students with a disability	All Commencing Domestic Students
Non-University Higher Education Institutions	966	18154
Non-University Higher Education Institutions	286	3754
Non-University Higher Education Institutions	127	2615
Non-University Higher Education Institutions	43	594
Private Universities (Table C) and Non-University Higher Education Institutions	66	1762
Non-University Higher Education Institutions	12	919

#visually confirmed these are small non-university providers and aggregates thus the results can be dropped 

#join tables (use an inner join to ensure all universities found in both tables are mapped), because we don't want unmatched records
equity_state <- inner_join(clean_equity, clean_uni, by = "name")
kable(head(equity_state), caption = "Example of data contained in the dataset after combining")

Example of data contained in the dataset after combining
name	Students with a disability	All Commencing Domestic Students	Type	location	age
Charles Sturt University	505	14015	Public	NSW	Newer
Macquarie University	829	11304	Public	NSW	Newer
Southern Cross University	489	6015	Public	NSW	Newer
University of New England	1212	9410	Public	NSW	Middle
University of Newcastle	889	11526	Public	NSW	Newer
University of Sydney	789	12367	Public	NSW	Older

#data appears to be tidy and ready for further processing.

Scan II (Dataset 1&2)

Confirm the final dataset does not contain any special characters or missing data using colSums and is.na. There shouldn’t be any special characters in the numeric columns from the equity dataset, because as.numeric forced coercion earlier, thus any characters or non-numeric outputs would have been converted to NA at that time, but this was checked using a custom function.

#Confirm the three variables selected for further analysis do not contain outliers

colSums(is.na(equity_state))

##                             name       Students with a disability 
##                                0                                0 
## All Commencing Domestic Students                             Type 
##                                0                                0 
##                         location                              age 
##                                0                                0

#The data set is complete
#used coercion earlier values should be numeric 

is.specialorNA<- function (x) {
  if (is.numeric(x)) (is.infinite(x)|is.nan(x)|is.na(x))
}
sapply(equity_state, function (x) sum(is.specialorNA(x)))

##                             name       Students with a disability 
##                                0                                0 
## All Commencing Domestic Students                             Type 
##                                0                                0 
##                         location                              age 
##                                0                                0

There are no special characters in the dataset.

check of outliers

To see if there are any outliers using “Tukey’s method of outlier detection”, this involved creating a box plot for each variable. On the box plot the outlier fence is drawn at 1.5 times the interquartile range (IQR), and values which appear outside this range are individually plotted and potentially considered outlier are considered.

#explore the combined dataset, to check there are no outliers or errors in dates or other columns
summary(equity_state)

##      name           Students with a disability All Commencing Domestic Students
##  Length:40          Min.   :  90.0             Min.   : 1233                   
##  Class :character   1st Qu.: 465.8             1st Qu.: 6121                   
##  Mode  :character   Median : 569.0             Median : 9690                   
##                     Mean   : 630.0             Mean   : 9369                   
##                     3rd Qu.: 799.5             3rd Qu.:12810                   
##                     Max.   :1329.0             Max.   :16171                   
##                                                                                
##             Type       location      age    
##  Public       :37   NSW    :10   Older : 4  
##  Private (Au) : 2   QLD    : 8   Middle: 6  
##  Private (Int): 1   VIC    : 8   Newer :30  
##                     WA     : 5              
##                     SA     : 3              
##                     ACT    : 2              
##                     (Other): 4

#par(mfrow = c(2,1))
boxplot(equity_state$`Students with a disability`, equity_state$`All Commencing Domestic Students`, 
        names = c("students with a disability", "domestic students"),
        main= "Higher Education 2019 enrolments", 
        ylab = 'Students')

There are no outliers ## create a new variable

The ordered factor of university age was created when data was imported. To further explore this dataset, a new variable %disable has been created using manually defined formula.

#Create a new variable (% of disabled students per university)----
equity_state$`%disable` <- (equity_state$`Students with a disability`/equity_state$`All Commencing Domestic Students`)*100

Transform

Z-score check for outliers

To additionally check of outliers a Z-score transformation was applied to the All Commending Domestic Students, Students with a disability and %disable columns. To confirm the data for these variables was normally distributed and thus able to be transformed using a z-score, the distribution was visually assessed by plotting a histogram and mathematically calculated using the Kolmogorov-Smirov test and the Shapiro-Wilk test. For the mathematical assessment the H0 is that the data is normally distributed, as such the p results of >0.05 indicate a lack of evidence to reject this hypothesis. As such there is no evidence the data for the three variable is anything other than normally distributed, and therefore a z-score transformation is appropriate. The transformation is completed using the scores function from the outliers package. Typically an outlier is a value which has a Z score of more than 3.

From the summary function, there are no minimum or maximum values of more than 3, thus according to the standard interpretation of a Z-score there are no outliers in this dataset. Additionally there are only three values which are more than 2 standard deviations away from the mean. Thus all data is retained for further analysis.

par(mfrow = c(1,3))

#check if the all commencing domestic students variable is normally distributed  
hist(equity_state$`All Commencing Domestic Students`,
        main = "Domestic Students \n commencing 2019", 
        xlab = "Students",
        ylab =  "Frequency")
shapiro.test(equity_state$`All Commencing Domestic Students`)

## 
##  Shapiro-Wilk normality test
## 
## data:  equity_state$`All Commencing Domestic Students`
## W = 0.96175, p-value = 0.1922

ks.test(equity_state$`All Commencing Domestic Students`, "pnorm", mean = mean(equity_state$`All Commencing Domestic Students`), sd = sd(equity_state$`All Commencing Domestic Students`))

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  equity_state$`All Commencing Domestic Students`
## D = 0.12013, p-value = 0.5695
## alternative hypothesis: two-sided

#check if the students with disability variable is normally distributed 
hist(equity_state$`Students with a disability`,
     main = " Students with a Disability", 
     xlab = "Students with a disablity",
     ylab =  "Frequency")
shapiro.test(equity_state$`Students with a disability`)

## 
##  Shapiro-Wilk normality test
## 
## data:  equity_state$`Students with a disability`
## W = 0.96039, p-value = 0.1727

ks.test(equity_state$`Students with a disability`, "pnorm", mean = mean(equity_state$`Students with a disability`), sd = sd(equity_state$`Students with a disability`))

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  equity_state$`Students with a disability`
## D = 0.15655, p-value = 0.2531
## alternative hypothesis: two-sided

#check to see if the new %disable variable is normally distributed 

hist(equity_state$`%disable`, 
     main = "%Disabled", 
     xlab = "%students with a disability",
     ylab =  "Frequency")

shapiro.test(equity_state$`%disable`)

## 
##  Shapiro-Wilk normality test
## 
## data:  equity_state$`%disable`
## W = 0.97994, p-value = 0.6872

ks.test(equity_state$`%disable`, "pnorm", mean = mean(equity_state$`%disable`), sd = sd(equity_state$`%disable`))

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  equity_state$`%disable`
## D = 0.086372, p-value = 0.9015
## alternative hypothesis: two-sided

#as the data is normally distributed it can be transformed using z scores

equity_state$zdom <- scores(equity_state$`All Commencing Domestic Students`, type = "z")
equity_state$zdis <- scores(equity_state$`Students with a disability`, type = "z")
equity_state$`z%` <- scores(equity_state$`%disable`, type = "z")

summary(equity_state$zdom)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.02402 -0.80808  0.07973  0.00000  0.85602  1.69214

summary(equity_state$zdis)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8297 -0.5565 -0.2066  0.0000  0.5744  2.3686

summary(equity_state$`z%`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0469 -0.4556 -0.1538  0.0000  0.6235  2.8091

length(which(abs(equity_state$zdom)>2))

## [1] 1

length(which(abs(equity_state$zdis)>2))

## [1] 2

length(which(abs(equity_state$`z%`)>2))

## [1] 2

Tranform scale

To more readily interpret the student number variable, it is converted into ‘thousands of students’. This involves dividing the All Commencing Domestic Students by 1000.

#transform student numbers to '000's of students. 
equity_state$students <- equity_state$`All Commencing Domestic Students`/1000

Outcome

The datset has been cleaned and is ready for further investigation. Examples of questions which could be answered with this data are included below.

plot(equity_state$`%disable` ~ equity_state$location,
     main = "Does state influence equity student enrolment?",
     xlab = "Location", 
     ylab = "Student Identifing as disabled",
     ylim = c(0,25))
     abline (h =20, col = "Blue", lwd =2)
     text (5, 22, "National average (disability)", col = "blue")
     text (6, 15, "UNE 12%")

     plot(equity_state$`%disable` ~ equity_state$students,
     main = "Does size influence equity student enrolment?",
     xlab = "Student number ('000s)", 
     ylab = "Student Identifing as disabled",
     ylim = c(0,25))
     abline (h =20, col = "Blue", lwd =2)
     text (5, 22, "National average (disability)", col = "blue")

     plot(equity_state$students ~ equity_state$location,
     main = "Where are larger universities located?",
     xlab = "Location", 
     ylab = "Student number ('000s)")

#     ylim = c(0,25))
#     abline (h =20, col = "Blue", lwd =2)
#     text (5, 22, "National average (disability)", col = "blue")
#     text (6, 15, "UNE 12%")
#Colour by state or age

References

List of universities in Australia. (2020). In Wikipedia. https://en.wikipedia.org/w/index.php?title=List_of_universities_in_Australia&oldid=982617437 Selected Higher Education Statistics – 2019 Student data | Department of Education, Skills and Employment. (n.d.). Retrieved 1 October 2020, from https://www.education.gov.au/selected-higher-education-statistics-2019-student-data Wickham, H. (2014). Tidy data. The American Statistician, 14. https://doi.org/10.18637/jss.v059.i10