MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
# install.packages("forecast")

library(readxl)
library(readr)
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

Executive Summary

This report involves pre-processing two data sets: the Post School Destinations of Year 12 students from Victorian schools and the the results of Year 12 Students in Victoria. These would be referred to as the Post School and 2018 Outcomes data sets respectively. Upon inspecting the data sets it was determined that the former had column names that were values and was deemed to be an untidy data set while the latter consisted of tidy data.

A list of variable descriptions were created to describe the data sets and allowed desired data types to be inferred. The data, consisting of two separate .xlsx formatted files was read into R consequently were converted to either a factor or numerical type as appropriate per previous description.

The Post School data set was then tidied, converting it from wide to long form. The key variable was identified as the school name and was used to merge the observations of the two data sets together.

The data was then scanned for missing values. Only the four variables that had been merged from the 2018 Outcomes data set had missing values. This lead to the assumption that the missing values were due to the data sets referencing to the same school in different ways resulting in some observations not being matched.

The data was then scanned for outliers using box plots on the numerical variables that originated from the 2018 Outcomes data set. After finding outliers, the capping method was used to replace these values, however there were still some existing outliers in the result which may point to measurement errors for those cases.

Finally, the Perc_Over_40 variable was identified to have outliers using box plots and consequentially normalised using the capping method.

Data

This report consists of data taken from two different sources that observed Year 12 students of 2018:

Post School Destinations of Year 12 or equivalent completers in 2018, Victorian schools (Department of Education and Training 2020)
Senior Secondary Completion and Achievement Information, 2018 (Victorian Curriculum and Assessment Authority 2018)

For ease of reference, they will be referred to as the ‘Post School’ and ‘2018 Outcomes’ data sets respectively throughout this report.

The Post School data set was sourced from conducting the ‘On Track Year 12 completer’ survey between April to May 2019 on school-leavers who attended Year 12, the final year of high school, in a Victorian school in 2018. It captures how they have transitioned from school, particularly in regards to whether or not they had moved into further education and training. The data set treats Victorian schools as observations and was conducted by the Victorian Department of Education and Training.

The 2018 Outcomes data set provides the outcomes of Victorian students that completed their Year 12 examinations (VCE, VET or VCAL) in 2018 and treats Victorian schools as observations as well. It was conducted by the Victorian Curriculum and Assessment Authority.

This report will observe how Victorian students performed at different schools and whether or not their performance translates to what path they transitioned to after completing their final year.

The Post School data set consists of the following 14 variables:

VCAA Code: The unique identifying code that the VCAA uses to reference a school
School Name: The name of a Victorian school
Sector: Identifies the type of school (A = Adult, C = Catholic, I = Independent, G = Government)
Locality: The locality of the school within Victoria
Total Completed Year 12: The number of school-leavers that successfully completed Year 12
On Track Consenters: The number of school-leavers that successfully completed Year 12 and consented to taking the ‘On Track’ survey
On Track Respondents: The number of school-leavers that consented to taking the ‘On Track’ survey and provided responses
Bachelor Enrolled: The percentage of school-leavers that were enrolled in a Bachelor’s Degree and provided responses to the survey
Deferred: The percentage of school-leavers that deferred their univeristy course and provided responses to the survey
TAFE/VET Enrolled: The percentage of school-leavers that were enrolled in a TAFE or VET course and provided responses to the survey
Apprentice/Trainee: The percentage of school-leavers that were undertaking an Apprenticeship or were Trainees and provided responses to the survey
Employed: The percentage of school-leavers that were employed and not undertaking further education or training and provided responses to the survey
Other: The percentage of school-leavers that were not in the labour force, education or employment and could not be determined from the provided information from the survey

The 2018 Outcomes data set consists of 17 variables, but for the purposes of this analysis only the following five will be used:

School: The name of the Victorian School
Percentage of VCE students applying for tertiary places: The percentage of students undertaking the VCE that applied for tertiary places
Percentage of VCE completions in 2018: The percentage of students that completed the VCE in 2018
Median VCE study score: The median study score of a VCE unit, where the maximum is 50 and minimum is 0
Percentage of study scores of 40 and over: The percentage of study scores of 40 and over, where the maximum is 50 and minimum is 0

To assist with the join process, will convert the ‘School’ values of the data set ‘completed_2018’ to upper case to appropriately match cases with the ‘School’ values in the ‘destination_2019’ data set.

The data sets will be matched based on school names, thus based on the ‘School’ values.

The data set in .xlsx file format, renamed as “Destination_19.xlsx” for ease of reference and was read with the read_excel function and saved into the variable destination_2019. Pa

# Read in relevant data and rename columns appropriately 
post_school_2019 <- read_excel("post_school_2019.xlsx", 
                            sheet = 'SCHOOL PUBLICATION TABLE 2019', 
                            range = ("a8:n386"), 
                            col_names = c("VCAA_Code", "School_Name", "Sector", "Locality", "Total_Completed_Year_12", "On_Track_Consenters", "On_Track_Respondents", "Bachelors", "Deferred", "TAFE_VET", "Appr_Trainee", "Employed", "Looking_for_Work", "Other"));

head(post_school_2019);

# Read in 2018 Outcomes and rename col_names to be shorter and readable

outcomes_2018 <- read_excel("outcomes_2018.xlsx", 
                            range = ("a10:q593"),
                            col_names = c("School", "Adult", "Small", "Locality", "Num_of_3_4_Subj", "Num_of_VET_enrolments", "Int_Bacc_Availability", "Num_std_enrolled_1_VCE", "Num_std_enrolled_VET_cert", "Num_std_enrolled_VCAL", "Perc_applying", "Perc_completed_VCE", "Num_VCE_Bacc", "Perc_VET_completed", "Perc_VCAL_completed", "Median_Study_Score", "Perc_Over_40"));

head(outcomes_2018)

# Only select necessary variables and place into new dataframe

outcomes_2018_req <- select(outcomes_2018, `School`, `Perc_applying`, `Perc_completed_VCE`, `Median_Study_Score`, `Perc_Over_40` );

head(outcomes_2018_req);

Understand

By using the str function, we can determine how the variables have been read in appropriately and whether or not data type conversions will need to take place.

First, looking at the post_school_2019 dataframe.

# Check the data types of each variable in the post_school_2019 dataframe
str(post_school_2019)

## tibble [379 × 14] (S3: tbl_df/tbl/data.frame)
##  $ VCAA_Code              : chr [1:379] "01216" "14952" "14924" "11802" ...
##  $ School_Name            : chr [1:379] "ACADEMY OF MARY IMMACULATE" "AL SIRAAT COLLEGE" "ALKIRA SECONDARY COLLEGE" "ANTONINE COLLEGE" ...
##  $ Sector                 : chr [1:379] "C" "I" "G" "C" ...
##  $ Locality               : chr [1:379] "FITZROY" "EPPING" "CRANBOURNE NORTH" "PASCOE VALE SOUTH" ...
##  $ Total_Completed_Year_12: chr [1:379] "96" "21" "165" "58" ...
##  $ On_Track_Consenters    : chr [1:379] "96" "21" "87" "46" ...
##  $ On_Track_Respondents   : chr [1:379] "60" "13" "63" "24" ...
##  $ Bachelors              : chr [1:379] "68" "77" "51" "54" ...
##  $ Deferred               : chr [1:379] "12" "0" "2" "8" ...
##  $ TAFE_VET               : chr [1:379] "15" "15" "21" "25" ...
##  $ Appr_Trainee           : chr [1:379] "3" "8" "5" "0" ...
##  $ Employed               : chr [1:379] "2" "0" "17" "4" ...
##  $ Looking_for_Work       : chr [1:379] "0" "0" "5" "8" ...
##  $ Other                  : chr [1:379] "0" "0" "0" "0" ...

It appears that all variables have been converted into the character data type which is not what was desired. To allow for the appropriate labeling of different variable values, the VCAA_Code, School_Name, Sector and Locality variables need to be converted to type factor. The other variables need to be converted to the numeric data type.

# Convert correct variables to factor type
post_school_2019$VCAA_Code <- as.factor(post_school_2019$VCAA_Code)
post_school_2019$School_Name <- as.factor(post_school_2019$School_Name)
post_school_2019$Sector <- as.factor(post_school_2019$Sector)
post_school_2019$Locality <- as.factor(post_school_2019$Locality)

# Convert correct variables to numeric type
post_school_2019$Total_Completed_Year_12 <- as.numeric(post_school_2019$Total_Completed_Year_12)
post_school_2019$On_Track_Consenters <- as.numeric(post_school_2019$On_Track_Consenters)
post_school_2019$On_Track_Respondents <- as.numeric(post_school_2019$On_Track_Respondents)
post_school_2019$Bachelors <- as.numeric(post_school_2019$Bachelors)
post_school_2019$Deferred <- as.numeric(post_school_2019$Deferred)
post_school_2019$TAFE_VET <- as.numeric(post_school_2019$TAFE_VET)
post_school_2019$Appr_Trainee <- as.numeric(post_school_2019$Appr_Trainee)
post_school_2019$Employed <- as.numeric(post_school_2019$Employed)
post_school_2019$Looking_for_Work <- as.numeric(post_school_2019$Looking_for_Work)
post_school_2019$Other <- as.numeric(post_school_2019$Other)

# Check if converted correctly
str(post_school_2019)

## tibble [379 × 14] (S3: tbl_df/tbl/data.frame)
##  $ VCAA_Code              : Factor w/ 379 levels "01003","01004",..: 101 364 361 342 145 139 97 156 213 4 ...
##  $ School_Name            : Factor w/ 370 levels "ACADEMY OF MARY IMMACULATE",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sector                 : Factor w/ 4 levels "A","C","G","I": 2 4 3 2 3 2 3 3 2 3 ...
##  $ Locality               : Factor w/ 246 levels "ALBERT PARK",..: 81 78 58 176 170 187 4 6 115 100 ...
##  $ Total_Completed_Year_12: num [1:379] 96 21 165 58 16 256 19 39 172 37 ...
##  $ On_Track_Consenters    : num [1:379] 96 21 87 46 15 235 19 26 145 36 ...
##  $ On_Track_Respondents   : num [1:379] 60 13 63 24 11 133 13 20 90 19 ...
##  $ Bachelors              : num [1:379] 68 77 51 54 36 49 15 65 46 58 ...
##  $ Deferred               : num [1:379] 12 0 2 8 36 11 15 10 11 16 ...
##  $ TAFE_VET               : num [1:379] 15 15 21 25 0 15 15 15 17 5 ...
##  $ Appr_Trainee           : num [1:379] 3 8 5 0 9 14 8 5 12 5 ...
##  $ Employed               : num [1:379] 2 0 17 4 18 10 38 5 13 16 ...
##  $ Looking_for_Work       : num [1:379] 0 0 5 8 0 1 8 0 1 0 ...
##  $ Other                  : num [1:379] 0 0 0 0 0 0 0 0 0 0 ...

Now checking the outcomes_2018_req dataframe.

# Check the data types of each variable in the outcomes_2018_req dataframe
str(outcomes_2018_req)

## tibble [584 × 5] (S3: tbl_df/tbl/data.frame)
##  $ School            : chr [1:584] "Academy of Mary Immaculate" "Adass Israel School" "Advance College of Education" "Aitken College" ...
##  $ Perc_applying     : chr [1:584] "95.8" "-" "-" "89.2" ...
##  $ Perc_completed_VCE: chr [1:584] "100" "-" "-" "100" ...
##  $ Median_Study_Score: chr [1:584] "31" "-" "-" "30" ...
##  $ Perc_Over_40      : chr [1:584] "7.1" "-" "-" "4.8" ...

Again, all variables were read in as character variables and need to be converted appropriately. The School variable must be converted to a factor and all other variables to numeric. Noting that School will be the variable that joins the two dataframes, will first convert the case of the School variable before converting to a factor.

# Convert case of School variable to uppercase
outcomes_2018_req$School <- toupper(outcomes_2018_req$School)

# Convert variables into their appropriate data types
outcomes_2018_req$School <- as.factor(outcomes_2018_req$School)
outcomes_2018_req$Perc_applying <- as.numeric(outcomes_2018_req$Perc_applying)

## Warning: NAs introduced by coercion

outcomes_2018_req$Perc_completed_VCE <- as.numeric(outcomes_2018_req$Perc_completed_VCE)

## Warning: NAs introduced by coercion

outcomes_2018_req$Median_Study_Score <- as.numeric(outcomes_2018_req$Median_Study_Score)

## Warning: NAs introduced by coercion

outcomes_2018_req$Perc_Over_40 <- as.numeric(outcomes_2018_req$Perc_Over_40)

## Warning: NAs introduced by coercion

# Check if converted appropriately
str(outcomes_2018_req)

## tibble [584 × 5] (S3: tbl_df/tbl/data.frame)
##  $ School            : Factor w/ 567 levels "ACADEMY OF MARY IMMACULATE",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Perc_applying     : num [1:584] 95.8 NA NA 89.2 100 90.2 85.7 NA 75 38.9 ...
##  $ Perc_completed_VCE: num [1:584] 100 NA NA 100 100 100 100 NA 96 89 ...
##  $ Median_Study_Score: num [1:584] 31 NA NA 30 28 26 31 NA 24 24 ...
##  $ Perc_Over_40      : num [1:584] 7.1 NA NA 4.8 5.4 3.8 8.6 NA 4.2 0 ...

During the conversion of data types into numeric, the ‘-’ character was converted to NA.

Tidy & Manipulate Data I

The Post School data set is untidy as the different types of ‘In Education and Training’ (Bachelors, Deferred, TAFE/VET, Appr_entice/_Trainee) and ‘Not in Education and Training’ (Employed, Looking For Work, Other) are not variables, but are values that form the columns of the data set.

To tidy up the data, the gather() function will be used to transform the data from wide to long format for the variables mentioned above.

The ‘In Education and Training’ related values will be values in the new column (key) titled ‘Type of Education’ and the value will be stored in the column ‘Percentage in Education Type’.

# Using gather() on the 'In Education and Training' related variables
post_school_2019 <- post_school_2019 %>% gather(Bachelors, Deferred, TAFE_VET, Appr_Trainee, key = "Type_of_Education", value = "Percentage_in_Education_Type");

# Check data types of new columns
typeof(post_school_2019$Type_of_Education)

## [1] "character"

typeof(post_school_2019$Percentage_in_Education_Type)

## [1] "double"

# Convert the Type of Education to a factor
post_school_2019$Type_of_Education <- as.factor(post_school_2019$Type_of_Education)

The ‘Not in Education and Training’ related values will be values in the new column (key) titled ‘Type of Non-Education’ and the value will be stored in the column ‘Percentage in Non-Education type’

# Using gather() on the 'In Education and Training' related variables
post_school_2019 <- post_school_2019 %>% gather(Employed, Looking_for_Work, Other,  key = "Type_of_Non_Education", value = "Percentage_in_Non_Education_Type");

# Check data types of new columns
typeof(post_school_2019$Type_of_Non_Education)

## [1] "character"

typeof(post_school_2019$Percentage_in_Non_Education_Type)

## [1] "double"

# Convert the Type of Non-Education to a factor
post_school_2019$Type_of_Non_Education <- as.factor(post_school_2019$Type_of_Non_Education)

Data Revisited (merging data sets)

As the data sets are now both tidy, they will be merged via a left_join, into a new dataframe called school_leavers_2018. The School Name variable of post_school_2019 will be used with the School variable of outcomes_2018_req to the outcomes_2018_req data set variables to the schools that exist in the post_school_2019 dataframe.

# Combining data sets with left_join
school_leavers_2018 <- post_school_2019%>%left_join(outcomes_2018_req, by =c("School_Name"="School"))

# Produce head of new dataframe
head(school_leavers_2018)

Tidy & Manipulate Data II

It may be interesting to know the percentage of respondents from the total amount of consenters. As such, the mutate function will be used to create a new variable called Percentage Responded by dividing the On Track Respondents by the On Track Consenters and multiplying the result by 100% to convert to a percentage.

# Creating Percentage_Responded column using mutate function
# Multiply by 100 to convert to percentage and keep to 1 decimal place to be consistent with other percentages

school_leavers_2018 <- school_leavers_2018 %>% mutate(.,Percentage_Responded = round(On_Track_Respondents/On_Track_Consenters*100,1))

head(school_leavers_2018)

Scan I

The warnings generated during the Understand section of the report when applying type conversions indicate that there was missing data in the 2018 Outcomes data set. Specifically, these were represented by ‘-’ values and were were forcibly coerced into NAs.

Also, during the creation of the variable Percentage Responded, it is possible that the division produced NaN values if there were 0 On_Track_Consenters and 0 On_Track_Respondents. On that point, it may also be useful to check if there are any Inf or -Inf values as well.

To determine how prevalent NA values are within the entire dataframe school_leavers_2018, we can apply the is.na() function and obtain the frequency with the sapply() and sum() functions.

To determine how prevalent NA and special characters are within the combined dataframe school_leavers_2018 we can develop a function that counts the frequency of NA, NaN, Inf and -Inf values for numeric type data.

# Checking if there are any NA values in general across the entire dataframe
# Code sums up all NA values per variable in the dataframe
sapply(school_leavers_2018, function(x) sum(is.na(x)))

##                        VCAA_Code                      School_Name 
##                                0                                0 
##                           Sector                         Locality 
##                                0                                0 
##          Total_Completed_Year_12              On_Track_Consenters 
##                                0                                0 
##             On_Track_Respondents                Type_of_Education 
##                                0                                0 
##     Percentage_in_Education_Type            Type_of_Non_Education 
##                                0                                0 
## Percentage_in_Non_Education_Type                    Perc_applying 
##                                0                             1176 
##               Perc_completed_VCE               Median_Study_Score 
##                             1176                             1176 
##                     Perc_Over_40             Percentage_Responded 
##                             1176                                0

It appears that there are no NA values among the factor type variables. To see if there are any additional special values within the numeric data, we use the following:

# Create function that returns TRUE or FALSE depending if the read value is NA or a special value
is.specialorNA_numeric <- function(x) {
  if(is.numeric(x))(is.nan(x) | is.na(x) | is.infinite(x))
}

# Using sapply to apply across the entire school_leavers_2018 dataframe
# Combining with sum() to generate frequency 
sapply(school_leavers_2018, function(x) sum(is.specialorNA_numeric(x)))

##                        VCAA_Code                      School_Name 
##                                0                                0 
##                           Sector                         Locality 
##                                0                                0 
##          Total_Completed_Year_12              On_Track_Consenters 
##                                0                                0 
##             On_Track_Respondents                Type_of_Education 
##                                0                                0 
##     Percentage_in_Education_Type            Type_of_Non_Education 
##                                0                                0 
## Percentage_in_Non_Education_Type                    Perc_applying 
##                                0                             1176 
##               Perc_completed_VCE               Median_Study_Score 
##                             1176                             1176 
##                     Perc_Over_40             Percentage_Responded 
##                             1176                                0

Given that both checks have found there are 1176 missing values in both cases, it can be concluded that there are no special values within the data set. As the missing values occur across the Perc_applying, Perc_completed_VCE, Median_Study_Score and Perc_Over_40 variables, this is likely related to the left_join conducted earlier. That is, some School_Name variables in the original post_school_2019 dataframe either did not have matching observations as the 2018 Outcomes and Post School data sets may refer to the same school differently or that those schools simply did not exist in the 2018 Outcomes data set. The former is more likely as outcomes_2018_req captures 568 factor levels (schools) and the post_school_2019 dataframe only has 370 factor levels.

While there are total 4704 observations in the school_leavers_2018 dataframe, and that 1176 is greater than 5% of this, it seems necessary to remove observations where all four variables are NA as these lack the meaningful data required for this report’s intended analysis.

As there are no other NA or special values in the data set, we can use the complete.cases() function to remove these values. We can then check the completeness of this new data by assigning it to a new dataframe and applying the is.na() across it.

# Store only complete observations in new dataframe
school_leavers_2018_complete <- school_leavers_2018[complete.cases(school_leavers_2018),]

# Check new dataframe that it is complete
sapply(school_leavers_2018_complete, function(x) sum(is.na(x)))

##                        VCAA_Code                      School_Name 
##                                0                                0 
##                           Sector                         Locality 
##                                0                                0 
##          Total_Completed_Year_12              On_Track_Consenters 
##                                0                                0 
##             On_Track_Respondents                Type_of_Education 
##                                0                                0 
##     Percentage_in_Education_Type            Type_of_Non_Education 
##                                0                                0 
## Percentage_in_Non_Education_Type                    Perc_applying 
##                                0                                0 
##               Perc_completed_VCE               Median_Study_Score 
##                                0                                0 
##                     Perc_Over_40             Percentage_Responded 
##                                0                                0

As there is 0 frequency across every variable, we can conclude that all observations are now complete.

Scan II

For the purposes of this report, the following four numerical variables were chosen to determine whether there were any outliers within the results and behaviours of the students that completed their VCE in 2018:

Perc_applying
Perc_completed_VCE
Median_Study_Score
Perc_Over_40

Boxplots were first used to determine whether any outliers existed for any of these variables.

# Percentage of Students Applying to University Box Plot

boxplot(school_leavers_2018_complete$Perc_applying,
        main = "Percentage of Students applying to University Box Plot")

# Percentage of Students that Successfully Completed VCE Box Plot

boxplot(school_leavers_2018_complete$Perc_completed_VCE,
        main = "Percentage of Students that Successfully Completed VCE Box Plot")

# Median VCE Study Score Box Plot

boxplot(school_leavers_2018_complete$Median_Study_Score,
        main = "Median VCE Study Score Box Plot")

# Percentage of VCE Scores over 40 Box Plot

boxplot(school_leavers_2018_complete$Perc_Over_40,
        main = "Percentage of VCE Scores over 40 Box Plot")

The box plots demonstrate that each variable has outliers. We can see that besides from the final plot where outliers will likely positively skew the distribution of percentage of VCE scores over 40, the outliers will negatively skew the distribution of the other variables.

To reduce the effect of the outliers, we can use the capping method to replace outliers below the lower limit with the 5th percentile value and ourliers above the upper limit with the 95th percentile value of the numerical variables.

To achieve this, we first define the capping function:

# Define capping function
cap <- function(x){
    quantiles <- quantile(x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

Then we create dummy variables to apply capping without effecting the true data:

# Applying capping to the numerical variables

school_leavers_2018_complete$Perc_applying_dummy <- school_leavers_2018_complete$Perc_applying %>% cap()
school_leavers_2018_complete$Perc_completed_VCE_dummy <- school_leavers_2018_complete$Perc_completed_VCE %>% cap()
school_leavers_2018_complete$Median_Study_Score_dummy <- school_leavers_2018_complete$Median_Study_Score %>% cap()
school_leavers_2018_complete$Perc_Over_40_dummy <- school_leavers_2018_complete$Perc_Over_40 %>% cap()

To check the results of capping, we produce the boxplots of the variables again:

# Percentage of Students Applying to University Box Plot
boxplot(school_leavers_2018_complete$Perc_applying_dummy,
        main = "Percentage of Students applying to University Box Plot")

# Percentage of Students that Successfully Completed VCE Box Plot
boxplot(school_leavers_2018_complete$Perc_completed_VCE_dummy,
        main = "Percentage of Students that Successfully Completed VCE Box Plot")

# Median VCE Study Score Box Plot
boxplot(school_leavers_2018_complete$Median_Study_Score_dummy,
        main = "Median VCE Study Score Box Plot")

# Percentage of VCE Scores over 40 Box Plot
boxplot(school_leavers_2018_complete$Perc_Over_40_dummy,
        main = "Percentage of VCE Scores over 40 Box Plot")

This has drastically reduced the number of outliers, but as outliers are still present this may be indicative that some extreme outliers existed which would prompt the need to investigate if they were due to measurement error.

Transform

We first observe the distribution of the variable Perc_Over_40 by plotting its histogram and Q-Q plots.

# Histogram of Percentage of VCE Study Scores over 40
hist(school_leavers_2018_complete$Perc_Over_40, main="Histogram of Percentage of VCE Study Scores over 40",
     xlab = "Percentage of VCE Study Scores over 40")

# Q-Q Plot of Percentage of VCE Study Scores over 40
school_leavers_2018_complete$Perc_Over_40%>%qqPlot(dist="norm")

## [1] 117 411

Judging from the right-skewed histogram and the deviation of the Q-Q plot, the variable is not suitable for data analysis. It would be preferred if it was normalised. As such we can apply the Box-Cox transformation.

boxcox_Perc_Over_40<- BoxCox(school_leavers_2018_complete$Perc_Over_40,lambda = "auto")
hist(boxcox_Perc_Over_40, main="Normalised Histogram of Percentage of VCE Study Scores over 40",
     xlab="Percentage of VCE Study Scores over 40")

Creating a Q-Q plot of the normalised data set to check:

# Q-Q Plot of Percentage of VCE Study Scores over 40
boxcox_Perc_Over_40%>%qqPlot(dist="norm")

## [1] 25 36

As the values hug the line more closely it can be seen that the Box-Cox transformation has effectively normalised the data.

References

State Government of Victoria 2018, 2018 On Track Year 12 or equivalent completers post school destinations, Victoria, data file, Australian Government, Department of Education and Training, Victoria, viewed 20 October 2020, http://education.vic.gov.au/Documents/about/research/datavic/dv293-destinationdata2018.xlsx

Victorian Curriculum and Assessment Authority 2018, Senior Secondary Completion and Achievement Information, 2018, data file, Australian Government, Victorian Curriculum and Assessment Authority, Victoria, viewed 20 October 2020, https://vcaa.vic.edu.au/Documents/statistics/2018/2018_Senior_Secondary_Completion_and_Achievement_Information.xlsx