Data Wrangling (Data Preprocessing)

Load Packages

Firstly, I loaded all the essential packages to get ready for the data wrangling steps.

library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)

Executive Summary

This report focuses on preprocessing language enrolment data stored in Victorian public database. Below steps were taken to the final tidy data:
- Data: Data was sourced on public website and loaded into RStudio;
- Understand: Understand data structure and convert variables into appropriate type;
- Tidy & Manipulate Data: data was reshaped following tidy data principles using tidyr package tools, and manipulated using dplyr package tools to join the two data sets;
- Scan: data was scanned and checked for missing values, special values, and outliers;
- Transformation: data transformation was conducted using transformation techniques.

Data

I found two data sets from www.discover.data.vic.gov.au/dataset website. The data sets contain enrolment of language program in Victorian schools. They were stored in CSV format, so I downloaded and then loaded into RStudio.

Both data sets have below six variables:
- Language: the name of the language program enrolled in Victorian government schools
- SWV, NWV, NEV, SEV: the names of four regions in Victoria
- Total: the total enrolment across four regions in that specific language program

# Load the two data sets skipping the first row
Primary <- read_csv("Data/49-dv primary enrolments 2012.csv", skip = 1)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Language = col_character(),
##   SWV = col_number(),
##   NWV = col_number(),
##   NEV = col_number(),
##   SEV = col_number(),
##   Total = col_number()
## )

Secondary <- read_csv("Data/50-dv secondary enrolments 2012.csv", skip = 1)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Language = col_character(),
##   SWV = col_number(),
##   NWV = col_number(),
##   NEV = col_number(),
##   SEV = col_number(),
##   Total = col_number()
## )

# Save as dataframe
Primary <- as.data.frame(Primary)
Secondary <- as.data.frame(Secondary)

Understand

The data is about the enrolments of language study programs in Victorian schools. Data was collected in 2012 across four regions of Victoria. The “Primary” data records the total enrolments in language study in Victorian government primary schools and the “Secondary” data records the total enrolments in language study in Victorian government Secondary schools.
- “Primary” data has 20 observations and 6 variables
- “Secondary” data has 16 observations and 6 variables.

Data view shows that two data sets have similar structure:
- One character variable showing the list of Language program enrolled
- Four regions variables are numeric showing the count of enrolments in each language program
- “Total” column is the sum of enrolment across four regions in each language program
- “Total” row is the sum of enrolment of all language programs in each region

# View data sets
head(Primary)

head(Secondary)

# Check data structure
str(Primary)

## 'data.frame':    21 obs. of  6 variables:
##  $ Language: chr  "Italian" "Indonesian" "Japanese" "Chinese (Mandarin)" ...
##  $ SWV     : num  11223 9497 8995 3733 1934 ...
##  $ NWV     : num  13611 9431 1641 1380 2185 ...
##  $ NEV     : num  9893 7422 12554 10512 4346 ...
##  $ SEV     : num  8121 9342 11860 3743 6236 ...
##  $ Total   : num  42848 35692 35050 19368 14701 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Language = col_character(),
##   ..   SWV = col_number(),
##   ..   NWV = col_number(),
##   ..   NEV = col_number(),
##   ..   SEV = col_number(),
##   ..   Total = col_number()
##   .. )

str(Secondary)

## 'data.frame':    17 obs. of  6 variables:
##  $ Language: chr  "French" "Italian" "Indonesian" "Japanese" ...
##  $ SWV     : num  3362 6986 5184 5831 2413 ...
##  $ NWV     : num  3341 7856 4414 1527 1612 ...
##  $ NEV     : num  6484 2710 3217 4134 3386 ...
##  $ SEV     : num  5727 947 5560 5753 3538 ...
##  $ Total   : num  18914 18499 18375 17245 10949 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Language = col_character(),
##   ..   SWV = col_number(),
##   ..   NWV = col_number(),
##   ..   NEV = col_number(),
##   ..   SEV = col_number(),
##   ..   Total = col_number()
##   .. )

I converted character variable “Language” to factor and confirmed the class of numeric variables. I don’t need to order the factor levels as the language is ordered by total count of enrolment and its nature does not require order.

# Convert character variable to factor, and confirm the class of numeric variables
Primary$Language <- as.factor(Primary$Language)
class(Primary$Language)

## [1] "factor"

is.numeric(Primary$SWV)

## [1] TRUE

Secondary$Language <- as.factor(Secondary$Language)
class(Secondary$Language)

## [1] "factor"

is.numeric(Secondary$Total)

## [1] TRUE

Tidy & Manipulate Data I

Looking at the both data sets, I believe they are untidy because in this case, a long format would suit wrangling purpose better than a wide format in this case. The 2nd, 3rd, 4th, and 5th columns have headings of four regions but values of enrolments. In other words, the enrolment information is stored in multiple columns (ie. 4 regions and Total), which does not meet definition of a tidy data set (Wickham and Grolemund (2016)) that requires:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell

So I need to tidy up the two data sets before I join them.

Firstly, I noticed that the lowest total enrolment in the Secondary data is 79, while the lowest total enrolment in the Primary data is 3, and there are more language programs enrolled in the Primary data than in Secondary data. So I decided to remove languages with number of enrolment in Primary data smaller or equal to the minimum number of enrolment in the Secondary data.

# Remove low enrolments in Primary data based on the lowest enrolment in Secondary data
Primary <- Primary %>% filter(Total >= min(Secondary$Total))

Secondly, I need to remove “Total” variable in column, and “Total” observation in row because they are not needed for this study.

# Remove Total in column and row
Primary <- Primary %>% filter(Total < max(Total)) %>% select(-Total)
Secondary <- Secondary %>% filter(Total < max(Total)) %>% select(-Total)

Thirdly, I need to gather four region columns to convert data from wide format to long format, so the data sets are tidy for joining. Therefore, I have two variables - “Language” and “Region”, which will be used to join the two data sets.

# Convert data to tidy version
Primary_sub <- gather(Primary, "Region", "Prim_n", 2:5)

Secondary_sub <- gather(Secondary, "Region", "Second_n", 2:5)

# Convert Region variable to factor
Primary_sub$Region <- factor(Primary_sub$Region)

Secondary_sub$Region <- factor(Secondary_sub$Region)

# View data sets
head(Primary_sub)

head(Secondary_sub)

Lastly, the two data sets are tidy and ready to join. Because the purpose of this analysis is to study the languages enrolled in both primary and secondary programs. So I decided to join the two data sets using inner_join() function in R so it produces data sets containing language enrolled in both primary and secondary schools.

# Join two data sets using dplyr::inner_join()
LangEnrol <- inner_join(Primary_sub, Secondary_sub, by = c("Language", "Region"))

# View data and check data structure
head(LangEnrol)

str(LangEnrol)

## 'data.frame':    52 obs. of  4 variables:
##  $ Language: Factor w/ 23 levels "Aboriginal Languages",..: 11 10 12 4 6 3 7 18 8 20 ...
##  $ Region  : Factor w/ 4 levels "NEV","NWV","SEV",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Prim_n  : num  11223 9497 8995 3733 1934 ...
##  $ Second_n: num  6986 5184 5831 1636 3362 ...

Tidy & Manipulate Data II

To quantify the enrolment in both primary and secondary schools, I need to create two new variables based on the existing enrolment variables in the joined data set. I will use mutate() function in R to create a “Total” variable and a “Drop” rate variable by keeping the original variables in the data set. My “Total” variable denotes the total enrolment in language programs through out primary and secondary schools, while “Drop” rate variable represents the proportion of enrolment dropped out during secondary school during the transition from primary school.

# Create Total enrolment
LangEnrol <- LangEnrol %>% mutate(Total = Second_n + Prim_n)

# Create Drop rate
LangEnrol <- LangEnrol %>% mutate(Drop = ((Second_n - Prim_n) / Second_n) * 100)

# View data set
head(LangEnrol)

Finally, my data set contains 52 observations of 6 variables. I run descriptive analysis just to get an overview of my final data and then arrange the summary statistics by Mean values in descending order. So I conclude that the most popular language programs enrolled in both primary and secondary schools are Italian, Indonesian, Japanese, French, Chinese Mandarin, and German.

# Descriptive analysis by grouping results into language categories
Summary <- LangEnrol %>% group_by(Language) %>% summarise(Min = min(Total, na.rm = TRUE), Q1 = quantile(Total, probs = .25, na.rm = TRUE), Median = median(Total, na.rm = TRUE), Mean = round(mean(Total, na.rm = TRUE), 2), Q3 = quantile(Total, probs = .75, na.rm = TRUE), Max = max(Total, na.rm = TRUE))

# Arrange summary data by mean in descending order
Summary <- Summary %>% arrange(desc(Mean))

Scan I

The summary statistics show that there is great levels of difference between min and max values (ie. enrolments for each language across regions). I assumed it might be due to high level of skewness or inconsistencies in data. So it’s necessary to scan and detect if there are any missing or special values, or outliers in data set.

I created a function and use sapply() to scan data for missing or special values. This is because is.na() function or sapply(df, is.special) function would produce long output of logical values by corresponding to every value in the data set. Also, sapply() can apply any function to a list and data frames have the characteristics of both lists and matrices (Data Wrangling Module 5 (May 2021)). So my data set was checked on every numerical column by using a function “is.special” with sapply() function together.

# Create a function for scanning purpose 
is.special <- function(x){
  if(is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

Apart from Drop rate variable, there is no missing value or special values found. Due to computation, Drop variable shows 8 NaN values and 5 -Inf values. I checked the dataset and understood:
- the -Inf value was due to the calculation of being divided by 0 (ie., 0 value in Secondary enrolment but not 0 value in Primary enrolment)
- the NaN value was due to two 0 values in both Primary and Secondary variables.

# Apply function to check missing value
sapply(LangEnrol, function(x) sum(is.na(x)))

## Language   Region   Prim_n Second_n    Total     Drop 
##        0        0        0        0        0        8

# Apply function to check special values Inf or -Inf
sapply(LangEnrol, function(x) sum(is.infinite(x)))

## Language   Region   Prim_n Second_n    Total     Drop 
##        0        0        0        0        0        5

# Apply function to check special values NaN
sapply(LangEnrol, function(x) sum(is.nan(x)))

## Language   Region   Prim_n Second_n    Total     Drop 
##        0        0        0        0        0        8

Regarding the -Inf value, if I replace this special value with mean or other value, it would distort the true enrolment of language program. I also don’t want to just remove them because I don’t want to lose enrolment information in Primary schools. So I decided to keep them but acknowledge its existence.

Regarding the NaN value, I decided to remove them because there is 0 enrolment of that language program in that region at both Primary and Secondary schools.

## Remove NaN value
LangEnrol <- LangEnrol[!is.nan(LangEnrol$Drop), ]

Scan II

I use Tukey’s method of outlier detection (boxplot) to detect univariate outliers. But I need to ensure the numeric variables are not normal distributions. So I run histogram to give a quick check to see it variables meet the assumption of Tukey’s method.

# Set graphical parameters to put multiple graphs in a single plot
par(mfrow = c(2, 2))

# Generate Histogram to check if distribution is normal
hist(LangEnrol$Prim_n, main = "Histogram of Primary Enrolment", xlab = "Primary Enrolment")
hist(LangEnrol$Second_n, main = "Histogram of Second Enrolment", xlab = "Second Enrolment")
hist(LangEnrol$Total, main = "Histogram of Total Enrolment", xlab = "Total Enrolment")
hist(LangEnrol$Drop, main = "Histogram of Drop Rate", xlab = "Drop Rate")

Histgram charts show all four numeric variables are non-symmetric / non-normal distributions. So it is safe to use Tuke’s method of outlier detection.

# Set graphical parameters to put multiple graphs in a single plot
par(mfrow = c(2, 2))

# Tukey's boxplot to check outliers
LangEnrol$Prim_n %>% boxplot(main = "Boxplot of Primary Enrolment", ylab = "Primary Enrolment")
LangEnrol$Second_n %>% boxplot(main = "Boxplot of Secondary Enrolment", ylab = "Secondary Enrolment")
LangEnrol$Total %>% boxplot(main = "Boxplot of Total Enrolment", ylab = "Total Enrolment")
LangEnrol$Drop %>% boxplot(main = "Boxplot of Drop Rate", ylab = "Drop Rate")

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn

The boxplot charts show the existence of two outliers below lower outlier fence (ie., below -1.5*IQR) in Drop rate variable, but there is no outliers in Prim_n, Second_n, and Total variables.

Some studies suggest that if the outlier is due to data entry error, data processing error or outlier observations are very small in numbers, then leaving out or deleting the outliers (Data Wrangling Module 6 (May 2021)). I understand the two outliers represent a very low number of enrolment in Secondary program but a high number of enrolment in Primary program. I decided to use Capping (a.k.a Winsorising) method to replace the outliers below the lower IQR with the value of 5th percentile.

# Define a function to cap the values outside the limits
cap <- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, .95))
x[ x < quantiles[2] - 1.5 *IQR(x)] <- quantiles[1]
x[ x > quantiles[3] + 1.5 *IQR(x)] <- quantiles[4]
x}

# Drop variable capped
Drop_capped <- LangEnrol$Drop %>% cap()

Transform

Data transformation is needed for my data to decrease the skewness and convert the distribution into a normal distribution, because I can see great level of skewness in my data based on the histogram plot in Scan II section of this report. All three enrolment variables (ie., Prim_n, Second_n, and Total enrolment) are clearly right skewed and non-symmetric distributions. Because my Total enrolmet is computed based on Prim_n and Second_n variables, I decided to do transformation on Total variable. So I tried below transformation methods in order to select the best method from the output.

# Set graphical parameters to put multiple graphs in a single plot
par(mfrow = c(2, 2))

# The raw Total variable
hist(LangEnrol$Total, main = "Histogram of Total Enrolment", xlab = "Total Enrolment")

# The Log transformation
log_total <- log10(LangEnrol$Total)
hist(log_total, main = "Histogram of log_total", xlab = "log_total")
# The natural logarithm
ln_total <- log(LangEnrol$Total)
hist(ln_total, main = "Histogram of ln_total", xlab = "ln_total")
# The square root transformation
sqrt_total <- sqrt(LangEnrol$Total)
hist(sqrt_total, main = "Histogram of sqrt_total", xlab = "sqrt_total")

The square root transformation has adjusted the Total variable distribution a little better, while log transformation did not improve Total variable distribution. So, it’s consistent to what I’ve learned about square root transformation is effective for right skewed distribution.

References

Data source: http://www.discover.data.vic.gov.au/dataset

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

Data Wrangling Module 5 (May 2021) http://rare-phoenix-161610.appspot.com/secured/Module_05.html#Learning_Objectives

Data Wrangling Module 6 (May 2021) http://rare-phoenix-161610.appspot.com/secured/Module_06.html#Excluding_or_Deleting_Outliers