Amit Singh s3868685
Last updated: 28 May, 2023
RPubs URL: https://rpubs.com/amnits/Datavisassign3Final
In recent years, Government of Victoria has branded Victoria as ‘Education state of Australia’.
Often, the key performance index (KPI) of Full Time Enrolled (FTE) students density/ ratio per school, is considered as important criteria to gauge quality of education in Victoria schools and to measure school development facilities. It is a reflection of shared educational, staff, infrastructure, by enrolled students in school.
KPI of FTE Students density/ ratio is evaluated across all 3 category school types ie. Government (gov.), Catholic and Independent and analyzed across all 80 Victorian Local Government Area (LGA) schools for the year 2022.
Recent decline in Victorian school enrollments for catholic schools have been reported in various media.
Through data visualization, it is evident that independent schools generally have highest FTE students ratio, across Victoria LGA’s.
It is not very evident about the FTE students ratio of Government and catholic category school types, and which is having better KPI ratio than other.
Statistics and hypothesis testing is done to evaluate how Catholic schools fare on this KPI criteria of FTE students ratio/ density, in comparison to government schools.
#Data import and scanning followed by data type conversions
schoolsenrol <- read_excel("C:/Users/praam/OneDrive - RMIT University/schoolsenrol.xlsx", sheet = "LGA Data", skip = 9)
#Data re import and removing unnecessary rows and columns
schoolsenrol_use <- read_excel("schoolsenrol_use.xlsx")
schools1 <- schoolsenrol_use[c(-81),]
#Adding proper column names to data set
colnames(schools1) <- c("LGA_name", "Gov_Schools_Enrols", "Gov_Schools_No.", "Catholic_Schools_Enrols", "Catholic_Schools_No.", "Indep_Schools_Enrols", "Indep_Schools_No.", "Total_Schools_Enrols", "Total_Schools_No.")
#Conversion of data types
schools1$"LGA_name" <- as.factor(schools1$"LGA_name")
schools1$"Gov_Schools_Enrols" <- as.numeric(schools1$"Gov_Schools_Enrols")
schools1$"Gov_Schools_No." <- as.numeric(schools1$"Gov_Schools_No.")
schools1$"Catholic_Schools_Enrols" <- as.numeric(schools1$"Catholic_Schools_Enrols")
schools1$"Catholic_Schools_No." <- as.numeric(schools1$"Catholic_Schools_No.")
schools1$"Indep_Schools_Enrols" <- as.numeric(schools1$"Indep_Schools_Enrols")
schools1$"Indep_Schools_No." <- as.numeric(schools1$"Indep_Schools_No.")
schools1[is.na(schools1)] <- 0 # identify NAs in full data frame and imputing them by 0
head(schools1) ##Preparation of data sets and adding columns for ratio/ density ie. to calculate No. of students per school
school_ratio <- schools1 %>% add_column(Gov_Schools_Ratio = (schools1$Gov_Schools_Enrols/ schools1$Gov_Schools_No.),.after = "Gov_Schools_No.")
school_ratio1 <- school_ratio %>% add_column(Catholic_Schools_Ratio = (school_ratio$Catholic_Schools_Enrols/ school_ratio$Catholic_Schools_No.),.after = "Catholic_Schools_No.")
school_ratio2 <- school_ratio1 %>% add_column(Indep_Schools_Ratio = (school_ratio1$Indep_Schools_Enrols/ school_ratio1$Indep_Schools_No.),.after = "Indep_Schools_No.")
school_ratio3 <- school_ratio2 %>% add_column(Total_Schools_Ratio = (school_ratio2$Total_Schools_Enrols/ school_ratio2$Total_Schools_No.),.after = "Total_Schools_No.")
# identify NAs in full data frame and imputing them by 0
school_ratio3[is.na(school_ratio3)] <- 0
#Ranking the LGA's as per descending order of Victoria LGA
school_ratio_rank <- school_ratio3 %>% add_column(Total_Schools_No._Rank = rank(desc(school_ratio3$Total_Schools_No.),ties.method = "min"),.before = "LGA_name")
#Reordering of columns for likewise categories
school_ratio_rank2 <- school_ratio_rank[, c(1,2,13,4,7,10,12,3,6,9,14,5,8,11)]
#Subsetting dataset
school_ratio_rank4 <- school_ratio_rank2[, c(1,2,11,12,13,14)]
school_ratio_rank4_long <- gather(school_ratio_rank4, key = "Variable", value = "Students_Ratio", Total_Schools_Ratio:Indep_Schools_Ratio)
#Conversion of data type
school_ratio_rank4_long$Variable <- as.factor(school_ratio_rank4_long$Variable)
school_ratio_rank4_long$Rank_labels <- c(rank(-school_ratio_rank4$Total_Schools_Ratio,ties.method = "min"),
rank(-school_ratio_rank4$Gov_Schools_Ratio,ties.method = "min"), rank(-school_ratio_rank4$Catholic_Schools_Ratio,ties.method = "min"),
rank(-school_ratio_rank4$Indep_Schools_Ratio,ties.method = "min"))
school_ratio_rank4_long$Variable <- factor(school_ratio_rank4_long$Variable,levels=c("Total_Schools_Ratio","Gov_Schools_Ratio","Catholic_Schools_Ratio","Indep_Schools_Ratio"),labels = c("Average of Total Schools","Government Schools","Catholic Schools", "Independent Schools"))
#Rearranging dataset for statistical testing of government and catholic schools student ratios
school_test <- school_ratio_rank4[, c(4,5)]
school_test_long <- gather(school_test, key = "School_Type", value = "Students_Ratio", Gov_Schools_Ratio:Catholic_Schools_Ratio)
#Rounding off the ratios and assigning factors
school_test_long$Students_Ratio <- round(school_test_long$Students_Ratio,2)
school_test_long$School_Type <- factor(school_test_long$School_Type,labels = c("Catholic","Government"))#Interactive visualization for KPI across all 80 Victorian LGA's, 2022
p1 <- plot_ly(data = school_ratio_rank4_long, x = ~Students_Ratio, y= ~LGA_name,size = ~Students_Ratio , alpha = 2, symbol = ~Variable, color = ~Variable, type = "scatter", mode = "markers", colors= c("#E7298A","#D95F02","#1B9E77","#7570B3")) %>% layout(title ="FTE Students Ratio (categorical), Victoria LGA - 2022",xaxis = list(zeroline = FALSE, title = "Ratio (No. of FTE Students per school)"), yaxis = list(zeroline = FALSE, title = "Victoria LGA"))
p1#Summarizing of test dataset for statistical analysis
school_test_long %>% group_by(School_Type) %>% summarise(Min = min(Students_Ratio,na.rm = TRUE),
Q1 = quantile(Students_Ratio,probs = .25,na.rm = TRUE),Median = median(Students_Ratio, na.rm = TRUE),Q3 = quantile(Students_Ratio,probs = .75,na.rm = TRUE),Max = max(Students_Ratio,na.rm = TRUE),
Mean = mean(Students_Ratio, na.rm = TRUE),SD = sd(Students_Ratio, na.rm = TRUE),n = n(),Missing = sum(is.na(Students_Ratio)))-> table1
knitr::kable(table1)| School_Type | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Catholic | 0 | 165.2500 | 335.115 | 461.6375 | 711.40 | 331.3274 | 193.2953 | 80 | 0 |
| Government | 10 | 143.8375 | 310.705 | 499.5975 | 1039.89 | 336.0165 | 205.7019 | 80 | 0 |
#Statistically visualizing using box plot
school_test_long %>% boxplot(Students_Ratio ~ School_Type, data = ., ylab = "Students_Ratio")#Student ratio distributions for catholic schools using the qqPlot() function.
school_Catholic <- school_test_long %>% filter(School_Type == "Catholic")
school_Catholic$Students_Ratio %>% qqPlot(dist="norm")## [1] 47 64
#Student ratio distributions for government schools using the qqPlot() function.
school_Government <- school_test_long %>% filter(School_Type == "Government")
school_Government$Students_Ratio %>% qqPlot(dist="norm")## [1] 77 45
Visualization and checking of normality using Q−Q plots, for catholic schools indicates all points appear to be following a normal distribution, with data points falling closely to the diagonal line.
For government school, ratio data isn’t so clear, with points falling outside the tails of distribution indicating heavier tails than a normal distribution.
However, using Central Limit Theorem (CLT), we know that the sampling distribution of a mean will be approximately normally distributed, when the sample size is large (i.e. n>30).
Since the data size is large = 80 (ie. n>30), for each catholic and government school category, normality checking can be effectively ignored for government school ratio, and we can proceed with the two-sample t-test Hypothesis Testing.
The two-sample t-test will further help us to consider whether the difference between catholic and gov. schools student ratio is statistically significant or not.
The p-value for the Levene’s test of equal variance for school ratio between catholic and gov. schools was p=0.30, ie. we find p>.05, and therefore, we FAIL to reject H0.In plain language, we are safe to assume equal variance.
The assumption of equal variance is important because it will determine the type of two-sample t-test we will perform, which is a Two-sample t-test - Assuming Equal Variance in this case.
The two-sample t-test has the following statistical hypotheses: \[H_0: \mu_1 - \mu_2 = 0 \] \[H_A: \mu_1 -\mu_2 \ne 0 \] where, μ1 and μ2,refer to the population means of catholic and gov. schools student ratios respectively.
#Two-sample t-test - Assuming Equal Variance
t.test(Students_Ratio ~ School_Type,
data = school_test_long,
var.equal = TRUE,
alternative = "two.sided")##
## Two Sample t-test
##
## data: Students_Ratio by School_Type
## t = -0.14858, df = 158, p-value = 0.8821
## alternative hypothesis: true difference in means between group Catholic and group Government is not equal to 0
## 95 percent confidence interval:
## -67.02056 57.64231
## sample estimates:
## mean in group Catholic mean in group Government
## 331.3274 336.0165
## [1] -0.6760457
As the test statistic t from the two-sample t-test assuming equal variance, is t=−0.148, which is NOT more extreme than −0.676, we FAIL to reject H0. According to the critical value method, there was NOT statistically significant difference between catholic and government schools student ratio means.
The p-value of the two-sample t-test tell us the probability of observing a sample difference between the means of -4.6891, or one more extreme, assuming the difference was 0 in the population (i.e. H0 is true). The two-tailed p-value is reported to be p=.882. According to the p-value method, as p=.882 > α = 0.05, we FAIL to reject H0. Therefore, there was NOT statistically significant difference observed between the means of catholic and government schools student ratio.
The 95% CI of the difference between the means (- 4.689) was calculated ie. H0: μ = -4.689 Since, 95% CI [-67.02 57.64] interval does capture H0, we FAIL to reject it. Once again, there was NOT a significant difference between the means of student ratios of catholic and government schools.
A two-sample t-test was used to test for a significant difference between the means of student ratios of Victoria catholic and gov. schools. The student ratio for gov. schools exhibited evidence of non-normality upon inspection of normal Q-Q plot.
The central limit theorem ensured that t-test could be applied due to large sample size in each group. Also, the Levene’s test of homogeneity of variance indicated that equal variance could be assumed.
Using all 3 approaches of testing, it was deduced that t(df=158)= −0.148, p=.882, 95% CI for the difference in means [-67.02 57.64]. Results of the two-sample t-test assuming equal variance found a NOT statistically significant difference between the mean student density/ ratios of catholic and government schools of Victoria LGA’s.
KPI of student density/ ratio per school shall not be considered as the only criteria to determine quality of education in schools. It is evident from investigation, that in-spite of high student ratio generally observed for independent schools, student enrollments are categorically increasing.[2]
Victoria Education department shall make brisk efforts to plan and bridge school educational gaps, by opening new schools in laggards LGA’s such as Wyndham which are high population growth LGA’s and are having high student density/ ratio KPI per school.
Sincere efforts shall also be made by educational department to improve quality of education in catholic and government schools across LGA’s to stop further decline of student enrollments and to showcase its model of “The Education State” to Australia.
[1] Data Reference: Statistics on Victorian schools. 21/05/2023. Retrieved May 21, 2023 from Statistics on Victorian schools and teaching website https://www.vic.gov.au/statistics-victorian-schools-and-teaching
[2] Falling enrolments put Catholic schools at risk of closure. Cath News article. Accessed 10 May, 2023. https://cathnews.com/2023/05/10/falling-enrolments-put-catholic-schools-at-risk-of-closure/