RPubs link information

RPubs URL: https://rpubs.com/amnits/Datavisassign3Final

Victoria - “The Education State”

In recent years, Government of Victoria has branded Victoria as ‘Education state of Australia’.
Often, the key performance index (KPI) of Full Time Enrolled (FTE) students density/ ratio per school, is considered as important criteria to gauge quality of education in Victoria schools and to measure school development facilities. It is a reflection of shared educational, staff, infrastructure, by enrolled students in school.
KPI of FTE Students density/ ratio is evaluated across all 3 category school types ie. Government (gov.), Catholic and Independent and analyzed across all 80 Victorian Local Government Area (LGA) schools for the year 2022.

Problem Statement

Recent decline in Victorian school enrollments for catholic schools have been reported in various media.
Through data visualization, it is evident that independent schools generally have highest FTE students ratio, across Victoria LGA’s.
It is not very evident about the FTE students ratio of Government and catholic category school types, and which is having better KPI ratio than other.
Statistics and hypothesis testing is done to evaluate how Catholic schools fare on this KPI criteria of FTE students ratio/ density, in comparison to government schools.

Data

Data is collected from open source: Source: Statistics on Victorian schools. 21/05/2023. Retrieved May 21, 2023 from Statistics on Victorian schools and teaching website https://www.vic.gov.au/statistics-victorian-schools-and-teaching.

#Data import and scanning followed by data type conversions
schoolsenrol <- read_excel("C:/Users/praam/OneDrive - RMIT University/schoolsenrol.xlsx", sheet = "LGA Data", skip = 9)
#Data re import and removing unnecessary rows and columns
schoolsenrol_use <- read_excel("schoolsenrol_use.xlsx")
schools1 <- schoolsenrol_use[c(-81),]
#Adding proper column names to data set
colnames(schools1) <- c("LGA_name", "Gov_Schools_Enrols", "Gov_Schools_No.", "Catholic_Schools_Enrols", "Catholic_Schools_No.", "Indep_Schools_Enrols", "Indep_Schools_No.", "Total_Schools_Enrols", "Total_Schools_No.")

#Conversion of data types
schools1$"LGA_name" <- as.factor(schools1$"LGA_name")
schools1$"Gov_Schools_Enrols" <- as.numeric(schools1$"Gov_Schools_Enrols")
schools1$"Gov_Schools_No." <- as.numeric(schools1$"Gov_Schools_No.")
schools1$"Catholic_Schools_Enrols" <- as.numeric(schools1$"Catholic_Schools_Enrols")
schools1$"Catholic_Schools_No." <- as.numeric(schools1$"Catholic_Schools_No.")
schools1$"Indep_Schools_Enrols" <- as.numeric(schools1$"Indep_Schools_Enrols")
schools1$"Indep_Schools_No." <- as.numeric(schools1$"Indep_Schools_No.")
schools1[is.na(schools1)] <- 0 # identify NAs in full data frame and imputing them by 0
head(schools1)

Data Cont.

##Preparation of data sets and adding columns for ratio/ density ie. to calculate No. of students per school
school_ratio <- schools1 %>%  add_column(Gov_Schools_Ratio = (schools1$Gov_Schools_Enrols/ schools1$Gov_Schools_No.),.after = "Gov_Schools_No.") 
school_ratio1 <- school_ratio %>%  add_column(Catholic_Schools_Ratio = (school_ratio$Catholic_Schools_Enrols/ school_ratio$Catholic_Schools_No.),.after = "Catholic_Schools_No.") 
school_ratio2 <- school_ratio1 %>%  add_column(Indep_Schools_Ratio = (school_ratio1$Indep_Schools_Enrols/ school_ratio1$Indep_Schools_No.),.after = "Indep_Schools_No.") 
school_ratio3 <- school_ratio2 %>%  add_column(Total_Schools_Ratio = (school_ratio2$Total_Schools_Enrols/ school_ratio2$Total_Schools_No.),.after = "Total_Schools_No.") 

# identify NAs in full data frame and imputing them by 0
school_ratio3[is.na(school_ratio3)] <- 0
#Ranking the LGA's as per descending order of Victoria LGA
school_ratio_rank <- school_ratio3 %>%  add_column(Total_Schools_No._Rank = rank(desc(school_ratio3$Total_Schools_No.),ties.method = "min"),.before =  "LGA_name") 
#Reordering of columns for likewise categories
school_ratio_rank2 <- school_ratio_rank[, c(1,2,13,4,7,10,12,3,6,9,14,5,8,11)]

#Subsetting dataset
school_ratio_rank4 <- school_ratio_rank2[, c(1,2,11,12,13,14)]
school_ratio_rank4_long <- gather(school_ratio_rank4, key = "Variable", value = "Students_Ratio", Total_Schools_Ratio:Indep_Schools_Ratio)
#Conversion of data type
school_ratio_rank4_long$Variable <- as.factor(school_ratio_rank4_long$Variable)
school_ratio_rank4_long$Rank_labels <- c(rank(-school_ratio_rank4$Total_Schools_Ratio,ties.method = "min"),
                                         rank(-school_ratio_rank4$Gov_Schools_Ratio,ties.method = "min"), rank(-school_ratio_rank4$Catholic_Schools_Ratio,ties.method = "min"),
rank(-school_ratio_rank4$Indep_Schools_Ratio,ties.method = "min"))
school_ratio_rank4_long$Variable <- factor(school_ratio_rank4_long$Variable,levels=c("Total_Schools_Ratio","Gov_Schools_Ratio","Catholic_Schools_Ratio","Indep_Schools_Ratio"),labels = c("Average of Total Schools","Government Schools","Catholic Schools", "Independent Schools"))

#Rearranging dataset for statistical testing of government and catholic schools student ratios
school_test <- school_ratio_rank4[, c(4,5)]
school_test_long <- gather(school_test, key = "School_Type", value = "Students_Ratio", Gov_Schools_Ratio:Catholic_Schools_Ratio)

#Rounding off the ratios and assigning factors
school_test_long$Students_Ratio <- round(school_test_long$Students_Ratio,2)
school_test_long$School_Type <- factor(school_test_long$School_Type,labels = c("Catholic","Government"))

In next visualization, categorical analysis of student density/ Ratio KPI, indicates that Independent schools generally have higher KPI.
Visualization also indicates that both Government (gov.) and Catholic schools categories seems to perform well on KPI.

Descriptive Statistics and Visualisation

#Interactive visualization for KPI across all 80 Victorian LGA's, 2022 
p1 <- plot_ly(data = school_ratio_rank4_long, x = ~Students_Ratio, y= ~LGA_name,size = ~Students_Ratio , alpha = 2, symbol = ~Variable, color = ~Variable, type = "scatter", mode = "markers", colors= c("#E7298A","#D95F02","#1B9E77","#7570B3")) %>% layout(title ="FTE Students Ratio (categorical), Victoria LGA - 2022",xaxis = list(zeroline = FALSE, title = "Ratio (No. of FTE Students per school)"), yaxis = list(zeroline = FALSE, title = "Victoria LGA"))
p1

Descriptive Statistics and Visualisation Cont.

Statistical techniques will be further utilized to infer between the catholic and gov. school categories, since from previous interactive visualization, it is ambiguous which has higher student density/ratio KPI.

#Summarizing of test dataset for statistical analysis
school_test_long %>% group_by(School_Type) %>% summarise(Min = min(Students_Ratio,na.rm = TRUE),
                          Q1 = quantile(Students_Ratio,probs = .25,na.rm = TRUE),Median = median(Students_Ratio, na.rm = TRUE),Q3 = quantile(Students_Ratio,probs = .75,na.rm = TRUE),Max = max(Students_Ratio,na.rm = TRUE),
                          Mean = mean(Students_Ratio, na.rm = TRUE),SD = sd(Students_Ratio, na.rm = TRUE),n = n(),Missing = sum(is.na(Students_Ratio)))-> table1
knitr::kable(table1)

School_Type	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Catholic	0	165.2500	335.115	461.6375	711.40	331.3274	193.2953	80	0
Government	10	143.8375	310.705	499.5975	1039.89	336.0165	205.7019	80	0

#Statistically visualizing using box plot
school_test_long %>% boxplot(Students_Ratio ~ School_Type, data = ., ylab = "Students_Ratio")

Summary statistics and box plots, indicates a close call to decide which of above category is having higher student ratios. An outlier is also observed for gov. school box plot which is correct data representative of Wyndham LGA.

#Student ratio distributions for catholic schools using the qqPlot() function.
school_Catholic <- school_test_long %>% filter(School_Type == "Catholic")
school_Catholic$Students_Ratio %>% qqPlot(dist="norm")

## [1] 47 64

Descriptive Statistics and Visualisation Cont.

#Student ratio distributions for government schools using the qqPlot() function.
school_Government <- school_test_long %>% filter(School_Type == "Government")
school_Government$Students_Ratio %>% qqPlot(dist="norm")

## [1] 77 45

Descriptive Statistics and Visualisation Cont.

Visualization and checking of normality using Q−Q plots, for catholic schools indicates all points appear to be following a normal distribution, with data points falling closely to the diagonal line.
For government school, ratio data isn’t so clear, with points falling outside the tails of distribution indicating heavier tails than a normal distribution.
However, using Central Limit Theorem (CLT), we know that the sampling distribution of a mean will be approximately normally distributed, when the sample size is large (i.e. n>30).
Since the data size is large = 80 (ie. n>30), for each catholic and government school category, normality checking can be effectively ignored for government school ratio, and we can proceed with the two-sample t-test Hypothesis Testing.
The two-sample t-test will further help us to consider whether the difference between catholic and gov. schools student ratio is statistically significant or not.

Hypothesis Testing

Homogeneity of variance, or the assumption of equal variance, is tested using the Levene’s test. The Levene’s test has the following statistical hypotheses: \[H_0: \sigma^2_1 = \sigma^2_2 \] \[H_A: \sigma^2_1 \neq \sigma^2_2 \] where σ1 and σ2 are population variances of catholic schools ratio and gov. schools ratio respectively. Levene’s test reports a p-value that is compared to standard 0.05 significance level.

leveneTest(Students_Ratio ~ School_Type, data = school_test_long)

The p-value for the Levene’s test of equal variance for school ratio between catholic and gov. schools was p=0.30, ie. we find p>.05, and therefore, we FAIL to reject H0.In plain language, we are safe to assume equal variance.
The assumption of equal variance is important because it will determine the type of two-sample t-test we will perform, which is a Two-sample t-test - Assuming Equal Variance in this case.
The two-sample t-test has the following statistical hypotheses: \[H_0: \mu_1 - \mu_2 = 0 \] \[H_A: \mu_1 -\mu_2 \ne 0 \] where, μ1 and μ2,refer to the population means of catholic and gov. schools student ratios respectively.

#Two-sample t-test - Assuming Equal Variance
t.test(Students_Ratio ~ School_Type,
  data = school_test_long,
  var.equal = TRUE,
  alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  Students_Ratio by School_Type
## t = -0.14858, df = 158, p-value = 0.8821
## alternative hypothesis: true difference in means between group Catholic and group Government is not equal to 0
## 95 percent confidence interval:
##  -67.02056  57.64231
## sample estimates:
##   mean in group Catholic mean in group Government 
##                 331.3274                 336.0165

The null hypothesis is simply the difference between two independent population means is 0 ie. the difference between catholic and gov. schools ratios estimated by the sample is 331.3274 - 336.0165 = -4.6891 and test statistic t = −0.148.
The t-statistic is compared to a two-tailed t-critical value t∗ with df: df = n1+n2−2 Assuming α=0.05,and a two-tailed test, we find t∗:

qt(p = 0.25, df = 80 + 80 - 2)

## [1] -0.6760457

Critical value approach

As the test statistic t from the two-sample t-test assuming equal variance, is t=−0.148, which is NOT more extreme than −0.676, we FAIL to reject H0. According to the critical value method, there was NOT statistically significant difference between catholic and government schools student ratio means.

Using p-value method

The p-value of the two-sample t-test tell us the probability of observing a sample difference between the means of -4.6891, or one more extreme, assuming the difference was 0 in the population (i.e. H0 is true). The two-tailed p-value is reported to be p=.882. According to the p-value method, as p=.882 > α = 0.05, we FAIL to reject H0. Therefore, there was NOT statistically significant difference observed between the means of catholic and government schools student ratio.

Confidence Interval approach

The 95% CI of the difference between the means (- 4.689) was calculated ie. H0: μ = -4.689 Since, 95% CI [-67.02 57.64] interval does capture H0, we FAIL to reject it. Once again, there was NOT a significant difference between the means of student ratios of catholic and government schools.

Discussion

A two-sample t-test was used to test for a significant difference between the means of student ratios of Victoria catholic and gov. schools. The student ratio for gov. schools exhibited evidence of non-normality upon inspection of normal Q-Q plot.
The central limit theorem ensured that t-test could be applied due to large sample size in each group. Also, the Levene’s test of homogeneity of variance indicated that equal variance could be assumed.
Using all 3 approaches of testing, it was deduced that t(df=158)= −0.148, p=.882, 95% CI for the difference in means [-67.02 57.64]. Results of the two-sample t-test assuming equal variance found a NOT statistically significant difference between the mean student density/ ratios of catholic and government schools of Victoria LGA’s.
KPI of student density/ ratio per school shall not be considered as the only criteria to determine quality of education in schools. It is evident from investigation, that in-spite of high student ratio generally observed for independent schools, student enrollments are categorically increasing.[2]
Victoria Education department shall make brisk efforts to plan and bridge school educational gaps, by opening new schools in laggards LGA’s such as Wyndham which are high population growth LGA’s and are having high student density/ ratio KPI per school.
Sincere efforts shall also be made by educational department to improve quality of education in catholic and government schools across LGA’s to stop further decline of student enrollments and to showcase its model of “The Education State” to Australia.

MATH 1324: Assignment 2

Evaluation of Students density/ ratio of government and catholic schools across Victoria LGA’s, 2022

RPubs link information

Victoria - “The Education State”

Problem Statement

Data

Data Cont.

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation Cont.

Descriptive Statistics and Visualisation Cont.

Descriptive Statistics and Visualisation Cont.

Hypothesis Testing

Critical value approach

Using p-value method

Confidence Interval approach

Discussion

References