In critically acclaimed research paper, Ludger Woessmann and Eric Hanushek (2008) have claimed that the quality of education is a strong predictor of economic growth. The authors argue that increasing educational attainment with no guarantee in increasing the effectiveness of the education system is fruitless. However, in their analysis Hanushek & Woessmann (2008) rely on achievement from Large-Scale Assessments (e.g. PISA, TIMSS), which usually give us a narrow perspective on the condition of the education system. In response to this issue, the goal of this project is to explore the similarities between the education systems in European Union. I apply several clustering techniques, such as Hierarchical Clustering, K-Means, PAM, Clara and Latent Class analysis based on PIRLS 2021 data. In the following parts, I use Principal Component Analysis to reduce the dimensions used for clustering. Finally, I use MDS to compare the features of the educational systems in 2-dimensional space.
PIRLS 2021 was downloaded from International Education Association website (IEA) and merged using IEA IDB Analyzer. Out of the participating countries, we focus on 23 educational systems in 22 EU countries.
Aggregate basic data on achievements
dataPIRLS21_agg <- dataPIRLS21 %>%
mutate(IDCNTRY = as_factor(IDCNTRY)) %>%
group_by(IDCNTRY) %>%
summarise(achievements_read = round(mean(ASRREA01, ASRREA02, ASRREA03, ASRREA04, ASRREA05), 2),
desire = mean(ACBG11J, na.rm = TRUE)) %>%
arrange(across(achievements_read, desc))
pic1 <- ggplot(dataPIRLS21_agg, aes(x = reorder(IDCNTRY, achievements_read), y = achievements_read, fill = achievements_read))+
geom_bar( stat = "summary", fun.y = "mean", position = position_stack(reverse = TRUE)) +
theme_minimal() +
scale_fill_viridis(option = "D") +
labs(title = "") +
xlab("Country") +
ylab("Fourth Graders' Achievements in Reading") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
)+
coord_flip()
# But at the same time, the self-perception desire of students to do well
pic2 <- ggplot(dataPIRLS21_agg, aes(x = reorder(IDCNTRY, desire), y = desire, fill = desire))+
geom_bar( stat = "summary", fun.y = "mean", position = position_stack(reverse = TRUE)) +
theme_minimal() +
scale_fill_viridis(option = "D") +
labs(title = "") +
xlab("Country") +
ylab("Students' desire to do well in school") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
)+
coord_flip()
pic1We can see that the overlap in countries’ position in the case of achievements and desire to do well in school is not large. For instance, in Croatia and Ireland the desire is generally low, despite high achievements in reading.
I will use several indicators related to the school climate and students’ mental health to model the clusters. PIRLS generates continuous scales in several domains. Hence, the analysis will be based on continous variables only. For the analysis, the clustering will be based on:
Digital self-efficacy (ASBGSEC)
Students’ sense of school belonging (ASBGSSB)
Students’ engagement in reading lessons (ASBGSB)
Relevance of disoredly behaviour during lessons (ASBGERL)
Students like reading (ASBGDRL)
Students’ confidence in reading (ASBGSCR)
Instruction affected by resources shortages (ACBGRRS)
School emphasis on academic success (ACBGEAS)
School discipline (ACBGDAS)
Teachers’ job satisfaction (ATBGTJS)
Teaching limited by students not ready for learning (ATBGSLI)
I set up ID in original data, to track individuals later.
dataPIRLS21$ID <- 1:length(dataPIRLS21$IDCNTRY)
data_selected <- dataPIRLS21 %>%
dplyr::select(ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, ID, IDCNTRY, IDSCHOOL, WGTADJ1) %>%
filter(
across(everything(), ~ . != 999999)
)
# I need to reweight data to give each country an equal wage. If not, the results will be dominated by Spain and France
# Calculate weights for each country
country_weights <- data_selected %>%
group_by(IDCNTRY) %>%
mutate(weight = 1/n() * WGTADJ1)
# Almost equal weights
country_weights %>%
group_by(IDCNTRY) %>%
summarise(suma = sum(weight))## # A tibble: 22 × 2
## IDCNTRY suma
## <dbl+lbl> <dbl>
## 1 40 [Austria] 1
## 2 100 [Bulgaria] 1
## 3 191 [Croatia] 1.04
## 4 196 [Cyprus] 1.01
## 5 203 [Czech Republic] 1.00
## 6 208 [Denmark] 1.09
## 7 246 [Finland] 1
## 8 250 [France] 1.03
## 9 276 [Germany] 1.03
## 10 380 [Italy] 1.01
## # ℹ 12 more rows
# Merge the calculated weights back to the original dataset
data_selected$weight <- country_weights$weight
data_selected %>%
group_by(IDCNTRY) %>%
summarise(suma = sum(weight))## # A tibble: 22 × 2
## IDCNTRY suma
## <dbl+lbl> <dbl>
## 1 40 [Austria] 1
## 2 100 [Bulgaria] 1
## 3 191 [Croatia] 1.04
## 4 196 [Cyprus] 1.01
## 5 203 [Czech Republic] 1.00
## 6 208 [Denmark] 1.09
## 7 246 [Finland] 1
## 8 250 [France] 1.03
## 9 276 [Germany] 1.03
## 10 380 [Italy] 1.01
## # ℹ 12 more rows
I set my analysis with aggregating the data on a education system level. Then, I apply Hierarchical Clustering to see the similarities between educational systems.
# Aggregate data
data_country <- data_selected %>%
dplyr::select(IDCNTRY, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, weight) %>%
dplyr::group_by(IDCNTRY) %>%
summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T),
ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T),
ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T),
ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T),
ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T),
ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T),
ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T))
# Scale the variables
data_country2 <- scale(data_country[,-1])
rownames(data_country2) = as_factor(data_country$IDCNTRY)
# Compute the dendogram
hclusters <- hclust(dist(data_country2[,-1]), method = "average")
# Evaluate the number of histograms
p1 <- fviz_nbclust(data_country2, FUN = hcut, method = "wss",
k.max = 10) +
ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(data_country2, FUN = hcut, method = "silhouette",
k.max = 10) +
ggtitle("(B) Silhouette method")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, nrow = 1)# We got 9 clusters. But the silhouette measure for 6 clusters is quite similar
hclust9_graph <- fviz_dend(hclusters, k = 9, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#264653",
"#2A9D8F",
"#BFBCCB",
"#DCEED1",
"#E9C46A",
"#F4A261",
"#E76F51",
"#784F41",
"#B80053"),
color_labels_by_k = TRUE, # color labels by groups
ggtheme = theme_classic()) +
theme(
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12)
) + labs(title = "9 clusters (optimal)")
# Graph for 6 clusters
hclust6_graph <- fviz_dend(hclusters, k = 6, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#264653",
"#2A9D8F",
"#BFBCCB",
"#DCEED1",
"#E9C46A",
"#F4A261"),
color_labels_by_k = TRUE, # color labels by groups
ggtheme = theme_classic()) +
theme(
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12)
) + labs(title = "6 clusters")
# Display plots side by side
gridExtra::grid.arrange(hclust6_graph, hclust9_graph, nrow = 1)The results of hierarchical clustering show insightful conclusions. Not even looking at the data for the achievements we find similarities between educational systems. Some results, as Czech Republic and Slovakia or Denmark and Sweden in the same cluster are straightforward due to the similarity in culture. The case of Poland is interesting. It is generally established in public opinion that we the educational system is poor and it should look up to Finish system. However the results clearly show similarities between the systems.
As 24 education systems are in the data, we set the maximum number of clusters as 24. For computational power, I will examine only school level data, as it is generally not feasible to do it on individual students.
data_school <- data_selected %>%
dplyr::select(IDCNTRY, IDSCHOOL, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI,weight) %>%
dplyr::group_by(IDCNTRY, IDSCHOOL) %>%
summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T),
ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T),
ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T),
ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T),
ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T),
ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T),
ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T),
sum_wgt = sum(weight))
# Evaluate the number of clusters
p1_kmeans <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2, 14)])), FUN = stats::kmeans, method ="wss",k.max = 22) + ggtitle("(A) Elbow method")
# Silhouette Evaluation
p2_kmeans <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2, 14)])), FUN = stats::kmeans, method = "silhouette", k.max = 22) +
ggtitle("(B) Silhouette method")
clusters_kmeans <- flexclust::cclust(as.data.frame(scale(data_school[-c(1,2, 14)])), 2, dist = "euclidean", method = "kmeans", weights = data_school$sum_wgt, save.data = TRUE)
# Evaluate the number of clusters
p1_pam <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2)])), FUN = cluster::pam, method ="wss", k.max = 22) + ggtitle("(A) Elbow method")
# Silhouette Evaluation
p2_pam <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2)])), FUN = cluster::pam, method = "silhouette", k.max = 22) + ggtitle("(B) Silhouette method")
# Arrange the evaluation graphs
kmeans_combined_ev <- gridExtra::grid.arrange(p1_kmeans, p2_kmeans)# Silhouette and WSS suggests 2 clusters
kmeans <- eclust(as.data.frame(scale(data_school[-c(1,2)])), FUNcluster = "kmeans" , k = 2)Based on the graphs the educational systems look similar. However, it is possible to assess the similarity between the cluster assessment. In this project, I apply rand index, shown as:
\[ R = \frac{a+b}{a+b+c+d} \]
where, a - corresponds to the set of observations clustered into the same clusters, b - corresponds to set of observations not clustered to the same clusters using both clustering techniques. The remaining - c + d - correspond to cases when there was a shift in clustering.
# Rand Index to assess the similarity between clustering
rand.index(data_school$cluster_kmeans, data_school$cluster_pam)## [1] 0.6985897
We find that, K-means and PAM give the most similar results, while CLARA and PAM show the least similar.
Below I summarize the results of the clustering given K-means, PAM and Clara
data_school %>%
group_by(cluster_kmeans) %>%
summarize(Digital_Skills = weighted.mean(ASBGSEC_m, sum_wgt),
Sense_of_School_Belonging = weighted.mean(ASBGSSB_m, sum_wgt),
Engagement = weighted.mean(ASBGSB_m, sum_wgt),
DisorderlyBehavior = weighted.mean(ASBGERL_m, sum_wgt),
LikeReading = weighted.mean(ASBGDRL_m, sum_wgt),
ConfidentReading = weighted.mean(ASBGSCR_m, sum_wgt),
ResourcesShortage = weighted.mean(ACBGRRS_m, sum_wgt),
EmphasisSuccess = weighted.mean(ACBGEAS_m, sum_wgt),
Discipline = weighted.mean(ACBGDAS_m, sum_wgt),
TeacherSatisfaction = weighted.mean(ATBGTJS_m, sum_wgt),
NotReadyForLearning = weighted.mean(ATBGSLI_m, sum_wgt))## # A tibble: 2 × 12
## cluster_kmeans Digital_Skills Sense_of_School_Belonging Engagement
## <int> <dbl> <dbl> <dbl>
## 1 1 10.8 10.7 10.2
## 2 2 10.1 9.78 9.72
## # ℹ 8 more variables: DisorderlyBehavior <dbl>, LikeReading <dbl>,
## # ConfidentReading <dbl>, ResourcesShortage <dbl>, EmphasisSuccess <dbl>,
## # Discipline <dbl>, TeacherSatisfaction <dbl>, NotReadyForLearning <dbl>
data_school %>%
group_by(cluster_pam) %>%
summarize(Digital_Skills = mean(ASBGSEC_m),
Sense_of_School_Belonging = mean(ASBGSSB_m),
Engagement = mean(ASBGSB_m),
DisorderlyBehavior = mean(ASBGERL_m),
LikeReading = mean(ASBGDRL_m),
ConfidentReading = mean(ASBGSCR_m),
ResourcesShortage = mean(ACBGRRS_m),
EmphasisSuccess = mean(ACBGEAS_m),
Discipline = mean(ACBGDAS_m),
TeacherSatisfaction = mean(ATBGTJS_m),
NotReadyForLearning = mean(ATBGSLI_m))## # A tibble: 2 × 12
## cluster_pam Digital_Skills Sense_of_School_Belonging Engagement
## <int> <dbl> <dbl> <dbl>
## 1 1 10.7 10.7 10.2
## 2 2 10.2 9.95 9.78
## # ℹ 8 more variables: DisorderlyBehavior <dbl>, LikeReading <dbl>,
## # ConfidentReading <dbl>, ResourcesShortage <dbl>, EmphasisSuccess <dbl>,
## # Discipline <dbl>, TeacherSatisfaction <dbl>, NotReadyForLearning <dbl>
Although there are some differences between mean values for the calculated variables, in general in one cluster all measures are higher, indicating better conditions for learning. Based on the results, let’s see the country cluster composition
data_school %>%
group_by(cluster_kmeans, as_factor(IDCNTRY)) %>%
summarise(n = n(), sum_wgt = sum(sum_wgt)) %>%
mutate(freq = round(n * sum_wgt / sum(n * sum_wgt), 2) * 100) %>%
filter(cluster_kmeans == 1) %>%
arrange(-sum_wgt)## # A tibble: 22 × 5
## # Groups: cluster_kmeans [1]
## cluster_kmeans `as_factor(IDCNTRY)` n sum_wgt freq
## <int> <fct> <int> <dbl> <dbl>
## 1 1 Bulgaria 120 0.896 16
## 2 1 Finland 150 0.671 15
## 3 1 Portugal 128 0.671 13
## 4 1 Poland 95 0.624 9
## 5 1 Spain 246 0.564 21
## 6 1 Sweden 50 0.492 4
## 7 1 Denmark 54 0.408 3
## 8 1 Cyprus 58 0.398 3
## 9 1 Malta 28 0.390 2
## 10 1 Netherlands 31 0.379 2
## # ℹ 12 more rows
data_school %>%
group_by(cluster_kmeans, as_factor(IDCNTRY)) %>%
summarise(n = n(), sum_wgt = sum(sum_wgt)) %>%
mutate(freq = round(n * sum_wgt / sum(n * sum_wgt), 2) * 100) %>%
filter(cluster_kmeans == 2) %>%
arrange(-sum_wgt)## # A tibble: 22 × 5
## # Groups: cluster_kmeans [1]
## cluster_kmeans `as_factor(IDCNTRY)` n sum_wgt freq
## <int> <fct> <int> <dbl> <dbl>
## 1 2 Belgium (Flemish) 101 1.01 6
## 2 2 Belgium (French) 139 0.989 8
## 3 2 France 164 0.967 10
## 4 2 Latvia 137 0.933 8
## 5 2 Lithuania 84 0.921 5
## 6 2 Croatia 113 0.877 6
## 7 2 Italy 141 0.875 7
## 8 2 Netherlands 65 0.849 3
## 9 2 Slovenia 98 0.824 5
## 10 2 Slovak Republic 120 0.802 6
## # ℹ 12 more rows
# Merge with achievements
dataPIRLS21_agg_schools <- dataPIRLS21 %>%
group_by(IDCNTRY, IDSCHOOL) %>%
summarise(achievements_read = round(mean(ASRREA01, ASRREA02, ASRREA03, ASRREA04, ASRREA05), 2)) %>%
dplyr::select(IDCNTRY, IDSCHOOL, achievements_read)
data_school %>%
left_join(dataPIRLS21_agg_schools, by = c('IDCNTRY', 'IDSCHOOL')) %>%
group_by(cluster_kmeans) %>%
summarise(Reading = weighted.mean(x = achievements_read, w = sum_wgt) )## # A tibble: 2 × 2
## cluster_kmeans Reading
## <int> <dbl>
## 1 1 548.
## 2 2 525.
We find that in general, Belgian, Latvian, French, Lithuanian and Italian Education systems are mainly characterized by poorer characteristics. In contrast, in Bulgaria and Finland the well-being indicators show some similarities.
For this part I slightly modify the approach by using indices rather than scales.
data_selected_lca <- dataPIRLS21 %>%
dplyr::select(ASDGSEC, ASDGSSB, ASDGSB, ASDGERL, ASDGDRL, ASDGSCR, ASDGHRL, ATDGTJS, ATDGSLI, ID, IDCNTRY, IDSCHOOL, WGTADJ1) %>%
filter(
across(everything(), ~ . != 9)
) %>%
group_by(IDCNTRY) %>%
mutate(weight = 1/n() * WGTADJ1) %>%
ungroup()
# Modification
data_selected_lca$no_dig <- ifelse(data_selected_lca$ASDGSEC == 3, as.integer(1), as.integer(0))
data_selected_lca$no_bel <- ifelse(data_selected_lca$ASDGSSB == 3, as.integer(1), as.integer(0))
data_selected_lca$bullying <- ifelse(data_selected_lca$ASDGSB == 3, as.integer(1), as.integer(0))
data_selected_lca$no_eng <- ifelse(data_selected_lca$ASDGERL == 3, as.integer(1), as.integer(0))
data_selected_lca$disorder <- ifelse(data_selected_lca$ASDGDRL == 3, as.integer(1), as.integer(0))
data_selected_lca$conf_read <- ifelse(data_selected_lca$ASDGSCR == 3, as.integer(1), as.integer(0))
data_selected_lca$res_short <- ifelse(data_selected_lca$ASDGHRL == 3, as.integer(1), as.integer(0))
data_selected_lca$jobsas <- ifelse(data_selected_lca$ATDGTJS == 3, as.integer(1), as.integer(0))
data_selected_lca$notread <- ifelse(data_selected_lca$ATDGSLI == 3, as.integer(1), as.integer(0))Latent Class Anlaysis.
data_lca <- data_selected_lca %>%
dplyr::select(no_dig, no_bel, bullying, no_eng, disorder, conf_read, res_short, jobsas, notread, ID, IDCNTRY, IDSCHOOL)
form <- cbind(no_dig, no_bel, bullying, no_eng, disorder, conf_read, res_short, jobsas, notread) ~ 1
M0 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 1, )## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.8878 0.1122
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9311 0.0689
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9044 0.0956
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9515 0.0485
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.8582 0.1418
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.8329 0.1671
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9761 0.0239
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9182 0.0818
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.9625 0.0375
##
## Estimated class population shares
## 1
##
## Predicted class memberships (by modal posterior prob.)
## 1
##
## =========================================================
## Fit for 1 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 9
## residual degrees of freedom: 502
## maximum log-likelihood: -213703
##
## AIC(1): 427424
## BIC(1): 427508.1
## G^2(1): 15713.19 (Likelihood ratio/deviance statistic)
## X^2(1): 50810.25 (Chi-square goodness of fit)
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.9052 0.0948
## class 2: 0.8292 0.1708
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9951 0.0049
## class 2: 0.7156 0.2844
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9610 0.0390
## class 2: 0.7138 0.2862
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9978 0.0022
## class 2: 0.7957 0.2043
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.9134 0.0866
## class 2: 0.6726 0.3274
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.8851 0.1149
## class 2: 0.6573 0.3427
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9842 0.0158
## class 2: 0.9489 0.0511
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9238 0.0762
## class 2: 0.8996 0.1004
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.9680 0.0320
## class 2: 0.9441 0.0559
##
## Estimated class population shares
## 0.7708 0.2292
##
## Predicted class memberships (by modal posterior prob.)
## 0.8388 0.1612
##
## =========================================================
## Fit for 2 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 19
## residual degrees of freedom: 492
## maximum log-likelihood: -208506.8
##
## AIC(2): 417051.6
## BIC(2): 417229.2
## G^2(2): 5320.79 (Likelihood ratio/deviance statistic)
## X^2(2): 7580.892 (Chi-square goodness of fit)
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.9014 0.0986
## class 2: 0.8654 0.1346
## class 3: 0.7903 0.2097
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9846 0.0154
## class 2: 0.9198 0.0802
## class 3: 0.3695 0.6305
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9727 0.0273
## class 2: 0.6332 0.3668
## class 3: 0.7738 0.2262
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9886 0.0114
## class 2: 0.9936 0.0064
## class 3: 0.4479 0.5521
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.9267 0.0733
## class 2: 0.5805 0.4195
## class 3: 0.7405 0.2595
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.8922 0.1078
## class 2: 0.6007 0.3993
## class 3: 0.7115 0.2885
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9863 0.0137
## class 2: 0.9271 0.0729
## class 3: 0.9756 0.0244
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9239 0.0761
## class 2: 0.9003 0.0997
## class 3: 0.8973 0.1027
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.9690 0.0310
## class 2: 0.9335 0.0665
## class 3: 0.9574 0.0426
##
## Estimated class population shares
## 0.7698 0.1601 0.0701
##
## Predicted class memberships (by modal posterior prob.)
## 0.8596 0.0831 0.0573
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 29
## residual degrees of freedom: 482
## maximum log-likelihood: -207047.2
##
## AIC(3): 414152.5
## BIC(3): 414423.5
## G^2(3): 2401.674 (Likelihood ratio/deviance statistic)
## X^2(3): 4184.523 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.9481 0.0519
## class 2: 0.9252 0.0748
## class 3: 0.5275 0.4725
## class 4: 0.8125 0.1875
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.8989 0.1011
## class 2: 0.9829 0.0171
## class 3: 0.9389 0.0611
## class 4: 0.1612 0.8388
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.4951 0.5049
## class 2: 0.9704 0.0296
## class 3: 0.9508 0.0492
## class 4: 0.7577 0.2423
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9693 0.0307
## class 2: 0.9862 0.0138
## class 3: 0.9256 0.0744
## class 4: 0.3974 0.6026
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.5253 0.4747
## class 2: 0.9177 0.0823
## class 3: 0.8563 0.1437
## class 4: 0.7209 0.2791
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.6065 0.3935
## class 2: 0.9070 0.0930
## class 3: 0.5401 0.4599
## class 4: 0.7434 0.2566
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9504 0.0496
## class 2: 0.9903 0.0097
## class 3: 0.8837 0.1163
## class 4: 0.9839 0.0161
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9128 0.0872
## class 2: 0.9254 0.0746
## class 3: 0.8721 0.1279
## class 4: 0.9022 0.0978
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.9501 0.0499
## class 2: 0.9714 0.0286
## class 3: 0.9029 0.0971
## class 4: 0.9615 0.0385
##
## Estimated class population shares
## 0.1144 0.7515 0.0874 0.0467
##
## Predicted class memberships (by modal posterior prob.)
## 0.0745 0.842 0.0437 0.0398
##
## =========================================================
## Fit for 4 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 39
## residual degrees of freedom: 472
## maximum log-likelihood: -206370.5
##
## AIC(4): 412818.9
## BIC(4): 413183.4
## G^2(4): 1048.111 (Likelihood ratio/deviance statistic)
## X^2(4): 1922.927 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.8874 0.1126
## class 2: 0.9487 0.0513
## class 3: 0.4746 0.5254
## class 4: 0.9219 0.0781
## class 5: 0.8112 0.1888
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9706 0.0294
## class 2: 0.9006 0.0994
## class 3: 0.9303 0.0697
## class 4: 0.9846 0.0154
## class 5: 0.1874 0.8126
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9838 0.0162
## class 2: 0.5058 0.4942
## class 3: 0.9402 0.0598
## class 4: 0.9675 0.0325
## class 5: 0.7572 0.2428
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9830 0.0170
## class 2: 0.9727 0.0273
## class 3: 0.9184 0.0816
## class 4: 0.9859 0.0141
## class 5: 0.3988 0.6012
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.8582 0.1418
## class 2: 0.5222 0.4778
## class 3: 0.8621 0.1379
## class 4: 0.9293 0.0707
## class 5: 0.7223 0.2777
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.9507 0.0493
## class 2: 0.6044 0.3956
## class 3: 0.4144 0.5856
## class 4: 0.8954 0.1046
## class 5: 0.7447 0.2553
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9766 0.0234
## class 2: 0.9509 0.0491
## class 3: 0.8612 0.1388
## class 4: 0.9909 0.0091
## class 5: 0.9840 0.0160
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.8184 0.1816
## class 2: 0.9141 0.0859
## class 3: 0.8763 0.1237
## class 4: 0.9456 0.0544
## class 5: 0.9039 0.0961
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.8587 0.1413
## class 2: 0.9505 0.0495
## class 3: 0.9059 0.0941
## class 4: 0.9924 0.0076
## class 5: 0.9639 0.0361
##
## Estimated class population shares
## 0.1376 0.116 0.0605 0.6374 0.0484
##
## Predicted class memberships (by modal posterior prob.)
## 0.0337 0.0712 0.0385 0.8104 0.0462
##
## =========================================================
## Fit for 5 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 49
## residual degrees of freedom: 462
## maximum log-likelihood: -206271.6
##
## AIC(5): 412641.1
## BIC(5): 413099.1
## G^2(5): 850.3277 (Likelihood ratio/deviance statistic)
## X^2(5): 1691.787 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.5553 0.4447
## class 2: 0.5288 0.4712
## class 3: 0.8994 0.1006
## class 4: 0.9333 0.0667
## class 5: 0.8124 0.1876
## class 6: 0.9446 0.0554
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9153 0.0847
## class 2: 0.9621 0.0379
## class 3: 0.9698 0.0302
## class 4: 0.9875 0.0125
## class 5: 0.1084 0.8916
## class 6: 0.8939 0.1061
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9751 0.0249
## class 2: 0.8965 0.1035
## class 3: 0.9854 0.0146
## class 4: 0.9652 0.0348
## class 5: 0.7506 0.2494
## class 6: 0.5106 0.4894
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.8917 0.1083
## class 2: 0.9635 0.0365
## class 3: 0.9758 0.0242
## class 4: 0.9906 0.0094
## class 5: 0.3597 0.6403
## class 6: 0.9637 0.0363
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.9022 0.0978
## class 2: 0.8090 0.1910
## class 3: 0.8672 0.1328
## class 4: 0.9400 0.0600
## class 5: 0.7180 0.2820
## class 6: 0.5321 0.4679
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.5199 0.4801
## class 2: 0.4932 0.5068
## class 3: 0.9276 0.0724
## class 4: 0.9041 0.0959
## class 5: 0.7491 0.2509
## class 6: 0.6108 0.3892
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9649 0.0351
## class 2: 0.7062 0.2938
## class 3: 0.9982 0.0018
## class 4: 0.9859 0.0141
## class 5: 0.9812 0.0188
## class 6: 0.9571 0.0429
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9348 0.0652
## class 2: 0.8062 0.1938
## class 3: 0.8272 0.1728
## class 4: 0.9649 0.0351
## class 5: 0.9030 0.0970
## class 6: 0.9179 0.0821
##
## $notread
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.7570 0.2430
## class 3: 0.9211 0.0789
## class 4: 0.9895 0.0105
## class 5: 0.9586 0.0414
## class 6: 0.9573 0.0427
##
## Estimated class population shares
## 0.0605 0.0275 0.2339 0.5152 0.0407 0.1222
##
## Predicted class memberships (by modal posterior prob.)
## 0.027 0.0123 0.0942 0.7203 0.0344 0.1118
##
## =========================================================
## Fit for 6 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 59
## residual degrees of freedom: 452
## maximum log-likelihood: -206209.5
##
## AIC(6): 412536.9
## BIC(6): 413088.3
## G^2(6): 726.0864 (Likelihood ratio/deviance statistic)
## X^2(6): 1129.648 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $no_dig
## Pr(1) Pr(2)
## class 1: 0.9965 0.0035
## class 2: 0.8996 0.1004
## class 3: 0.9151 0.0849
## class 4: 0.9393 0.0607
## class 5: 0.8216 0.1784
## class 6: 0.6179 0.3821
## class 7: 0.4018 0.5982
##
## $no_bel
## Pr(1) Pr(2)
## class 1: 0.9893 0.0107
## class 2: 0.9695 0.0305
## class 3: 0.9840 0.0160
## class 4: 0.8780 0.1220
## class 5: 0.1120 0.8880
## class 6: 0.9453 0.0547
## class 7: 0.9356 0.0644
##
## $bullying
## Pr(1) Pr(2)
## class 1: 0.9211 0.0789
## class 2: 0.9873 0.0127
## class 3: 0.9727 0.0273
## class 4: 0.3638 0.6362
## class 5: 0.7628 0.2372
## class 6: 0.8470 0.1530
## class 7: 0.9518 0.0482
##
## $no_eng
## Pr(1) Pr(2)
## class 1: 0.9740 0.0260
## class 2: 0.9750 0.0250
## class 3: 0.9919 0.0081
## class 4: 0.9633 0.0367
## class 5: 0.4116 0.5884
## class 6: 0.9637 0.0363
## class 7: 0.9221 0.0779
##
## $disorder
## Pr(1) Pr(2)
## class 1: 0.8446 0.1554
## class 2: 0.8436 0.1564
## class 3: 0.9588 0.0412
## class 4: 0.4878 0.5122
## class 5: 0.7211 0.2789
## class 6: 0.7558 0.2442
## class 7: 0.8692 0.1308
##
## $conf_read
## Pr(1) Pr(2)
## class 1: 0.7150 0.2850
## class 2: 0.9346 0.0654
## class 3: 0.9655 0.0345
## class 4: 0.6081 0.3919
## class 5: 0.7402 0.2598
## class 6: 0.5955 0.4045
## class 7: 0.4998 0.5002
##
## $res_short
## Pr(1) Pr(2)
## class 1: 0.9666 0.0334
## class 2: 0.9978 0.0022
## class 3: 0.9924 0.0076
## class 4: 0.9522 0.0478
## class 5: 0.9829 0.0171
## class 6: 0.7289 0.2711
## class 7: 0.9047 0.0953
##
## $jobsas
## Pr(1) Pr(2)
## class 1: 0.9361 0.0639
## class 2: 0.8182 0.1818
## class 3: 0.9608 0.0392
## class 4: 0.9174 0.0826
## class 5: 0.9060 0.0940
## class 6: 0.7220 0.2780
## class 7: 0.9037 0.0963
##
## $notread
## Pr(1) Pr(2)
## class 1: 0.9907 0.0093
## class 2: 0.9216 0.0784
## class 3: 0.9815 0.0185
## class 4: 0.9559 0.0441
## class 5: 0.9612 0.0388
## class 6: 0.0338 0.9662
## class 7: 0.9779 0.0221
##
## Estimated class population shares
## 0.2087 0.1799 0.4076 0.0794 0.045 0.0073 0.0722
##
## Predicted class memberships (by modal posterior prob.)
## 0.1725 0.0791 0.5795 0.0812 0.0435 0.0078 0.0363
##
## =========================================================
## Fit for 7 latent classes:
## =========================================================
## number of observations: 84583
## number of estimated parameters: 69
## residual degrees of freedom: 442
## maximum log-likelihood: -206182.8
##
## AIC(7): 412503.7
## BIC(7): 413148.5
## G^2(7): 672.8392 (Likelihood ratio/deviance statistic)
## X^2(7): 835.1604 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
## [1] 1
## [1] 1
Important note: In the case of polLCA package, Information Criteria are positive. Usually, researchers minimise the IC, but in this package we have to maximize it. Failed to find any latent class in this case. It seems that EU schools are generally similar.
In this part, I will apply dimension reduction to prepare the data for re-clustering. The aim of this part is twofold. Firstly apply the dimension reductions algorithms to combine large datasets. I will also test whether the effects after clustering differ. In addition to prior information on system’s characteristics, I use data on students. The goal of this part is to assess if schools will be combined into the same clusters after dimension reduction. If yes, that indicates good quality of the dimension reduction.
data_school_dr <- dataPIRLS21 %>%
dplyr::select(IDCNTRY, IDSCHOOL, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, ASDGHRL, ASDHSES, ASDHELA, ASDHENA, ASDHELT, ASDHPCS, ASDHPLR, WGTADJ1) %>%
dplyr::group_by(IDCNTRY, IDSCHOOL) %>%
mutate(weight = 1/n() * WGTADJ1) %>%
summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T),
ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T),
ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T),
ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T),
ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T),
ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T),
ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T),
ASDGHRL_m = weighted.mean(ASDGHRL, w = weight, na.rm = T),
ASDHSES_m = weighted.mean(ASDHSES, w = weight, na.rm = T),
ASDHELA_m = weighted.mean(ASDHELA, w = weight, na.rm = T),
ASDHENA_m = weighted.mean(ASDHENA, w = weight, na.rm = T),
ASDHELT_m = weighted.mean(ASDHELT, w = weight, na.rm = T),
ASDHPCS_m = weighted.mean(ASDHPCS, w = weight, na.rm = T),
ASDHPLR_m = weighted.mean(ASDHPLR, w = weight, na.rm = T),
sum_wgt = sum(weight)) %>%
na.omit()
data_school_dr = apply_labels(data_school_dr,
ASBGSEC_m = "Digital Self-Efficacy",
ASBGSSB_m = "Students' Sense of School Belonging",
ASBGSB_m = "Students' engagement in reading lessons",
ASBGERL_m = "Disorderly behaviour during lessons",
ASBGDRL_m = "Students like reading",
ASBGSCR_m = "Students' confidence in reading",
ACBGRRS_m = "Instruction affected by resources shortages",
ACBGEAS_m = "School's emphasis on academic success",
ACBGDAS_m = "School's Discipline",
ATBGTJS_m = "Teachers' job satisfaction",
ATBGSLI_m = "Teching limited by students not ready for learning",
ASDGHRL_m = "Home resources for learning" ,
ASDHSES_m = "Home socioeconomic status",
ASDHELA_m = "Early literacy before school",
ASDHENA_m = "Early numeracy before school",
ASDHELT_m = "Early literacy tasks",
ASDHPCS_m = "Parents' perception of school",
ASDHPLR_m = "Parents like reading"
)schools_PrePCA <- preProcess(data_school_dr[, c(-1, -2, -21)], method=c("center", "scale"))
schools_PrePCA.s <- predict(schools_PrePCA, data_school_dr[, c(-1, -2, -21)])
summary(schools_PrePCA.s)## ASBGSEC_m ASBGSSB_m ASBGSB_m ASBGERL_m
## Min. : 3.681 Min. : 6.060 Min. : 6.032 Min. : 6.843
## 1st Qu.: 9.875 1st Qu.: 9.536 1st Qu.: 9.431 1st Qu.: 9.275
## Median :10.346 Median :10.164 Median : 9.896 Median : 9.836
## Mean :10.317 Mean :10.166 Mean : 9.889 Mean : 9.938
## 3rd Qu.:10.781 3rd Qu.:10.782 3rd Qu.:10.357 3rd Qu.:10.524
## Max. :13.236 Max. :12.853 Max. :12.567 Max. :13.335
## ASBGDRL_m ASBGSCR_m ACBGRRS_m ACBGEAS_m
## Min. : 6.735 Min. : 6.423 Min. : 3.469 Min. : 3.332
## 1st Qu.: 9.185 1st Qu.: 9.592 1st Qu.: 9.626 1st Qu.: 8.776
## Median : 9.702 Median :10.004 Median :10.400 Median : 9.724
## Mean : 9.754 Mean : 9.997 Mean :10.675 Mean : 9.864
## 3rd Qu.:10.267 3rd Qu.:10.435 3rd Qu.:11.621 3rd Qu.:11.007
## Max. :14.988 Max. :12.607 Max. :14.681 Max. :16.974
## ACBGDAS_m ATBGTJS_m ATBGSLI_m ASDGHRL_m
## Min. : 3.786 Min. : 3.84 Min. : 2.796 Min. :1.000
## 1st Qu.: 9.282 1st Qu.: 8.68 1st Qu.: 9.102 1st Qu.:1.658
## Median :10.289 Median :10.39 Median :10.279 Median :1.800
## Mean :10.330 Mean :10.22 Mean :10.406 Mean :1.796
## 3rd Qu.:11.161 3rd Qu.:11.82 3rd Qu.:11.576 3rd Qu.:1.944
## Max. :12.961 Max. :13.24 Max. :15.080 Max. :3.000
## ASDHSES_m ASDHELA_m ASDHENA_m ASDHELT_m
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.487 1st Qu.:1.455 1st Qu.:1.500 1st Qu.:1.889
## Median :1.714 Median :1.568 Median :1.615 Median :2.167
## Mean :1.750 Mean :1.574 Mean :1.619 Mean :2.155
## 3rd Qu.:2.000 3rd Qu.:1.679 3rd Qu.:1.737 3rd Qu.:2.439
## Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000
## ASDHPCS_m ASDHPLR_m
## Min. :1.000 Min. :1.000
## 1st Qu.:1.214 1st Qu.:1.692
## Median :1.409 Median :1.846
## Mean :1.432 Mean :1.855
## 3rd Qu.:1.615 3rd Qu.:2.000
## Max. :3.000 Max. :3.000
Correlation between variables
schools.cov <-cor(schools_PrePCA.s)
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)
ord <- order(schools.cov[1, ])
data_ord <- schools.cov[ord, ord]
plotcorr(schools.cov, col=my_colors[data_ord*50+50] , mar=c(1,1,1,1))In majority the correlations are not large. We run Bartlett Test and KMO citerion to verify if PCA is justified
##
## ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
##
## ✔ The overall KMO value for your data is middling.
## These data are probably suitable for factor analysis.
##
## Overall: 0.728
##
## For each variable:
## ASBGSEC_m ASBGSSB_m ASBGSB_m ASBGERL_m ASBGDRL_m ASBGSCR_m ACBGRRS_m ACBGEAS_m
## 0.756 0.632 0.676 0.644 0.714 0.803 0.781 0.861
## ACBGDAS_m ATBGTJS_m ATBGSLI_m ASDGHRL_m ASDHSES_m ASDHELA_m ASDHENA_m ASDHELT_m
## 0.799 0.768 0.863 0.724 0.718 0.665 0.597 0.669
## ASDHPCS_m ASDHPLR_m
## 0.817 0.892
## $chisq
## [1] 21515.11
##
## $p.value
## [1] 0
##
## $df
## [1] 153
The results of the tests above identify the need for dimension reduction.
Below I apply the Prinicipal Components Analysis on the scaled data. I use Scree plot to show the share of variance explained by each of the components.
# Screeplot + PCA
pca2_schools <- princomp(schools_PrePCA.s)
pca1_schools<-prcomp(schools_PrePCA.s, center=FALSE, scale.=FALSE)
fviz_eig(pca2_schools, choice = "eigenvalue", barfill = "#B6163A", barcolor = "#999999", linecolor = "#000000", addlabels = TRUE, main = "Eigenvalues") +
theme_minimal() Based on the Kaiser’s rule we should keep the 6 dimensions in our data. Let’s see how much of variance is explained via 6 components
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.060274929 29.36754519 29.36755
## Dim.2 3.694100628 17.90127815 47.26882
## Dim.3 2.759697740 13.37324611 60.64207
## Dim.4 1.997518923 9.67979637 70.32187
## Dim.5 1.944683844 9.42376235 79.74563
## Dim.6 1.500695033 7.27223265 87.01786
## Dim.7 0.828436179 4.01452693 91.03239
## Dim.8 0.616834980 2.98912663 94.02151
## Dim.9 0.350385934 1.69793860 95.71945
## Dim.10 0.292686015 1.41833001 97.13778
## Dim.11 0.220763730 1.06980111 98.20758
## Dim.12 0.122630632 0.59425697 98.80184
## Dim.13 0.101845197 0.49353263 99.29537
## Dim.14 0.057625568 0.27924830 99.57462
## Dim.15 0.041456078 0.20089241 99.77551
## Dim.16 0.027791448 0.13467485 99.91019
## Dim.17 0.009378922 0.04544941 99.95564
## Dim.18 0.009154388 0.04436134 100.00000
We can reduce the dimensions from 18 up to 7 and still keep 87% of the total variance!
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## ASBGSEC_m 9.1885050 0.299339535 2.49317550 7.242114e-01 0.883598795
## ASBGSSB_m 8.9419292 0.002324699 6.78231586 2.078221e+00 0.198372933
## ASBGSB_m 8.4449032 0.867167807 3.27927196 1.142761e+00 0.079653169
## ASBGERL_m 8.5380386 0.086716047 8.93848078 1.546169e+00 0.520789399
## ASBGDRL_m 8.2272115 0.963542184 3.34756639 1.175796e+00 0.060387110
## ASBGSCR_m 8.6234988 0.450712817 3.18853669 3.014597e-01 0.572406185
## ACBGRRS_m 9.8908481 9.473546306 0.01042093 9.532732e+00 29.654797001
## ACBGEAS_m 8.4832385 6.111903145 23.95507213 2.588751e+01 0.251437119
## ACBGDAS_m 9.2801823 3.162921244 3.18178594 1.649107e+00 64.776188007
## ATBGTJS_m 9.0927625 76.583544314 0.03618841 1.248331e+01 0.068076985
## ATBGSLI_m 9.4446617 1.865222617 40.95356930 4.328204e+01 2.769851175
## ASDGHRL_m 0.2765315 0.006682829 0.63827396 4.640316e-02 0.017476665
## ASDHSES_m 0.2617039 0.039555252 1.33165775 7.565485e-02 0.059826225
## ASDHELA_m 0.2126205 0.008378159 0.28035294 5.030649e-03 0.008120919
## ASDHENA_m 0.2253101 0.003776428 0.21061902 2.940177e-03 0.015457999
## ASDHELT_m 0.3981433 0.011790795 0.66533226 4.819521e-02 0.002736848
## ASDHPCS_m 0.1753918 0.055501774 0.10828614 1.925799e-06 0.053705967
## ASDHPLR_m 0.2945194 0.007374047 0.59909404 1.845871e-02 0.007117501
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## ASBGSEC_m 6.838004e-01 2.955106753 52.433878546 24.682213120 1.575316e+00
## ASBGSSB_m 2.345477e+00 22.594712928 2.164335882 0.195348496 2.243227e-04
## ASBGSB_m 1.217464e+00 12.463401466 4.790603162 9.442578512 5.521553e+01
## ASBGERL_m 2.035786e+00 31.129109019 0.077091812 0.120353300 1.616446e-01
## ASBGDRL_m 7.903906e-01 13.074581058 31.554860053 28.041944362 1.080575e+01
## ASBGSCR_m 1.834901e+00 10.987238553 6.437088819 33.051063674 2.385109e+01
## ACBGRRS_m 3.889494e+01 1.744879688 0.587038276 0.060295872 3.352916e-02
## ACBGEAS_m 3.338499e+01 0.913358549 0.128018400 0.059376104 4.136407e-02
## ACBGDAS_m 1.704301e+01 0.007617333 0.644914717 0.041600216 1.397126e-01
## ATBGTJS_m 8.752217e-01 0.618632428 0.145721780 0.006474108 1.758253e-02
## ATBGSLI_m 8.365999e-01 0.224403788 0.063441163 0.080055407 1.125790e-02
## ASDGHRL_m 6.890413e-03 0.111204673 0.006169723 0.194442014 7.759106e-01
## ASDHSES_m 1.876075e-02 0.556426925 0.002447572 0.501648115 1.478341e+00
## ASDHELA_m 1.232529e-03 0.059164671 0.041758092 0.038787775 3.162083e-01
## ASDHENA_m 4.635636e-03 0.039100407 0.012633846 0.070203631 1.334955e-01
## ASDHELT_m 2.407853e-02 0.171753780 0.545972195 3.164793681 5.129644e+00
## ASDHPCS_m 7.452997e-04 2.266098455 0.351990764 0.186526703 7.187397e-03
## ASDHPLR_m 1.077812e-03 0.083209525 0.012035197 0.062294911 3.062168e-01
## Dim.11 Dim.12 Dim.13 Dim.14 Dim.15
## ASBGSEC_m 0.63955986 9.368193e-01 1.465769960 9.847803e-01 2.620806e-02
## ASBGSSB_m 15.88172874 3.363204e+01 3.503591850 2.775410e-01 1.359736e+00
## ASBGSB_m 0.53536641 5.646284e-01 1.310026916 4.466863e-01 1.560232e-01
## ASBGERL_m 3.74731085 3.520393e+01 6.197472080 1.102670e+00 5.567268e-01
## ASBGDRL_m 0.15862643 1.647890e+00 0.086636699 2.818033e-03 4.333015e-02
## ASBGSCR_m 0.38166366 1.440126e+00 8.568467278 2.222810e-01 2.647935e-02
## ACBGRRS_m 0.03529059 5.182866e-02 0.001931021 1.462078e-02 9.110109e-03
## ACBGEAS_m 0.72228661 2.976426e-02 0.009873929 2.379757e-03 1.457811e-02
## ACBGDAS_m 0.04253747 1.976623e-02 0.004035794 1.148688e-03 2.309069e-03
## ATBGTJS_m 0.05778415 3.710480e-04 0.010185433 4.680367e-05 3.411296e-04
## ATBGSLI_m 0.45789220 4.499035e-04 0.003794512 4.342789e-03 2.869416e-05
## ASDGHRL_m 14.02554060 3.164835e-01 5.703295312 8.175416e-01 2.356116e+00
## ASDHSES_m 28.53148893 4.992020e-01 20.295397292 5.161209e+00 5.130762e+00
## ASDHELA_m 4.16188714 3.762426e+00 0.114950169 1.471834e+00 2.877349e+01
## ASDHENA_m 1.83036888 3.702574e+00 0.031653175 3.661895e+00 3.868860e+01
## ASDHELT_m 10.88452201 1.375892e+01 47.098852691 1.450219e+01 3.162258e+00
## ASDHPCS_m 2.46530189 1.292754e+00 3.460080604 6.645081e+01 1.792033e+01
## ASDHPLR_m 15.44084357 3.140032e+00 2.133985287 4.875206e+00 1.773572e+00
## Dim.16 Dim.17 Dim.18
## ASBGSEC_m 4.757909e-03 2.130918e-02 1.650060e-03
## ASBGSSB_m 7.061701e-03 2.143028e-03 3.289838e-02
## ASBGSB_m 1.184974e-02 2.871838e-02 3.364711e-03
## ASBGERL_m 1.597608e-02 1.632146e-02 5.418197e-03
## ASBGDRL_m 5.842674e-03 2.617919e-03 1.021092e-02
## ASBGSCR_m 5.628022e-02 6.701704e-03 8.237560e-06
## ACBGRRS_m 3.082123e-03 7.963515e-04 3.081607e-04
## ACBGEAS_m 4.227838e-03 5.814497e-04 4.901645e-05
## ACBGDAS_m 2.513252e-03 5.995264e-04 5.239093e-05
## ATBGTJS_m 2.967592e-03 3.354733e-04 4.517704e-04
## ATBGSLI_m 4.084924e-06 2.215377e-03 1.698563e-04
## ASDGHRL_m 2.931571e+00 6.447222e+01 7.297250e+00
## ASDHSES_m 8.321229e+00 2.410734e+01 3.627347e+00
## ASDHELA_m 3.183599e+00 6.761283e+00 5.079888e+01
## ASDHENA_m 1.009892e+01 4.207374e+00 3.706044e+01
## ASDHELT_m 2.032381e-02 1.338970e-03 4.091554e-01
## ASDHPCS_m 4.832616e+00 3.678893e-01 4.778503e-03
## ASDHPLR_m 7.049718e+01 2.155487e-04 7.475654e-01
Below, I have verified the most important variables for the 1st, 2nd and the 3rd component.
## ACBGRRS_m ATBGSLI_m ACBGDAS_m ASBGSEC_m ATBGTJS_m ASBGSSB_m ASBGSCR_m
## 0.31449719 0.30732168 0.30463392 0.30312547 0.30154208 0.29903059 0.29365794
## ASBGERL_m ACBGEAS_m ASBGSB_m ASBGDRL_m ASDHELT_m ASDHPLR_m ASDGHRL_m
## 0.29219922 0.29126000 0.29060116 0.28683116 0.06309860 0.05426964 0.05258626
## ASDHSES_m ASDHENA_m ASDHELA_m ASDHPCS_m
## 0.05115701 0.04746684 0.04611079 0.04187980
## ATBGTJS_m ACBGRRS_m ACBGEAS_m ACBGDAS_m ATBGSLI_m ASBGDRL_m
## 0.875120245 0.307791265 0.247222635 0.177846036 0.136573153 0.098160185
## ASBGSB_m ASBGSCR_m ASBGSEC_m ASBGERL_m ASDHPCS_m ASDHSES_m
## 0.093121845 0.067135149 0.054711931 0.029447589 0.023558814 0.019888502
## ASDHELT_m ASDHELA_m ASDHPLR_m ASDGHRL_m ASDHENA_m ASBGSSB_m
## 0.010858543 0.009153228 0.008587227 0.008174857 0.006145265 0.004821513
## ATBGSLI_m ACBGEAS_m ASBGERL_m ASBGSSB_m ASBGDRL_m ASBGSB_m ASBGSCR_m
## 0.63994976 0.48943919 0.29897292 0.26042880 0.18296356 0.18108760 0.17856474
## ACBGDAS_m ASBGSEC_m ASDHSES_m ASDHELT_m ASDGHRL_m ASDHPLR_m ASDHELA_m
## 0.17837561 0.15789793 0.11539748 0.08156790 0.07989205 0.07740117 0.05294837
## ASDHENA_m ASDHPCS_m ATBGTJS_m ACBGRRS_m
## 0.04589325 0.03290686 0.01902325 0.01020830
I find that: - Instruction Affected by resources shortages, Teaching Limited by students not ready, School’s discipline, Digital Self-Efficacy, Teacher’s job satisfaction equally contribute to the first component - Teacher’s job satisfaction dominates the second component - Teaching limited by students not ready for learning, School’s emphasis on academic success contributes the most to the third component
In addition, I will apply T-distributed Stochastic Neighbor Embedding (t-SNE). The algortihim differs from PCA and MDS, that it is a stochastic method (hence, it is important to provide other researchers with seed applied. By doing so, our results will be reproducible). We can use t-SNE to present high-dimensional space in two- or three-dimensions. Hence, it is a proper tool for visualization of the data.
library(Rtsne)
tsne_data_school <- Rtsne(data_school_dr[, c(-1, -2, -21)],
pca = FALSE, perplexity = 5,
theta = 0.0
)
tsne_num <- data.frame(
TSNE1 = tsne_data_school$Y[, 1],
TSNE2 = tsne_data_school$Y[, 2],
label = as_factor(data_school_dr$IDCNTRY)
)
ggplot(tsne_num, aes(
x = TSNE1, y = TSNE2, col = label
)) +
geom_point() The results of the t-SNE show nothing interesting from the perspective of the analysis. There seems to be no concentration of points in terms of the educational system origin. In the following part, I will use PCA to evaluate the quality of dimension reduction.
# Let's assign the data again to the main data frame
data_school_dr$pr1 <- pca1_schools$x[, 1]
data_school_dr$pr2 <- pca1_schools$x[, 2]
data_school_dr$pr3 <- pca1_schools$x[, 3]
data_school_dr$pr4 <- pca1_schools$x[, 4]
data_school_dr$pr5 <- pca1_schools$x[, 5]
data_school_dr$pr6 <- pca1_schools$x[, 6]Now we will use K-means to verify if clusters were assigned similarly as in chapter 3 of this document.
# Evaluate the number of histograms
p1_kmeans <- fviz_nbclust(as.data.frame(scale(data_school_dr[c(22:27)])), FUN = stats::kmeans, method ="wss",k.max = 22) + ggtitle("(A) Elbow method")## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
# Silhouette Evaluation
p2_kmeans <- fviz_nbclust(as.data.frame(scale(data_school_dr[c(22:27)])), FUN = stats::kmeans, method = "silhouette", k.max = 22) +
ggtitle("(B) Silhouette method")## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
clusters_kmeans <- flexclust::cclust(as.data.frame(scale(data_school_dr[c(22:27)])), 2, dist = "euclidean", method = "kmeans", weights = data_school_dr$sum_wgt, save.data = TRUE)## Warning in flexclust::cclust(as.data.frame(scale(data_school_dr[c(22:27)])), :
## weights can only be used with hard competitive learning
Now let’s draw the clusters
Let’s verify if schools were clustered to the same clusters as in the case of no dimension reduction.
# Rand Index to assess the similarity between clustering
data_school_common <- left_join(data_school[c(1, 2, 15)], data_school_dr[c(1,2, 28)], by=c("IDSCHOOL", "IDCNTRY")) %>% na.omit()
rand.index(data_school_common$cluster_kmeans, data_school_common$cluster_dim_red)## [1] 0.5493772
We find that the overlap in clustering is more than 50%. That indicates moderate quality of the dimension reduction.
As a concluding remark, I will apply Multidimensional Scaling, to place observations in the two-dimensional space to allow comparison of the educational systems. Although the MDS scales are not easily interpretable, it will allow me to draw some conclusions on the similarity between the systems.
## initial value 28.977313
## iter 5 value 23.641882
## iter 5 value 23.626423
## iter 5 value 23.611226
## final value 23.611226
## converged
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Dim.1 Dim.2
## Min. :-9.2250 Min. :-8.01623
## 1st Qu.:-1.7184 1st Qu.:-1.34956
## Median :-0.1719 Median : 0.06132
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.6435 3rd Qu.: 1.44987
## Max. :11.4278 Max. : 7.68101
Now, let’s plot the relationship
data_school_dr$dim1_mds <- mds$Dim.1
data_school_dr$dim2_mds <- mds$Dim.2
ggplot(data_school_dr, aes(x= dim1_mds, y = dim2_mds, col = as_factor(IDCNTRY))) +
geom_point()We can’t see any clustering at this point, so let’s aggregate the data
mds_data_graph <- data_school_dr %>%
group_by(IDCNTRY) %>%
summarise(dim_1_mean = weighted.mean(dim1_mds, w = sum_wgt),
dim_2_mean = weighted.mean(dim2_mds, w = sum_wgt))
ggplot(mds_data_graph, aes(x= dim_1_mean, y = dim_2_mean, col = as_factor(IDCNTRY))) +
geom_point() +
geom_text(aes(label = as_factor(IDCNTRY)), size=3) +
theme(legend.position = "none") The results of the MDS are fairly different as compared to hierarchical clustering. Firstly, we find more countries as very different than the remaining - Poland, France, Bulgaria, Spain and Italy. At the same time, the smallest distance is often found between countries clustered into different clusters - e.g Czech Republic and Finland.
Association rules are usually applied for basket analysis. In particular, we are interested in assessing if buying certain products is associated with buying other products, which can be useful for drawing marketing strategies. However, the application of Association Rules go far above the Basket Analysis. In this part of the project, I will apply the Association Rules to measure the association between high achievements in reading and other factors.
For the analysis of the association rules, we usually apply three different metrics:
Support - how many times ‘product A’ appeared in the transactions in the dataset
Confidence - how many times products ‘A’ and ‘B’ appeared among all transactions with product ‘A’. It can be applied as a measure of the strength of the rule.
Lift - Ratio of the confidence and expected confidence. Lift above one indicates positive dependence of ‘A’ and ‘B’ products.
The goal of this part of the project is to determine the association between school characteristics and the high achievements in reading in the 4th grade.
Since Association Rules are usually applied for binary variables, I start with modifying the data for the analysis.
dataPIRLS21$mean_read = rowMeans(dataPIRLS21[,496:500], na.rm = TRUE)
dataPIRLS21$q75 <- quantile(dataPIRLS21$mean_read, probs = 0.75)
dataPIRLS21$high_score <- ifelse(dataPIRLS21$mean_read > dataPIRLS21$q75, "HighScore", "LowModScore" )
# Modification
dataPIRLS21$no_dig <- as.factor(ifelse(dataPIRLS21$ASDGSEC == 3, "NoDig", "Dig"))
dataPIRLS21$no_bel <- as.factor(ifelse(dataPIRLS21$ASDGSSB == 3, "NoSenseOfSchoolBelonging", "HighSenseOfSchoolBelonging"))
dataPIRLS21$bullying <- as.factor(ifelse(dataPIRLS21$ASDGSB == 3, "Bullying", "NoBullying"))
dataPIRLS21$no_eng <- as.factor(ifelse(dataPIRLS21$ASDGERL == 3, "NoEarlyLiteracy", "EarlyLiteracy"))
dataPIRLS21$disorder <- as.factor(ifelse(dataPIRLS21$ASDGDRL == 3, "DisorderlyBehavAtSchool"," NoDisorderlyBehavAtSchool"))
dataPIRLS21$conf_read <- as.factor(ifelse(dataPIRLS21$ASDGSCR == 3, "ConfidentInReading", "LackOfConfInReading"))
dataPIRLS21$res_short <- as.factor(ifelse(dataPIRLS21$ASDGHRL == 3, "ResourcesShortages", "NoResourcesShort"))
dataPIRLS21$jobsas <- as.factor(ifelse(dataPIRLS21$ATDGTJS == 3, "SatisfiedTeachers", "NotSatisfiedTeachers"))
dataPIRLS21$notread <- as.factor(ifelse(dataPIRLS21$ATDGSLI == 3, "NotReadyForSchool", "ReadyForSchool"))
table(dataPIRLS21$high_score)##
## HighScore LowModScore
## 30345 91035
Hence, around 33% of the population is classified into high_score. I save the dataset to csv and use it in the following part.
# Clean the data before re-reading them as "transactions"
dataPIRLS21_trans <- dataPIRLS21[,c(1, 532:541)]
write.csv(dataPIRLS21_trans, "achievements.csv", row.names = F)## Warning in asMethod(object): removing duplicated items in transactions
## NoResourcesShort EarlyLiteracy
## 0.97448509 0.93527764
## HighSenseOfSchoolBelonging NoBullying
## 0.91464821 0.88638985
## Dig NoDisorderlyBehavAtSchool
## 0.87710496 0.84583127
## ReadyForSchool LackOfConfInReading
## 0.83130664 0.81304993
## NotSatisfiedTeachers LowModScore
## 0.79422475 0.75000000
## HighScore ConfidentInReading
## 0.25000000 0.17593508
## DisorderlyBehavAtSchool NoDig
## 0.14312902 0.11186357
## Bullying NoSenseOfSchoolBelonging
## 0.10258692 0.07432032
## SatisfiedTeachers 724
## 0.07413907 0.07044818
## 246 203
## 0.05881529 0.05467952
## NoEarlyLiteracy 620
## 0.05369089 0.05034602
## 380 250
## 0.04481793 0.04398583
## 348 752
## 0.04376339 0.04263470
## 956 705
## 0.04213215 0.04209919
## 40 276
## 0.04054210 0.03990773
## 703 208
## 0.03988301 0.03971824
## 372 440
## 0.03841654 0.03808700
## 196 NotReadyForSchool
## 0.03780689 0.03705718
## 428 528
## 0.03599440 0.03553304
## 957 616
## 0.03525292 0.03442907
## 100 191
## 0.03330862 0.03243533
## 470 ResourcesShortages
## 0.02496293 0.02006920
## bullying conf_read
## 0.00000000 0.00000000
## disorder high_score
## 0.00000000 0.00000000
## IDCNTRY jobsas
## 0.00000000 0.00000000
## no_bel no_dig
## 0.00000000 0.00000000
## no_eng notread
## 0.00000000 0.00000000
## res_short
## 0.00000000
The most frequent items in the dataset are: - No resources shortage (97.4%) - Early literacy activities (93.5%) - High Sense of School Belonging (91.5%)
## NULL
It generally shows that schools in European Union are characterized by high quality in terms of supporting pupils.
Below I use the scatter plot, showing the association between items support (total number of ‘transactions’ of A and B among all transactions) and confidence (the share of transactions of A and B among all A transactions)
library(arulesViz)
achievements_rules <- apriori(dataPIRLS21_ar, parameter = list(support = 0.1, confidence = 0.15, minlen = 2))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 12138
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.11s].
## sorting and recoding items ... [15 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.01s].
## writing ... [8799 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Interestingly, we find three clusters of points. In the lower left part of the graph, there is a cloud of points characterized by low confidence and support. The second cloud of points is in the left upper part of the graph. It shows points with high confidence, but low support. These points stand for rules where a certain rule is generally moderately often visible among all transactions, but if observed, it is often found that the ‘A’ and ‘B’ are ‘bought’ at the same time. The third cloud point is the largest, and shows high confidence with variable support.
In this part, I use Apriori algorithm to determine the association rules. Apriori algorithm uses prior knowledge of the most frequent items observed in the transaction dataset. It uses the simple property, showing that the subsets of frequent items must also be frequent. In particular, if a certain set does not pass the test (e.g. support, confidence threshold), its subsets automatically also do not pass the test <- reduction of the search space.
I look for rules, where HighScore is implied by other factors.
# Apriori association rules algorithm
rules.score <- apriori(dataPIRLS21_ar, parameter = list(support = 0.20,
confidence = 0.20),
appearance = list(rhs = "HighScore"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24276
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.13s].
## sorting and recoding items ... [11 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.00s].
## writing ... [106 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
I find 106 rules. I want to see only the 5 characterised by higher confidence.
# Order rules by confidence
rules.score.byconf <- sort(rules.score, by = "confidence",
decreasing = T)
inspect(rules.score.byconf[1:5])## lhs rhs support confidence coverage lift count
## [1] {Dig,
## EarlyLiteracy,
## HighSenseOfSchoolBelonging,
## LackOfConfInReading,
## NoBullying,
## NoResourcesShort} => {HighScore} 0.2028423 0.3318910 0.6111715 1.327564 24621
## [2] {Dig,
## HighSenseOfSchoolBelonging,
## LackOfConfInReading,
## NoBullying,
## NoResourcesShort} => {HighScore} 0.2068875 0.3311356 0.6247817 1.324542 25112
## [3] {EarlyLiteracy,
## HighSenseOfSchoolBelonging,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## NoResourcesShort} => {HighScore} 0.2003048 0.3308386 0.6054457 1.323355 24313
## [4] {HighSenseOfSchoolBelonging,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## NoResourcesShort} => {HighScore} 0.2041522 0.3295607 0.6194678 1.318243 24780
## [5] {Dig,
## EarlyLiteracy,
## LackOfConfInReading,
## NoBullying,
## NoResourcesShort} => {HighScore} 0.2083127 0.3285601 0.6340171 1.314240 25285
rules.score.suport <- sort(rules.score, by = "support",
decreasing = T)
inspect(rules.score.suport[1:5])## lhs rhs support confidence
## [1] {} => {HighScore} 0.2500000 0.2500000
## [2] {NoResourcesShort} => {HighScore} 0.2491349 0.2556580
## [3] {LackOfConfInReading} => {HighScore} 0.2409705 0.2963785
## [4] {LackOfConfInReading, NoResourcesShort} => {HighScore} 0.2404268 0.3005768
## [5] {EarlyLiteracy} => {HighScore} 0.2400890 0.2567034
## coverage lift count
## [1] 1.0000000 1.000000 30345
## [2] 0.9744851 1.022632 30240
## [3] 0.8130499 1.185514 29249
## [4] 0.7998847 1.202307 29183
## [5] 0.9352776 1.026814 29142
Not surprisingly all elements of implying high score, are naturally associated with high score - NoResourcesShortage; LackOfConfidenceInWriting, Ealy Literacy etc.
In addition to the association between school characteristics I study the association between all factors defined in the data. This time, I take more conservative approach and assume minimal confidence 0.3 and minimum support 0.3.
# Apriori association rules algorithm
rules.score_all <- apriori(dataPIRLS21_ar, parameter = list(support = 0.30,confidence = 0.20))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.3 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 36414
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.22s].
## sorting and recoding items ... [10 item(s)] done [0.03s].
## creating transaction tree ... done [0.21s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.00s].
## writing ... [5066 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
rules.score_all_conf <- sort(rules.score_all, by = "confidence",
decreasing = T)
inspect(rules.score_all_conf[1:5])## lhs rhs support confidence coverage lift count
## [1] {Dig,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## ReadyForSchool} => {NoResourcesShort} 0.4919756 0.9875637 0.4981710 1.013421 59716
## [2] {Dig,
## EarlyLiteracy,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## ReadyForSchool} => {NoResourcesShort} 0.4755067 0.9875607 0.4814961 1.013418 57717
## [3] {Dig,
## EarlyLiteracy,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## NotSatisfiedTeachers,
## ReadyForSchool} => {NoResourcesShort} 0.4396111 0.9874715 0.4451887 1.013326 53360
## [4] {Dig,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool} => {NoResourcesShort} 0.5870077 0.9874577 0.5944637 1.013312 71251
## [5] {Dig,
## LackOfConfInReading,
## NoBullying,
## NoDisorderlyBehavAtSchool,
## NotSatisfiedTeachers,
## ReadyForSchool} => {NoResourcesShort} 0.4544241 0.9874505 0.4601994 1.013305 55158
# Apriori association rules algorithm
rules.score_all_supp <- sort(rules.score_all, by = "support",
decreasing = T)
inspect(rules.score_all_supp[1:5])## lhs rhs support confidence
## [1] {} => {NoResourcesShort} 0.9744851 0.9744851
## [2] {} => {EarlyLiteracy} 0.9352776 0.9352776
## [3] {EarlyLiteracy} => {NoResourcesShort} 0.9162959 0.9797047
## [4] {NoResourcesShort} => {EarlyLiteracy} 0.9162959 0.9402873
## [5] {} => {HighSenseOfSchoolBelonging} 0.9146482 0.9146482
## coverage lift count
## [1] 1.0000000 1.000000 118283
## [2] 1.0000000 1.000000 113524
## [3] 0.9352776 1.005356 111220
## [4] 0.9744851 1.005356 111220
## [5] 1.0000000 1.000000 111020
The association found by sorting support is particularly interesting as it shows a strong association between early literacy and no resources shortages. It may be true, as parents with high Socioeconomic Status, who usually invest more in their children education might be more prone to send them to schools with no resources shortages, or even invest in school’s resources.
As there are too many rules to be visualized, I show only 20 of them.
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
It can be seen from the graphs, that the largest association is found between NoResourcesShortages and High Score. LackOfConfidenceInReading and Early Literacy also show large importance.
This project covered broad application of the Unsupervised Learning methods for data exploring in educational research. Based on clustering methods, we captured similarities between educational systems. Dimension reduction helped us to get better sense of what is going on in the data. Finally, the association rules showed some supporting evidence for investing in human capital.