Do not fixate at means! Assessment of similarities between EU education system using clustering techniques.

Szymczak Wojciech

14-12-2023

Introduction
5. Association Rules
6 Conclusions

Introduction

In critically acclaimed research paper, Ludger Woessmann and Eric Hanushek (2008) have claimed that the quality of education is a strong predictor of economic growth. The authors argue that increasing educational attainment with no guarantee in increasing the effectiveness of the education system is fruitless. However, in their analysis Hanushek & Woessmann (2008) rely on achievement from Large-Scale Assessments (e.g. PISA, TIMSS), which usually give us a narrow perspective on the condition of the education system. In response to this issue, the goal of this project is to explore the similarities between the education systems in European Union. I apply several clustering techniques, such as Hierarchical Clustering, K-Means, PAM, Clara and Latent Class analysis based on PIRLS 2021 data. In the following parts, I use Principal Component Analysis to reduce the dimensions used for clustering. Finally, I use MDS to compare the features of the educational systems in 2-dimensional space.

Data Cleaning and Variable Selection

1. Data Cleaning and Variable Selection

setwd("C:/Users/wojci/OneDrive/WSZ/Unsupervised Learning/Project Clustering")

PIRLS 2021 was downloaded from International Education Association website (IEA) and merged using IEA IDB Analyzer. Out of the participating countries, we focus on 23 educational systems in 22 EU countries.

Aggregate basic data on achievements

dataPIRLS21_agg <- dataPIRLS21 %>% 
                    mutate(IDCNTRY = as_factor(IDCNTRY)) %>%
                    group_by(IDCNTRY) %>% 
                    summarise(achievements_read = round(mean(ASRREA01, ASRREA02, ASRREA03, ASRREA04, ASRREA05), 2),
                              desire = mean(ACBG11J, na.rm = TRUE)) %>% 
                    arrange(across(achievements_read, desc))

pic1 <- ggplot(dataPIRLS21_agg, aes(x = reorder(IDCNTRY, achievements_read), y = achievements_read, fill = achievements_read))+
  geom_bar( stat = "summary", fun.y = "mean", position = position_stack(reverse = TRUE)) +
  theme_minimal() +
  scale_fill_viridis(option = "D") +
  labs(title = "") +
  xlab("Country") +
  ylab("Fourth Graders' Achievements in Reading") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  )+
  coord_flip()



# But at the same time, the self-perception desire of students to do well
pic2 <- ggplot(dataPIRLS21_agg, aes(x = reorder(IDCNTRY, desire), y = desire, fill = desire))+
  geom_bar( stat = "summary", fun.y = "mean", position = position_stack(reverse = TRUE)) +
  theme_minimal() +
  scale_fill_viridis(option = "D") +
  labs(title = "") +
  xlab("Country") +
  ylab("Students' desire to do well in school") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  )+
  coord_flip()


pic1

pic2

We can see that the overlap in countries’ position in the case of achievements and desire to do well in school is not large. For instance, in Croatia and Ireland the desire is generally low, despite high achievements in reading.

2. Empirical Strategy

I will use several indicators related to the school climate and students’ mental health to model the clusters. PIRLS generates continuous scales in several domains. Hence, the analysis will be based on continous variables only. For the analysis, the clustering will be based on:

Digital self-efficacy (ASBGSEC)
Students’ sense of school belonging (ASBGSSB)
Students’ engagement in reading lessons (ASBGSB)
Relevance of disoredly behaviour during lessons (ASBGERL)
Students like reading (ASBGDRL)
Students’ confidence in reading (ASBGSCR)
Instruction affected by resources shortages (ACBGRRS)
School emphasis on academic success (ACBGEAS)
School discipline (ACBGDAS)
Teachers’ job satisfaction (ATBGTJS)
Teaching limited by students not ready for learning (ATBGSLI)

I set up ID in original data, to track individuals later.

dataPIRLS21$ID <- 1:length(dataPIRLS21$IDCNTRY)
data_selected <- dataPIRLS21 %>% 
  dplyr::select(ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, ID, IDCNTRY, IDSCHOOL, WGTADJ1) %>% 
  filter(
        across(everything(), ~ . != 999999)
    )

# I need to reweight data to give each country an equal wage. If not, the results will be dominated by Spain and France
# Calculate weights for each country
country_weights <- data_selected %>%
  group_by(IDCNTRY) %>%
  mutate(weight = 1/n() * WGTADJ1) 

# Almost equal weights
country_weights %>%  
  group_by(IDCNTRY) %>% 
  summarise(suma = sum(weight))

## # A tibble: 22 × 2
##    IDCNTRY               suma
##    <dbl+lbl>            <dbl>
##  1  40 [Austria]         1   
##  2 100 [Bulgaria]        1   
##  3 191 [Croatia]         1.04
##  4 196 [Cyprus]          1.01
##  5 203 [Czech Republic]  1.00
##  6 208 [Denmark]         1.09
##  7 246 [Finland]         1   
##  8 250 [France]          1.03
##  9 276 [Germany]         1.03
## 10 380 [Italy]           1.01
## # ℹ 12 more rows

# Merge the calculated weights back to the original dataset
data_selected$weight <- country_weights$weight
data_selected %>%  
  group_by(IDCNTRY) %>% 
  summarise(suma = sum(weight))

## # A tibble: 22 × 2
##    IDCNTRY               suma
##    <dbl+lbl>            <dbl>
##  1  40 [Austria]         1   
##  2 100 [Bulgaria]        1   
##  3 191 [Croatia]         1.04
##  4 196 [Cyprus]          1.01
##  5 203 [Czech Republic]  1.00
##  6 208 [Denmark]         1.09
##  7 246 [Finland]         1   
##  8 250 [France]          1.03
##  9 276 [Germany]         1.03
## 10 380 [Italy]           1.01
## # ℹ 12 more rows

3. Clustering

3.1 Hierarchical clustering

I set my analysis with aggregating the data on a education system level. Then, I apply Hierarchical Clustering to see the similarities between educational systems.

# Aggregate data 
data_country <- data_selected %>% 
  dplyr::select(IDCNTRY, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, weight) %>% 
  dplyr::group_by(IDCNTRY) %>% 
  summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
            ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
            ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T), 
            ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
            ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
            ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T), 
            ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T), 
            ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T), 
            ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T), 
            ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T), 
            ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T)) 
# Scale the variables 
data_country2 <- scale(data_country[,-1])
rownames(data_country2) = as_factor(data_country$IDCNTRY)
# Compute the dendogram 
hclusters <- hclust(dist(data_country2[,-1]), method = "average")

# Evaluate the number of histograms
p1 <- fviz_nbclust(data_country2, FUN = hcut, method = "wss", 
                   k.max = 10) +
  ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(data_country2, FUN = hcut, method = "silhouette", 
                   k.max = 10) +
  ggtitle("(B) Silhouette method")

# Display plots side by side
gridExtra::grid.arrange(p1, p2,  nrow = 1)

# We got 9 clusters. But the silhouette measure for 6 clusters is quite similar
hclust9_graph <- fviz_dend(hclusters, k = 9,      # Cut in four groups
          cex = 0.5,                 # label size
          k_colors = c("#264653", 
                       "#2A9D8F",
                       "#BFBCCB",
                       "#DCEED1",
                       "#E9C46A",
                       "#F4A261",
                       "#E76F51",
                       "#784F41",
                       "#B80053"),
          color_labels_by_k = TRUE,  # color labels by groups
          ggtheme = theme_classic())  +
  theme(
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12)
  ) + labs(title = "9 clusters (optimal)")
  
# Graph for 6 clusters
hclust6_graph <- fviz_dend(hclusters, k = 6,      # Cut in four groups
          cex = 0.5,                 # label size
          k_colors = c("#264653", 
                       "#2A9D8F",
                       "#BFBCCB",
                       "#DCEED1",
                       "#E9C46A",
                       "#F4A261"),
          color_labels_by_k = TRUE,  # color labels by groups
          ggtheme = theme_classic())  +
  theme(
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12)
  ) + labs(title = "6 clusters")


# Display plots side by side
gridExtra::grid.arrange(hclust6_graph, hclust9_graph,  nrow = 1)

The results of hierarchical clustering show insightful conclusions. Not even looking at the data for the achievements we find similarities between educational systems. Some results, as Czech Republic and Slovakia or Denmark and Sweden in the same cluster are straightforward due to the similarity in culture. The case of Poland is interesting. It is generally established in public opinion that we the educational system is poor and it should look up to Finish system. However the results clearly show similarities between the systems.

3.2 K-MEANS, PAM

As 24 education systems are in the data, we set the maximum number of clusters as 24. For computational power, I will examine only school level data, as it is generally not feasible to do it on individual students.

data_school <- data_selected %>% 
  dplyr::select(IDCNTRY, IDSCHOOL, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI,weight) %>%
  dplyr::group_by(IDCNTRY, IDSCHOOL) %>% 
  summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
            ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
            ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T), 
            ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
            ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
            ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T), 
            ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T), 
            ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T), 
            ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T), 
            ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T), 
            ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T),
            sum_wgt = sum(weight)) 


# Evaluate the number of clusters
p1_kmeans <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2, 14)])), FUN = stats::kmeans, method ="wss",k.max = 22) + ggtitle("(A) Elbow method")

# Silhouette Evaluation
p2_kmeans <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2, 14)])), FUN = stats::kmeans, method = "silhouette", k.max = 22) +
  ggtitle("(B) Silhouette method")

clusters_kmeans <- flexclust::cclust(as.data.frame(scale(data_school[-c(1,2, 14)])), 2, dist = "euclidean", method = "kmeans", weights = data_school$sum_wgt, save.data = TRUE)


# Evaluate the number of clusters
p1_pam <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2)])), FUN = cluster::pam, method ="wss", k.max = 22) + ggtitle("(A) Elbow method")

# Silhouette Evaluation
p2_pam <- fviz_nbclust(as.data.frame(scale(data_school[-c(1,2)])), FUN = cluster::pam, method = "silhouette", k.max = 22) + ggtitle("(B) Silhouette method")


# Arrange the evaluation graphs  
kmeans_combined_ev  <- gridExtra::grid.arrange(p1_kmeans, p2_kmeans)

pam_combined_ev  <- gridExtra::grid.arrange(p1_pam, p2_pam)

# Silhouette and WSS suggests 2 clusters 
kmeans <- eclust(as.data.frame(scale(data_school[-c(1,2)])), FUNcluster = "kmeans" , k = 2)

pam <- eclust(as.data.frame(scale(data_school[-c(1,2)])), FUNcluster = "pam" , k = 2)

data_school$cluster_kmeans <- clusters_kmeans@second
data_school$cluster_pam <- pam$cluster

Based on the graphs the educational systems look similar. However, it is possible to assess the similarity between the cluster assessment. In this project, I apply rand index, shown as:

\[ R = \frac{a+b}{a+b+c+d} \]

where, a - corresponds to the set of observations clustered into the same clusters, b - corresponds to set of observations not clustered to the same clusters using both clustering techniques. The remaining - c + d - correspond to cases when there was a shift in clustering.

# Rand Index to assess the similarity between clustering 
rand.index(data_school$cluster_kmeans, data_school$cluster_pam)

## [1] 0.6985897

We find that, K-means and PAM give the most similar results, while CLARA and PAM show the least similar.

Below I summarize the results of the clustering given K-means, PAM and Clara

data_school %>% 
  group_by(cluster_kmeans) %>% 
  summarize(Digital_Skills = weighted.mean(ASBGSEC_m, sum_wgt),
            Sense_of_School_Belonging = weighted.mean(ASBGSSB_m, sum_wgt),
            Engagement = weighted.mean(ASBGSB_m, sum_wgt),
            DisorderlyBehavior = weighted.mean(ASBGERL_m, sum_wgt),
            LikeReading = weighted.mean(ASBGDRL_m, sum_wgt),
            ConfidentReading = weighted.mean(ASBGSCR_m, sum_wgt),
            ResourcesShortage = weighted.mean(ACBGRRS_m, sum_wgt),
            EmphasisSuccess = weighted.mean(ACBGEAS_m, sum_wgt),
            Discipline = weighted.mean(ACBGDAS_m, sum_wgt),
            TeacherSatisfaction = weighted.mean(ATBGTJS_m, sum_wgt),
            NotReadyForLearning = weighted.mean(ATBGSLI_m, sum_wgt))

## # A tibble: 2 × 12
##   cluster_kmeans Digital_Skills Sense_of_School_Belonging Engagement
##            <int>          <dbl>                     <dbl>      <dbl>
## 1              1           10.8                     10.7       10.2 
## 2              2           10.1                      9.78       9.72
## # ℹ 8 more variables: DisorderlyBehavior <dbl>, LikeReading <dbl>,
## #   ConfidentReading <dbl>, ResourcesShortage <dbl>, EmphasisSuccess <dbl>,
## #   Discipline <dbl>, TeacherSatisfaction <dbl>, NotReadyForLearning <dbl>

data_school %>% 
  group_by(cluster_pam) %>% 
  summarize(Digital_Skills = mean(ASBGSEC_m),
            Sense_of_School_Belonging = mean(ASBGSSB_m),
            Engagement = mean(ASBGSB_m),
            DisorderlyBehavior = mean(ASBGERL_m),
            LikeReading = mean(ASBGDRL_m),
            ConfidentReading = mean(ASBGSCR_m),
            ResourcesShortage = mean(ACBGRRS_m),
            EmphasisSuccess = mean(ACBGEAS_m),
            Discipline = mean(ACBGDAS_m),
            TeacherSatisfaction = mean(ATBGTJS_m),
            NotReadyForLearning = mean(ATBGSLI_m))

## # A tibble: 2 × 12
##   cluster_pam Digital_Skills Sense_of_School_Belonging Engagement
##         <int>          <dbl>                     <dbl>      <dbl>
## 1           1           10.7                     10.7       10.2 
## 2           2           10.2                      9.95       9.78
## # ℹ 8 more variables: DisorderlyBehavior <dbl>, LikeReading <dbl>,
## #   ConfidentReading <dbl>, ResourcesShortage <dbl>, EmphasisSuccess <dbl>,
## #   Discipline <dbl>, TeacherSatisfaction <dbl>, NotReadyForLearning <dbl>

Although there are some differences between mean values for the calculated variables, in general in one cluster all measures are higher, indicating better conditions for learning. Based on the results, let’s see the country cluster composition

data_school %>% 
  group_by(cluster_kmeans, as_factor(IDCNTRY)) %>% 
  summarise(n = n(), sum_wgt = sum(sum_wgt)) %>%
  mutate(freq = round(n * sum_wgt / sum(n * sum_wgt), 2) * 100) %>% 
  filter(cluster_kmeans == 1) %>% 
  arrange(-sum_wgt)

## # A tibble: 22 × 5
## # Groups:   cluster_kmeans [1]
##    cluster_kmeans `as_factor(IDCNTRY)`     n sum_wgt  freq
##             <int> <fct>                <int>   <dbl> <dbl>
##  1              1 Bulgaria               120   0.896    16
##  2              1 Finland                150   0.671    15
##  3              1 Portugal               128   0.671    13
##  4              1 Poland                  95   0.624     9
##  5              1 Spain                  246   0.564    21
##  6              1 Sweden                  50   0.492     4
##  7              1 Denmark                 54   0.408     3
##  8              1 Cyprus                  58   0.398     3
##  9              1 Malta                   28   0.390     2
## 10              1 Netherlands             31   0.379     2
## # ℹ 12 more rows

data_school %>% 
  group_by(cluster_kmeans, as_factor(IDCNTRY)) %>% 
  summarise(n = n(), sum_wgt = sum(sum_wgt)) %>%
  mutate(freq = round(n * sum_wgt / sum(n * sum_wgt), 2) * 100) %>% 
  filter(cluster_kmeans == 2) %>% 
  arrange(-sum_wgt)

## # A tibble: 22 × 5
## # Groups:   cluster_kmeans [1]
##    cluster_kmeans `as_factor(IDCNTRY)`     n sum_wgt  freq
##             <int> <fct>                <int>   <dbl> <dbl>
##  1              2 Belgium (Flemish)      101   1.01      6
##  2              2 Belgium (French)       139   0.989     8
##  3              2 France                 164   0.967    10
##  4              2 Latvia                 137   0.933     8
##  5              2 Lithuania               84   0.921     5
##  6              2 Croatia                113   0.877     6
##  7              2 Italy                  141   0.875     7
##  8              2 Netherlands             65   0.849     3
##  9              2 Slovenia                98   0.824     5
## 10              2 Slovak Republic        120   0.802     6
## # ℹ 12 more rows

# Merge with achievements 

dataPIRLS21_agg_schools <- dataPIRLS21 %>% 
                    group_by(IDCNTRY, IDSCHOOL) %>% 
                    summarise(achievements_read = round(mean(ASRREA01, ASRREA02, ASRREA03, ASRREA04, ASRREA05), 2)) %>% 
  dplyr::select(IDCNTRY, IDSCHOOL, achievements_read)

data_school %>% 
  left_join(dataPIRLS21_agg_schools, by = c('IDCNTRY', 'IDSCHOOL')) %>% 
  group_by(cluster_kmeans) %>% 
  summarise(Reading = weighted.mean(x = achievements_read, w = sum_wgt) )

## # A tibble: 2 × 2
##   cluster_kmeans Reading
##            <int>   <dbl>
## 1              1    548.
## 2              2    525.

We find that in general, Belgian, Latvian, French, Lithuanian and Italian Education systems are mainly characterized by poorer characteristics. In contrast, in Bulgaria and Finland the well-being indicators show some similarities.

3.3 Latent Class Analysis

For this part I slightly modify the approach by using indices rather than scales.

data_selected_lca <- dataPIRLS21 %>% 
  dplyr::select(ASDGSEC, ASDGSSB, ASDGSB, ASDGERL, ASDGDRL, ASDGSCR, ASDGHRL, ATDGTJS, ATDGSLI, ID, IDCNTRY, IDSCHOOL, WGTADJ1) %>% 
  filter(
        across(everything(), ~ . != 9)
    ) %>% 
  group_by(IDCNTRY) %>%
  mutate(weight = 1/n() * WGTADJ1) %>% 
  ungroup()

# Modification
data_selected_lca$no_dig    <-     ifelse(data_selected_lca$ASDGSEC == 3, as.integer(1), as.integer(0)) 
data_selected_lca$no_bel    <-     ifelse(data_selected_lca$ASDGSSB == 3, as.integer(1), as.integer(0))
data_selected_lca$bullying  <-   ifelse(data_selected_lca$ASDGSB  == 3, as.integer(1), as.integer(0))
data_selected_lca$no_eng    <-     ifelse(data_selected_lca$ASDGERL == 3, as.integer(1), as.integer(0))
data_selected_lca$disorder  <-   ifelse(data_selected_lca$ASDGDRL == 3, as.integer(1), as.integer(0))
data_selected_lca$conf_read <-  ifelse(data_selected_lca$ASDGSCR == 3, as.integer(1), as.integer(0))
data_selected_lca$res_short <-  ifelse(data_selected_lca$ASDGHRL == 3, as.integer(1), as.integer(0))
data_selected_lca$jobsas    <-     ifelse(data_selected_lca$ATDGTJS == 3, as.integer(1), as.integer(0))
data_selected_lca$notread   <-    ifelse(data_selected_lca$ATDGSLI == 3, as.integer(1), as.integer(0))

Latent Class Anlaysis.

data_lca <- data_selected_lca %>% 
  dplyr::select(no_dig, no_bel, bullying, no_eng, disorder, conf_read, res_short, jobsas, notread, ID, IDCNTRY, IDSCHOOL)  


form <- cbind(no_dig, no_bel, bullying, no_eng, disorder, conf_read, res_short, jobsas, notread) ~ 1 
M0 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 1, )

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.8878 0.1122
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9311 0.0689
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9044 0.0956
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9515 0.0485
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.8582 0.1418
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.8329 0.1671
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9761 0.0239
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9182 0.0818
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.9625 0.0375
## 
## Estimated class population shares 
##  1 
##  
## Predicted class memberships (by modal posterior prob.) 
##  1 
##  
## ========================================================= 
## Fit for 1 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 9 
## residual degrees of freedom: 502 
## maximum log-likelihood: -213703 
##  
## AIC(1): 427424
## BIC(1): 427508.1
## G^2(1): 15713.19 (Likelihood ratio/deviance statistic) 
## X^2(1): 50810.25 (Chi-square goodness of fit) 
##

M1 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 2)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.9052 0.0948
## class 2:  0.8292 0.1708
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9951 0.0049
## class 2:  0.7156 0.2844
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9610 0.0390
## class 2:  0.7138 0.2862
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9978 0.0022
## class 2:  0.7957 0.2043
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.9134 0.0866
## class 2:  0.6726 0.3274
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.8851 0.1149
## class 2:  0.6573 0.3427
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9842 0.0158
## class 2:  0.9489 0.0511
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9238 0.0762
## class 2:  0.8996 0.1004
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.9680 0.0320
## class 2:  0.9441 0.0559
## 
## Estimated class population shares 
##  0.7708 0.2292 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.8388 0.1612 
##  
## ========================================================= 
## Fit for 2 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 19 
## residual degrees of freedom: 492 
## maximum log-likelihood: -208506.8 
##  
## AIC(2): 417051.6
## BIC(2): 417229.2
## G^2(2): 5320.79 (Likelihood ratio/deviance statistic) 
## X^2(2): 7580.892 (Chi-square goodness of fit) 
##

M2 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 3)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.9014 0.0986
## class 2:  0.8654 0.1346
## class 3:  0.7903 0.2097
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9846 0.0154
## class 2:  0.9198 0.0802
## class 3:  0.3695 0.6305
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9727 0.0273
## class 2:  0.6332 0.3668
## class 3:  0.7738 0.2262
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9886 0.0114
## class 2:  0.9936 0.0064
## class 3:  0.4479 0.5521
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.9267 0.0733
## class 2:  0.5805 0.4195
## class 3:  0.7405 0.2595
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.8922 0.1078
## class 2:  0.6007 0.3993
## class 3:  0.7115 0.2885
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9863 0.0137
## class 2:  0.9271 0.0729
## class 3:  0.9756 0.0244
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9239 0.0761
## class 2:  0.9003 0.0997
## class 3:  0.8973 0.1027
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.9690 0.0310
## class 2:  0.9335 0.0665
## class 3:  0.9574 0.0426
## 
## Estimated class population shares 
##  0.7698 0.1601 0.0701 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.8596 0.0831 0.0573 
##  
## ========================================================= 
## Fit for 3 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 29 
## residual degrees of freedom: 482 
## maximum log-likelihood: -207047.2 
##  
## AIC(3): 414152.5
## BIC(3): 414423.5
## G^2(3): 2401.674 (Likelihood ratio/deviance statistic) 
## X^2(3): 4184.523 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

M3 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 4)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.9481 0.0519
## class 2:  0.9252 0.0748
## class 3:  0.5275 0.4725
## class 4:  0.8125 0.1875
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.8989 0.1011
## class 2:  0.9829 0.0171
## class 3:  0.9389 0.0611
## class 4:  0.1612 0.8388
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.4951 0.5049
## class 2:  0.9704 0.0296
## class 3:  0.9508 0.0492
## class 4:  0.7577 0.2423
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9693 0.0307
## class 2:  0.9862 0.0138
## class 3:  0.9256 0.0744
## class 4:  0.3974 0.6026
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.5253 0.4747
## class 2:  0.9177 0.0823
## class 3:  0.8563 0.1437
## class 4:  0.7209 0.2791
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.6065 0.3935
## class 2:  0.9070 0.0930
## class 3:  0.5401 0.4599
## class 4:  0.7434 0.2566
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9504 0.0496
## class 2:  0.9903 0.0097
## class 3:  0.8837 0.1163
## class 4:  0.9839 0.0161
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9128 0.0872
## class 2:  0.9254 0.0746
## class 3:  0.8721 0.1279
## class 4:  0.9022 0.0978
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.9501 0.0499
## class 2:  0.9714 0.0286
## class 3:  0.9029 0.0971
## class 4:  0.9615 0.0385
## 
## Estimated class population shares 
##  0.1144 0.7515 0.0874 0.0467 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.0745 0.842 0.0437 0.0398 
##  
## ========================================================= 
## Fit for 4 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 39 
## residual degrees of freedom: 472 
## maximum log-likelihood: -206370.5 
##  
## AIC(4): 412818.9
## BIC(4): 413183.4
## G^2(4): 1048.111 (Likelihood ratio/deviance statistic) 
## X^2(4): 1922.927 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

M4 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 5)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.8874 0.1126
## class 2:  0.9487 0.0513
## class 3:  0.4746 0.5254
## class 4:  0.9219 0.0781
## class 5:  0.8112 0.1888
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9706 0.0294
## class 2:  0.9006 0.0994
## class 3:  0.9303 0.0697
## class 4:  0.9846 0.0154
## class 5:  0.1874 0.8126
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9838 0.0162
## class 2:  0.5058 0.4942
## class 3:  0.9402 0.0598
## class 4:  0.9675 0.0325
## class 5:  0.7572 0.2428
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9830 0.0170
## class 2:  0.9727 0.0273
## class 3:  0.9184 0.0816
## class 4:  0.9859 0.0141
## class 5:  0.3988 0.6012
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.8582 0.1418
## class 2:  0.5222 0.4778
## class 3:  0.8621 0.1379
## class 4:  0.9293 0.0707
## class 5:  0.7223 0.2777
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.9507 0.0493
## class 2:  0.6044 0.3956
## class 3:  0.4144 0.5856
## class 4:  0.8954 0.1046
## class 5:  0.7447 0.2553
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9766 0.0234
## class 2:  0.9509 0.0491
## class 3:  0.8612 0.1388
## class 4:  0.9909 0.0091
## class 5:  0.9840 0.0160
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.8184 0.1816
## class 2:  0.9141 0.0859
## class 3:  0.8763 0.1237
## class 4:  0.9456 0.0544
## class 5:  0.9039 0.0961
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.8587 0.1413
## class 2:  0.9505 0.0495
## class 3:  0.9059 0.0941
## class 4:  0.9924 0.0076
## class 5:  0.9639 0.0361
## 
## Estimated class population shares 
##  0.1376 0.116 0.0605 0.6374 0.0484 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.0337 0.0712 0.0385 0.8104 0.0462 
##  
## ========================================================= 
## Fit for 5 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 49 
## residual degrees of freedom: 462 
## maximum log-likelihood: -206271.6 
##  
## AIC(5): 412641.1
## BIC(5): 413099.1
## G^2(5): 850.3277 (Likelihood ratio/deviance statistic) 
## X^2(5): 1691.787 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

M5 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 6)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.5553 0.4447
## class 2:  0.5288 0.4712
## class 3:  0.8994 0.1006
## class 4:  0.9333 0.0667
## class 5:  0.8124 0.1876
## class 6:  0.9446 0.0554
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9153 0.0847
## class 2:  0.9621 0.0379
## class 3:  0.9698 0.0302
## class 4:  0.9875 0.0125
## class 5:  0.1084 0.8916
## class 6:  0.8939 0.1061
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9751 0.0249
## class 2:  0.8965 0.1035
## class 3:  0.9854 0.0146
## class 4:  0.9652 0.0348
## class 5:  0.7506 0.2494
## class 6:  0.5106 0.4894
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.8917 0.1083
## class 2:  0.9635 0.0365
## class 3:  0.9758 0.0242
## class 4:  0.9906 0.0094
## class 5:  0.3597 0.6403
## class 6:  0.9637 0.0363
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.9022 0.0978
## class 2:  0.8090 0.1910
## class 3:  0.8672 0.1328
## class 4:  0.9400 0.0600
## class 5:  0.7180 0.2820
## class 6:  0.5321 0.4679
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.5199 0.4801
## class 2:  0.4932 0.5068
## class 3:  0.9276 0.0724
## class 4:  0.9041 0.0959
## class 5:  0.7491 0.2509
## class 6:  0.6108 0.3892
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9649 0.0351
## class 2:  0.7062 0.2938
## class 3:  0.9982 0.0018
## class 4:  0.9859 0.0141
## class 5:  0.9812 0.0188
## class 6:  0.9571 0.0429
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9348 0.0652
## class 2:  0.8062 0.1938
## class 3:  0.8272 0.1728
## class 4:  0.9649 0.0351
## class 5:  0.9030 0.0970
## class 6:  0.9179 0.0821
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  1.0000 0.0000
## class 2:  0.7570 0.2430
## class 3:  0.9211 0.0789
## class 4:  0.9895 0.0105
## class 5:  0.9586 0.0414
## class 6:  0.9573 0.0427
## 
## Estimated class population shares 
##  0.0605 0.0275 0.2339 0.5152 0.0407 0.1222 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.027 0.0123 0.0942 0.7203 0.0344 0.1118 
##  
## ========================================================= 
## Fit for 6 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 59 
## residual degrees of freedom: 452 
## maximum log-likelihood: -206209.5 
##  
## AIC(6): 412536.9
## BIC(6): 413088.3
## G^2(6): 726.0864 (Likelihood ratio/deviance statistic) 
## X^2(6): 1129.648 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

M6 <- poLCA(formula = form, data = data_lca[,1:9] + 1, nclass = 7)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $no_dig
##            Pr(1)  Pr(2)
## class 1:  0.9965 0.0035
## class 2:  0.8996 0.1004
## class 3:  0.9151 0.0849
## class 4:  0.9393 0.0607
## class 5:  0.8216 0.1784
## class 6:  0.6179 0.3821
## class 7:  0.4018 0.5982
## 
## $no_bel
##            Pr(1)  Pr(2)
## class 1:  0.9893 0.0107
## class 2:  0.9695 0.0305
## class 3:  0.9840 0.0160
## class 4:  0.8780 0.1220
## class 5:  0.1120 0.8880
## class 6:  0.9453 0.0547
## class 7:  0.9356 0.0644
## 
## $bullying
##            Pr(1)  Pr(2)
## class 1:  0.9211 0.0789
## class 2:  0.9873 0.0127
## class 3:  0.9727 0.0273
## class 4:  0.3638 0.6362
## class 5:  0.7628 0.2372
## class 6:  0.8470 0.1530
## class 7:  0.9518 0.0482
## 
## $no_eng
##            Pr(1)  Pr(2)
## class 1:  0.9740 0.0260
## class 2:  0.9750 0.0250
## class 3:  0.9919 0.0081
## class 4:  0.9633 0.0367
## class 5:  0.4116 0.5884
## class 6:  0.9637 0.0363
## class 7:  0.9221 0.0779
## 
## $disorder
##            Pr(1)  Pr(2)
## class 1:  0.8446 0.1554
## class 2:  0.8436 0.1564
## class 3:  0.9588 0.0412
## class 4:  0.4878 0.5122
## class 5:  0.7211 0.2789
## class 6:  0.7558 0.2442
## class 7:  0.8692 0.1308
## 
## $conf_read
##            Pr(1)  Pr(2)
## class 1:  0.7150 0.2850
## class 2:  0.9346 0.0654
## class 3:  0.9655 0.0345
## class 4:  0.6081 0.3919
## class 5:  0.7402 0.2598
## class 6:  0.5955 0.4045
## class 7:  0.4998 0.5002
## 
## $res_short
##            Pr(1)  Pr(2)
## class 1:  0.9666 0.0334
## class 2:  0.9978 0.0022
## class 3:  0.9924 0.0076
## class 4:  0.9522 0.0478
## class 5:  0.9829 0.0171
## class 6:  0.7289 0.2711
## class 7:  0.9047 0.0953
## 
## $jobsas
##            Pr(1)  Pr(2)
## class 1:  0.9361 0.0639
## class 2:  0.8182 0.1818
## class 3:  0.9608 0.0392
## class 4:  0.9174 0.0826
## class 5:  0.9060 0.0940
## class 6:  0.7220 0.2780
## class 7:  0.9037 0.0963
## 
## $notread
##            Pr(1)  Pr(2)
## class 1:  0.9907 0.0093
## class 2:  0.9216 0.0784
## class 3:  0.9815 0.0185
## class 4:  0.9559 0.0441
## class 5:  0.9612 0.0388
## class 6:  0.0338 0.9662
## class 7:  0.9779 0.0221
## 
## Estimated class population shares 
##  0.2087 0.1799 0.4076 0.0794 0.045 0.0073 0.0722 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.1725 0.0791 0.5795 0.0812 0.0435 0.0078 0.0363 
##  
## ========================================================= 
## Fit for 7 latent classes: 
## ========================================================= 
## number of observations: 84583 
## number of estimated parameters: 69 
## residual degrees of freedom: 442 
## maximum log-likelihood: -206182.8 
##  
## AIC(7): 412503.7
## BIC(7): 413148.5
## G^2(7): 672.8392 (Likelihood ratio/deviance statistic) 
## X^2(7): 835.1604 (Chi-square goodness of fit) 
##  
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
##

plot(1:7, c(M0$bic, M1$bic, M2$bic, M3$bic, M4$bic, M5$bic, M6$bic))

which.max(c(M0$bic, M1$bic, M2$bic, M3$bic, M4$bic, M5$bic, M6$bic))

## [1] 1

which.max(c(M0$aic, M1$aic, M2$aic, M3$aic, M4$aic, M5$aic, M6$aic))

## [1] 1

Important note: In the case of polLCA package, Information Criteria are positive. Usually, researchers minimise the IC, but in this package we have to maximize it. Failed to find any latent class in this case. It seems that EU schools are generally similar.

4. Dimension Reduction

In this part, I will apply dimension reduction to prepare the data for re-clustering. The aim of this part is twofold. Firstly apply the dimension reductions algorithms to combine large datasets. I will also test whether the effects after clustering differ. In addition to prior information on system’s characteristics, I use data on students. The goal of this part is to assess if schools will be combined into the same clusters after dimension reduction. If yes, that indicates good quality of the dimension reduction.

data_school_dr <- dataPIRLS21 %>% 
  dplyr::select(IDCNTRY, IDSCHOOL, ASBGSEC, ASBGSSB, ASBGSB, ASBGERL, ASBGDRL, ASBGSCR, ACBGRRS, ACBGEAS, ACBGDAS, ATBGTJS, ATBGSLI, ASDGHRL, ASDHSES, ASDHELA, ASDHENA, ASDHELT, ASDHPCS, ASDHPLR, WGTADJ1) %>%
  dplyr::group_by(IDCNTRY, IDSCHOOL) %>% 
  mutate(weight = 1/n() * WGTADJ1) %>% 
  summarise(ASBGSEC_m = weighted.mean(x = ASBGSEC, w = weight, na.rm = T),
            ASBGSSB_m = weighted.mean(ASBGSSB, w = weight, na.rm = T),
            ASBGSB_m = weighted.mean(ASBGSB, w = weight, na.rm = T), 
            ASBGERL_m = weighted.mean(ASBGERL, w = weight, na.rm = T),
            ASBGDRL_m = weighted.mean(ASBGDRL, w = weight, na.rm = T),
            ASBGSCR_m = weighted.mean(ASBGSCR, w = weight, na.rm = T), 
            ACBGRRS_m = weighted.mean(ACBGRRS, w = weight, na.rm = T), 
            ACBGEAS_m = weighted.mean(ACBGEAS, w = weight, na.rm = T), 
            ACBGDAS_m = weighted.mean(ACBGDAS, w = weight, na.rm = T), 
            ATBGTJS_m = weighted.mean(ATBGTJS, w = weight, na.rm = T), 
            ATBGSLI_m = weighted.mean(ATBGSLI, w = weight, na.rm = T),
            ASDGHRL_m = weighted.mean(ASDGHRL, w = weight, na.rm = T), 
            ASDHSES_m = weighted.mean(ASDHSES, w = weight, na.rm = T), 
            ASDHELA_m = weighted.mean(ASDHELA, w = weight, na.rm = T), 
            ASDHENA_m = weighted.mean(ASDHENA, w = weight, na.rm = T), 
            ASDHELT_m = weighted.mean(ASDHELT, w = weight, na.rm = T), 
            ASDHPCS_m = weighted.mean(ASDHPCS, w = weight, na.rm = T), 
            ASDHPLR_m = weighted.mean(ASDHPLR, w = weight, na.rm = T),
            sum_wgt = sum(weight)) %>% 
  na.omit()



data_school_dr = apply_labels(data_school_dr,
                      ASBGSEC_m = "Digital Self-Efficacy",
                      ASBGSSB_m = "Students' Sense of School Belonging", 
                      ASBGSB_m = "Students' engagement in reading lessons",
                      ASBGERL_m = "Disorderly behaviour during lessons", 
                      ASBGDRL_m = "Students like reading", 
                      ASBGSCR_m = "Students' confidence in reading", 
                      ACBGRRS_m = "Instruction affected by resources shortages", 
                      ACBGEAS_m = "School's emphasis on academic success", 
                      ACBGDAS_m = "School's Discipline", 
                      ATBGTJS_m = "Teachers' job satisfaction", 
                      ATBGSLI_m = "Teching limited by students not ready for learning", 
                      ASDGHRL_m = "Home resources for learning" , 
                      ASDHSES_m = "Home socioeconomic status", 
                      ASDHELA_m = "Early literacy before school", 
                      ASDHENA_m = "Early numeracy before school",
                      ASDHELT_m = "Early literacy tasks", 
                      ASDHPCS_m = "Parents' perception of school", 
                      ASDHPLR_m = "Parents like reading"
                      )

4.1 Principal Component Analysis

schools_PrePCA <- preProcess(data_school_dr[, c(-1, -2, -21)], method=c("center", "scale"))
schools_PrePCA.s <- predict(schools_PrePCA, data_school_dr[, c(-1, -2, -21)])
summary(schools_PrePCA.s)

##    ASBGSEC_m        ASBGSSB_m         ASBGSB_m        ASBGERL_m     
##  Min.   : 3.681   Min.   : 6.060   Min.   : 6.032   Min.   : 6.843  
##  1st Qu.: 9.875   1st Qu.: 9.536   1st Qu.: 9.431   1st Qu.: 9.275  
##  Median :10.346   Median :10.164   Median : 9.896   Median : 9.836  
##  Mean   :10.317   Mean   :10.166   Mean   : 9.889   Mean   : 9.938  
##  3rd Qu.:10.781   3rd Qu.:10.782   3rd Qu.:10.357   3rd Qu.:10.524  
##  Max.   :13.236   Max.   :12.853   Max.   :12.567   Max.   :13.335  
##    ASBGDRL_m        ASBGSCR_m        ACBGRRS_m        ACBGEAS_m     
##  Min.   : 6.735   Min.   : 6.423   Min.   : 3.469   Min.   : 3.332  
##  1st Qu.: 9.185   1st Qu.: 9.592   1st Qu.: 9.626   1st Qu.: 8.776  
##  Median : 9.702   Median :10.004   Median :10.400   Median : 9.724  
##  Mean   : 9.754   Mean   : 9.997   Mean   :10.675   Mean   : 9.864  
##  3rd Qu.:10.267   3rd Qu.:10.435   3rd Qu.:11.621   3rd Qu.:11.007  
##  Max.   :14.988   Max.   :12.607   Max.   :14.681   Max.   :16.974  
##    ACBGDAS_m        ATBGTJS_m       ATBGSLI_m        ASDGHRL_m    
##  Min.   : 3.786   Min.   : 3.84   Min.   : 2.796   Min.   :1.000  
##  1st Qu.: 9.282   1st Qu.: 8.68   1st Qu.: 9.102   1st Qu.:1.658  
##  Median :10.289   Median :10.39   Median :10.279   Median :1.800  
##  Mean   :10.330   Mean   :10.22   Mean   :10.406   Mean   :1.796  
##  3rd Qu.:11.161   3rd Qu.:11.82   3rd Qu.:11.576   3rd Qu.:1.944  
##  Max.   :12.961   Max.   :13.24   Max.   :15.080   Max.   :3.000  
##    ASDHSES_m       ASDHELA_m       ASDHENA_m       ASDHELT_m    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.487   1st Qu.:1.455   1st Qu.:1.500   1st Qu.:1.889  
##  Median :1.714   Median :1.568   Median :1.615   Median :2.167  
##  Mean   :1.750   Mean   :1.574   Mean   :1.619   Mean   :2.155  
##  3rd Qu.:2.000   3rd Qu.:1.679   3rd Qu.:1.737   3rd Qu.:2.439  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##    ASDHPCS_m       ASDHPLR_m    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.214   1st Qu.:1.692  
##  Median :1.409   Median :1.846  
##  Mean   :1.432   Mean   :1.855  
##  3rd Qu.:1.615   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :3.000

Correlation between variables

schools.cov <-cor(schools_PrePCA.s)
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)
ord <- order(schools.cov[1, ])
data_ord <- schools.cov[ord, ord]

plotcorr(schools.cov, col=my_colors[data_ord*50+50] , mar=c(1,1,1,1))

In majority the correlations are not large. We run Bartlett Test and KMO citerion to verify if PCA is justified

KMO(schools_PrePCA.s)

## 
## ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
## 
## ✔ The overall KMO value for your data is middling.
##   These data are probably suitable for factor analysis.
## 
##   Overall: 0.728
## 
##   For each variable:
## ASBGSEC_m ASBGSSB_m  ASBGSB_m ASBGERL_m ASBGDRL_m ASBGSCR_m ACBGRRS_m ACBGEAS_m 
##     0.756     0.632     0.676     0.644     0.714     0.803     0.781     0.861 
## ACBGDAS_m ATBGTJS_m ATBGSLI_m ASDGHRL_m ASDHSES_m ASDHELA_m ASDHENA_m ASDHELT_m 
##     0.799     0.768     0.863     0.724     0.718     0.665     0.597     0.669 
## ASDHPCS_m ASDHPLR_m 
##     0.817     0.892

cortest.bartlett(schools_PrePCA.s)

## $chisq
## [1] 21515.11
## 
## $p.value
## [1] 0
## 
## $df
## [1] 153

The results of the tests above identify the need for dimension reduction.

Below I apply the Prinicipal Components Analysis on the scaled data. I use Scree plot to show the share of variance explained by each of the components.

# Screeplot + PCA
pca2_schools <- princomp(schools_PrePCA.s) 
pca1_schools<-prcomp(schools_PrePCA.s, center=FALSE, scale.=FALSE)


fviz_eig(pca2_schools, choice = "eigenvalue", barfill = "#B6163A", barcolor = "#999999", linecolor = "#000000",  addlabels = TRUE,   main = "Eigenvalues") +
  theme_minimal()

Based on the Kaiser’s rule we should keep the 6 dimensions in our data. Let’s see how much of variance is explained via 6 components

values_pca <- get_eigenvalue(pca2_schools)
values_pca

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.060274929      29.36754519                    29.36755
## Dim.2  3.694100628      17.90127815                    47.26882
## Dim.3  2.759697740      13.37324611                    60.64207
## Dim.4  1.997518923       9.67979637                    70.32187
## Dim.5  1.944683844       9.42376235                    79.74563
## Dim.6  1.500695033       7.27223265                    87.01786
## Dim.7  0.828436179       4.01452693                    91.03239
## Dim.8  0.616834980       2.98912663                    94.02151
## Dim.9  0.350385934       1.69793860                    95.71945
## Dim.10 0.292686015       1.41833001                    97.13778
## Dim.11 0.220763730       1.06980111                    98.20758
## Dim.12 0.122630632       0.59425697                    98.80184
## Dim.13 0.101845197       0.49353263                    99.29537
## Dim.14 0.057625568       0.27924830                    99.57462
## Dim.15 0.041456078       0.20089241                    99.77551
## Dim.16 0.027791448       0.13467485                    99.91019
## Dim.17 0.009378922       0.04544941                    99.95564
## Dim.18 0.009154388       0.04436134                   100.00000

We can reduce the dimensions from 18 up to 7 and still keep 87% of the total variance!

fviz_pca_var(pca2_schools, col.var = "black", parse = T) +
  theme_minimal()

pca_var <- get_pca_var(pca1_schools)
pca_var$contrib

##               Dim.1        Dim.2       Dim.3        Dim.4        Dim.5
## ASBGSEC_m 9.1885050  0.299339535  2.49317550 7.242114e-01  0.883598795
## ASBGSSB_m 8.9419292  0.002324699  6.78231586 2.078221e+00  0.198372933
## ASBGSB_m  8.4449032  0.867167807  3.27927196 1.142761e+00  0.079653169
## ASBGERL_m 8.5380386  0.086716047  8.93848078 1.546169e+00  0.520789399
## ASBGDRL_m 8.2272115  0.963542184  3.34756639 1.175796e+00  0.060387110
## ASBGSCR_m 8.6234988  0.450712817  3.18853669 3.014597e-01  0.572406185
## ACBGRRS_m 9.8908481  9.473546306  0.01042093 9.532732e+00 29.654797001
## ACBGEAS_m 8.4832385  6.111903145 23.95507213 2.588751e+01  0.251437119
## ACBGDAS_m 9.2801823  3.162921244  3.18178594 1.649107e+00 64.776188007
## ATBGTJS_m 9.0927625 76.583544314  0.03618841 1.248331e+01  0.068076985
## ATBGSLI_m 9.4446617  1.865222617 40.95356930 4.328204e+01  2.769851175
## ASDGHRL_m 0.2765315  0.006682829  0.63827396 4.640316e-02  0.017476665
## ASDHSES_m 0.2617039  0.039555252  1.33165775 7.565485e-02  0.059826225
## ASDHELA_m 0.2126205  0.008378159  0.28035294 5.030649e-03  0.008120919
## ASDHENA_m 0.2253101  0.003776428  0.21061902 2.940177e-03  0.015457999
## ASDHELT_m 0.3981433  0.011790795  0.66533226 4.819521e-02  0.002736848
## ASDHPCS_m 0.1753918  0.055501774  0.10828614 1.925799e-06  0.053705967
## ASDHPLR_m 0.2945194  0.007374047  0.59909404 1.845871e-02  0.007117501
##                  Dim.6        Dim.7        Dim.8        Dim.9       Dim.10
## ASBGSEC_m 6.838004e-01  2.955106753 52.433878546 24.682213120 1.575316e+00
## ASBGSSB_m 2.345477e+00 22.594712928  2.164335882  0.195348496 2.243227e-04
## ASBGSB_m  1.217464e+00 12.463401466  4.790603162  9.442578512 5.521553e+01
## ASBGERL_m 2.035786e+00 31.129109019  0.077091812  0.120353300 1.616446e-01
## ASBGDRL_m 7.903906e-01 13.074581058 31.554860053 28.041944362 1.080575e+01
## ASBGSCR_m 1.834901e+00 10.987238553  6.437088819 33.051063674 2.385109e+01
## ACBGRRS_m 3.889494e+01  1.744879688  0.587038276  0.060295872 3.352916e-02
## ACBGEAS_m 3.338499e+01  0.913358549  0.128018400  0.059376104 4.136407e-02
## ACBGDAS_m 1.704301e+01  0.007617333  0.644914717  0.041600216 1.397126e-01
## ATBGTJS_m 8.752217e-01  0.618632428  0.145721780  0.006474108 1.758253e-02
## ATBGSLI_m 8.365999e-01  0.224403788  0.063441163  0.080055407 1.125790e-02
## ASDGHRL_m 6.890413e-03  0.111204673  0.006169723  0.194442014 7.759106e-01
## ASDHSES_m 1.876075e-02  0.556426925  0.002447572  0.501648115 1.478341e+00
## ASDHELA_m 1.232529e-03  0.059164671  0.041758092  0.038787775 3.162083e-01
## ASDHENA_m 4.635636e-03  0.039100407  0.012633846  0.070203631 1.334955e-01
## ASDHELT_m 2.407853e-02  0.171753780  0.545972195  3.164793681 5.129644e+00
## ASDHPCS_m 7.452997e-04  2.266098455  0.351990764  0.186526703 7.187397e-03
## ASDHPLR_m 1.077812e-03  0.083209525  0.012035197  0.062294911 3.062168e-01
##                Dim.11       Dim.12       Dim.13       Dim.14       Dim.15
## ASBGSEC_m  0.63955986 9.368193e-01  1.465769960 9.847803e-01 2.620806e-02
## ASBGSSB_m 15.88172874 3.363204e+01  3.503591850 2.775410e-01 1.359736e+00
## ASBGSB_m   0.53536641 5.646284e-01  1.310026916 4.466863e-01 1.560232e-01
## ASBGERL_m  3.74731085 3.520393e+01  6.197472080 1.102670e+00 5.567268e-01
## ASBGDRL_m  0.15862643 1.647890e+00  0.086636699 2.818033e-03 4.333015e-02
## ASBGSCR_m  0.38166366 1.440126e+00  8.568467278 2.222810e-01 2.647935e-02
## ACBGRRS_m  0.03529059 5.182866e-02  0.001931021 1.462078e-02 9.110109e-03
## ACBGEAS_m  0.72228661 2.976426e-02  0.009873929 2.379757e-03 1.457811e-02
## ACBGDAS_m  0.04253747 1.976623e-02  0.004035794 1.148688e-03 2.309069e-03
## ATBGTJS_m  0.05778415 3.710480e-04  0.010185433 4.680367e-05 3.411296e-04
## ATBGSLI_m  0.45789220 4.499035e-04  0.003794512 4.342789e-03 2.869416e-05
## ASDGHRL_m 14.02554060 3.164835e-01  5.703295312 8.175416e-01 2.356116e+00
## ASDHSES_m 28.53148893 4.992020e-01 20.295397292 5.161209e+00 5.130762e+00
## ASDHELA_m  4.16188714 3.762426e+00  0.114950169 1.471834e+00 2.877349e+01
## ASDHENA_m  1.83036888 3.702574e+00  0.031653175 3.661895e+00 3.868860e+01
## ASDHELT_m 10.88452201 1.375892e+01 47.098852691 1.450219e+01 3.162258e+00
## ASDHPCS_m  2.46530189 1.292754e+00  3.460080604 6.645081e+01 1.792033e+01
## ASDHPLR_m 15.44084357 3.140032e+00  2.133985287 4.875206e+00 1.773572e+00
##                 Dim.16       Dim.17       Dim.18
## ASBGSEC_m 4.757909e-03 2.130918e-02 1.650060e-03
## ASBGSSB_m 7.061701e-03 2.143028e-03 3.289838e-02
## ASBGSB_m  1.184974e-02 2.871838e-02 3.364711e-03
## ASBGERL_m 1.597608e-02 1.632146e-02 5.418197e-03
## ASBGDRL_m 5.842674e-03 2.617919e-03 1.021092e-02
## ASBGSCR_m 5.628022e-02 6.701704e-03 8.237560e-06
## ACBGRRS_m 3.082123e-03 7.963515e-04 3.081607e-04
## ACBGEAS_m 4.227838e-03 5.814497e-04 4.901645e-05
## ACBGDAS_m 2.513252e-03 5.995264e-04 5.239093e-05
## ATBGTJS_m 2.967592e-03 3.354733e-04 4.517704e-04
## ATBGSLI_m 4.084924e-06 2.215377e-03 1.698563e-04
## ASDGHRL_m 2.931571e+00 6.447222e+01 7.297250e+00
## ASDHSES_m 8.321229e+00 2.410734e+01 3.627347e+00
## ASDHELA_m 3.183599e+00 6.761283e+00 5.079888e+01
## ASDHENA_m 1.009892e+01 4.207374e+00 3.706044e+01
## ASDHELT_m 2.032381e-02 1.338970e-03 4.091554e-01
## ASDHPCS_m 4.832616e+00 3.678893e-01 4.778503e-03
## ASDHPLR_m 7.049718e+01 2.155487e-04 7.475654e-01

Below, I have verified the most important variables for the 1st, 2nd and the 3rd component.

sort(abs(pca1_schools$rotation[,1]), decreasing = T)

##  ACBGRRS_m  ATBGSLI_m  ACBGDAS_m  ASBGSEC_m  ATBGTJS_m  ASBGSSB_m  ASBGSCR_m 
## 0.31449719 0.30732168 0.30463392 0.30312547 0.30154208 0.29903059 0.29365794 
##  ASBGERL_m  ACBGEAS_m   ASBGSB_m  ASBGDRL_m  ASDHELT_m  ASDHPLR_m  ASDGHRL_m 
## 0.29219922 0.29126000 0.29060116 0.28683116 0.06309860 0.05426964 0.05258626 
##  ASDHSES_m  ASDHENA_m  ASDHELA_m  ASDHPCS_m 
## 0.05115701 0.04746684 0.04611079 0.04187980

sort(abs(pca1_schools$rotation[,2]), decreasing = T)

##   ATBGTJS_m   ACBGRRS_m   ACBGEAS_m   ACBGDAS_m   ATBGSLI_m   ASBGDRL_m 
## 0.875120245 0.307791265 0.247222635 0.177846036 0.136573153 0.098160185 
##    ASBGSB_m   ASBGSCR_m   ASBGSEC_m   ASBGERL_m   ASDHPCS_m   ASDHSES_m 
## 0.093121845 0.067135149 0.054711931 0.029447589 0.023558814 0.019888502 
##   ASDHELT_m   ASDHELA_m   ASDHPLR_m   ASDGHRL_m   ASDHENA_m   ASBGSSB_m 
## 0.010858543 0.009153228 0.008587227 0.008174857 0.006145265 0.004821513

sort(abs(pca1_schools$rotation[,3]), decreasing = T)

##  ATBGSLI_m  ACBGEAS_m  ASBGERL_m  ASBGSSB_m  ASBGDRL_m   ASBGSB_m  ASBGSCR_m 
## 0.63994976 0.48943919 0.29897292 0.26042880 0.18296356 0.18108760 0.17856474 
##  ACBGDAS_m  ASBGSEC_m  ASDHSES_m  ASDHELT_m  ASDGHRL_m  ASDHPLR_m  ASDHELA_m 
## 0.17837561 0.15789793 0.11539748 0.08156790 0.07989205 0.07740117 0.05294837 
##  ASDHENA_m  ASDHPCS_m  ATBGTJS_m  ACBGRRS_m 
## 0.04589325 0.03290686 0.01902325 0.01020830

I find that: - Instruction Affected by resources shortages, Teaching Limited by students not ready, School’s discipline, Digital Self-Efficacy, Teacher’s job satisfaction equally contribute to the first component - Teacher’s job satisfaction dominates the second component - Teaching limited by students not ready for learning, School’s emphasis on academic success contributes the most to the third component

4.2 t-SNE

In addition, I will apply T-distributed Stochastic Neighbor Embedding (t-SNE). The algortihim differs from PCA and MDS, that it is a stochastic method (hence, it is important to provide other researchers with seed applied. By doing so, our results will be reproducible). We can use t-SNE to present high-dimensional space in two- or three-dimensions. Hence, it is a proper tool for visualization of the data.

library(Rtsne)
tsne_data_school <- Rtsne(data_school_dr[, c(-1, -2, -21)],
  pca = FALSE, perplexity = 5,
  theta = 0.0
)

tsne_num <- data.frame(
  TSNE1 = tsne_data_school$Y[, 1],
  TSNE2 = tsne_data_school$Y[, 2],
  label = as_factor(data_school_dr$IDCNTRY)
)

ggplot(tsne_num, aes(
  x = TSNE1, y = TSNE2, col = label
)) +
  geom_point()

The results of the t-SNE show nothing interesting from the perspective of the analysis. There seems to be no concentration of points in terms of the educational system origin. In the following part, I will use PCA to evaluate the quality of dimension reduction.

4.3. Clustering of the dimensionally reduced data

# Let's assign the data again to the main data frame 

data_school_dr$pr1 <- pca1_schools$x[, 1]
data_school_dr$pr2 <- pca1_schools$x[, 2]
data_school_dr$pr3 <- pca1_schools$x[, 3]
data_school_dr$pr4 <- pca1_schools$x[, 4]
data_school_dr$pr5 <- pca1_schools$x[, 5]
data_school_dr$pr6 <- pca1_schools$x[, 6]

Now we will use K-means to verify if clusters were assigned similarly as in chapter 3 of this document.

# Evaluate the number of histograms
p1_kmeans <- fviz_nbclust(as.data.frame(scale(data_school_dr[c(22:27)])), FUN = stats::kmeans, method ="wss",k.max = 22) + ggtitle("(A) Elbow method")

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

# Silhouette Evaluation
p2_kmeans <- fviz_nbclust(as.data.frame(scale(data_school_dr[c(22:27)])), FUN = stats::kmeans, method = "silhouette", k.max = 22) +
  ggtitle("(B) Silhouette method")

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

clusters_kmeans <- flexclust::cclust(as.data.frame(scale(data_school_dr[c(22:27)])), 2, dist = "euclidean", method = "kmeans", weights = data_school_dr$sum_wgt, save.data = TRUE)

## Warning in flexclust::cclust(as.data.frame(scale(data_school_dr[c(22:27)])), :
## weights can only be used with hard competitive learning

gridExtra::grid.arrange(p1_kmeans, p2_kmeans)

Now let’s draw the clusters

kmeans <- eclust(as.data.frame(scale(data_school_dr[c(22:27)])), FUNcluster = "kmeans" , k = 2)

data_school_dr$cluster_dim_red <- kmeans$cluster

Let’s verify if schools were clustered to the same clusters as in the case of no dimension reduction.

# Rand Index to assess the similarity between clustering 
data_school_common <- left_join(data_school[c(1, 2, 15)], data_school_dr[c(1,2, 28)], by=c("IDSCHOOL", "IDCNTRY")) %>% na.omit()
rand.index(data_school_common$cluster_kmeans, data_school_common$cluster_dim_red)

## [1] 0.5493772

We find that the overlap in clustering is more than 50%. That indicates moderate quality of the dimension reduction.

4.4 Multidimensional scaling

As a concluding remark, I will apply Multidimensional Scaling, to place observations in the two-dimensional space to allow comparison of the educational systems. Although the MDS scales are not easily interpretable, it will allow me to draw some conclusions on the similarity between the systems.

# Let's Run MDS
mds <- data_school_dr[3:20] %>%
  dist() %>%          
  isoMDS() %>%
  .$points %>%
  as_tibble()

## initial  value 28.977313 
## iter   5 value 23.641882
## iter   5 value 23.626423
## iter   5 value 23.611226
## final  value 23.611226 
## converged

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

colnames(mds) <- c("Dim.1", "Dim.2")

summary(mds)    # we get coordinates of new points

##      Dim.1             Dim.2         
##  Min.   :-9.2250   Min.   :-8.01623  
##  1st Qu.:-1.7184   1st Qu.:-1.34956  
##  Median :-0.1719   Median : 0.06132  
##  Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 1.6435   3rd Qu.: 1.44987  
##  Max.   :11.4278   Max.   : 7.68101

Now, let’s plot the relationship

data_school_dr$dim1_mds <-  mds$Dim.1
data_school_dr$dim2_mds <-  mds$Dim.2

ggplot(data_school_dr, aes(x= dim1_mds, y = dim2_mds, col = as_factor(IDCNTRY))) + 
  geom_point()

We can’t see any clustering at this point, so let’s aggregate the data

mds_data_graph <- data_school_dr %>% 
  group_by(IDCNTRY) %>% 
  summarise(dim_1_mean = weighted.mean(dim1_mds, w = sum_wgt), 
            dim_2_mean = weighted.mean(dim2_mds, w = sum_wgt))


ggplot(mds_data_graph, aes(x= dim_1_mean, y = dim_2_mean, col = as_factor(IDCNTRY))) + 
  geom_point() + 
  geom_text(aes(label = as_factor(IDCNTRY)), size=3)  +
  theme(legend.position = "none")

The results of the MDS are fairly different as compared to hierarchical clustering. Firstly, we find more countries as very different than the remaining - Poland, France, Bulgaria, Spain and Italy. At the same time, the smallest distance is often found between countries clustered into different clusters - e.g Czech Republic and Finland.

5. Association Rules

Association rules are usually applied for basket analysis. In particular, we are interested in assessing if buying certain products is associated with buying other products, which can be useful for drawing marketing strategies. However, the application of Association Rules go far above the Basket Analysis. In this part of the project, I will apply the Association Rules to measure the association between high achievements in reading and other factors.

For the analysis of the association rules, we usually apply three different metrics:

Support - how many times ‘product A’ appeared in the transactions in the dataset
Confidence - how many times products ‘A’ and ‘B’ appeared among all transactions with product ‘A’. It can be applied as a measure of the strength of the rule.
Lift - Ratio of the confidence and expected confidence. Lift above one indicates positive dependence of ‘A’ and ‘B’ products.

The goal of this part of the project is to determine the association between school characteristics and the high achievements in reading in the 4th grade.

5.1 Data cleaning

Since Association Rules are usually applied for binary variables, I start with modifying the data for the analysis.

dataPIRLS21$mean_read = rowMeans(dataPIRLS21[,496:500], na.rm = TRUE) 
dataPIRLS21$q75 <- quantile(dataPIRLS21$mean_read, probs = 0.75)
dataPIRLS21$high_score <- ifelse(dataPIRLS21$mean_read > dataPIRLS21$q75, "HighScore", "LowModScore" )


# Modification
dataPIRLS21$no_dig    <-     as.factor(ifelse(dataPIRLS21$ASDGSEC == 3, "NoDig", "Dig"))
dataPIRLS21$no_bel    <-     as.factor(ifelse(dataPIRLS21$ASDGSSB == 3, "NoSenseOfSchoolBelonging", "HighSenseOfSchoolBelonging"))
dataPIRLS21$bullying  <-   as.factor(ifelse(dataPIRLS21$ASDGSB  == 3, "Bullying", "NoBullying"))
dataPIRLS21$no_eng    <-     as.factor(ifelse(dataPIRLS21$ASDGERL == 3, "NoEarlyLiteracy", "EarlyLiteracy"))
dataPIRLS21$disorder  <-   as.factor(ifelse(dataPIRLS21$ASDGDRL == 3, "DisorderlyBehavAtSchool"," NoDisorderlyBehavAtSchool"))
dataPIRLS21$conf_read <-  as.factor(ifelse(dataPIRLS21$ASDGSCR == 3, "ConfidentInReading", "LackOfConfInReading"))
dataPIRLS21$res_short <-  as.factor(ifelse(dataPIRLS21$ASDGHRL == 3, "ResourcesShortages", "NoResourcesShort"))
dataPIRLS21$jobsas    <-     as.factor(ifelse(dataPIRLS21$ATDGTJS == 3, "SatisfiedTeachers", "NotSatisfiedTeachers"))
dataPIRLS21$notread   <-    as.factor(ifelse(dataPIRLS21$ATDGSLI == 3, "NotReadyForSchool", "ReadyForSchool"))

table(dataPIRLS21$high_score)

## 
##   HighScore LowModScore 
##       30345       91035

Hence, around 33% of the population is classified into high_score. I save the dataset to csv and use it in the following part.

# Clean the data before re-reading them as "transactions"
dataPIRLS21_trans <- dataPIRLS21[,c(1, 532:541)]
write.csv(dataPIRLS21_trans, "achievements.csv", row.names = F)

dataPIRLS21_ar <- read.transactions("achievements.csv", sep = ",")

## Warning in asMethod(object): removing duplicated items in transactions

dataPIRLS21_ar <- dataPIRLS21_ar[-1,]
# Summarize the data
image(sample(dataPIRLS21_ar, 100))

itemFreq <- itemFrequency(dataPIRLS21_ar, type="relative") 
sort(itemFreq, decreasing = T)

##           NoResourcesShort              EarlyLiteracy 
##                 0.97448509                 0.93527764 
## HighSenseOfSchoolBelonging                 NoBullying 
##                 0.91464821                 0.88638985 
##                        Dig  NoDisorderlyBehavAtSchool 
##                 0.87710496                 0.84583127 
##             ReadyForSchool        LackOfConfInReading 
##                 0.83130664                 0.81304993 
##       NotSatisfiedTeachers                LowModScore 
##                 0.79422475                 0.75000000 
##                  HighScore         ConfidentInReading 
##                 0.25000000                 0.17593508 
##    DisorderlyBehavAtSchool                      NoDig 
##                 0.14312902                 0.11186357 
##                   Bullying   NoSenseOfSchoolBelonging 
##                 0.10258692                 0.07432032 
##          SatisfiedTeachers                        724 
##                 0.07413907                 0.07044818 
##                        246                        203 
##                 0.05881529                 0.05467952 
##            NoEarlyLiteracy                        620 
##                 0.05369089                 0.05034602 
##                        380                        250 
##                 0.04481793                 0.04398583 
##                        348                        752 
##                 0.04376339                 0.04263470 
##                        956                        705 
##                 0.04213215                 0.04209919 
##                         40                        276 
##                 0.04054210                 0.03990773 
##                        703                        208 
##                 0.03988301                 0.03971824 
##                        372                        440 
##                 0.03841654                 0.03808700 
##                        196          NotReadyForSchool 
##                 0.03780689                 0.03705718 
##                        428                        528 
##                 0.03599440                 0.03553304 
##                        957                        616 
##                 0.03525292                 0.03442907 
##                        100                        191 
##                 0.03330862                 0.03243533 
##                        470         ResourcesShortages 
##                 0.02496293                 0.02006920 
##                   bullying                  conf_read 
##                 0.00000000                 0.00000000 
##                   disorder                 high_score 
##                 0.00000000                 0.00000000 
##                    IDCNTRY                     jobsas 
##                 0.00000000                 0.00000000 
##                     no_bel                     no_dig 
##                 0.00000000                 0.00000000 
##                     no_eng                    notread 
##                 0.00000000                 0.00000000 
##                  res_short 
##                 0.00000000

The most frequent items in the dataset are: - No resources shortage (97.4%) - Early literacy activities (93.5%) - High Sense of School Belonging (91.5%)

itemFrequencyPlot(dataPIRLS21_ar, topN = 5) + theme_minimal()

## NULL

It generally shows that schools in European Union are characterized by high quality in terms of supporting pupils.

Below I use the scatter plot, showing the association between items support (total number of ‘transactions’ of A and B among all transactions) and confidence (the share of transactions of A and B among all A transactions)

library(arulesViz)
achievements_rules <- apriori(dataPIRLS21_ar, parameter = list(support = 0.1, confidence = 0.15, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 12138 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.11s].
## sorting and recoding items ... [15 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.01s].
## writing ... [8799 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

plot(achievements_rules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Interestingly, we find three clusters of points. In the lower left part of the graph, there is a cloud of points characterized by low confidence and support. The second cloud of points is in the left upper part of the graph. It shows points with high confidence, but low support. These points stand for rules where a certain rule is generally moderately often visible among all transactions, but if observed, it is often found that the ‘A’ and ‘B’ are ‘bought’ at the same time. The third cloud point is the largest, and shows high confidence with variable support.

5.2 Apriori Algorithim

In this part, I use Apriori algorithm to determine the association rules. Apriori algorithm uses prior knowledge of the most frequent items observed in the transaction dataset. It uses the simple property, showing that the subsets of frequent items must also be frequent. In particular, if a certain set does not pass the test (e.g. support, confidence threshold), its subsets automatically also do not pass the test <- reduction of the search space.

I look for rules, where HighScore is implied by other factors.

# Apriori association rules algorithm
rules.score <- apriori(dataPIRLS21_ar, parameter = list(support = 0.20, 
                                confidence = 0.20), 
                       appearance = list(rhs = "HighScore"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 24276 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.13s].
## sorting and recoding items ... [11 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.00s].
## writing ... [106 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

I find 106 rules. I want to see only the 5 characterised by higher confidence.

# Order rules by confidence
rules.score.byconf <- sort(rules.score, by = "confidence", 
                                    decreasing = T)
inspect(rules.score.byconf[1:5])

##     lhs                              rhs           support confidence  coverage     lift count
## [1] {Dig,                                                                                     
##      EarlyLiteracy,                                                                           
##      HighSenseOfSchoolBelonging,                                                              
##      LackOfConfInReading,                                                                     
##      NoBullying,                                                                              
##      NoResourcesShort}            => {HighScore} 0.2028423  0.3318910 0.6111715 1.327564 24621
## [2] {Dig,                                                                                     
##      HighSenseOfSchoolBelonging,                                                              
##      LackOfConfInReading,                                                                     
##      NoBullying,                                                                              
##      NoResourcesShort}            => {HighScore} 0.2068875  0.3311356 0.6247817 1.324542 25112
## [3] {EarlyLiteracy,                                                                           
##      HighSenseOfSchoolBelonging,                                                              
##      LackOfConfInReading,                                                                     
##      NoBullying,                                                                              
##      NoDisorderlyBehavAtSchool,                                                               
##      NoResourcesShort}            => {HighScore} 0.2003048  0.3308386 0.6054457 1.323355 24313
## [4] {HighSenseOfSchoolBelonging,                                                              
##      LackOfConfInReading,                                                                     
##      NoBullying,                                                                              
##      NoDisorderlyBehavAtSchool,                                                               
##      NoResourcesShort}            => {HighScore} 0.2041522  0.3295607 0.6194678 1.318243 24780
## [5] {Dig,                                                                                     
##      EarlyLiteracy,                                                                           
##      LackOfConfInReading,                                                                     
##      NoBullying,                                                                              
##      NoResourcesShort}            => {HighScore} 0.2083127  0.3285601 0.6340171 1.314240 25285

rules.score.suport <- sort(rules.score, by = "support", 
                                    decreasing = T)
inspect(rules.score.suport[1:5])

##     lhs                                        rhs         support   confidence
## [1] {}                                      => {HighScore} 0.2500000 0.2500000 
## [2] {NoResourcesShort}                      => {HighScore} 0.2491349 0.2556580 
## [3] {LackOfConfInReading}                   => {HighScore} 0.2409705 0.2963785 
## [4] {LackOfConfInReading, NoResourcesShort} => {HighScore} 0.2404268 0.3005768 
## [5] {EarlyLiteracy}                         => {HighScore} 0.2400890 0.2567034 
##     coverage  lift     count
## [1] 1.0000000 1.000000 30345
## [2] 0.9744851 1.022632 30240
## [3] 0.8130499 1.185514 29249
## [4] 0.7998847 1.202307 29183
## [5] 0.9352776 1.026814 29142

Not surprisingly all elements of implying high score, are naturally associated with high score - NoResourcesShortage; LackOfConfidenceInWriting, Ealy Literacy etc.

5.3 Other Associations

In addition to the association between school characteristics I study the association between all factors defined in the data. This time, I take more conservative approach and assume minimal confidence 0.3 and minimum support 0.3.

# Apriori association rules algorithm
rules.score_all <- apriori(dataPIRLS21_ar, parameter = list(support = 0.30,confidence = 0.20))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5     0.3      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 36414 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 121380 transaction(s)] done [0.22s].
## sorting and recoding items ... [10 item(s)] done [0.03s].
## creating transaction tree ... done [0.21s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.00s].
## writing ... [5066 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

rules.score_all_conf <- sort(rules.score_all, by = "confidence", 
                                    decreasing = T)
inspect(rules.score_all_conf[1:5])

##     lhs                             rhs                  support confidence  coverage     lift count
## [1] {Dig,                                                                                           
##      LackOfConfInReading,                                                                           
##      NoBullying,                                                                                    
##      NoDisorderlyBehavAtSchool,                                                                     
##      ReadyForSchool}             => {NoResourcesShort} 0.4919756  0.9875637 0.4981710 1.013421 59716
## [2] {Dig,                                                                                           
##      EarlyLiteracy,                                                                                 
##      LackOfConfInReading,                                                                           
##      NoBullying,                                                                                    
##      NoDisorderlyBehavAtSchool,                                                                     
##      ReadyForSchool}             => {NoResourcesShort} 0.4755067  0.9875607 0.4814961 1.013418 57717
## [3] {Dig,                                                                                           
##      EarlyLiteracy,                                                                                 
##      LackOfConfInReading,                                                                           
##      NoBullying,                                                                                    
##      NoDisorderlyBehavAtSchool,                                                                     
##      NotSatisfiedTeachers,                                                                          
##      ReadyForSchool}             => {NoResourcesShort} 0.4396111  0.9874715 0.4451887 1.013326 53360
## [4] {Dig,                                                                                           
##      LackOfConfInReading,                                                                           
##      NoBullying,                                                                                    
##      NoDisorderlyBehavAtSchool}  => {NoResourcesShort} 0.5870077  0.9874577 0.5944637 1.013312 71251
## [5] {Dig,                                                                                           
##      LackOfConfInReading,                                                                           
##      NoBullying,                                                                                    
##      NoDisorderlyBehavAtSchool,                                                                     
##      NotSatisfiedTeachers,                                                                          
##      ReadyForSchool}             => {NoResourcesShort} 0.4544241  0.9874505 0.4601994 1.013305 55158

# Apriori association rules algorithm
rules.score_all_supp <- sort(rules.score_all, by = "support", 
                                    decreasing = T)
inspect(rules.score_all_supp[1:5])

##     lhs                   rhs                          support   confidence
## [1] {}                 => {NoResourcesShort}           0.9744851 0.9744851 
## [2] {}                 => {EarlyLiteracy}              0.9352776 0.9352776 
## [3] {EarlyLiteracy}    => {NoResourcesShort}           0.9162959 0.9797047 
## [4] {NoResourcesShort} => {EarlyLiteracy}              0.9162959 0.9402873 
## [5] {}                 => {HighSenseOfSchoolBelonging} 0.9146482 0.9146482 
##     coverage  lift     count 
## [1] 1.0000000 1.000000 118283
## [2] 1.0000000 1.000000 113524
## [3] 0.9352776 1.005356 111220
## [4] 0.9744851 1.005356 111220
## [5] 1.0000000 1.000000 111020

The association found by sorting support is particularly interesting as it shows a strong association between early literacy and no resources shortages. It may be true, as parents with high Socioeconomic Status, who usually invest more in their children education might be more prone to send them to schools with no resources shortages, or even invest in school’s resources.

5.4 Visualization

As there are too many rules to be visualized, I show only 20 of them.

plot(rules.score.suport[1:20], method = "graph")

plot(rules.score.suport[1:5], method="grouped")

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

plot(rules.score, method="paracoord", control=list(reorder=TRUE))

It can be seen from the graphs, that the largest association is found between NoResourcesShortages and High Score. LackOfConfidenceInReading and Early Literacy also show large importance.

6 Conclusions

This project covered broad application of the Unsupervised Learning methods for data exploring in educational research. Based on clustering methods, we captured similarities between educational systems. Dimension reduction helped us to get better sense of what is going on in the data. Finally, the association rules showed some supporting evidence for investing in human capital.