Evaluating the Effect of Pre-Primary Education on Grade Retention in Thailand: A Re-analysis

library(haven)
library(fastDummies)

## Thank you for using fastDummies!

## To acknowledge our work, please cite the package:

## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

In this project, I will be reanalyzing the impact of exposure to pre-primary schooling on grade retention in Thailand based on Stephen Raudenbush’s 1991 paper “The Effects of Preprimary Access and Quality on Educational Achievement in Thailand.”

Research Question

Does exposure to pre-primary schooling impact grade retention in Thailand?

If we can determine a causal impact of pre-primary schooling on grade retention, we might consider investing more resources into incentivizing students to enroll in pre-primary school.

We should choose a hierarchical modeling strategy for this dataset because there are nested variables. We want to understand the effect of pre-primary school on grade retention and this effect may vary separately on a student-level and a school-level. To do so, I will conduct propensity score matching, as we don’t have perfect randomization into pre-primary schooling here. In essence, this is a balance test to ensure that we are comparing grade retention from students with similar propensity scores (likelihood of enrolling in pre-primary).

Sample and Data

# read in level 1 and level 2 data
level1 <- read_sav("/Users/shanametcalf/Desktop/UChicago/Spring 2024/Hierarchical Linear Models/assignments/Final Project/thai_level1.sav")
level2 <- read_sav("/Users/shanametcalf/Desktop/UChicago/Spring 2024/Hierarchical Linear Models/assignments/Final Project/thai_level2.sav")

# get summary statistics
summary(level1)

##     schoolid           sesc              ln_time              male       
##  Min.   : 10101   Min.   :-1.760000   Min.   :-2.74000   Min.   :0.0000  
##  1st Qu.: 70211   1st Qu.:-0.380000   1st Qu.:-0.34000   1st Qu.:0.0000  
##  Median :120103   Median :-0.230000   Median : 0.03000   Median :1.0000  
##  Mean   :112184   Mean   : 0.003521   Mean   :-0.00374   Mean   :0.5054  
##  3rd Qu.:150543   3rd Qu.: 0.240000   3rd Qu.: 0.69000   3rd Qu.:1.0000  
##  Max.   :180665   Max.   : 3.480000   Max.   : 1.97000   Max.   :1.0000  
##     dialect         breakfast        preprimary        retained     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.4852   Mean   :0.8419   Mean   :0.5054   Mean   :0.1451  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

summary(level2)

##     schoolid          msesc         
##  Min.   : 10101   Min.   :-0.93000  
##  1st Qu.: 80102   1st Qu.:-0.28000  
##  Median :110421   Median :-0.17000  
##  Mean   :109768   Mean   : 0.00285  
##  3rd Qu.:150326   3rd Qu.: 0.21000  
##  Max.   :180666   Max.   : 2.01000  
##                   NA's   :5

The sample: a nationally representative sample of Thai third graders in 1982. Level 1 data has 8582 observations (students), and level 2 data has 405 observations (schools). The variables present in the level-1 data are:
1. SESC, an index of the socioeconomic status
2. Ln_time, the logarithm of the amount of time it takes a student to get to schools
3. Male, 1=male, 0=female
4. Dialect, 1=central Thai, 0=other
5. Breakfast, 1=has breakfast daily, 0=does not
6. Pre-primary=1 if has pre-primary experience, 0 if not
7. Retained=1 if ever retained in grade, 0 if not
8. schoolid for the unique ID of the school in question

The variables present in the level-2 data are: 1. schoolid for the unique ID of the school in question 2. MSESC, an index of the socioeconomic status of that particular school

Methodology

I will use two methods to estimate the effect of pre-primary schooling on grade-retention and determine which produces more accurate results. In the first method, I will control for covariates in HGLM. In the second method, I will stratify on propensity score.

Method One: Simple HLM 2-Level Model with Covariates

Step One: Generate Residuals

Separately, I generated residuals using HLM software. To do this, I regressed all of my covariates on pre-primary (outcome). This results in a residual file that I can use to explore the relationships between various covariates and pre-primary.

# read in residual file
residual <- read_sav("/Users/shanametcalf/Desktop/UChicago/Spring 2024/Hierarchical Linear Models/assignments/Final Project/resfil1.sav")

Step Two: Identify Relevant Confounding Variables

# cor function and convert to DF
correlation_matrix <- cor(residual)

## Warning in cor(residual): the standard deviation is zero

correlation_df <- as.data.frame(correlation_matrix)
correlation_with_retained <- abs(correlation_matrix["RETAINED", ])

# get the variable names associated with the highest absolute correlations
top_variables <- names(correlation_with_retained)

# print the covariates most strongly correlated to grade retention
for (variable in top_variables) {
  correlation_value <- correlation_matrix["RETAINED", variable]
  print(paste("Variable:", variable, "Correlation Value:", correlation_value))
}

## [1] "Variable: L2ID Correlation Value: -0.0612237300708609"
## [1] "Variable: L1RESID Correlation Value: -0.0368819994591392"
## [1] "Variable: FITVAL Correlation Value: -0.091346486197767"
## [1] "Variable: OLS_FV Correlation Value: -0.0894188024054969"
## [1] "Variable: SIGMA Correlation Value: NA"
## [1] "Variable: SESC Correlation Value: -0.13750358083365"
## [1] "Variable: LN_TIME Correlation Value: 0.00235795727983038"
## [1] "Variable: MALE Correlation Value: 0.082624263780864"
## [1] "Variable: DIALECT Correlation Value: -0.0471236047966828"
## [1] "Variable: BREAKFAS Correlation Value: -0.0122983368174534"
## [1] "Variable: PREPRIMA Correlation Value: -0.103386857376787"
## [1] "Variable: RETAINED Correlation Value: 1"

Results: SESC has the highest correlated value and is the biggest confounding variable. Explanation: a student’s SES is likely correlated with whether their parent put them pre-primary school and if they were also retained (maybe they were put in extra tutoring classes or had more involved parents)

Because all of the variables seem to have a non-zero correlation with RETAINED, I will control for all of these variables in my model This is likely because they have some relationship to SES, which is a confounding variable.

Step Three: Build a 2-Level Model with Covariates

Level 1 Model -

\[ \text{Prob}(RETAINED_{ij}=1|\beta_j) = \phi_{ij} \]

\[ \log\left[\frac{\phi_{ij}}{1 - \phi_{ij}}\right] = \eta_{ij} \]

\[ \eta_{ij} = \beta_{0j} + \beta_{1j} \cdot \text{SESC}_{ij} + \beta_{2j} \cdot \text{LN\_TIME}_{ij} + \beta_{3j} \cdot \text{MALE}_{ij} + \beta_{4j} \cdot \text{DIALECT}_{ij} + \beta_{5j} \cdot \text{BREAKFAST}_{ij} + \beta_{6j} \cdot \text{PREPRIMA}_{ij} \]

Level 2 Model -

\[ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} + u_{1j} \\ \beta_{2j} = \gamma_{20} \\ \beta_{3j} = \gamma_{30} \\ \beta_{4j} = \gamma_{40} \\ \beta_{5j} = \gamma_{50} + u_{5j}\\ \beta_{6j} = \gamma_{60} \]

Step Four: Interpret Results

I use HGLM (Adaptive Gauss-Hermite Quadrature, 10 quadrature points) when running my 2-Level model in the HLM software.

My output is as follows:

Coefficient on Pre-Primary: -0.393652 (the average impact of pre-primary schooling on the log odds of grade retention, controlling for covariates)

95% Confidence Interval: [-0.25, -0.537]

Upper bound of interval: -0.393652 + (1.96 * 0.073251) = -0.25

-0.393652 + (1.96 * 0.073251)

## [1] -0.25008

Lower bound of interval: -0.393652 - (1.96 * 0.073251) = -0.537

-0.393652 - (1.96 * 0.073251)

## [1] -0.537224

Odds ratio: 0.674589. This value indicates that getting a pre-primary education reduces your odds of repetition by 32.55% (or multiplies your odds by 0.67).

Step Five: Graph Results

Here, I graph the probability of grade retention as a function of SES (the single most important covariate related to grade retention. Pre-primary is a z-focus.

We can see that there is a weak negative correlation between SESC and RETAINED. As SES gets higher, the chance of being retained goes down. This makes sense as wealth likely has an impact on student performance.

Method Two: Propensity Score Analysis

I predict that using propensity score matching before running the 2-level models in HLM will produce more accurate results. This is because we will compare students who are most alike given their measurable covariates (selection on observables).

Step One: Defining the Model

The following model represents the conditional probability of receiving pre-primary education on all covariates.

Level-1 Model

\[ \text{Prob}(\text{PREPRIMA}_{ij}=1|\beta_{j}) = \phi_{ij} \]

\[ \log\left(\frac{\phi_{ij}}{1 - \phi_{ij}}\right) = \eta_{ij} \]

Level-2 Model

\[ \beta_{0j} = \gamma_{00} + u_{0j} \\\beta_{1j} = \gamma_{10} + u_{1j} \\\beta_{2j} = \gamma_{20} \\\beta_{3j} = \gamma_{30} \\\beta_{4j} = \gamma_{40} \\\beta_{5j} = \gamma_{50} \]

Step Two: Run the Model using HGLM

We will use the residual file from the first method and explore our propensity scores (FITVAL in the model).

summary(residual)

##       L2ID           L1RESID              FITVAL            OLS_FV       
##  Min.   : 10101   Min.   :-41.08515   Min.   :-6.2675   Min.   :-6.2045  
##  1st Qu.: 80206   1st Qu.: -0.67017   1st Qu.:-2.9175   1st Qu.:-2.9362  
##  Median :120209   Median :  0.10057   Median : 0.1081   Median : 0.0612  
##  Mean   :113682   Mean   :  0.02405   Mean   :-0.2572   Mean   :-0.2437  
##  3rd Qu.:160202   3rd Qu.:  0.96303   3rd Qu.: 2.0688   3rd Qu.: 2.0631  
##  Max.   :180665   Max.   : 27.70087   Max.   : 6.8305   Max.   :10.9187  
##      SIGMA        SESC              LN_TIME               MALE       
##  Min.   :1   Min.   :-1.760000   Min.   :-2.74e+00   Min.   :0.0000  
##  1st Qu.:1   1st Qu.:-0.380000   1st Qu.:-3.40e-01   1st Qu.:0.0000  
##  Median :1   Median :-0.240000   Median : 3.00e-02   Median :1.0000  
##  Mean   :1   Mean   : 0.003038   Mean   : 2.40e-06   Mean   :0.5076  
##  3rd Qu.:1   3rd Qu.: 0.240000   3rd Qu.: 6.90e-01   3rd Qu.:1.0000  
##  Max.   :1   Max.   : 3.480000   Max.   : 1.97e+00   Max.   :1.0000  
##     DIALECT          BREAKFAS         PREPRIMA         RETAINED     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4715   Mean   :0.8405   Mean   :0.4961   Mean   :0.1449  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

I have found that FITVAL (propensity scores) have a range of -6.26 to 6.83. The lower values indicate a lower likelihood of enrolling in pre-primary, while the higher values indicate a higher likelihood of enrolling in pre-primary. The median is 0.10, meaning that the majority of students in this sample have a higher likelihood of enrolling in pre-primary.

Step Three: Stratifying the Sample by Propensity Score

Clean up Propensity Scores

# rank order the students by the logit of their predicted propensity scores
residual <- residual[order(residual$FITVAL), ]

# determine ranges for each
range(residual$FITVAL[residual$PREPRIMA == 1], na.rm = T)

## [1] -3.750158  6.830528

range(residual$FITVAL[residual$PREPRIMA == 0], na.rm = T)

## [1] -6.267483  4.170739

# eliminate data below the highest minimum and above the lowest maximum
# start by picking out the mins and maxs of the propensity scores.
mins <- c(min(residual$FITVAL[residual$PREPRIMA==1], na.rm = T), min(residual$FITVAL[residual$PREPRIMA==0], na.rm = T) )
maxs <- c(max(residual$FITVAL[residual$PREPRIMA==1], na.rm = T), max(residual$FITVAL[residual$PREPRIMA==0], na.rm = T) )

#remove the outliers as we only want to compare students in the same strata
residual$FITVAL[residual$FITVAL < max(mins) | residual$FITVAL > min(maxs) ] <- NA

# change NAs to 0
residual_clear <- residual %>% 
  filter(is.na(residual$FITVAL)==0)

# see how the propensity scores look now
summary(residual$FITVAL)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -3.7502 -0.7024  0.6207  0.5832  2.1288  4.1707    2232

Cut Data into Strata

residual_clear$stratum_range <- cut_number(residual_clear$FITVAL, 10)   #cuts data into 10 strata based on propensity score

residual_clear <- residual_clear %>% 
  mutate(stratum = as.numeric(stratum_range)) %>% 
  mutate(stratum = as.factor(stratum))  

# end result is that we have the range for each stratum, and the stratum is labeled with numbers
# here are the number of students per strata
residual_clear %>% 
  count(stratum_range)

## # A tibble: 10 × 2
##    stratum_range      n
##    <fct>          <int>
##  1 [-3.75,-2.45]    611
##  2 (-2.45,-1.1]     611
##  3 (-1.1,-0.382]    611
##  4 (-0.382,0.108]   611
##  5 (0.108,0.621]    611
##  6 (0.621,1.21]     610
##  7 (1.21,1.82]      611
##  8 (1.82,2.51]      611
##  9 (2.51,3.22]      611
## 10 (3.22,4.17]      611

Cross-classification of the students by receipt of pre-primary schooling (column) and propensity score stratum (row).

residual_table <- table(residual_clear$stratum, residual_clear$PREPRIMA)

print(residual_table)

##     
##        0   1
##   1  586  25
##   2  503 108
##   3  411 200
##   4  336 275
##   5  251 360
##   6  170 440
##   7  107 504
##   8   66 545
##   9   34 577
##   10  12 599

Create dummy variables of the strata to later read back into HLM as controls

# create a dummy variable for each stratum. no dummy for Stratum 1.
residual_filtered <- residual_clear %>% 
  fastDummies::dummy_cols(select_columns = "stratum", remove_first_dummy = TRUE)

residual_filtered <- residual_filtered %>% 
  mutate(St_2 = as.numeric(stratum_2),
         St_3 = as.numeric(stratum_3),
         St_4 = as.numeric(stratum_4),
         St_5 = as.numeric(stratum_5),
         St_6 = as.numeric(stratum_6),
         St_7 = as.numeric(stratum_7),
         St_8 = as.numeric(stratum_8),
         St_9 = as.numeric(stratum_9),
         St_10 = as.numeric(stratum_10))

Check for balance

# compare means of the predicted logit-propensity score in each cell of the table
t.test (residual_filtered$FITVAL[residual_filtered$stratum == 1] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 1])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 1] by residual_filtered$PREPRIMA[residual_filtered$stratum == 1]
## t = -0.74828, df = 26.145, p-value = 0.461
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.21357821  0.09955618
## sample estimates:
## mean in group 0 mean in group 1 
##       -3.067403       -3.010392

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 2] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 2])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 2] by residual_filtered$PREPRIMA[residual_filtered$stratum == 2]
## t = -2.3024, df = 154.12, p-value = 0.02265
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.18693897 -0.01428433
## sample estimates:
## mean in group 0 mean in group 1 
##       -1.801946       -1.701335

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 3] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 3])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 3] by residual_filtered$PREPRIMA[residual_filtered$stratum == 3]
## t = -1.2901, df = 396.03, p-value = 0.1978
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.05863435  0.01217003
## sample estimates:
## mean in group 0 mean in group 1 
##      -0.7192417      -0.6960096

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 4] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 4])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 4] by residual_filtered$PREPRIMA[residual_filtered$stratum == 4]
## t = -0.35233, df = 593.31, p-value = 0.7247
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.02502249  0.01741024
## sample estimates:
## mean in group 0 mean in group 1 
##      -0.1406254      -0.1368192

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 5] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 5])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 5] by residual_filtered$PREPRIMA[residual_filtered$stratum == 5]
## t = -1.7583, df = 542.34, p-value = 0.07926
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.045918311  0.002541897
## sample estimates:
## mean in group 0 mean in group 1 
##       0.3566510       0.3783392

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 6] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 6])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 6] by residual_filtered$PREPRIMA[residual_filtered$stratum == 6]
## t = -1.6994, df = 304.69, p-value = 0.09026
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.057909233  0.004237732
## sample estimates:
## mean in group 0 mean in group 1 
##       0.9103988       0.9372346

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 7] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 7])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 7] by residual_filtered$PREPRIMA[residual_filtered$stratum == 7]
## t = -3.366, df = 165.41, p-value = 0.0009482
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.09585180 -0.02497719
## sample estimates:
## mean in group 0 mean in group 1 
##        1.469352        1.529766

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 8] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 8])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 8] by residual_filtered$PREPRIMA[residual_filtered$stratum == 8]
## t = -1.1456, df = 80.907, p-value = 0.2553
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.08424755  0.02268242
## sample estimates:
## mean in group 0 mean in group 1 
##        2.120795        2.151578

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 9] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 9])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 9] by residual_filtered$PREPRIMA[residual_filtered$stratum == 9]
## t = -1.9207, df = 37.627, p-value = 0.06238
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.136889940  0.003621909
## sample estimates:
## mean in group 0 mean in group 1 
##        2.819134        2.885768

t.test (residual_filtered$FITVAL[residual_filtered$stratum == 10] ~ residual_filtered$PREPRIMA[residual_filtered$stratum == 10])

## 
##  Welch Two Sample t-test
## 
## data:  residual_filtered$FITVAL[residual_filtered$stratum == 10] by residual_filtered$PREPRIMA[residual_filtered$stratum == 10]
## t = -2.0765, df = 11.568, p-value = 0.06084
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.311472172  0.008130787
## sample estimates:
## mean in group 0 mean in group 1 
##        3.535208        3.686879

I do not find plausible balance in stratum 2, 7, and 10. The p-values are statistically significant at the 1% level, indicating that those strata are unbalanced. This would make sense since students were not randomized into these strata and likely have differing characteristics.

Exploring the Strata

# for each cell in the table, I calculate the fraction of students who repeated a grade
results <- data.frame(
  stratum = 1:10,
  PREPRIMA_0 = numeric(10),
  PREPRIMA_1 = numeric(10)
)

# loop through each stratum from 1 to 10
for (i in 1:10) {
  # filter the dataframe for the specified conditions (PREPRIMA == 0)
  data_subset_0 <- residual_filtered %>%
    filter(stratum == i, PREPRIMA == 0)
  
  # calculate the fraction of these observations where RETAINED == 1
  if (nrow(data_subset_0) > 0) {
    fraction_retained_0 <- nrow(data_subset_0 %>% filter(RETAINED == 1)) / nrow(data_subset_0)
  } else {
    fraction_retained_0 <- NA
  }
  
  # store the rounded fraction in the results data frame
  results$PREPRIMA_0[i] <- round(fraction_retained_0, 2)
  
  # filter the dataframe for the specified conditions (PREPRIMA == 1)
  data_subset_1 <- residual_filtered %>%
    filter(stratum == i, PREPRIMA == 1)
  
  # calculate the fraction of these observations where RETAINED == 1
  if (nrow(data_subset_1) > 0) {
    fraction_retained_1 <- nrow(data_subset_1 %>% filter(RETAINED == 1)) / nrow(data_subset_1)
  } else {
    fraction_retained_1 <- NA
  }
  
  # store the rounded fraction in the results data frame
  results$PREPRIMA_1[i] <- round(fraction_retained_1, 2)
}

# print the results data frame
print(results)

##    stratum PREPRIMA_0 PREPRIMA_1
## 1        1       0.16       0.16
## 2        2       0.21       0.13
## 3        3       0.20       0.14
## 4        4       0.23       0.13
## 5        5       0.13       0.09
## 6        6       0.18       0.14
## 7        7       0.17       0.13
## 8        8       0.23       0.11
## 9        9       0.18       0.11
## 10      10       0.00       0.09

# with respect to the proportion of students retained, I next calculate the difference between those with and without pre-primary schooling.

# pre-primary == 0 minus pre-primary == 1 (retention rates)
differences <- data.frame(
  stratum = results$stratum,
  Difference = round(results$PREPRIMA_0 - results$PREPRIMA_1, 2)
)

print(differences)

##    stratum Difference
## 1        1       0.00
## 2        2       0.08
## 3        3       0.06
## 4        4       0.10
## 5        5       0.04
## 6        6       0.04
## 7        7       0.04
## 8        8       0.12
## 9        9       0.07
## 10      10      -0.09

I see that overall, those who did not do pre-primary are being retained at a higher rate than those who did, across all strata. This is consistent with expectations.

Step Four: Upload Finished Dataset to HLM for Analysis

# write to dta for uploading to HLM
write_dta(residual_filtered, "residual_filtered.dta")

# change the level 2 data file to dta
lvl2 <- read_sav("thai_level2.sav")
write_dta(lvl2, "thai_level2.dta")

Step Five: Run Model in HLM with Strata Dummies

The following is an HGLM model in which the logit of the probability of grade retention is the outcome.

Level 1 Model -

$$
    \text{Prob}(RETAINED_{ij}=1|\beta_j) = \phi_{ij}
    $$

$$
\log\left[\frac{\phi_{ij}}{1 - \phi_{ij}}\right] = \eta_{ij}
$$

\[ \eta_{ij} = \beta_{0j} + \beta_{1j} \cdot \text{PREPRIMA}_{ij} + \beta_{2j} \cdot \text{ST\_2}_{ij} + \beta_{3j} \cdot \text{ST\_3}_{ij} + \beta_{4j} \cdot \text{ST\_4}_{ij} + \\ \beta_{5j} \cdot \text{ST\_5}_{ij} + \beta_{6j} \cdot \text{ST\_6}_{ij} + \beta_{7j} \cdot \text{ST\_7}_{ij}\\ + \beta_{8j} \cdot \text{ST\_8}_{ij} + \beta_{9j} \cdot \text{ST\_9}_{ij} + \\\beta_{10j} \cdot \text{ST\_10}_{ij} \]

Level 2 Model -

\[ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} \\ \beta_{2j} = \gamma_{20} \\ \beta_{3j} = \gamma_{30} \\ \beta_{4j} = \gamma_{40} \\ \beta_{5j} = \gamma_{50} \\ \beta_{6j} = \gamma_{60} \\ \beta_{7j} = \gamma_{70} \\ \beta_{8j} = \gamma_{80} \\ \beta_{9j} = \gamma_{90} \\ \beta_{10j} = \gamma_{100} \]

Terms:

\(\beta_{0j}\) represents the coefficient/impact of not having pre-primary (pre-primary = 0) and for each of the strata = 0. (holding all the covariates fixed/conditional on all covariates being 0)

\(\beta_{1j}\) represents the coefficient/impact of pre-primary schooling = 1 on grade retention.

Each of the Betas on ST_i represents the coefficient/impact of each strata dummy on grade retention.

Because I could not add random effects due to HLM overloading, each of the betas = gammas (except for \(\beta_{0j}\) where there is a random effect of \(u_{0j}\).

Overall Results

The coefficient of pre-primary education is -0.453015 is highly statistically significant, indicating that pre-primary education is negatively correlated with grade retention: enrolling in pre-primary is associated with lower grade retention. However, this is a selection on observables and the fundamental assumption here to justify a causal interpretation is that there are no un-observable characteristics (or observable characteristics that we just didn’t include in the model) that could impact grade retention beyond the ones we controlled for. Additionally, we see that stratum 10 dummy has a coefficient of -0.587 and is statistically significant at the 5% level, indicating that being in the 10th stratum is negatively correlated with grade retention, which is aligned with expectations.

Comparison of simple regression results to the propensity score analysis

The coefficient of pre-primary education is -0.453015 compared to -0.393652 without propensity score matching, indicating that when you match students based on observed characteristics, you get a higher treatment effect of pre-primary on grade retention. The coefficient is highly statistically significant, indicating that pre-primary education is correlated with grade retention. The two analyses do not tell the same story. The second analysis (PSM) is likely more reliable. PSM allowed us to get both treatment and control groups with similar observable characteristics, and so we were able to filter out the variation due to differing observable characteristics. After doing so, we get a smaller treatment effect, meaning that pre-primary has less of an effect on grade retention than otherwise thought. The original coefficient of -0.45 was likely higher because we were comparing two groups of students who were fundamentally different.