Replication-Report_cheema2014

title: “Replication of `empirical sample 1’ by Cheema (2014, Journal of Modern Applied Statistical Methods)” author: “Radhika Kapoor (rkap786@stanford.edu)” date: “November 20, 2020” output: html_document: toc: yes toc_depth: 3 toc_float: collapsed: false —

Cheema 2014

Introduction

Cheema (2014) evaluates different treatment methods for missing data in surveys by comparing their effect on the accuracy of estimation. The paper conducts the analysis on: (1) Simulated dataset (n=10,000 cases) (2) Empirical sample 1: US portion of PISA 2003 (n=456 cases) (3) Empirical sample 2: Population and Housing portion of US Census (2000). The paper looks at the effect of the different methods if sample size is small, medium or large, and if 5% and 10% of data is missing.

In this project, I propose to reproduce Cheema 2014’s findings from the US portion of the PISA 2003 dataset (Empirical sample 1) with 5% and 10% missing data using five different imputation methods - Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation . These are the results in Table 2 and Table 3 of the paper.

Cheema (2014) found that other than mean imputation, all imputation methods produced parameter estimates and model statistics that were very similar to their full data counterparts. This is the result I will be aiming to reproduce. Further, reproduction would be achieved if the R-squared estimates produced for different methods here are within 5% of Cheema (2014).

Justification for choice of study

This paper is highly cited in context of handing missing data, but it hasn’t been replicated to my knowledge. The issue of missing data is highly relevant in education research, especially when datasets are small and a relatively large proportion of variables are missing. This project would also give me an opportunity to work with large-scale assessment datasets like PISA and understand their underlying structure. This paper is relevant to my research interests as I plan to work on large education datasets in the context of international assessments. By replicating the results, I would learn some methods for handling missing data in my future work.

Challenges and limitations of the reproduction

1. Student achievement variables used in Cheema (2014)

Cheema (2014) conducts student level imputations for PISA (2003). The dependent variable is Math achievement, and the reading achievement is one of the explanatory variables.

PISA reports 5 plausible values for Math and Reading for each student, and advises that student level plausible values should not be averaged. It suggests that any estimation should be conducted for each of the plausible values, and the resultant estimates should be averaged. However, the author confirmed that he had averaged the plausible values in the PISA dataset in this paper. I have followed the same approach, but the scores we report would not be the most accurate representation of student ability.

2. Methods used for data imputation

The original study used SPSS which offers baked in packages for missing data imputation used in this paper. I reproduced this study in R which does not offer the same baked in models. R does have packages for regression and multiple imputation (I used the package “MICE”), but it is not possible to know how it compares to SPSS. For other models, I wrote the code, and again, it is difficult to know if the methods are identical to those in SPSS.

I also could not reproduce the code to use the method “Expectation Maximization (EM)”. SPSS does not offer too much information about this package, and the consensus of internet blogs seems to be that it’s not a rigorous package. EM has an optimzation method is available in R and I tried to use it, but the results were very different from SPSS. This is more likely due to differences in packages and code used, and not due to other data or analysis related issues in reproduction. I hencec chose not to report the results for this method.

Links

Link to the repo here

Link to the original paper here

Methods

Description of the steps required to reproduce the results

PISA (2003) dataset for the US was downloaded from the NCES website. SPSS macros provided on the NCES were run to make the data ready for analysis, and the relevant variables were retained (student achievement for math and reading, student gender, home education resources and math anxiety).

Performance of different imputation methods is compared by: (1) Dropping data completely at random (5% and 10%) (2) Estimate missing values using each imputation method (3) Run a linear multiple regression where math achievement is predicted using the other variables - the regression is run using both the full dataset and the dataset with imputed values for missing data. The imputation methods applied are : Listwise deletion, Mean imputation, Regression imputation, and Multiple imputation.

Differences from original study

This study differs from the original in use of R instead of SPSS, which might change the packages used. I am also not sure how the original study generated the missing data. THe study describes the method as:

“Each of these sub-samples was then reduced in size by 1%, 2%, 5%, 10%, and 20% in order to simulate datasets containing missing data. The cases were discarded randomly from each complete sample five times separately in order to make sure that there were no dependencies between samples.”

I have taken this to mean that 5% and 10% of “math achievement” variable should be marked as NA for the analysis, but the author might have dropped the full rows. However, that was too complicated for me to do in R in a way that was logical.

Further, the variables “home education resources” and “math anxiety” have additional missing values. The author did not mention dropping them from the analysis, so I have retained them. However, it’s possible that the author didn’t mention dropping these variables, because they behave differently in my estimations.

Results

Libraries and loading data

I load the libraries needed for analysis and the relevant datasets in this section.

## Load Relevant Libraries and Functions
## Default repo

##local({r <- getOption("repos")
##       r["CRAN"] <- "https://cloud.r-project.org"
##       options(repos=r)
##})


library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(ggplot2)
library(readr)
library(norm)
library(knitr)

#install.packages("mice")  #package for regression imputation
library("mice")

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

#install.packages("miceadds")
library(miceadds)

## * miceadds 3.10-28 (2020-07-29 21:56:24)

library(sandwich)
library(mitools)
library(texreg)

## Version:  1.37.5
## Date:     2020-06-17
## Author:   Philip Leifeld (University of Essex)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

## 
## Attaching package: 'texreg'

## The following object is masked from 'package:tidyr':
## 
##     extract

#install.packages("jtools")
library(jtools)
#install.packages("huxtable")
library(huxtable)

## 
## Attaching package: 'huxtable'

## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

## The following object is masked from 'package:dplyr':
## 
##     add_rownames

#### Import data
url <- '/Users/radhika/Documents/Stanford readings/251 Experimental methods/Project/cheema2014/PISA data/Extracted data/PISA 2003 Student Background.csv'
pisa <-read.csv(url)

Data preparation

In this section, I modify the dataset to retain only the variables needed for the analysis. I create the mean achievement scores needed for the analysis. I also confirm that the variables behave as described in the original paper e.g. Reading and math achievement range between 200 and 800. I also clean the data by dropping missing values, again as described in the paper.

# Variables selected are: IDs, gender, math anxiety, household, education resources, plausible values for math and reading

#### Prepare data for analysis - create columns etc.
####Calculate mean reading and math achievement
pisa = pisa %>% 
   rowwise() %>% 
  mutate(math_achievement = mean(c(PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH))) %>%  
  mutate(read_achievement=mean(c(PV1READ, PV2READ, PV3READ, PV4READ, PV5READ)))
  

#### Data exclusion / filtering NA values
pisa = pisa %>% 
  rename(student_gender = ST03Q01, studentID=STIDSTD) %>% 
  select(c("studentID", "SCHOOLID", "student_gender","ANXMAT", "HEDRES", "math_achievement", "read_achievement"))
        

#head(pisa)
#nrow(pisa)
#colnames(pisa)
#summary(pisa)

#Paper says math and reading scores are normally distributed - plotting to check
ggplot(pisa, aes(x=math_achievement)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(pisa, aes(x=read_achievement)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Check for outliers in variables of interest
colnames(pisa)

## [1] "studentID"        "SCHOOLID"         "student_gender"   "ANXMAT"          
## [5] "HEDRES"           "math_achievement" "read_achievement"

unique(pisa$student_gender)

## [1] 1 2 8

unique(pisa$HEDRES)

## [1]  -1.4485  -0.6236   0.6766 999.0000  -2.1925  -3.0064  -4.2987

#Check outliers in math anxiety
ggplot(pisa, aes(x=ANXMAT)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Student gender, Math anxiety seems to have outliers - checking to see all possible values in it

pisa_filter=pisa %>%
     mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
    mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
    filter(student_gender<3,
    !is.na(student_gender))

#Paper standardizes math anxiety and home education resources, so the standardized variables are created
mean_anxmat <- mean(pisa_filter$ANXMAT, na.rm=T)
sd_anxmat <- sd(pisa_filter$ANXMAT, na.rm=T)
mean_hedres <- mean(pisa_filter$HEDRES, na.rm=T)
sd_hedres <- sd(pisa_filter$HEDRES, na.rm=T)


pisa_filter=pisa_filter %>%
  mutate(stdANXMAT= (ANXMAT-mean_anxmat)/sd_anxmat) %>%
mutate(stdHEDRES= (HEDRES-mean_hedres)/sd_hedres)

Key analysis

Reduce sample size for analysis

Cheema (2014) describes reducing the dataset by 5% and 10% for the analysis. I create the datasets as described in this section.

When I run the regressions from the model described above, the variable “home education resources” produces very different beta estimates than the original (the level of significance and coefficient estimate change). When I wrote to the author, he suggested that I try not standardizing the variable. Hence, a dataset with the original home education resources variable (and not it’s standardized version) is also created (with 5% and 10% missing values).

Drop 5% and 10% of the PISA data

set.seed(123)
pisa_filter <- pisa_filter %>%
  mutate(rv = rnorm(n())) %>%
  arrange(rv) 

cutoff_5per <-5183 #Cheema (2014) rounded off 95% of 5455 obs as 5182, preserving that 
cutoff_10per <-4911 #Cheema (2014) rounded off 90% of 5455 obs as 4910

#pisa_5persample <- sample_frac(pisa_filter, 0.95)
######the code above didnt work

pisa_5persample <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>% 
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, HEDRES,math_achievement, rv))

pisa_5persample_mod <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>% 
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, stdHEDRES,math_achievement, rv)) %>%
  mutate(stdHEDRES=HEDRES) %>%
  select(-c(HEDRES)) 

summary(pisa_5persample_mod)

##    studentID       SCHOOLID     student_gender  read_achievement
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   :149.4   
##  1st Qu.:1364   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.:432.6   
##  Median :2728   Median :138.0   Median :2.000   Median :500.5   
##  Mean   :2728   Mean   :136.4   Mean   :1.502   Mean   :494.1   
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.:562.5   
##  Max.   :5456   Max.   :274.0   Max.   :2.000   Max.   :733.3   
##                                                                 
##    stdANXMAT        math_achievement2   stdHEDRES      
##  Min.   :-2.20444   Min.   :212.9     Min.   :-4.2987  
##  1st Qu.:-0.34362   1st Qu.:417.7     1st Qu.:-0.6236  
##  Median :-0.07396   Median :482.1     Median : 0.6766  
##  Mean   : 0.00000   Mean   :481.5     Mean   :-0.1690  
##  3rd Qu.: 0.68481   3rd Qu.:545.3     3rd Qu.: 0.6766  
##  Max.   : 2.56340   Max.   :739.2     Max.   : 0.6766  
##  NA's   :106        NA's   :273       NA's   :37

pisa_10persample <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>%  
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, HEDRES,math_achievement, rv))

pisa_10persample_mod <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>% 
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, stdHEDRES,math_achievement, rv)) %>%
  mutate(stdHEDRES=HEDRES) %>%
  select(-c(HEDRES)) 

summary(pisa_10persample_mod)

##    studentID       SCHOOLID     student_gender  read_achievement
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   :149.4   
##  1st Qu.:1364   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.:432.6   
##  Median :2728   Median :138.0   Median :2.000   Median :500.5   
##  Mean   :2728   Mean   :136.4   Mean   :1.502   Mean   :494.1   
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.:562.5   
##  Max.   :5456   Max.   :274.0   Max.   :2.000   Max.   :733.3   
##                                                                 
##    stdANXMAT        math_achievement2   stdHEDRES      
##  Min.   :-2.20444   Min.   :224.2     Min.   :-4.2987  
##  1st Qu.:-0.34362   1st Qu.:418.0     1st Qu.:-0.6236  
##  Median :-0.07396   Median :482.4     Median : 0.6766  
##  Mean   : 0.00000   Mean   :482.0     Mean   :-0.1690  
##  3rd Qu.: 0.68481   3rd Qu.:546.6     3rd Qu.: 0.6766  
##  Max.   : 2.56340   Max.   :739.2     Max.   : 0.6766  
##  NA's   :106        NA's   :545       NA's   :37

Define the missing data models to be used

I define functions for imputation methods here. This is based on my understanding of what these functions do in SPSS, and it reflects how I implemented them in R:

Regression for full model - This is a linear regression function that returns the results from lm function.
List wise deletion imputation method - The rows with missing values are dropped from the regression and the model is run.
Mean imputation method - missing values are replaced with the mean of the remaining values in the variable column
Regression model imputation method - missing values are estimated using a regression function. The model is estimated rows for which data is available; this model is then used to estimate missing values for the concerned variable. The `mice’ package in R offers two options for regression imputation: (1) Stochastic, that estimates a regular regression model (2) Regression imputation, that adds a small, randomly generated error term. While regression imputation is better since it is less deterministic, I used stochastic here since that is what SPSS uses (The author of the original study also recommended using the stochastic method).
EM method - this has not been used, as I could not find a good way to run this method in R.
Multiple imputation - as the name suggests, this method imputes muliple estimated datasets. The dataset with missing values is copied for a specified number of times, and the missing values are estimated for each copied dataset. The values are slightly different because small errors are introduced in the estimations. The mice package, a chain of regression equations is used to estimate the variables i.e. first, one variable is estimated using other variables. The missing values in the other variables are replaced by the mean. Then the results from the first variable are used in the regression estimation for the second variable to be imputed. This is similar to iterating stochastic regression with the difference that (a) it is run on mutiple, slightly different datasets (b) it is run for multiple iterations till results converge.

The function allows for different estimation methods to be used - we can specify different methods for different variables. I used the method predictive mean matching (pmm), since we are not estimating any categorical variables. The variables with missing values that are estimated are math achievement, home education resources and math anxiety.

0. True model

model0_full = function(data){
lm_fulldata=lm(math_achievement ~ student_gender + HEDRES + stdANXMAT + read_achievement, data=data)
return(lm_fulldata)
}

1. Imputation method: List wise deletion model

model1_deletion = function(data){
data = data %>%
  filter(!is.na(math_achievement2))
  
lm_ld=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=data)

return(lm_ld)
}

2. Imputation method: Mean imputation

model2_mean_imp = function(data){

mean_math <- mean(data$math_achievement2, na.rm = T) #calculate mean score to be used for imputation

mean_anxmat <- mean(data$stdANXMAT, na.rm = T)
mean_hedres <- mean(data$stdHEDRES, na.rm = T)

data <- data %>%
  rowwise() %>%
  mutate(math_achievement2=ifelse(is.na(math_achievement2),mean_math,math_achievement2)) %>%
  mutate(stdHEDRES=ifelse(is.na(stdHEDRES),mean_math,stdHEDRES)) %>%
  mutate(stdANXMAT=ifelse(is.na(stdANXMAT),mean_math,stdANXMAT))


#Run the regression using the imputed mean
lm_mi=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=data)

return(lm_mi)
}

3. Imputation method: Regression imputation

model3_reg_imp = function(data){
imp <- mice(data, method = "norm.nob", m = 1) # Impute data, produce 1 result matrix
    #norm.predict = regression imputation with error
    #norm.nob = regression imputation stochastic

imp2 <- complete(imp) # Store datasummary - mice produces data in mids (list) format and this converts it to a single dataframe
summary(imp2) #Missing values have been filled

lm_ri=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=imp2)
return(lm_ri)
}

4. Imputation method: EM imputation

I couldn’t use Expectation Maximization (EM) imputation in R to replicate the method used in SPSS. The code chunk below has the two methods I tried, neither of which look right. The first one produces estimates very different from the original model, and the second one produced missing or “inf” values. There don’t seem to be any readily available packages in R that correspond to SPSS, so I decided to not reproduce this method.

### (Failed) Method 1: Use package mvdalab
#### The package runs but I am not quite sure what it does, and the results don't correspond to the paper. My best guess is that this model is not the same as SPSS

# install.packages("mvdalab")
# library(mvdalab)
# imp= imputeEM(pisa_5persample, impute.ncomps = 1, pca.ncomps = 1, CV = TRUE, Init = "mean",
#          scale = TRUE, iters = 25, tol = .Machine$double.eps^0.25)
# implist = (pool(imp$Imputed.DataFrames))
# 
# modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
# summary(modelFit2)



### (Failed) Method 2: Use imp.norm
  ### The code runs but the missing values are not filled and "Inf" is generated
# library(norm)
# summary(pisa_5persample)
# s <- prelim.norm(as.matrix(pisa_5persample))
# thetahat <- em.norm(s)
# getparam.norm(s,thetahat)
# 
# ximp = imp.norm(s, thetahat, as.matrix(pisa_5persample))
# summary(ximp)

5. Imputation method: Multiple imputation

model5_multiple_imp = function(data){
colnames(pisa_5persample)

  imp <- mice(data, m=5, maxit=10, meth=c('pmm'),seed=500, printFlag = F) 
  #mice is the package for performing multiple imputations
      #m=5: number of multiple imputations i.e. 5 completed datasets will be generated
      #maxit=10: number of iterations run 
      #pmm: predictive mean matching


implist <- miceadds::mids2datlist(imp)
modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement)) #this will estimate lm model for each imputued dataset i.e. 5 lm models are estimated
#summary(pool(modelFit2)) #pooling results from the models
}

####I copy pasted the entire code chunk for function "extract.df" from a blog
               
extract.df <- function(tt, cl = NULL) {
  #create a list from imputation - makes export easier

  require(sandwich)
  require(mitools)
  require(texreg)
  m2 <- length(tt) #number of imputations
  betas <- lapply(tt, coef)
  vars <- lapply(tt, vcov)
  # conduct statistical inference and save results into a data.frame
  model1 <- summary(pool_mi(betas, vars))
  
  
  R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
  #technically, should use R to Z-- but close for now when R is low
  ns <- nobs(tt[[1]])
  
   #creates what is used by texreg
  tr <- createTexreg(
    coef.names = row.names(model1), 
    coef = model1$results, 
    se = model1$se, 
    pvalues = model1$p,
    gof.names = c("R2", "Nobs"), 
    gof = c(R2, ns),
    gof.decimal = c(T,F)
  )
}

Running the analysis using the functions defined above

This section runs the imputation methods for PISA dataset with 5% and 10% missing data. The models are estimated using both standardized home education resources variable, as well as the original variable.

#These results are similar to the paper results
##Full data analysis
md.pattern(pisa_filter, plot = F) #check the pattern of missing data

##      studentID SCHOOLID student_gender math_achievement read_achievement rv
## 5349         1        1              1                1                1  1
## 69           1        1              1                1                1  1
## 37           1        1              1                1                1  1
##              0        0              0                0                0  0
##      HEDRES stdHEDRES ANXMAT stdANXMAT    
## 5349      1         1      1         1   0
## 69        1         1      0         0   2
## 37        0         0      0         0   4
##          37        37    106       106 286

lm_fulldata=model0_full(pisa_filter)


## For 5% missing data
md.pattern(pisa_5persample, plot = F) #check the pattern of missing data

##      studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 5083         1        1              1                1         1         1
## 266          1        1              1                1         1         1
## 66           1        1              1                1         1         0
## 3            1        1              1                1         1         0
## 33           1        1              1                1         0         0
## 4            1        1              1                1         0         0
##              0        0              0                0        37       106
##      math_achievement2    
## 5083                 1   0
## 266                  0   1
## 66                   1   1
## 3                    0   2
## 33                   1   2
## 4                    0   3
##                    273 416

### Using standardized HEDRES as specified in paper
lm_ld_5per = model1_deletion(pisa_5persample) #list wise deletion
lm_mi_5per = model2_mean_imp(pisa_5persample) #mean imputation
lm_ri_5per = model3_reg_imp(pisa_5persample) #regression imputation

## 
##  iter imp variable
##   1   1  stdANXMAT  stdHEDRES  math_achievement2
##   2   1  stdANXMAT  stdHEDRES  math_achievement2
##   3   1  stdANXMAT  stdHEDRES  math_achievement2
##   4   1  stdANXMAT  stdHEDRES  math_achievement2
##   5   1  stdANXMAT  stdHEDRES  math_achievement2

## Warning: Number of logged events: 1

multim_est_5per= model5_multiple_imp(pisa_5persample) #multiple imputation

## Warning: Number of logged events: 1

lm_multim_5per <- extract.df(multim_est_5per) #extract results

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                      results          se          t             p       (lower
## (Intercept)       18.7049692 3.436358157   5.443254  5.270374e-08  11.96957368
## student_gender    30.5474741 0.992240013  30.786376 3.698504e-196  28.60240725
## stdHEDRES          1.0509477 0.515044109   2.040500  4.158885e-02   0.04013469
## stdANXMAT        -12.5190415 0.515182112 -24.300225 1.105203e-112 -13.52947640
## read_achievement   0.8439916 0.005625905 150.018829  0.000000e+00   0.83296434
##                      upper) missInfo
## (Intercept)       25.440365    1.1 %
## student_gender    32.492541    2.3 %
## stdHEDRES          2.061761    6.8 %
## stdANXMAT        -11.508607    4.9 %
## read_achievement   0.855019    1.5 %

### Using HEDRES variable with no standardization, as suggested by the author
lm_ld_5per2 = model1_deletion(pisa_5persample_mod) #list wise deletion
lm_mi_5per2 = model2_mean_imp(pisa_5persample_mod) #mean imputation
lm_ri_5per2 = model3_reg_imp(pisa_5persample_mod) #regression imputation

## 
##  iter imp variable
##   1   1  stdANXMAT  math_achievement2  stdHEDRES
##   2   1  stdANXMAT  math_achievement2  stdHEDRES
##   3   1  stdANXMAT  math_achievement2  stdHEDRES
##   4   1  stdANXMAT  math_achievement2  stdHEDRES
##   5   1  stdANXMAT  math_achievement2  stdHEDRES

## Warning: Number of logged events: 1

multim_est_5per2=model5_multiple_imp(pisa_5persample_mod) #multiple imputation

## Warning: Number of logged events: 1

lm_multim_5per2= extract.df(multim_est_5per2) #extract results

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                      results          se          t             p      (lower
## (Intercept)       18.9832323 3.617928574   5.246989  2.533734e-07  11.8703234
## student_gender    30.5118330 1.003301070  30.411443 6.060850e-167  28.5441930
## stdHEDRES          0.8664216 0.457160365   1.895225  5.822086e-02  -0.0301921
## stdANXMAT        -12.5267505 0.509984745 -24.562991 7.383019e-126 -13.5265452
## read_achievement   0.8437922 0.005803911 145.383378  0.000000e+00   0.8323971
##                       upper) missInfo
## (Intercept)       26.0961411   10.5 %
## student_gender    32.4794730    4.6 %
## stdHEDRES          1.7630352    4.8 %
## stdANXMAT        -11.5269559    2.9 %
## read_achievement   0.8551874    7.8 %

######################################################################
## For 10% missing data
md.pattern(pisa_10persample, plot = F) #check the pattern of missing data

##      studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 4813         1        1              1                1         1         1
## 536          1        1              1                1         1         1
## 64           1        1              1                1         1         0
## 5            1        1              1                1         1         0
## 33           1        1              1                1         0         0
## 4            1        1              1                1         0         0
##              0        0              0                0        37       106
##      math_achievement2    
## 4813                 1   0
## 536                  0   1
## 64                   1   1
## 5                    0   2
## 33                   1   2
## 4                    0   3
##                    545 688

### Using standardized HEDRES as specified in paper
lm_ld_10per = model1_deletion(pisa_10persample) #list wise deletion
lm_mi_10per = model2_mean_imp(pisa_10persample) #mean imputation
lm_ri_10per = model3_reg_imp(pisa_10persample) #regression imputation

## 
##  iter imp variable
##   1   1  stdANXMAT  stdHEDRES  math_achievement2
##   2   1  stdANXMAT  stdHEDRES  math_achievement2
##   3   1  stdANXMAT  stdHEDRES  math_achievement2
##   4   1  stdANXMAT  stdHEDRES  math_achievement2
##   5   1  stdANXMAT  stdHEDRES  math_achievement2

## Warning: Number of logged events: 1

multim_est_10per= model5_multiple_imp(pisa_10persample) #multiple imputation

## Warning: Number of logged events: 1

lm_multim_10per= extract.df(multim_est_10per) #extract results

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                      results          se          t            p       (lower
## (Intercept)       17.5490203 3.650936806   4.806717 2.736037e-06  10.35633268
## student_gender    30.8305051 1.065787309  28.927446 7.774665e-65  28.72539242
## stdHEDRES          0.9102864 0.506541173   1.797063 7.245279e-02  -0.08302295
## stdANXMAT        -12.6921422 0.579401721 -21.905600 3.705601e-31 -13.84999149
## read_achievement   0.8451888 0.005796448 145.811491 0.000000e+00   0.83380696
##                       upper) missInfo
## (Intercept)       24.7417080   13.8 %
## student_gender    32.9356178     17 %
## stdHEDRES          1.9035958    4.2 %
## stdANXMAT        -11.5342928   27.5 %
## read_achievement   0.8565706    8.1 %

### Using HEDRES variable with no standardization, as suggested by the author
lm_ld_10per2 = model1_deletion(pisa_10persample_mod) #list wise deletion
lm_mi_10per2 = model2_mean_imp(pisa_10persample_mod) #mean imputation
lm_ri_10per2 = model3_reg_imp(pisa_10persample_mod) #regression imputation

## 
##  iter imp variable
##   1   1  stdANXMAT  math_achievement2  stdHEDRES
##   2   1  stdANXMAT  math_achievement2  stdHEDRES
##   3   1  stdANXMAT  math_achievement2  stdHEDRES
##   4   1  stdANXMAT  math_achievement2  stdHEDRES
##   5   1  stdANXMAT  math_achievement2  stdHEDRES

## Warning: Number of logged events: 1

multim_est_10per2=model5_multiple_imp(pisa_10persample_mod) #multiple imputation

## Warning: Number of logged events: 1

lm_multim_10per2= extract.df(multim_est_10per2) #extract results

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                      results          se          t             p      (lower
## (Intercept)       17.9253153 3.614054089   4.959891  1.083855e-06  10.8182909
## student_gender    30.7228040 1.031257595  29.791591 8.793595e-102  28.6951707
## stdHEDRES          0.7415120 0.483133284   1.534798  1.267098e-01  -0.2122652
## stdANXMAT        -12.8959376 0.528838122 -24.385416  6.542410e-81 -13.9356132
## read_achievement   0.8449354 0.005901284 143.178241 7.754751e-293   0.8333253
##                       upper) missInfo
## (Intercept)       25.0323397     11 %
## student_gender    32.7504373   10.7 %
## stdHEDRES          1.6952892   16.4 %
## stdANXMAT        -11.8562620   10.5 %
## read_achievement   0.8565455   11.7 %

Compiling the results in tables

This section compiles results from the multiple regressions in tables - one table is made for 5% missing data (corresponding to Table 2 in the paper), and a second table is made for 10% missing data (corresponding to Table 3). This reproduction gave the same result as before: it’s better to use any imputation method than mean imputation. Even deletion of missing data rows would be preferred to mean imputation, as mean imputation biases result.

The results for the home education resources variable do not get reproduced both when we standardize it and when we don’t. The next section only discusses the results from the standardized home education resources variable.

5 % missing data

##5% missing data 
### Using standardized HEDRES as in paper

screenreg(list(lm_fulldata, lm_ld_5per, lm_mi_5per,  lm_ri_5per, lm_multim_5per), 
          custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5), 
           custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
          custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
          digits=3,
          include.fstatistic = TRUE)

## 
## ==============================================================================================
##                            Full model     Deletion        Mean       Regression     Multiple  
##                           ------------  ------------  ------------  ------------  ------------
##                           Model 0       Model 1       Model 2       Model 3       Model 5     
## ----------------------------------------------------------------------------------------------
## Intercept                   14.896 ***    16.221 ***    13.467 ***    18.165 ***    18.705 ***
##                             (3.480)       (3.553)       (3.670)       (3.404)       (3.436)   
## Gender                      30.710 ***    30.538 ***    32.241 ***    30.576 ***    30.547 ***
##                             (0.986)       (1.012)       (1.106)       (0.977)       (0.992)   
## Home education resources     0.961 *       1.308 *      -0.026         1.024 *       1.051 *  
##                             (0.454)       (0.518)       (0.017)       (0.496)       (0.515)   
## Math anxiety               -12.344 ***   -12.450 ***     0.036 ***   -12.638 ***   -12.519 ***
##                             (0.506)       (0.518)       (0.010)       (0.501)       (0.515)   
## Reading achievement          0.851 ***     0.849 ***     0.849 ***     0.845 ***     0.844 ***
##                             (0.006)       (0.006)       (0.006)       (0.006)       (0.006)   
## ----------------------------------------------------------------------------------------------
## R^2                          0.848         0.848         0.792         0.850                  
## Adj. R^2                     0.848         0.848         0.792         0.850                  
## Num. obs.                 5349          5083          5455          5455                      
## F statistic               7471.529      7082.083      5199.940      7732.297                  
## R2                                                                                   0.849    
## Nobs                                                                              5455        
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

### Using normal HEDRES as suggested by author

screenreg(list(lm_fulldata, lm_ld_5per2, lm_mi_5per2,  lm_ri_5per2, lm_multim_5per2), 
          custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5), 
           custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
          custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
          digits=3,
          include.fstatistic = TRUE)

## 
## ==============================================================================================
##                            Full model     Deletion        Mean       Regression     Multiple  
##                           ------------  ------------  ------------  ------------  ------------
##                           Model 0       Model 1       Model 2       Model 3       Model 5     
## ----------------------------------------------------------------------------------------------
## Intercept                   14.896 ***    16.420 ***    13.459 ***    19.034 ***    18.983 ***
##                             (3.480)       (3.569)       (3.670)       (3.431)       (3.618)   
## Gender                      30.710 ***    30.538 ***    32.241 ***    30.222 ***    30.512 ***
##                             (0.986)       (1.012)       (1.106)       (0.980)       (1.003)   
## Home education resources     0.961 *       1.172 *      -0.026         0.771         0.866    
##                             (0.454)       (0.464)       (0.017)       (0.446)       (0.457)   
## Math anxiety               -12.344 ***   -12.450 ***     0.036 ***   -12.526 ***   -12.527 ***
##                             (0.506)       (0.518)       (0.010)       (0.503)       (0.510)   
## Reading achievement          0.851 ***     0.849 ***     0.849 ***     0.844 ***     0.844 ***
##                             (0.006)       (0.006)       (0.006)       (0.006)       (0.006)   
## ----------------------------------------------------------------------------------------------
## R^2                          0.848         0.848         0.792         0.849                  
## Adj. R^2                     0.848         0.848         0.792         0.849                  
## Num. obs.                 5349          5083          5455          5455                      
## F statistic               7471.529      7082.083      5199.910      7650.895                  
## R2                                                                                   0.849    
## Nobs                                                                              5455        
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

10% missing data

##10% missing data
screenreg(list(lm_fulldata, lm_ld_10per, lm_mi_10per,  lm_ri_10per, lm_multim_10per), 
          custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5), 
           custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
          custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
          digits=3,
          include.fstatistic = TRUE)

## 
## ==============================================================================================
##                            Full model     Deletion        Mean       Regression     Multiple  
##                           ------------  ------------  ------------  ------------  ------------
##                           Model 0       Model 1       Model 2       Model 3       Model 5     
## ----------------------------------------------------------------------------------------------
## Intercept                   14.896 ***    15.497 ***    37.617 ***    17.197 ***    17.549 ***
##                             (3.480)       (3.643)       (3.912)       (3.422)       (3.651)   
## Gender                      30.710 ***    30.644 ***    31.083 ***    30.532 ***    30.831 ***
##                             (0.986)       (1.037)       (1.179)       (0.982)       (1.066)   
## Home education resources     0.961 *       1.265 *      -0.028         0.858         0.910    
##                             (0.454)       (0.532)       (0.018)       (0.499)       (0.507)   
## Math anxiety               -12.344 ***   -12.534 ***     0.031 **    -12.404 ***   -12.692 ***
##                             (0.506)       (0.532)       (0.011)       (0.505)       (0.579)   
## Reading achievement          0.851 ***     0.850 ***     0.804 ***     0.847 ***     0.845 ***
##                             (0.006)       (0.006)       (0.006)       (0.006)       (0.006)   
## ----------------------------------------------------------------------------------------------
## R^2                          0.848         0.849         0.751         0.849                  
## Adj. R^2                     0.848         0.849         0.751         0.849                  
## Num. obs.                 5349          4813          5455          5455                      
## F statistic               7471.529      6771.698      4116.657      7660.441                  
## R2                                                                                   0.850    
## Nobs                                                                              5455        
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

### Using normal HEDRES as suggested by author

screenreg(list(lm_fulldata, lm_ld_10per2, lm_mi_10per2,  lm_ri_10per2, lm_multim_10per2), 
          custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5), 
           custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
          custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
          digits=3,
          include.fstatistic = TRUE)

## 
## ==============================================================================================
##                            Full model     Deletion        Mean       Regression     Multiple  
##                           ------------  ------------  ------------  ------------  ------------
##                           Model 0       Model 1       Model 2       Model 3       Model 5     
## ----------------------------------------------------------------------------------------------
## Intercept                   14.896 ***    15.689 ***    37.609 ***    17.031 ***    17.925 ***
##                             (3.480)       (3.660)       (3.912)       (3.429)       (3.614)   
## Gender                      30.710 ***    30.644 ***    31.083 ***    31.139 ***    30.723 ***
##                             (0.986)       (1.037)       (1.179)       (0.978)       (1.031)   
## Home education resources     0.961 *       1.134 *      -0.028         0.784         0.742    
##                             (0.454)       (0.477)       (0.018)       (0.445)       (0.483)   
## Math anxiety               -12.344 ***   -12.534 ***     0.031 **    -12.590 ***   -12.896 ***
##                             (0.506)       (0.532)       (0.011)       (0.502)       (0.529)   
## Reading achievement          0.851 ***     0.850 ***     0.804 ***     0.846 ***     0.845 ***
##                             (0.006)       (0.006)       (0.006)       (0.006)       (0.006)   
## ----------------------------------------------------------------------------------------------
## R^2                          0.848         0.849         0.751         0.850                  
## Adj. R^2                     0.848         0.849         0.751         0.850                  
## Num. obs.                 5349          4813          5455          5455                      
## F statistic               7471.529      6771.698      4116.640      7725.917                  
## R2                                                                                   0.850    
## Nobs                                                                              5455        
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Comparison of reproduction against the original paper

This section compares the R-squared estimates from the reproduction with the results from Cheema (2014).

Compile results - 5% missing data

#compile results for 5% missing data
##Compile R-squared
x= pool(multim_est_5per) #pooling estimates from the 5 imputed models 
#pool.r.squared(x, adjusted = FALSE)[1]


imputed_r = data.frame(fullData = summary(lm_fulldata)$r.squared, 
                    listDeletion = summary(lm_ld_5per)$r.squared, 
                    meanImp = summary(lm_mi_5per)$r.squared,
                    regressionImp = summary(lm_ri_5per)$r.squared,
                    multipleImp =pool.r.squared(x, adjusted = FALSE)[1])


original_r = data.frame(fullData = 0.849, 
                    listDeletion = 0.848, 
                    meanImp = 0.801,
                    regressionImp = 0.849,
                    multipleImp =0.849)

df=as.data.frame(t(rbind(imputed_r=imputed_r, original_r=original_r))) %>%
  tibble::rownames_to_column(var="method") %>%
  mutate(diff_r=(imputed_r-original_r)/original_r*100,
         diff_r=round(diff_r,2))
df

method	imputed_r	original_r	diff_r
fullData	0.848	0.849	-0.08
listDeletion	0.848	0.848	0
meanImp	0.792	0.801	-1.08
regressionImp	0.85	0.849	0.14
multipleImp	0.849	0.849	-0.02

Compile results - 10% missing data

x= pool(multim_est_10per) #pooling estimates from the 5 imputed models 
#pool.r.squared(x, adjusted = FALSE)[1]



imputed2 = data.frame(fullData = summary(lm_fulldata)$r.squared, 
                    listDeletion = summary(lm_ld_10per)$r.squared, 
                    meanImp = summary(lm_mi_10per)$r.squared,
                    regressionImp = summary(lm_ri_10per)$r.squared,
                    multipleImp =pool.r.squared(x, adjusted = FALSE)[1])


original2 = data.frame(fullData = 0.849, 
                    listDeletion = 0.850, 
                    meanImp = 0.773,
                    regressionImp = 0.849,
                    multipleImp =0.849)

df2=as.data.frame(t(rbind(imputed_r=imputed2, original_r=original2))) %>%
  tibble::rownames_to_column(var="method") %>%
  mutate(diff_r=(imputed_r-original_r)/original_r*100,
         diff_r=round(diff_r,2),
         imputed_r=round(imputed_r,3))

df2

method	imputed_r	original_r	diff_r
fullData	0.848	0.849	-0.08
listDeletion	0.849	0.85	-0.09
meanImp	0.751	0.773	-2.8
regressionImp	0.849	0.849	0
multipleImp	0.85	0.849	0.14

###Exploratory analyses

I didn’t conduct any exploratory analyses.

Discussion

Summary of Reproduction Attempt

The criteria for success of the reproduction attempt had been defined as:

Criteria 1: Other than mean imputation, all imputation methods produced parameter estimates and model statistics that were very similar to their full data counterparts: This is true for the parameter R-squared in the 5% missing data reproduction. The beta coefficients are also in the same direction and retain the significance, except for home education resources.

Criteria 2: R-squared estimates produced for different methods here are within 5% of Cheema (2014): This criteria was achieved for all methods, for both 5% and 10% missing data. As the tables above show, the difference between reproduction and the original results is less than 5% for all imputation methods, for both 5% and 10% missing data.

Commentary

It would have been ideal for me to be able to reproduce the exact results from the original paper, since I had access to the same dataset. I could reproduce the study pretty closely, however, I believe the estimates were slightly different due to:

(1) Differences in R and SPSS packages

(2) Differences in previous study’s and my approaches to handling and dropping data: this was both due to the blackbox nature of pre-written packages (in both softwares), and also due to smaller assumptions made in running the regressions. Even though the author responded to me, we didn’t have sufficient information to exactly reproduce the results. This could be because the author wrote the paper 8 years ago, or because I missed asking him about some detail. e.g. as I was writing this report, I realized I could have dropped 5% and 10% of PISA data in a different way. In this report, I dropped values from only one variable (math achievement), but it was also possible to drop entire rows. Both methods are used in data validation papers in education research, and the original paper does not mention which approach was used.

The difference made by using a different software, and the assumptions in running the packages, was bigger than I had anticipated. This was something we had discussed in class as well, and it emphasizes the importance of replication and open research frameworks for me.