title: “Replication of `empirical sample 1’ by Cheema (2014, Journal of Modern Applied Statistical Methods)” author: “Radhika Kapoor (rkap786@stanford.edu)” date: “November 20, 2020” output: html_document: toc: yes toc_depth: 3 toc_float: collapsed: false —
Cheema (2014) evaluates different treatment methods for missing data in surveys by comparing their effect on the accuracy of estimation. The paper conducts the analysis on: (1) Simulated dataset (n=10,000 cases) (2) Empirical sample 1: US portion of PISA 2003 (n=456 cases) (3) Empirical sample 2: Population and Housing portion of US Census (2000). The paper looks at the effect of the different methods if sample size is small, medium or large, and if 5% and 10% of data is missing.
In this project, I propose to reproduce Cheema 2014’s findings from the US portion of the PISA 2003 dataset (Empirical sample 1) with 5% and 10% missing data using five different imputation methods - Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation . These are the results in Table 2 and Table 3 of the paper.
Cheema (2014) found that other than mean imputation, all imputation methods produced parameter estimates and model statistics that were very similar to their full data counterparts. This is the result I will be aiming to reproduce. Further, reproduction would be achieved if the R-squared estimates produced for different methods here are within 5% of Cheema (2014).
This paper is highly cited in context of handing missing data, but it hasn’t been replicated to my knowledge. The issue of missing data is highly relevant in education research, especially when datasets are small and a relatively large proportion of variables are missing. This project would also give me an opportunity to work with large-scale assessment datasets like PISA and understand their underlying structure. This paper is relevant to my research interests as I plan to work on large education datasets in the context of international assessments. By replicating the results, I would learn some methods for handling missing data in my future work.
1. Student achievement variables used in Cheema (2014)
Cheema (2014) conducts student level imputations for PISA (2003). The dependent variable is Math achievement, and the reading achievement is one of the explanatory variables.
PISA reports 5 plausible values for Math and Reading for each student, and advises that student level plausible values should not be averaged. It suggests that any estimation should be conducted for each of the plausible values, and the resultant estimates should be averaged. However, the author confirmed that he had averaged the plausible values in the PISA dataset in this paper. I have followed the same approach, but the scores we report would not be the most accurate representation of student ability.
2. Methods used for data imputation
The original study used SPSS which offers baked in packages for missing data imputation used in this paper. I reproduced this study in R which does not offer the same baked in models. R does have packages for regression and multiple imputation (I used the package “MICE”), but it is not possible to know how it compares to SPSS. For other models, I wrote the code, and again, it is difficult to know if the methods are identical to those in SPSS.
I also could not reproduce the code to use the method “Expectation Maximization (EM)”. SPSS does not offer too much information about this package, and the consensus of internet blogs seems to be that it’s not a rigorous package. EM has an optimzation method is available in R and I tried to use it, but the results were very different from SPSS. This is more likely due to differences in packages and code used, and not due to other data or analysis related issues in reproduction. I hencec chose not to report the results for this method.
PISA (2003) dataset for the US was downloaded from the NCES website. SPSS macros provided on the NCES were run to make the data ready for analysis, and the relevant variables were retained (student achievement for math and reading, student gender, home education resources and math anxiety).
Performance of different imputation methods is compared by: (1) Dropping data completely at random (5% and 10%) (2) Estimate missing values using each imputation method (3) Run a linear multiple regression where math achievement is predicted using the other variables - the regression is run using both the full dataset and the dataset with imputed values for missing data. The imputation methods applied are : Listwise deletion, Mean imputation, Regression imputation, and Multiple imputation.
This study differs from the original in use of R instead of SPSS, which might change the packages used. I am also not sure how the original study generated the missing data. THe study describes the method as:
“Each of these sub-samples was then reduced in size by 1%, 2%, 5%, 10%, and 20% in order to simulate datasets containing missing data. The cases were discarded randomly from each complete sample five times separately in order to make sure that there were no dependencies between samples.”
I have taken this to mean that 5% and 10% of “math achievement” variable should be marked as NA for the analysis, but the author might have dropped the full rows. However, that was too complicated for me to do in R in a way that was logical.
Further, the variables “home education resources” and “math anxiety” have additional missing values. The author did not mention dropping them from the analysis, so I have retained them. However, it’s possible that the author didn’t mention dropping these variables, because they behave differently in my estimations.
I load the libraries needed for analysis and the relevant datasets in this section.
## Load Relevant Libraries and Functions
## Default repo
##local({r <- getOption("repos")
## r["CRAN"] <- "https://cloud.r-project.org"
## options(repos=r)
##})
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(readr)
library(norm)
library(knitr)
#install.packages("mice") #package for regression imputation
library("mice")
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
#install.packages("miceadds")
library(miceadds)
## * miceadds 3.10-28 (2020-07-29 21:56:24)
library(sandwich)
library(mitools)
library(texreg)
## Version: 1.37.5
## Date: 2020-06-17
## Author: Philip Leifeld (University of Essex)
##
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
#install.packages("jtools")
library(jtools)
#install.packages("huxtable")
library(huxtable)
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:ggplot2':
##
## theme_grey
## The following object is masked from 'package:dplyr':
##
## add_rownames
#### Import data
url <- '/Users/radhika/Documents/Stanford readings/251 Experimental methods/Project/cheema2014/PISA data/Extracted data/PISA 2003 Student Background.csv'
pisa <-read.csv(url)
In this section, I modify the dataset to retain only the variables needed for the analysis. I create the mean achievement scores needed for the analysis. I also confirm that the variables behave as described in the original paper e.g. Reading and math achievement range between 200 and 800. I also clean the data by dropping missing values, again as described in the paper.
# Variables selected are: IDs, gender, math anxiety, household, education resources, plausible values for math and reading
#### Prepare data for analysis - create columns etc.
####Calculate mean reading and math achievement
pisa = pisa %>%
rowwise() %>%
mutate(math_achievement = mean(c(PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH))) %>%
mutate(read_achievement=mean(c(PV1READ, PV2READ, PV3READ, PV4READ, PV5READ)))
#### Data exclusion / filtering NA values
pisa = pisa %>%
rename(student_gender = ST03Q01, studentID=STIDSTD) %>%
select(c("studentID", "SCHOOLID", "student_gender","ANXMAT", "HEDRES", "math_achievement", "read_achievement"))
#head(pisa)
#nrow(pisa)
#colnames(pisa)
#summary(pisa)
#Paper says math and reading scores are normally distributed - plotting to check
ggplot(pisa, aes(x=math_achievement)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(pisa, aes(x=read_achievement)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Check for outliers in variables of interest
colnames(pisa)
## [1] "studentID" "SCHOOLID" "student_gender" "ANXMAT"
## [5] "HEDRES" "math_achievement" "read_achievement"
unique(pisa$student_gender)
## [1] 1 2 8
unique(pisa$HEDRES)
## [1] -1.4485 -0.6236 0.6766 999.0000 -2.1925 -3.0064 -4.2987
#Check outliers in math anxiety
ggplot(pisa, aes(x=ANXMAT)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Student gender, Math anxiety seems to have outliers - checking to see all possible values in it
pisa_filter=pisa %>%
mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
filter(student_gender<3,
!is.na(student_gender))
#Paper standardizes math anxiety and home education resources, so the standardized variables are created
mean_anxmat <- mean(pisa_filter$ANXMAT, na.rm=T)
sd_anxmat <- sd(pisa_filter$ANXMAT, na.rm=T)
mean_hedres <- mean(pisa_filter$HEDRES, na.rm=T)
sd_hedres <- sd(pisa_filter$HEDRES, na.rm=T)
pisa_filter=pisa_filter %>%
mutate(stdANXMAT= (ANXMAT-mean_anxmat)/sd_anxmat) %>%
mutate(stdHEDRES= (HEDRES-mean_hedres)/sd_hedres)
Cheema (2014) describes reducing the dataset by 5% and 10% for the analysis. I create the datasets as described in this section.
When I run the regressions from the model described above, the variable “home education resources” produces very different beta estimates than the original (the level of significance and coefficient estimate change). When I wrote to the author, he suggested that I try not standardizing the variable. Hence, a dataset with the original home education resources variable (and not it’s standardized version) is also created (with 5% and 10% missing values).
set.seed(123)
pisa_filter <- pisa_filter %>%
mutate(rv = rnorm(n())) %>%
arrange(rv)
cutoff_5per <-5183 #Cheema (2014) rounded off 95% of 5455 obs as 5182, preserving that
cutoff_10per <-4911 #Cheema (2014) rounded off 90% of 5455 obs as 4910
#pisa_5persample <- sample_frac(pisa_filter, 0.95)
######the code above didnt work
pisa_5persample <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, HEDRES,math_achievement, rv))
pisa_5persample_mod <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, stdHEDRES,math_achievement, rv)) %>%
mutate(stdHEDRES=HEDRES) %>%
select(-c(HEDRES))
summary(pisa_5persample_mod)
## studentID SCHOOLID student_gender read_achievement
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :149.4
## 1st Qu.:1364 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.:432.6
## Median :2728 Median :138.0 Median :2.000 Median :500.5
## Mean :2728 Mean :136.4 Mean :1.502 Mean :494.1
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.:562.5
## Max. :5456 Max. :274.0 Max. :2.000 Max. :733.3
##
## stdANXMAT math_achievement2 stdHEDRES
## Min. :-2.20444 Min. :212.9 Min. :-4.2987
## 1st Qu.:-0.34362 1st Qu.:417.7 1st Qu.:-0.6236
## Median :-0.07396 Median :482.1 Median : 0.6766
## Mean : 0.00000 Mean :481.5 Mean :-0.1690
## 3rd Qu.: 0.68481 3rd Qu.:545.3 3rd Qu.: 0.6766
## Max. : 2.56340 Max. :739.2 Max. : 0.6766
## NA's :106 NA's :273 NA's :37
pisa_10persample <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, HEDRES,math_achievement, rv))
pisa_10persample_mod <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, stdHEDRES,math_achievement, rv)) %>%
mutate(stdHEDRES=HEDRES) %>%
select(-c(HEDRES))
summary(pisa_10persample_mod)
## studentID SCHOOLID student_gender read_achievement
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :149.4
## 1st Qu.:1364 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.:432.6
## Median :2728 Median :138.0 Median :2.000 Median :500.5
## Mean :2728 Mean :136.4 Mean :1.502 Mean :494.1
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.:562.5
## Max. :5456 Max. :274.0 Max. :2.000 Max. :733.3
##
## stdANXMAT math_achievement2 stdHEDRES
## Min. :-2.20444 Min. :224.2 Min. :-4.2987
## 1st Qu.:-0.34362 1st Qu.:418.0 1st Qu.:-0.6236
## Median :-0.07396 Median :482.4 Median : 0.6766
## Mean : 0.00000 Mean :482.0 Mean :-0.1690
## 3rd Qu.: 0.68481 3rd Qu.:546.6 3rd Qu.: 0.6766
## Max. : 2.56340 Max. :739.2 Max. : 0.6766
## NA's :106 NA's :545 NA's :37
I define functions for imputation methods here. This is based on my understanding of what these functions do in SPSS, and it reflects how I implemented them in R:
Regression for full model - This is a linear regression function that returns the results from lm function.
List wise deletion imputation method - The rows with missing values are dropped from the regression and the model is run.
Mean imputation method - missing values are replaced with the mean of the remaining values in the variable column
Regression model imputation method - missing values are estimated using a regression function. The model is estimated rows for which data is available; this model is then used to estimate missing values for the concerned variable. The `mice’ package in R offers two options for regression imputation: (1) Stochastic, that estimates a regular regression model (2) Regression imputation, that adds a small, randomly generated error term. While regression imputation is better since it is less deterministic, I used stochastic here since that is what SPSS uses (The author of the original study also recommended using the stochastic method).
EM method - this has not been used, as I could not find a good way to run this method in R.
Multiple imputation - as the name suggests, this method imputes muliple estimated datasets. The dataset with missing values is copied for a specified number of times, and the missing values are estimated for each copied dataset. The values are slightly different because small errors are introduced in the estimations. The mice package, a chain of regression equations is used to estimate the variables i.e. first, one variable is estimated using other variables. The missing values in the other variables are replaced by the mean. Then the results from the first variable are used in the regression estimation for the second variable to be imputed. This is similar to iterating stochastic regression with the difference that (a) it is run on mutiple, slightly different datasets (b) it is run for multiple iterations till results converge.
The function allows for different estimation methods to be used - we can specify different methods for different variables. I used the method predictive mean matching (pmm), since we are not estimating any categorical variables. The variables with missing values that are estimated are math achievement, home education resources and math anxiety.
model0_full = function(data){
lm_fulldata=lm(math_achievement ~ student_gender + HEDRES + stdANXMAT + read_achievement, data=data)
return(lm_fulldata)
}
model1_deletion = function(data){
data = data %>%
filter(!is.na(math_achievement2))
lm_ld=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=data)
return(lm_ld)
}
model2_mean_imp = function(data){
mean_math <- mean(data$math_achievement2, na.rm = T) #calculate mean score to be used for imputation
mean_anxmat <- mean(data$stdANXMAT, na.rm = T)
mean_hedres <- mean(data$stdHEDRES, na.rm = T)
data <- data %>%
rowwise() %>%
mutate(math_achievement2=ifelse(is.na(math_achievement2),mean_math,math_achievement2)) %>%
mutate(stdHEDRES=ifelse(is.na(stdHEDRES),mean_math,stdHEDRES)) %>%
mutate(stdANXMAT=ifelse(is.na(stdANXMAT),mean_math,stdANXMAT))
#Run the regression using the imputed mean
lm_mi=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=data)
return(lm_mi)
}
model3_reg_imp = function(data){
imp <- mice(data, method = "norm.nob", m = 1) # Impute data, produce 1 result matrix
#norm.predict = regression imputation with error
#norm.nob = regression imputation stochastic
imp2 <- complete(imp) # Store datasummary - mice produces data in mids (list) format and this converts it to a single dataframe
summary(imp2) #Missing values have been filled
lm_ri=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=imp2)
return(lm_ri)
}
I couldn’t use Expectation Maximization (EM) imputation in R to replicate the method used in SPSS. The code chunk below has the two methods I tried, neither of which look right. The first one produces estimates very different from the original model, and the second one produced missing or “inf” values. There don’t seem to be any readily available packages in R that correspond to SPSS, so I decided to not reproduce this method.
### (Failed) Method 1: Use package mvdalab
#### The package runs but I am not quite sure what it does, and the results don't correspond to the paper. My best guess is that this model is not the same as SPSS
# install.packages("mvdalab")
# library(mvdalab)
# imp= imputeEM(pisa_5persample, impute.ncomps = 1, pca.ncomps = 1, CV = TRUE, Init = "mean",
# scale = TRUE, iters = 25, tol = .Machine$double.eps^0.25)
# implist = (pool(imp$Imputed.DataFrames))
#
# modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
# summary(modelFit2)
### (Failed) Method 2: Use imp.norm
### The code runs but the missing values are not filled and "Inf" is generated
# library(norm)
# summary(pisa_5persample)
# s <- prelim.norm(as.matrix(pisa_5persample))
# thetahat <- em.norm(s)
# getparam.norm(s,thetahat)
#
# ximp = imp.norm(s, thetahat, as.matrix(pisa_5persample))
# summary(ximp)
model5_multiple_imp = function(data){
colnames(pisa_5persample)
imp <- mice(data, m=5, maxit=10, meth=c('pmm'),seed=500, printFlag = F)
#mice is the package for performing multiple imputations
#m=5: number of multiple imputations i.e. 5 completed datasets will be generated
#maxit=10: number of iterations run
#pmm: predictive mean matching
implist <- miceadds::mids2datlist(imp)
modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement)) #this will estimate lm model for each imputued dataset i.e. 5 lm models are estimated
#summary(pool(modelFit2)) #pooling results from the models
}
####I copy pasted the entire code chunk for function "extract.df" from a blog
extract.df <- function(tt, cl = NULL) {
#create a list from imputation - makes export easier
require(sandwich)
require(mitools)
require(texreg)
m2 <- length(tt) #number of imputations
betas <- lapply(tt, coef)
vars <- lapply(tt, vcov)
# conduct statistical inference and save results into a data.frame
model1 <- summary(pool_mi(betas, vars))
R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
#technically, should use R to Z-- but close for now when R is low
ns <- nobs(tt[[1]])
#creates what is used by texreg
tr <- createTexreg(
coef.names = row.names(model1),
coef = model1$results,
se = model1$se,
pvalues = model1$p,
gof.names = c("R2", "Nobs"),
gof = c(R2, ns),
gof.decimal = c(T,F)
)
}
This section runs the imputation methods for PISA dataset with 5% and 10% missing data. The models are estimated using both standardized home education resources variable, as well as the original variable.
#These results are similar to the paper results
##Full data analysis
md.pattern(pisa_filter, plot = F) #check the pattern of missing data
## studentID SCHOOLID student_gender math_achievement read_achievement rv
## 5349 1 1 1 1 1 1
## 69 1 1 1 1 1 1
## 37 1 1 1 1 1 1
## 0 0 0 0 0 0
## HEDRES stdHEDRES ANXMAT stdANXMAT
## 5349 1 1 1 1 0
## 69 1 1 0 0 2
## 37 0 0 0 0 4
## 37 37 106 106 286
lm_fulldata=model0_full(pisa_filter)
## For 5% missing data
md.pattern(pisa_5persample, plot = F) #check the pattern of missing data
## studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 5083 1 1 1 1 1 1
## 266 1 1 1 1 1 1
## 66 1 1 1 1 1 0
## 3 1 1 1 1 1 0
## 33 1 1 1 1 0 0
## 4 1 1 1 1 0 0
## 0 0 0 0 37 106
## math_achievement2
## 5083 1 0
## 266 0 1
## 66 1 1
## 3 0 2
## 33 1 2
## 4 0 3
## 273 416
### Using standardized HEDRES as specified in paper
lm_ld_5per = model1_deletion(pisa_5persample) #list wise deletion
lm_mi_5per = model2_mean_imp(pisa_5persample) #mean imputation
lm_ri_5per = model3_reg_imp(pisa_5persample) #regression imputation
##
## iter imp variable
## 1 1 stdANXMAT stdHEDRES math_achievement2
## 2 1 stdANXMAT stdHEDRES math_achievement2
## 3 1 stdANXMAT stdHEDRES math_achievement2
## 4 1 stdANXMAT stdHEDRES math_achievement2
## 5 1 stdANXMAT stdHEDRES math_achievement2
## Warning: Number of logged events: 1
multim_est_5per= model5_multiple_imp(pisa_5persample) #multiple imputation
## Warning: Number of logged events: 1
lm_multim_5per <- extract.df(multim_est_5per) #extract results
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 18.7049692 3.436358157 5.443254 5.270374e-08 11.96957368
## student_gender 30.5474741 0.992240013 30.786376 3.698504e-196 28.60240725
## stdHEDRES 1.0509477 0.515044109 2.040500 4.158885e-02 0.04013469
## stdANXMAT -12.5190415 0.515182112 -24.300225 1.105203e-112 -13.52947640
## read_achievement 0.8439916 0.005625905 150.018829 0.000000e+00 0.83296434
## upper) missInfo
## (Intercept) 25.440365 1.1 %
## student_gender 32.492541 2.3 %
## stdHEDRES 2.061761 6.8 %
## stdANXMAT -11.508607 4.9 %
## read_achievement 0.855019 1.5 %
### Using HEDRES variable with no standardization, as suggested by the author
lm_ld_5per2 = model1_deletion(pisa_5persample_mod) #list wise deletion
lm_mi_5per2 = model2_mean_imp(pisa_5persample_mod) #mean imputation
lm_ri_5per2 = model3_reg_imp(pisa_5persample_mod) #regression imputation
##
## iter imp variable
## 1 1 stdANXMAT math_achievement2 stdHEDRES
## 2 1 stdANXMAT math_achievement2 stdHEDRES
## 3 1 stdANXMAT math_achievement2 stdHEDRES
## 4 1 stdANXMAT math_achievement2 stdHEDRES
## 5 1 stdANXMAT math_achievement2 stdHEDRES
## Warning: Number of logged events: 1
multim_est_5per2=model5_multiple_imp(pisa_5persample_mod) #multiple imputation
## Warning: Number of logged events: 1
lm_multim_5per2= extract.df(multim_est_5per2) #extract results
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 18.9832323 3.617928574 5.246989 2.533734e-07 11.8703234
## student_gender 30.5118330 1.003301070 30.411443 6.060850e-167 28.5441930
## stdHEDRES 0.8664216 0.457160365 1.895225 5.822086e-02 -0.0301921
## stdANXMAT -12.5267505 0.509984745 -24.562991 7.383019e-126 -13.5265452
## read_achievement 0.8437922 0.005803911 145.383378 0.000000e+00 0.8323971
## upper) missInfo
## (Intercept) 26.0961411 10.5 %
## student_gender 32.4794730 4.6 %
## stdHEDRES 1.7630352 4.8 %
## stdANXMAT -11.5269559 2.9 %
## read_achievement 0.8551874 7.8 %
######################################################################
## For 10% missing data
md.pattern(pisa_10persample, plot = F) #check the pattern of missing data
## studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 4813 1 1 1 1 1 1
## 536 1 1 1 1 1 1
## 64 1 1 1 1 1 0
## 5 1 1 1 1 1 0
## 33 1 1 1 1 0 0
## 4 1 1 1 1 0 0
## 0 0 0 0 37 106
## math_achievement2
## 4813 1 0
## 536 0 1
## 64 1 1
## 5 0 2
## 33 1 2
## 4 0 3
## 545 688
### Using standardized HEDRES as specified in paper
lm_ld_10per = model1_deletion(pisa_10persample) #list wise deletion
lm_mi_10per = model2_mean_imp(pisa_10persample) #mean imputation
lm_ri_10per = model3_reg_imp(pisa_10persample) #regression imputation
##
## iter imp variable
## 1 1 stdANXMAT stdHEDRES math_achievement2
## 2 1 stdANXMAT stdHEDRES math_achievement2
## 3 1 stdANXMAT stdHEDRES math_achievement2
## 4 1 stdANXMAT stdHEDRES math_achievement2
## 5 1 stdANXMAT stdHEDRES math_achievement2
## Warning: Number of logged events: 1
multim_est_10per= model5_multiple_imp(pisa_10persample) #multiple imputation
## Warning: Number of logged events: 1
lm_multim_10per= extract.df(multim_est_10per) #extract results
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 17.5490203 3.650936806 4.806717 2.736037e-06 10.35633268
## student_gender 30.8305051 1.065787309 28.927446 7.774665e-65 28.72539242
## stdHEDRES 0.9102864 0.506541173 1.797063 7.245279e-02 -0.08302295
## stdANXMAT -12.6921422 0.579401721 -21.905600 3.705601e-31 -13.84999149
## read_achievement 0.8451888 0.005796448 145.811491 0.000000e+00 0.83380696
## upper) missInfo
## (Intercept) 24.7417080 13.8 %
## student_gender 32.9356178 17 %
## stdHEDRES 1.9035958 4.2 %
## stdANXMAT -11.5342928 27.5 %
## read_achievement 0.8565706 8.1 %
### Using HEDRES variable with no standardization, as suggested by the author
lm_ld_10per2 = model1_deletion(pisa_10persample_mod) #list wise deletion
lm_mi_10per2 = model2_mean_imp(pisa_10persample_mod) #mean imputation
lm_ri_10per2 = model3_reg_imp(pisa_10persample_mod) #regression imputation
##
## iter imp variable
## 1 1 stdANXMAT math_achievement2 stdHEDRES
## 2 1 stdANXMAT math_achievement2 stdHEDRES
## 3 1 stdANXMAT math_achievement2 stdHEDRES
## 4 1 stdANXMAT math_achievement2 stdHEDRES
## 5 1 stdANXMAT math_achievement2 stdHEDRES
## Warning: Number of logged events: 1
multim_est_10per2=model5_multiple_imp(pisa_10persample_mod) #multiple imputation
## Warning: Number of logged events: 1
lm_multim_10per2= extract.df(multim_est_10per2) #extract results
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 17.9253153 3.614054089 4.959891 1.083855e-06 10.8182909
## student_gender 30.7228040 1.031257595 29.791591 8.793595e-102 28.6951707
## stdHEDRES 0.7415120 0.483133284 1.534798 1.267098e-01 -0.2122652
## stdANXMAT -12.8959376 0.528838122 -24.385416 6.542410e-81 -13.9356132
## read_achievement 0.8449354 0.005901284 143.178241 7.754751e-293 0.8333253
## upper) missInfo
## (Intercept) 25.0323397 11 %
## student_gender 32.7504373 10.7 %
## stdHEDRES 1.6952892 16.4 %
## stdANXMAT -11.8562620 10.5 %
## read_achievement 0.8565455 11.7 %
This section compiles results from the multiple regressions in tables - one table is made for 5% missing data (corresponding to Table 2 in the paper), and a second table is made for 10% missing data (corresponding to Table 3). This reproduction gave the same result as before: it’s better to use any imputation method than mean imputation. Even deletion of missing data rows would be preferred to mean imputation, as mean imputation biases result.
The results for the home education resources variable do not get reproduced both when we standardize it and when we don’t. The next section only discusses the results from the standardized home education resources variable.
##5% missing data
### Using standardized HEDRES as in paper
screenreg(list(lm_fulldata, lm_ld_5per, lm_mi_5per, lm_ri_5per, lm_multim_5per),
custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5),
custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
digits=3,
include.fstatistic = TRUE)
##
## ==============================================================================================
## Full model Deletion Mean Regression Multiple
## ------------ ------------ ------------ ------------ ------------
## Model 0 Model 1 Model 2 Model 3 Model 5
## ----------------------------------------------------------------------------------------------
## Intercept 14.896 *** 16.221 *** 13.467 *** 18.165 *** 18.705 ***
## (3.480) (3.553) (3.670) (3.404) (3.436)
## Gender 30.710 *** 30.538 *** 32.241 *** 30.576 *** 30.547 ***
## (0.986) (1.012) (1.106) (0.977) (0.992)
## Home education resources 0.961 * 1.308 * -0.026 1.024 * 1.051 *
## (0.454) (0.518) (0.017) (0.496) (0.515)
## Math anxiety -12.344 *** -12.450 *** 0.036 *** -12.638 *** -12.519 ***
## (0.506) (0.518) (0.010) (0.501) (0.515)
## Reading achievement 0.851 *** 0.849 *** 0.849 *** 0.845 *** 0.844 ***
## (0.006) (0.006) (0.006) (0.006) (0.006)
## ----------------------------------------------------------------------------------------------
## R^2 0.848 0.848 0.792 0.850
## Adj. R^2 0.848 0.848 0.792 0.850
## Num. obs. 5349 5083 5455 5455
## F statistic 7471.529 7082.083 5199.940 7732.297
## R2 0.849
## Nobs 5455
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
### Using normal HEDRES as suggested by author
screenreg(list(lm_fulldata, lm_ld_5per2, lm_mi_5per2, lm_ri_5per2, lm_multim_5per2),
custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5),
custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
digits=3,
include.fstatistic = TRUE)
##
## ==============================================================================================
## Full model Deletion Mean Regression Multiple
## ------------ ------------ ------------ ------------ ------------
## Model 0 Model 1 Model 2 Model 3 Model 5
## ----------------------------------------------------------------------------------------------
## Intercept 14.896 *** 16.420 *** 13.459 *** 19.034 *** 18.983 ***
## (3.480) (3.569) (3.670) (3.431) (3.618)
## Gender 30.710 *** 30.538 *** 32.241 *** 30.222 *** 30.512 ***
## (0.986) (1.012) (1.106) (0.980) (1.003)
## Home education resources 0.961 * 1.172 * -0.026 0.771 0.866
## (0.454) (0.464) (0.017) (0.446) (0.457)
## Math anxiety -12.344 *** -12.450 *** 0.036 *** -12.526 *** -12.527 ***
## (0.506) (0.518) (0.010) (0.503) (0.510)
## Reading achievement 0.851 *** 0.849 *** 0.849 *** 0.844 *** 0.844 ***
## (0.006) (0.006) (0.006) (0.006) (0.006)
## ----------------------------------------------------------------------------------------------
## R^2 0.848 0.848 0.792 0.849
## Adj. R^2 0.848 0.848 0.792 0.849
## Num. obs. 5349 5083 5455 5455
## F statistic 7471.529 7082.083 5199.910 7650.895
## R2 0.849
## Nobs 5455
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
##10% missing data
screenreg(list(lm_fulldata, lm_ld_10per, lm_mi_10per, lm_ri_10per, lm_multim_10per),
custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5),
custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
digits=3,
include.fstatistic = TRUE)
##
## ==============================================================================================
## Full model Deletion Mean Regression Multiple
## ------------ ------------ ------------ ------------ ------------
## Model 0 Model 1 Model 2 Model 3 Model 5
## ----------------------------------------------------------------------------------------------
## Intercept 14.896 *** 15.497 *** 37.617 *** 17.197 *** 17.549 ***
## (3.480) (3.643) (3.912) (3.422) (3.651)
## Gender 30.710 *** 30.644 *** 31.083 *** 30.532 *** 30.831 ***
## (0.986) (1.037) (1.179) (0.982) (1.066)
## Home education resources 0.961 * 1.265 * -0.028 0.858 0.910
## (0.454) (0.532) (0.018) (0.499) (0.507)
## Math anxiety -12.344 *** -12.534 *** 0.031 ** -12.404 *** -12.692 ***
## (0.506) (0.532) (0.011) (0.505) (0.579)
## Reading achievement 0.851 *** 0.850 *** 0.804 *** 0.847 *** 0.845 ***
## (0.006) (0.006) (0.006) (0.006) (0.006)
## ----------------------------------------------------------------------------------------------
## R^2 0.848 0.849 0.751 0.849
## Adj. R^2 0.848 0.849 0.751 0.849
## Num. obs. 5349 4813 5455 5455
## F statistic 7471.529 6771.698 4116.657 7660.441
## R2 0.850
## Nobs 5455
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
### Using normal HEDRES as suggested by author
screenreg(list(lm_fulldata, lm_ld_10per2, lm_mi_10per2, lm_ri_10per2, lm_multim_10per2),
custom.header= list("Full model"=1,"Deletion"=2, "Mean"=3, "Regression"=4, "Multiple"=5),
custom.model.names = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 5"),
custom.coef.names = c("Intercept", "Gender", "Home education resources", "Math anxiety", "Reading achievement", "Home education resources"),
digits=3,
include.fstatistic = TRUE)
##
## ==============================================================================================
## Full model Deletion Mean Regression Multiple
## ------------ ------------ ------------ ------------ ------------
## Model 0 Model 1 Model 2 Model 3 Model 5
## ----------------------------------------------------------------------------------------------
## Intercept 14.896 *** 15.689 *** 37.609 *** 17.031 *** 17.925 ***
## (3.480) (3.660) (3.912) (3.429) (3.614)
## Gender 30.710 *** 30.644 *** 31.083 *** 31.139 *** 30.723 ***
## (0.986) (1.037) (1.179) (0.978) (1.031)
## Home education resources 0.961 * 1.134 * -0.028 0.784 0.742
## (0.454) (0.477) (0.018) (0.445) (0.483)
## Math anxiety -12.344 *** -12.534 *** 0.031 ** -12.590 *** -12.896 ***
## (0.506) (0.532) (0.011) (0.502) (0.529)
## Reading achievement 0.851 *** 0.850 *** 0.804 *** 0.846 *** 0.845 ***
## (0.006) (0.006) (0.006) (0.006) (0.006)
## ----------------------------------------------------------------------------------------------
## R^2 0.848 0.849 0.751 0.850
## Adj. R^2 0.848 0.849 0.751 0.850
## Num. obs. 5349 4813 5455 5455
## F statistic 7471.529 6771.698 4116.640 7725.917
## R2 0.850
## Nobs 5455
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
This section compares the R-squared estimates from the reproduction with the results from Cheema (2014).
#compile results for 5% missing data
##Compile R-squared
x= pool(multim_est_5per) #pooling estimates from the 5 imputed models
#pool.r.squared(x, adjusted = FALSE)[1]
imputed_r = data.frame(fullData = summary(lm_fulldata)$r.squared,
listDeletion = summary(lm_ld_5per)$r.squared,
meanImp = summary(lm_mi_5per)$r.squared,
regressionImp = summary(lm_ri_5per)$r.squared,
multipleImp =pool.r.squared(x, adjusted = FALSE)[1])
original_r = data.frame(fullData = 0.849,
listDeletion = 0.848,
meanImp = 0.801,
regressionImp = 0.849,
multipleImp =0.849)
df=as.data.frame(t(rbind(imputed_r=imputed_r, original_r=original_r))) %>%
tibble::rownames_to_column(var="method") %>%
mutate(diff_r=(imputed_r-original_r)/original_r*100,
diff_r=round(diff_r,2))
df
| method | imputed_r | original_r | diff_r |
|---|---|---|---|
| fullData | 0.848 | 0.849 | -0.08 |
| listDeletion | 0.848 | 0.848 | 0 |
| meanImp | 0.792 | 0.801 | -1.08 |
| regressionImp | 0.85 | 0.849 | 0.14 |
| multipleImp | 0.849 | 0.849 | -0.02 |
x= pool(multim_est_10per) #pooling estimates from the 5 imputed models
#pool.r.squared(x, adjusted = FALSE)[1]
imputed2 = data.frame(fullData = summary(lm_fulldata)$r.squared,
listDeletion = summary(lm_ld_10per)$r.squared,
meanImp = summary(lm_mi_10per)$r.squared,
regressionImp = summary(lm_ri_10per)$r.squared,
multipleImp =pool.r.squared(x, adjusted = FALSE)[1])
original2 = data.frame(fullData = 0.849,
listDeletion = 0.850,
meanImp = 0.773,
regressionImp = 0.849,
multipleImp =0.849)
df2=as.data.frame(t(rbind(imputed_r=imputed2, original_r=original2))) %>%
tibble::rownames_to_column(var="method") %>%
mutate(diff_r=(imputed_r-original_r)/original_r*100,
diff_r=round(diff_r,2),
imputed_r=round(imputed_r,3))
df2
| method | imputed_r | original_r | diff_r |
|---|---|---|---|
| fullData | 0.848 | 0.849 | -0.08 |
| listDeletion | 0.849 | 0.85 | -0.09 |
| meanImp | 0.751 | 0.773 | -2.8 |
| regressionImp | 0.849 | 0.849 | 0 |
| multipleImp | 0.85 | 0.849 | 0.14 |
###Exploratory analyses
I didn’t conduct any exploratory analyses.
The criteria for success of the reproduction attempt had been defined as:
Criteria 1: Other than mean imputation, all imputation methods produced parameter estimates and model statistics that were very similar to their full data counterparts: This is true for the parameter R-squared in the 5% missing data reproduction. The beta coefficients are also in the same direction and retain the significance, except for home education resources.
Criteria 2: R-squared estimates produced for different methods here are within 5% of Cheema (2014): This criteria was achieved for all methods, for both 5% and 10% missing data. As the tables above show, the difference between reproduction and the original results is less than 5% for all imputation methods, for both 5% and 10% missing data.
It would have been ideal for me to be able to reproduce the exact results from the original paper, since I had access to the same dataset. I could reproduce the study pretty closely, however, I believe the estimates were slightly different due to:
(1) Differences in R and SPSS packages
(2) Differences in previous study’s and my approaches to handling and dropping data: this was both due to the blackbox nature of pre-written packages (in both softwares), and also due to smaller assumptions made in running the regressions. Even though the author responded to me, we didn’t have sufficient information to exactly reproduce the results. This could be because the author wrote the paper 8 years ago, or because I missed asking him about some detail. e.g. as I was writing this report, I realized I could have dropped 5% and 10% of PISA data in a different way. In this report, I dropped values from only one variable (math achievement), but it was also possible to drop entire rows. Both methods are used in data validation papers in education research, and the original paper does not mention which approach was used.
The difference made by using a different software, and the assumptions in running the packages, was bigger than I had anticipated. This was something we had discussed in class as well, and it emphasizes the importance of replication and open research frameworks for me.