Replication of `empirical sample 1’ by Cheema (2014, Journal of Modern Applied Statistical Methods)

Cheema 2014

Project progress

Status of the project

I have completed the following parts of the project:

Imported the data into R
Prepared data for analysis: this included keeping required variables, dropping outliers/missing variables and standardizing variables. This was done to correspond to the analysis described in the paper
Randomly dropped 5% and 10% of the dependent variable for the analysis
Tried 2 of the 5 imputation methods for the 5% sample

I have to the following left to do:

Complete the 3 remaining imputation methods (regression imputation, EM imputation, multiple imputation)
Merge the results from the regression models in one table
Compare the results against Cheema (2014)

Question for teaching team

The feedback I got for my midterm presentation was to decide what would constitute replication. Replication in this case could be:

Comparison of beta coefficients - I can compare these by standardizing them
Comparison of R-squared: I am not sure of how to do this, would appreciate feedback

Issues I’ve run into so far

The regression results don’t match the results in the paper fully. Even though I have the exact same number of observations, the following variables don’t fully match:

As anticipated, the mean math and reading achievement doesn’t fully match the mean achievement variables I have calculated. This doesn’t seem that different though, going by the results
The regression results for one indepdendent variable (home education resources) doesn’t match the results in the paper. I’ve double-checked that I’ve picked the right variable, dropped the outlier (999) and normalized it as done in the paper. At this point, I am not sure why this variable is behaving differently than in the paper

I would appreciate feedback on:

Whether my coding so far is following Tidyverse, and if not, where to improve it
Whether my coding could have been simpler e.g. I am not sure if I wrote really long code for something that could have been done more simply
Whether my coding is easy to read and follow

Introduction

Cheema (2014) evaluates different treatment methods for missing data in surveys by comparing their effect on the accuracy of estimation. The paper conducts the analysis on: (1) Simulated dataset (n=10,000 cases) (2) Empirican sample 1: US portion of PISA 2003 (n=456 cases) (3) Empirical sample 2: Population and Housing portion of US Census (2000). The paper looks at the effect of the different methods if sample size is small, medium or large, and if 5% and 10% of data is missing.

In this project, I propose to reproduce Cheema 2014’s findings from the US portion of the PISA 2003 dataset (Empirical sample 1) with 5% missing data using five different imputation methods - Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation . These are the results in Table 2.

Justification for choice of study

This paper is highly cited in context of handing missing data, but it hasn’t been replicated to my knowledge. This project would also give me an opportunity to work with large-scale assessment datasets like PISA and understand their underlying structure. This paper is relevant to my research interests as I plan to work on large education datasets in the context of international assessments. By replicating the results, I would learn some methods for handling missing data in my future work.

Anticipated challenges

Do you anticipate running into any challenges when attempting to reproduce these result(s)? If so please, list them here.

1. Student achievement variables used in Cheema (2014)

Cheema (2014) conducts student level imputations for PISA (2003). The dependent variable is Math achievement, and the reading achievement is one of the explanatory variables. The paper mentions that both the achievement variables vary between 200 and 800.

However, PISA reports 5 plausible values for Math and Reading for each student, and advises that student level plausible values should not be averaged. It suggests that any estimation should be conducted for each of the plausible values, and the resultant estimates should be averaged. Each of these plausible values have a minimum in the range of 141-189, lower than the minimum reported by Cheema. Given the PISA dataset I have downloaded has the same number of observations as reported by Cheema, it is possible that the paper averaged the 5 plausible values. I will be able to determine this once I do the imputation.

To address this, I plan to: (1) First, imputation using average plausible values to see if Cheema (2014) can be replicated (2) If results are not replicated, conduct estimation using each plausible value and then take the average

2. Figuring out how to code the methods used in the paper

Cheema (2014) first performs imputation for a simulated dataset, and then for PISA. In the simulated dataset, 5% of the dataset is randomly dropped 5 times, so that there is no dependencies. However, the PISA section does not describe the values being dropped 5 times. To address this, I plan to drop 5% of data only once; if this is not sufficient to regenerate the sample, I can drop values 5 times, perform the imputation and run the regression, and then average the results

Links

Link to the repo here

Link to the original paper here

Methods

Description of the steps required to reproduce the results

PISA (2003) dataset for the US was downloaded from the NCES website. SPSS macros provided on the NCES were run to make the data ready for analysis, and the relevant variables were retained (student achievement for math and reading, student gender, home education resources and math anxiety).

For analysis, math achievement was predicted using the other variables using a linear multiple regression equation. 5% of the data was randomly discarded. Each of the 5 imputation methods were applied to each of the reduced datasets, and then the multiple analysis was conducted:Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation.

Differences from original study

Explicitly describe known differences in the analysis pipeline between the original paper and yours (e.g., computing environment). The goal, of course, is to minimize those differences, but differences may occur. Also, note whether such differences are anticipated to influence your ability to reproduce the original results.

Results

Data preparation

## Load Relevant Libraries and Functions
## Default repo

##local({r <- getOption("repos")
##       r["CRAN"] <- "https://cloud.r-project.org"
##       options(repos=r)
##})


library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr) 
library(ggplot2)
library(readr)

#install.packages("mice")  #package for regression imputation
library("mice")

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

#install.packages("miceadds")
library(miceadds)

## * miceadds 3.10-28 (2020-07-29 21:56:24)

library(sandwich)
library(mitools)
library(texreg)

## Version:  1.37.5
## Date:     2020-06-17
## Author:   Philip Leifeld (University of Essex)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

## 
## Attaching package: 'texreg'

## The following object is masked from 'package:tidyr':
## 
##     extract

#?sandwich
#install.packages("sandwich")
#install.packages(mitools)
#install.packages("texreg")


#### Import data
url <- '/Users/radhika/Documents/Stanford readings/251 Experimental methods/Project/cheema2014/PISA data/Extracted data/PISA 2003 Student Background.csv'
pisa <-read.csv(url)

           
# Variables selected are: IDs, gender, math anxiety, household, education resources, plausible values for math and reading

#### Prepare data for analysis - create columns etc.
####Calculate mean
pisa = pisa %>% 
   rowwise() %>% 
  mutate(math_achievement = mean(c(PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH))) %>%  
  mutate(read_achievement=mean(c(PV1READ, PV2READ, PV3READ, PV4READ, PV5READ)))
  

#### Data exclusion / filtering
pisa = pisa %>% 
  rename(student_gender = ST03Q01, studentID=STIDSTD) %>% 
  select(c("studentID", "SCHOOLID", "student_gender","ANXMAT", "HEDRES", "math_achievement", "read_achievement"))
         


head(pisa)

## # A tibble: 6 x 7
## # Rowwise: 
##   studentID SCHOOLID student_gender ANXMAT HEDRES math_achievement
##       <int>    <int>          <int>  <dbl>  <dbl>            <dbl>
## 1         1        1              1  0.404 -1.45              458.
## 2         2        1              2 -0.746 -0.624             492.
## 3         3        1              1 -0.851 -0.624             512.
## 4         4        1              1 -2.48   0.677             614.
## 5         5        1              1  0.127  0.677             374.
## 6         6        1              1 -0.458  0.677             571.
## # … with 1 more variable: read_achievement <dbl>

nrow(pisa)

## [1] 5456

colnames(pisa)

## [1] "studentID"        "SCHOOLID"         "student_gender"   "ANXMAT"          
## [5] "HEDRES"           "math_achievement" "read_achievement"

summary(pisa)

##    studentID       SCHOOLID     student_gender      ANXMAT        
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   : -2.4781  
##  1st Qu.:1365   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.: -0.4583  
##  Median :2728   Median :138.0   Median :2.000   Median : -0.1656  
##  Mean   :2728   Mean   :136.4   Mean   :1.503   Mean   : 19.3251  
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.:  0.6580  
##  Max.   :5456   Max.   :274.0   Max.   :8.000   Max.   :999.0000  
##      HEDRES         math_achievement read_achievement
##  Min.   : -4.2987   Min.   :212.9    Min.   :149.4   
##  1st Qu.: -0.6236   1st Qu.:418.2    1st Qu.:432.6   
##  Median :  0.6766   Median :482.1    Median :500.6   
##  Mean   :  6.6070   Mean   :481.5    Mean   :494.1   
##  3rd Qu.:  0.6766   3rd Qu.:544.8    3rd Qu.:562.5   
##  Max.   :999.0000   Max.   :739.2    Max.   :733.3

#Paper says math and reading scores are normally distributed - plotting to check
ggplot(pisa, aes(x=math_achievement)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(pisa, aes(x=read_achievement)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Check for outliers in variables of interest
colnames(pisa)

## [1] "studentID"        "SCHOOLID"         "student_gender"   "ANXMAT"          
## [5] "HEDRES"           "math_achievement" "read_achievement"

unique(pisa$student_gender)

## [1] 1 2 8

unique(pisa$HEDRES)

## [1]  -1.4485  -0.6236   0.6766 999.0000  -2.1925  -3.0064  -4.2987

#Check outliers in math anxiety
ggplot(pisa, aes(x=ANXMAT)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Student gender, Math anxiety seems to have outliers - checking to see all possible values in it

pisa_filter=pisa %>%
     mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
    mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
    filter(student_gender<3,
    !is.na(student_gender))

mean_anxmat <- mean(pisa_filter$ANXMAT, na.rm=T)
sd_anxmat <- sd(pisa_filter$ANXMAT, na.rm=T)
mean_hedres <- mean(pisa_filter$HEDRES, na.rm=T)
sd_hedres <- sd(pisa_filter$HEDRES, na.rm=T)



pisa_filter=pisa_filter %>%
     mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
    mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
    filter(student_gender<3,
    !is.na(student_gender)) %>%
  mutate(stdANXMAT= (ANXMAT-mean_anxmat)/sd_anxmat) %>%
mutate(stdHEDRES= (HEDRES-mean_hedres)/sd_hedres)

#####I tried to use the code below but it didn’t work

#pisa_filter=pisa %>%

 mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
 
filter(is.na(ANXMAT)==1)  %>%

mutate(sdANXMAT=scale(ANXMAT))

Key analysis

The analyses as specified in the analysis plan.

##Reduce sample size for analysis Cheema (2014) describes the process as under:

Each of these sub-samples was then reduced in size by 1%, 2%, 5%, 10%, and 20% in order to simulate datasets containing missing data. The cases were discarded randomly from each complete sample five times separately in order to make sure that there were no dependencies between samples.

#Drop 5% of the PISA data


set.seed(123)
pisa_filter <- pisa_filter %>%
  mutate(rv = rnorm(n())) %>%
  arrange(rv) 

cutoff_5per <-5183 #Cheema (2014) rounded off 95% of 5455 obs as 5182, preserving that 
cutoff_10per <-4911 #Cheema (2014) rounded off 90% of 5455 obs as 4910

pisa_5persample <- sample_frac(pisa_filter, 0.95)

pisa_5persample <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>% 
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, HEDRES,math_achievement, rv))


pisa_10persample <- pisa_filter %>% 
  tibble::rowid_to_column("new_rowid") %>%
  mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>%  
  select(-c(new_rowid)) %>%
  select(-c(ANXMAT, HEDRES,math_achievement, rv))

#Conduct analysis with full data

summary(pisa_filter)

##    studentID       SCHOOLID     student_gender      ANXMAT        
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   :-2.47810  
##  1st Qu.:1364   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.:-0.45830  
##  Median :2728   Median :138.0   Median :2.000   Median :-0.16560  
##  Mean   :2728   Mean   :136.4   Mean   :1.502   Mean   :-0.08532  
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.: 0.65800  
##  Max.   :5456   Max.   :274.0   Max.   :2.000   Max.   : 2.69710  
##                                                 NA's   :106       
##      HEDRES        math_achievement read_achievement   stdANXMAT       
##  Min.   :-4.2987   Min.   :212.9    Min.   :149.4    Min.   :-2.20444  
##  1st Qu.:-0.6236   1st Qu.:418.2    1st Qu.:432.6    1st Qu.:-0.34362  
##  Median : 0.6766   Median :482.0    Median :500.5    Median :-0.07396  
##  Mean   :-0.1690   Mean   :481.4    Mean   :494.1    Mean   : 0.00000  
##  3rd Qu.: 0.6766   3rd Qu.:544.8    3rd Qu.:562.5    3rd Qu.: 0.68481  
##  Max.   : 0.6766   Max.   :739.2    Max.   :733.3    Max.   : 2.56340  
##  NA's   :37                                          NA's   :106       
##    stdHEDRES             rv           
##  Min.   :-3.7017   Min.   :-3.189186  
##  1st Qu.:-0.4075   1st Qu.:-0.648679  
##  Median : 0.7580   Median :-0.007732  
##  Mean   : 0.0000   Mean   : 0.001169  
##  3rd Qu.: 0.7580   3rd Qu.: 0.655904  
##  Max.   : 0.7580   Max.   : 3.445992  
##  NA's   :37

lm_fulldata=lm(math_achievement ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_filter)
summary(lm_fulldata)

## 
## Call:
## lm(formula = math_achievement ~ student_gender + stdHEDRES + 
##     stdANXMAT + read_achievement, data = pisa_filter)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.323  -22.854   -0.006   23.008  141.897 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       14.733911   3.464456   4.253 2.15e-05 ***
## student_gender    30.710157   0.985639  31.158  < 2e-16 ***
## stdHEDRES          1.072422   0.506049   2.119   0.0341 *  
## stdANXMAT        -12.343564   0.505920 -24.398  < 2e-16 ***
## read_achievement   0.850814   0.005671 150.018  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35 on 5344 degrees of freedom
##   (106 observations deleted due to missingness)
## Multiple R-squared:  0.8483, Adjusted R-squared:  0.8482 
## F-statistic:  7472 on 4 and 5344 DF,  p-value: < 2.2e-16

#Imputation methods for 5% sample 1. List wise deletion

#List wise deletion is the same as not making any adjustments to the missing data
lm_ld_5per=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_5persample)
summary(lm_ld_5per)

## 
## Call:
## lm(formula = math_achievement2 ~ student_gender + stdHEDRES + 
##     stdANXMAT + read_achievement, data = pisa_5persample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.27  -22.80    0.00   22.92  142.59 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.221464   3.552818   4.566 5.09e-06 ***
## student_gender    30.537953   1.012254  30.168  < 2e-16 ***
## stdHEDRES          1.307573   0.517604   2.526   0.0116 *  
## stdANXMAT        -12.450136   0.518428 -24.015  < 2e-16 ***
## read_achievement   0.848598   0.005811 146.035  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 5078 degrees of freedom
##   (372 observations deleted due to missingness)
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.8479 
## F-statistic:  7082 on 4 and 5078 DF,  p-value: < 2.2e-16

Mean imputation

#Values of math_achievement2 will be replaced by the mean of the column

mean_math <- mean(pisa_5persample$math_achievement2, na.rm = T) #calculate mean score to be used for imputation
mean_math

## [1] 481.4513

pisa_5persample_mi <- pisa_5persample %>%
  rowwise() %>%
  mutate(math_mi=ifelse(is.na(math_achievement2),mean_math,math_achievement2)) 



#Run the regression using the imputed mean
lm_mi_5per=lm(math_mi ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_5persample_mi)
summary(lm_mi_5per)

## 
## Call:
## lm(formula = math_mi ~ student_gender + stdHEDRES + stdANXMAT + 
##     read_achievement, data = pisa_5persample_mi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -170.118  -23.801    0.126   24.006  214.312 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.057000   3.798468  10.019   <2e-16 ***
## student_gender    28.589976   1.080666  26.456   <2e-16 ***
## stdHEDRES          1.179526   0.554838   2.126   0.0336 *  
## stdANXMAT        -11.707707   0.554696 -21.107   <2e-16 ***
## read_achievement   0.810076   0.006218 130.275   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.38 on 5344 degrees of freedom
##   (106 observations deleted due to missingness)
## Multiple R-squared:  0.8084, Adjusted R-squared:  0.8082 
## F-statistic:  5636 on 4 and 5344 DF,  p-value: < 2.2e-16

colnames(pisa_5persample)

## [1] "studentID"         "SCHOOLID"          "student_gender"   
## [4] "read_achievement"  "stdANXMAT"         "stdHEDRES"        
## [7] "math_achievement2"

Regression imputation

imp <- mice(pisa_5persample, method = "norm.nob", m = 1) # Impute data

## 
##  iter imp variable
##   1   1  stdANXMAT  stdHEDRES  math_achievement2
##   2   1  stdANXMAT  stdHEDRES  math_achievement2
##   3   1  stdANXMAT  stdHEDRES  math_achievement2
##   4   1  stdANXMAT  stdHEDRES  math_achievement2
##   5   1  stdANXMAT  stdHEDRES  math_achievement2

## Warning: Number of logged events: 1

imp2 <- complete(imp) # Store datasummary(imp2)
summary(imp2)

##    studentID       SCHOOLID     student_gender  read_achievement
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   :149.4   
##  1st Qu.:1364   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.:432.6   
##  Median :2728   Median :138.0   Median :2.000   Median :500.5   
##  Mean   :2728   Mean   :136.4   Mean   :1.502   Mean   :494.1   
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.:562.5   
##  Max.   :5456   Max.   :274.0   Max.   :2.000   Max.   :733.3   
##    stdANXMAT            stdHEDRES        math_achievement2
##  Min.   :-2.2044358   Min.   :-3.70172   Min.   :212.9    
##  1st Qu.:-0.3436242   1st Qu.:-0.40747   1st Qu.:417.8    
##  Median :-0.0739640   Median : 0.75799   Median :482.3    
##  Mean   : 0.0000879   Mean   :-0.00017   Mean   :481.5    
##  3rd Qu.: 0.6848064   3rd Qu.: 0.75799   3rd Qu.:545.6    
##  Max.   : 2.5633988   Max.   : 1.85948   Max.   :739.2

md.pattern(imp2)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##      studentID SCHOOLID student_gender read_achievement stdANXMAT stdHEDRES
## 5455         1        1              1                1         1         1
##              0        0              0                0         0         0
##      math_achievement2  
## 5455                 1 0
##                      0 0

#Run the regression using the imputed math achievement scores
lm_ri_5per=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=imp2)
summary(lm_ri_5per)

## 
## Call:
## lm(formula = math_achievement2 ~ student_gender + stdHEDRES + 
##     stdANXMAT + read_achievement, data = imp2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.90  -23.11    0.08   23.19  141.67 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       17.799523   3.416037   5.211 1.95e-07 ***
## student_gender    30.689828   0.981487  31.269  < 2e-16 ***
## stdHEDRES          1.194264   0.499097   2.393   0.0168 *  
## stdANXMAT        -12.589893   0.503480 -25.006  < 2e-16 ***
## read_achievement   0.845319   0.005582 151.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.16 on 5450 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8491 
## F-statistic:  7675 on 4 and 5450 DF,  p-value: < 2.2e-16

#4. EM maximization

#5. Multiple imputation
md.pattern(pisa_10persample, plot = F)

##      studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 4813         1        1              1                1         1         1
## 536          1        1              1                1         1         1
## 64           1        1              1                1         1         0
## 5            1        1              1                1         1         0
## 33           1        1              1                1         0         0
## 4            1        1              1                1         0         0
##              0        0              0                0        37       106
##      math_achievement2    
## 4813                 1   0
## 536                  0   1
## 64                   1   1
## 5                    0   2
## 33                   1   2
## 4                    0   3
##                    545 688

imp <- mice(pisa_10persample, m=5, maxit=10, meth='pmm',seed=500, printFlag = F)

## Warning: Number of logged events: 1

  #mice is the package for performing multiple imputations
      #m=5: number of multiple imputations i.e. 5 completed datasets will be generated
      #maxit=10: number of iterations run 
      #pmm: predictive mean matching


summary(pisa_10persample)

##    studentID       SCHOOLID     student_gender  read_achievement
##  Min.   :   1   Min.   :  1.0   Min.   :1.000   Min.   :149.4   
##  1st Qu.:1364   1st Qu.: 67.0   1st Qu.:1.000   1st Qu.:432.6   
##  Median :2728   Median :138.0   Median :2.000   Median :500.5   
##  Mean   :2728   Mean   :136.4   Mean   :1.502   Mean   :494.1   
##  3rd Qu.:4092   3rd Qu.:205.0   3rd Qu.:2.000   3rd Qu.:562.5   
##  Max.   :5456   Max.   :274.0   Max.   :2.000   Max.   :733.3   
##                                                                 
##    stdANXMAT          stdHEDRES       math_achievement2
##  Min.   :-2.20444   Min.   :-3.7017   Min.   :224.2    
##  1st Qu.:-0.34362   1st Qu.:-0.4075   1st Qu.:418.0    
##  Median :-0.07396   Median : 0.7580   Median :482.4    
##  Mean   : 0.00000   Mean   : 0.0000   Mean   :482.0    
##  3rd Qu.: 0.68481   3rd Qu.: 0.7580   3rd Qu.:546.6    
##  Max.   : 2.56340   Max.   : 0.7580   Max.   :739.2    
##  NA's   :106        NA's   :37        NA's   :545

summary(imp) #we see that the missing values

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##         studentID          SCHOOLID    student_gender  read_achievement 
##                ""                ""                ""                "" 
##         stdANXMAT         stdHEDRES math_achievement2 
##             "pmm"             "pmm"             "pmm" 
## PredictorMatrix:
##                  studentID SCHOOLID student_gender read_achievement stdANXMAT
## studentID                0        0              1                1         1
## SCHOOLID                 1        0              1                1         1
## student_gender           1        0              0                1         1
## read_achievement         1        0              1                0         1
## stdANXMAT                1        0              1                1         0
## stdHEDRES                1        0              1                1         1
##                  stdHEDRES math_achievement2
## studentID                1                 1
## SCHOOLID                 1                 1
## student_gender           1                 1
## read_achievement         1                 1
## stdANXMAT                1                 1
## stdHEDRES                0                 1
## Number of logged events:  1 
##   it im dep      meth      out
## 1  0  0     collinear SCHOOLID

#modelFit1 <- with(imp,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
#modelFit1 #the regression has been run for each of the imputed datasets
#summary(pool(modelFit1)) #this pools the results from the 5 regressions


#write function to output result

#create a list from imputation - makes export easier
implist <- miceadds::mids2datlist(imp)
modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
summary(pool(modelFit2))

##               term    estimate   std.error  statistic         df      p.value
## 1      (Intercept)  17.5490203 3.650936806   4.806717  224.23752 2.810779e-06
## 2   student_gender  30.8305051 1.065787309  28.927446  151.96385 0.000000e+00
## 3        stdHEDRES   0.9102864 0.506541173   1.797063 1631.45369 7.251050e-02
## 4        stdANXMAT -12.6921422 0.579401721 -21.905600   62.01993 0.000000e+00
## 5 read_achievement   0.8451888 0.005796448 145.811491  580.11008 0.000000e+00

#extract

               
extract.df <- function(tt, cl = NULL) {
  require(sandwich)
  require(mitools)
  require(texreg)
  m2 <- length(tt) #number of imputations
  betas <- lapply(tt, coef)
  vars <- lapply(tt, vcov)
  # conduct statistical inference and save results into a data.frame
  model1 <- summary(pool_mi(betas, vars))
  
  
  R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
  #technically, should use R to Z-- but close for now when R is low
  ns <- nobs(tt[[1]])
  
   #creates what is used by texreg
  tr <- createTexreg(
    coef.names = row.names(model1), 
    coef = model1$results, 
    se = model1$se, 
    pvalues = model1$p,
    gof.names = c("R2", "Nobs"), 
    gof = c(R2, ns),
    gof.decimal = c(T,F)
  )
}

out1 <- extract.df(modelFit2)

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                      results          se          t            p       (lower
## (Intercept)       17.5490203 3.650936806   4.806717 2.736037e-06  10.35633268
## student_gender    30.8305051 1.065787309  28.927446 7.774665e-65  28.72539242
## stdHEDRES          0.9102864 0.506541173   1.797063 7.245279e-02  -0.08302295
## stdANXMAT        -12.6921422 0.579401721 -21.905600 3.705601e-31 -13.84999149
## read_achievement   0.8451888 0.005796448 145.811491 0.000000e+00   0.83380696
##                       upper) missInfo
## (Intercept)       24.7417080   13.8 %
## student_gender    32.9356178     17 %
## stdHEDRES          1.9035958    4.2 %
## stdANXMAT        -11.5342928   27.5 %
## read_achievement   0.8565706    8.1 %

screenreg(out1, digits = 3)

## 
## ==============================
##                   Model 1     
## ------------------------------
## (Intercept)         17.549 ***
##                     (3.651)   
## student_gender      30.831 ***
##                     (1.066)   
## stdHEDRES            0.910    
##                     (0.507)   
## stdANXMAT          -12.692 ***
##                     (0.579)   
## read_achievement     0.845 ***
##                     (0.006)   
## ------------------------------
## R2                   0.850    
## Nobs              5455        
## ==============================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Side-by-side graph with original graph is ideal here

###Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Reproduction Attempt

Open the discussion section with a paragraph summarizing the primary result from the key analysis and assess whether you successfully reproduced it, partially reproduced it, or failed to reproduce it.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis of the dataset, (b) assessment of the meaning of the successful or unsuccessful reproducibility attempt - e.g., for a failure to reproduce the original findings, are the differences between original and present analyses ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the reproducibility attempt (if you contacted them). None of these need to be long.