I have completed the following parts of the project:
Imported the data into R
Prepared data for analysis: this included keeping required variables, dropping outliers/missing variables and standardizing variables. This was done to correspond to the analysis described in the paper
Randomly dropped 5% and 10% of the dependent variable for the analysis
Tried 2 of the 5 imputation methods for the 5% sample
I have to the following left to do:
The feedback I got for my midterm presentation was to decide what would constitute replication. Replication in this case could be:
The regression results don’t match the results in the paper fully. Even though I have the exact same number of observations, the following variables don’t fully match:
As anticipated, the mean math and reading achievement doesn’t fully match the mean achievement variables I have calculated. This doesn’t seem that different though, going by the results
The regression results for one indepdendent variable (home education resources) doesn’t match the results in the paper. I’ve double-checked that I’ve picked the right variable, dropped the outlier (999) and normalized it as done in the paper. At this point, I am not sure why this variable is behaving differently than in the paper
I would appreciate feedback on:
Whether my coding so far is following Tidyverse, and if not, where to improve it
Whether my coding could have been simpler e.g. I am not sure if I wrote really long code for something that could have been done more simply
Whether my coding is easy to read and follow
Cheema (2014) evaluates different treatment methods for missing data in surveys by comparing their effect on the accuracy of estimation. The paper conducts the analysis on: (1) Simulated dataset (n=10,000 cases) (2) Empirican sample 1: US portion of PISA 2003 (n=456 cases) (3) Empirical sample 2: Population and Housing portion of US Census (2000). The paper looks at the effect of the different methods if sample size is small, medium or large, and if 5% and 10% of data is missing.
In this project, I propose to reproduce Cheema 2014’s findings from the US portion of the PISA 2003 dataset (Empirical sample 1) with 5% missing data using five different imputation methods - Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation . These are the results in Table 2.
This paper is highly cited in context of handing missing data, but it hasn’t been replicated to my knowledge. This project would also give me an opportunity to work with large-scale assessment datasets like PISA and understand their underlying structure. This paper is relevant to my research interests as I plan to work on large education datasets in the context of international assessments. By replicating the results, I would learn some methods for handling missing data in my future work.
Do you anticipate running into any challenges when attempting to reproduce these result(s)? If so please, list them here.
1. Student achievement variables used in Cheema (2014)
Cheema (2014) conducts student level imputations for PISA (2003). The dependent variable is Math achievement, and the reading achievement is one of the explanatory variables. The paper mentions that both the achievement variables vary between 200 and 800.
However, PISA reports 5 plausible values for Math and Reading for each student, and advises that student level plausible values should not be averaged. It suggests that any estimation should be conducted for each of the plausible values, and the resultant estimates should be averaged. Each of these plausible values have a minimum in the range of 141-189, lower than the minimum reported by Cheema. Given the PISA dataset I have downloaded has the same number of observations as reported by Cheema, it is possible that the paper averaged the 5 plausible values. I will be able to determine this once I do the imputation.
To address this, I plan to: (1) First, imputation using average plausible values to see if Cheema (2014) can be replicated (2) If results are not replicated, conduct estimation using each plausible value and then take the average
2. Figuring out how to code the methods used in the paper
Cheema (2014) first performs imputation for a simulated dataset, and then for PISA. In the simulated dataset, 5% of the dataset is randomly dropped 5 times, so that there is no dependencies. However, the PISA section does not describe the values being dropped 5 times. To address this, I plan to drop 5% of data only once; if this is not sufficient to regenerate the sample, I can drop values 5 times, perform the imputation and run the regression, and then average the results
PISA (2003) dataset for the US was downloaded from the NCES website. SPSS macros provided on the NCES were run to make the data ready for analysis, and the relevant variables were retained (student achievement for math and reading, student gender, home education resources and math anxiety).
For analysis, math achievement was predicted using the other variables using a linear multiple regression equation. 5% of the data was randomly discarded. Each of the 5 imputation methods were applied to each of the reduced datasets, and then the multiple analysis was conducted:Listwise deletion, Mean imputation, Regression imputation, EM impuation, and Multiple imputation.
Explicitly describe known differences in the analysis pipeline between the original paper and yours (e.g., computing environment). The goal, of course, is to minimize those differences, but differences may occur. Also, note whether such differences are anticipated to influence your ability to reproduce the original results.
## Load Relevant Libraries and Functions
## Default repo
##local({r <- getOption("repos")
## r["CRAN"] <- "https://cloud.r-project.org"
## options(repos=r)
##})
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(readr)
#install.packages("mice") #package for regression imputation
library("mice")
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
#install.packages("miceadds")
library(miceadds)
## * miceadds 3.10-28 (2020-07-29 21:56:24)
library(sandwich)
library(mitools)
library(texreg)
## Version: 1.37.5
## Date: 2020-06-17
## Author: Philip Leifeld (University of Essex)
##
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
#?sandwich
#install.packages("sandwich")
#install.packages(mitools)
#install.packages("texreg")
#### Import data
url <- '/Users/radhika/Documents/Stanford readings/251 Experimental methods/Project/cheema2014/PISA data/Extracted data/PISA 2003 Student Background.csv'
pisa <-read.csv(url)
# Variables selected are: IDs, gender, math anxiety, household, education resources, plausible values for math and reading
#### Prepare data for analysis - create columns etc.
####Calculate mean
pisa = pisa %>%
rowwise() %>%
mutate(math_achievement = mean(c(PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH))) %>%
mutate(read_achievement=mean(c(PV1READ, PV2READ, PV3READ, PV4READ, PV5READ)))
#### Data exclusion / filtering
pisa = pisa %>%
rename(student_gender = ST03Q01, studentID=STIDSTD) %>%
select(c("studentID", "SCHOOLID", "student_gender","ANXMAT", "HEDRES", "math_achievement", "read_achievement"))
head(pisa)
## # A tibble: 6 x 7
## # Rowwise:
## studentID SCHOOLID student_gender ANXMAT HEDRES math_achievement
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 0.404 -1.45 458.
## 2 2 1 2 -0.746 -0.624 492.
## 3 3 1 1 -0.851 -0.624 512.
## 4 4 1 1 -2.48 0.677 614.
## 5 5 1 1 0.127 0.677 374.
## 6 6 1 1 -0.458 0.677 571.
## # … with 1 more variable: read_achievement <dbl>
nrow(pisa)
## [1] 5456
colnames(pisa)
## [1] "studentID" "SCHOOLID" "student_gender" "ANXMAT"
## [5] "HEDRES" "math_achievement" "read_achievement"
summary(pisa)
## studentID SCHOOLID student_gender ANXMAT
## Min. : 1 Min. : 1.0 Min. :1.000 Min. : -2.4781
## 1st Qu.:1365 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.: -0.4583
## Median :2728 Median :138.0 Median :2.000 Median : -0.1656
## Mean :2728 Mean :136.4 Mean :1.503 Mean : 19.3251
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.: 0.6580
## Max. :5456 Max. :274.0 Max. :8.000 Max. :999.0000
## HEDRES math_achievement read_achievement
## Min. : -4.2987 Min. :212.9 Min. :149.4
## 1st Qu.: -0.6236 1st Qu.:418.2 1st Qu.:432.6
## Median : 0.6766 Median :482.1 Median :500.6
## Mean : 6.6070 Mean :481.5 Mean :494.1
## 3rd Qu.: 0.6766 3rd Qu.:544.8 3rd Qu.:562.5
## Max. :999.0000 Max. :739.2 Max. :733.3
#Paper says math and reading scores are normally distributed - plotting to check
ggplot(pisa, aes(x=math_achievement)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(pisa, aes(x=read_achievement)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Check for outliers in variables of interest
colnames(pisa)
## [1] "studentID" "SCHOOLID" "student_gender" "ANXMAT"
## [5] "HEDRES" "math_achievement" "read_achievement"
unique(pisa$student_gender)
## [1] 1 2 8
unique(pisa$HEDRES)
## [1] -1.4485 -0.6236 0.6766 999.0000 -2.1925 -3.0064 -4.2987
#Check outliers in math anxiety
ggplot(pisa, aes(x=ANXMAT)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Student gender, Math anxiety seems to have outliers - checking to see all possible values in it
pisa_filter=pisa %>%
mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
filter(student_gender<3,
!is.na(student_gender))
mean_anxmat <- mean(pisa_filter$ANXMAT, na.rm=T)
sd_anxmat <- sd(pisa_filter$ANXMAT, na.rm=T)
mean_hedres <- mean(pisa_filter$HEDRES, na.rm=T)
sd_hedres <- sd(pisa_filter$HEDRES, na.rm=T)
pisa_filter=pisa_filter %>%
mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
mutate(HEDRES=replace(HEDRES, HEDRES>=999, NA)) %>%
filter(student_gender<3,
!is.na(student_gender)) %>%
mutate(stdANXMAT= (ANXMAT-mean_anxmat)/sd_anxmat) %>%
mutate(stdHEDRES= (HEDRES-mean_hedres)/sd_hedres)
#####I tried to use the code below but it didn’t work
#pisa_filter=pisa %>%
mutate(ANXMAT=replace(ANXMAT, ANXMAT>=999, NA)) %>%
filter(is.na(ANXMAT)==1) %>%
mutate(sdANXMAT=scale(ANXMAT))
The analyses as specified in the analysis plan.
##Reduce sample size for analysis Cheema (2014) describes the process as under:
Each of these sub-samples was then reduced in size by 1%, 2%, 5%, 10%, and 20% in order to simulate datasets containing missing data. The cases were discarded randomly from each complete sample five times separately in order to make sure that there were no dependencies between samples.
#Drop 5% of the PISA data
set.seed(123)
pisa_filter <- pisa_filter %>%
mutate(rv = rnorm(n())) %>%
arrange(rv)
cutoff_5per <-5183 #Cheema (2014) rounded off 95% of 5455 obs as 5182, preserving that
cutoff_10per <-4911 #Cheema (2014) rounded off 90% of 5455 obs as 4910
pisa_5persample <- sample_frac(pisa_filter, 0.95)
pisa_5persample <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_5per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, HEDRES,math_achievement, rv))
pisa_10persample <- pisa_filter %>%
tibble::rowid_to_column("new_rowid") %>%
mutate(math_achievement2= ifelse(new_rowid<(cutoff_10per),math_achievement,NA)) %>%
select(-c(new_rowid)) %>%
select(-c(ANXMAT, HEDRES,math_achievement, rv))
#Conduct analysis with full data
summary(pisa_filter)
## studentID SCHOOLID student_gender ANXMAT
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :-2.47810
## 1st Qu.:1364 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.:-0.45830
## Median :2728 Median :138.0 Median :2.000 Median :-0.16560
## Mean :2728 Mean :136.4 Mean :1.502 Mean :-0.08532
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.: 0.65800
## Max. :5456 Max. :274.0 Max. :2.000 Max. : 2.69710
## NA's :106
## HEDRES math_achievement read_achievement stdANXMAT
## Min. :-4.2987 Min. :212.9 Min. :149.4 Min. :-2.20444
## 1st Qu.:-0.6236 1st Qu.:418.2 1st Qu.:432.6 1st Qu.:-0.34362
## Median : 0.6766 Median :482.0 Median :500.5 Median :-0.07396
## Mean :-0.1690 Mean :481.4 Mean :494.1 Mean : 0.00000
## 3rd Qu.: 0.6766 3rd Qu.:544.8 3rd Qu.:562.5 3rd Qu.: 0.68481
## Max. : 0.6766 Max. :739.2 Max. :733.3 Max. : 2.56340
## NA's :37 NA's :106
## stdHEDRES rv
## Min. :-3.7017 Min. :-3.189186
## 1st Qu.:-0.4075 1st Qu.:-0.648679
## Median : 0.7580 Median :-0.007732
## Mean : 0.0000 Mean : 0.001169
## 3rd Qu.: 0.7580 3rd Qu.: 0.655904
## Max. : 0.7580 Max. : 3.445992
## NA's :37
lm_fulldata=lm(math_achievement ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_filter)
summary(lm_fulldata)
##
## Call:
## lm(formula = math_achievement ~ student_gender + stdHEDRES +
## stdANXMAT + read_achievement, data = pisa_filter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.323 -22.854 -0.006 23.008 141.897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.733911 3.464456 4.253 2.15e-05 ***
## student_gender 30.710157 0.985639 31.158 < 2e-16 ***
## stdHEDRES 1.072422 0.506049 2.119 0.0341 *
## stdANXMAT -12.343564 0.505920 -24.398 < 2e-16 ***
## read_achievement 0.850814 0.005671 150.018 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35 on 5344 degrees of freedom
## (106 observations deleted due to missingness)
## Multiple R-squared: 0.8483, Adjusted R-squared: 0.8482
## F-statistic: 7472 on 4 and 5344 DF, p-value: < 2.2e-16
#Imputation methods for 5% sample 1. List wise deletion
#List wise deletion is the same as not making any adjustments to the missing data
lm_ld_5per=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_5persample)
summary(lm_ld_5per)
##
## Call:
## lm(formula = math_achievement2 ~ student_gender + stdHEDRES +
## stdANXMAT + read_achievement, data = pisa_5persample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.27 -22.80 0.00 22.92 142.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.221464 3.552818 4.566 5.09e-06 ***
## student_gender 30.537953 1.012254 30.168 < 2e-16 ***
## stdHEDRES 1.307573 0.517604 2.526 0.0116 *
## stdANXMAT -12.450136 0.518428 -24.015 < 2e-16 ***
## read_achievement 0.848598 0.005811 146.035 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.07 on 5078 degrees of freedom
## (372 observations deleted due to missingness)
## Multiple R-squared: 0.848, Adjusted R-squared: 0.8479
## F-statistic: 7082 on 4 and 5078 DF, p-value: < 2.2e-16
#Values of math_achievement2 will be replaced by the mean of the column
mean_math <- mean(pisa_5persample$math_achievement2, na.rm = T) #calculate mean score to be used for imputation
mean_math
## [1] 481.4513
pisa_5persample_mi <- pisa_5persample %>%
rowwise() %>%
mutate(math_mi=ifelse(is.na(math_achievement2),mean_math,math_achievement2))
#Run the regression using the imputed mean
lm_mi_5per=lm(math_mi ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=pisa_5persample_mi)
summary(lm_mi_5per)
##
## Call:
## lm(formula = math_mi ~ student_gender + stdHEDRES + stdANXMAT +
## read_achievement, data = pisa_5persample_mi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.118 -23.801 0.126 24.006 214.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.057000 3.798468 10.019 <2e-16 ***
## student_gender 28.589976 1.080666 26.456 <2e-16 ***
## stdHEDRES 1.179526 0.554838 2.126 0.0336 *
## stdANXMAT -11.707707 0.554696 -21.107 <2e-16 ***
## read_achievement 0.810076 0.006218 130.275 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.38 on 5344 degrees of freedom
## (106 observations deleted due to missingness)
## Multiple R-squared: 0.8084, Adjusted R-squared: 0.8082
## F-statistic: 5636 on 4 and 5344 DF, p-value: < 2.2e-16
colnames(pisa_5persample)
## [1] "studentID" "SCHOOLID" "student_gender"
## [4] "read_achievement" "stdANXMAT" "stdHEDRES"
## [7] "math_achievement2"
imp <- mice(pisa_5persample, method = "norm.nob", m = 1) # Impute data
##
## iter imp variable
## 1 1 stdANXMAT stdHEDRES math_achievement2
## 2 1 stdANXMAT stdHEDRES math_achievement2
## 3 1 stdANXMAT stdHEDRES math_achievement2
## 4 1 stdANXMAT stdHEDRES math_achievement2
## 5 1 stdANXMAT stdHEDRES math_achievement2
## Warning: Number of logged events: 1
imp2 <- complete(imp) # Store datasummary(imp2)
summary(imp2)
## studentID SCHOOLID student_gender read_achievement
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :149.4
## 1st Qu.:1364 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.:432.6
## Median :2728 Median :138.0 Median :2.000 Median :500.5
## Mean :2728 Mean :136.4 Mean :1.502 Mean :494.1
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.:562.5
## Max. :5456 Max. :274.0 Max. :2.000 Max. :733.3
## stdANXMAT stdHEDRES math_achievement2
## Min. :-2.2044358 Min. :-3.70172 Min. :212.9
## 1st Qu.:-0.3436242 1st Qu.:-0.40747 1st Qu.:417.8
## Median :-0.0739640 Median : 0.75799 Median :482.3
## Mean : 0.0000879 Mean :-0.00017 Mean :481.5
## 3rd Qu.: 0.6848064 3rd Qu.: 0.75799 3rd Qu.:545.6
## Max. : 2.5633988 Max. : 1.85948 Max. :739.2
md.pattern(imp2)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## studentID SCHOOLID student_gender read_achievement stdANXMAT stdHEDRES
## 5455 1 1 1 1 1 1
## 0 0 0 0 0 0
## math_achievement2
## 5455 1 0
## 0 0
#Run the regression using the imputed math achievement scores
lm_ri_5per=lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement, data=imp2)
summary(lm_ri_5per)
##
## Call:
## lm(formula = math_achievement2 ~ student_gender + stdHEDRES +
## stdANXMAT + read_achievement, data = imp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.90 -23.11 0.08 23.19 141.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.799523 3.416037 5.211 1.95e-07 ***
## student_gender 30.689828 0.981487 31.269 < 2e-16 ***
## stdHEDRES 1.194264 0.499097 2.393 0.0168 *
## stdANXMAT -12.589893 0.503480 -25.006 < 2e-16 ***
## read_achievement 0.845319 0.005582 151.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.16 on 5450 degrees of freedom
## Multiple R-squared: 0.8492, Adjusted R-squared: 0.8491
## F-statistic: 7675 on 4 and 5450 DF, p-value: < 2.2e-16
#4. EM maximization
#5. Multiple imputation
md.pattern(pisa_10persample, plot = F)
## studentID SCHOOLID student_gender read_achievement stdHEDRES stdANXMAT
## 4813 1 1 1 1 1 1
## 536 1 1 1 1 1 1
## 64 1 1 1 1 1 0
## 5 1 1 1 1 1 0
## 33 1 1 1 1 0 0
## 4 1 1 1 1 0 0
## 0 0 0 0 37 106
## math_achievement2
## 4813 1 0
## 536 0 1
## 64 1 1
## 5 0 2
## 33 1 2
## 4 0 3
## 545 688
imp <- mice(pisa_10persample, m=5, maxit=10, meth='pmm',seed=500, printFlag = F)
## Warning: Number of logged events: 1
#mice is the package for performing multiple imputations
#m=5: number of multiple imputations i.e. 5 completed datasets will be generated
#maxit=10: number of iterations run
#pmm: predictive mean matching
summary(pisa_10persample)
## studentID SCHOOLID student_gender read_achievement
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :149.4
## 1st Qu.:1364 1st Qu.: 67.0 1st Qu.:1.000 1st Qu.:432.6
## Median :2728 Median :138.0 Median :2.000 Median :500.5
## Mean :2728 Mean :136.4 Mean :1.502 Mean :494.1
## 3rd Qu.:4092 3rd Qu.:205.0 3rd Qu.:2.000 3rd Qu.:562.5
## Max. :5456 Max. :274.0 Max. :2.000 Max. :733.3
##
## stdANXMAT stdHEDRES math_achievement2
## Min. :-2.20444 Min. :-3.7017 Min. :224.2
## 1st Qu.:-0.34362 1st Qu.:-0.4075 1st Qu.:418.0
## Median :-0.07396 Median : 0.7580 Median :482.4
## Mean : 0.00000 Mean : 0.0000 Mean :482.0
## 3rd Qu.: 0.68481 3rd Qu.: 0.7580 3rd Qu.:546.6
## Max. : 2.56340 Max. : 0.7580 Max. :739.2
## NA's :106 NA's :37 NA's :545
summary(imp) #we see that the missing values
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## studentID SCHOOLID student_gender read_achievement
## "" "" "" ""
## stdANXMAT stdHEDRES math_achievement2
## "pmm" "pmm" "pmm"
## PredictorMatrix:
## studentID SCHOOLID student_gender read_achievement stdANXMAT
## studentID 0 0 1 1 1
## SCHOOLID 1 0 1 1 1
## student_gender 1 0 0 1 1
## read_achievement 1 0 1 0 1
## stdANXMAT 1 0 1 1 0
## stdHEDRES 1 0 1 1 1
## stdHEDRES math_achievement2
## studentID 1 1
## SCHOOLID 1 1
## student_gender 1 1
## read_achievement 1 1
## stdANXMAT 1 1
## stdHEDRES 0 1
## Number of logged events: 1
## it im dep meth out
## 1 0 0 collinear SCHOOLID
#modelFit1 <- with(imp,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
#modelFit1 #the regression has been run for each of the imputed datasets
#summary(pool(modelFit1)) #this pools the results from the 5 regressions
#write function to output result
#create a list from imputation - makes export easier
implist <- miceadds::mids2datlist(imp)
modelFit2 <- with(implist,lm(math_achievement2 ~ student_gender + stdHEDRES + stdANXMAT + read_achievement))
summary(pool(modelFit2))
## term estimate std.error statistic df p.value
## 1 (Intercept) 17.5490203 3.650936806 4.806717 224.23752 2.810779e-06
## 2 student_gender 30.8305051 1.065787309 28.927446 151.96385 0.000000e+00
## 3 stdHEDRES 0.9102864 0.506541173 1.797063 1631.45369 7.251050e-02
## 4 stdANXMAT -12.6921422 0.579401721 -21.905600 62.01993 0.000000e+00
## 5 read_achievement 0.8451888 0.005796448 145.811491 580.11008 0.000000e+00
#extract
extract.df <- function(tt, cl = NULL) {
require(sandwich)
require(mitools)
require(texreg)
m2 <- length(tt) #number of imputations
betas <- lapply(tt, coef)
vars <- lapply(tt, vcov)
# conduct statistical inference and save results into a data.frame
model1 <- summary(pool_mi(betas, vars))
R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
#technically, should use R to Z-- but close for now when R is low
ns <- nobs(tt[[1]])
#creates what is used by texreg
tr <- createTexreg(
coef.names = row.names(model1),
coef = model1$results,
se = model1$se,
pvalues = model1$p,
gof.names = c("R2", "Nobs"),
gof = c(R2, ns),
gof.decimal = c(T,F)
)
}
out1 <- extract.df(modelFit2)
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 17.5490203 3.650936806 4.806717 2.736037e-06 10.35633268
## student_gender 30.8305051 1.065787309 28.927446 7.774665e-65 28.72539242
## stdHEDRES 0.9102864 0.506541173 1.797063 7.245279e-02 -0.08302295
## stdANXMAT -12.6921422 0.579401721 -21.905600 3.705601e-31 -13.84999149
## read_achievement 0.8451888 0.005796448 145.811491 0.000000e+00 0.83380696
## upper) missInfo
## (Intercept) 24.7417080 13.8 %
## student_gender 32.9356178 17 %
## stdHEDRES 1.9035958 4.2 %
## stdANXMAT -11.5342928 27.5 %
## read_achievement 0.8565706 8.1 %
screenreg(out1, digits = 3)
##
## ==============================
## Model 1
## ------------------------------
## (Intercept) 17.549 ***
## (3.651)
## student_gender 30.831 ***
## (1.066)
## stdHEDRES 0.910
## (0.507)
## stdANXMAT -12.692 ***
## (0.579)
## read_achievement 0.845 ***
## (0.006)
## ------------------------------
## R2 0.850
## Nobs 5455
## ==============================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Side-by-side graph with original graph is ideal here
###Exploratory analyses
Any follow-up analyses desired (not required).
Open the discussion section with a paragraph summarizing the primary result from the key analysis and assess whether you successfully reproduced it, partially reproduced it, or failed to reproduce it.
Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis of the dataset, (b) assessment of the meaning of the successful or unsuccessful reproducibility attempt - e.g., for a failure to reproduce the original findings, are the differences between original and present analyses ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the reproducibility attempt (if you contacted them). None of these need to be long.