This problem set was worked in collaboration with Hansol Lee and Iris Zhong

Question #1

Using code provided by Ben D.

read.table(file="https://raw.githubusercontent.com/ben-domingue/252L/master/data/emp-3pl-math-reading.txt",header=TRUE)->x
dim(x)
x[rowSums(is.na(x))==0,]->x
dim(x)

grep("r.mc",names(x))->r_index # tweaked this to know which is reading and math 
rowSums(x[,r_index])->rs
as.numeric(rs<mean(rs))-> gr #this can be treated as the confounding variable in the DIF analysis (1 = low score; 0 = high score)
grep("m.mc",names(x))->m_index
rowSums(x[,m_index])->th #this can be treated as ability in DIF analysis

Creating a dataframe with the math items, theta, and grouping variable:

df <- data.frame(x[,m_index], th, gr)

The following code generates a general linear model for each item using a for loop. First, empty variables are created to store each of the beta, models, and p-values of each item.

betas <- list()
models <- list()
p_values <- list()
for(i in names(x[,m_index])){
  mod <- glm(get(i) ~ th + gr, data=df, family="binomial")
  models[[i]] <- mod
  betas[[i]] <- mod$coefficients
  p_values[[i]] <- coef(summary(mod))[,4]
}

Next, I created separate data frames to store beta_1 and beta_2 values, along with a separate data frame that had the significance for each variable from the glm. The purpose was to find discrepancies between the two beta values as well as which ones were significant in the glm. The code for this is below:

beta1 <- as.data.frame(betas)[2,]
beta2 <- as.data.frame(betas)[3,]
p_values <- as.data.frame(p_values)

Next, I transposed the data frames to then combine them into a single variable:

library(tidyverse)
beta1_long <- beta1 %>%
  pivot_longer(
    cols = starts_with("m.mc"),
    names_to = "item",
    names_prefix = "m.mc",
    values_to = "beta1"
  )

beta2_long <- beta2 %>%
  pivot_longer(
    cols = starts_with("m.mc"),
    names_to = "item",
    names_prefix = "m.mc",
    values_to = "beta2"
  )

pvalues_long <- as.data.frame(t(p_values))

df_final <- cbind(beta1_long, beta2_long, pvalues_long) %>%
  select(-item) %>% # remove columns
  round(digits = 3)

df_final provides a data frame that has the coefficients for both beta values, as well as their corresponding p_values for each item. First, I wanted to see how many items had beta_2 values that were greater than beta_1 values using the following code:

head(df_final)
##         beta1  beta2 (Intercept) th    gr
## m.mc1_1 0.131 -0.207       0.000  0 0.134
## m.mc1_2 0.120 -0.175       0.000  0 0.199
## m.mc1_3 0.098  0.287       0.000  0 0.084
## m.mc1_4 0.147 -0.077       0.000  0 0.583
## m.mc1_5 0.084  0.224       0.000  0 0.099
## m.mc1_6 0.090 -0.262       0.047  0 0.068
sum(df_final$beta2 > df_final$beta1) # 21 
## [1] 21

Based on the prior code, it indicates that 21 math items had beta_2 coefficients that were larger than beta_1. This is pretty alarming, considering that this suggests that almost half the items will function differently for those with high or low reading abilities. The following items that do have beta_2 values that are greater than beta_1 values are reported below:

df_final %>%
  filter(beta2 > beta1)
##          beta1 beta2 (Intercept) th    gr
## m.mc1_3  0.098 0.287           0  0 0.084
## m.mc1_5  0.084 0.224           0  0 0.099
## m.mc1_11 0.093 0.763           0  0 0.000
## m.mc1_12 0.109 0.262           0  0 0.122
## m.mc1_13 0.079 0.542           0  0 0.002
## m.mc2_1  0.125 0.562           0  0 0.002
## m.mc2_2  0.034 0.126           0  0 0.355
## m.mc2_5  0.162 0.211           0  0 0.179
## m.mc2_8  0.153 1.032           0  0 0.000
## m.mc2_9  0.065 0.121           0  0 0.362
## m.mc2_10 0.085 0.177           0  0 0.247
## m.mc2_11 0.137 0.169           0  0 0.227
## m.mc2_12 0.073 0.150           0  0 0.283
## m.mc2_14 0.103 0.552           0  0 0.000
## m.mc3_2  0.117 0.553           0  0 0.000
## m.mc3_3  0.131 0.174           0  0 0.270
## m.mc3_6  0.055 0.114           0  0 0.393
## m.mc3_7  0.147 0.180           0  0 0.415
## m.mc3_8  0.148 0.629           0  0 0.000
## m.mc3_12 0.134 0.424           0  0 0.010
## m.mc3_13 0.149 0.167           0  0 0.360

Question 2

1. What is the effect of dichotomizing low (e.g., responses of 1 and 2 become 1) versus high (e.g., responses of 0 and 1 become 0) on the central tendency and spread of information curves.

To first answer this question, I ran the code that you provided HERE which produced the following plot for Item 2:

I then wanted to generate just a few more plots to see what would happen if we selected different items. Here is the plot for the rest of the items:

Item #7:

Item #10:

Item #14:

From these plots, it appears that the information is most biased towards how the items were dichotomized: so low dichotomizations seem to be skewed towards lower ability thetas whereas high dichotomizations are biased towards higher abilities. However, what I notice from each of these items is that where the overall item information function falls varies in terms of which dichotomization was privileged. For example, item #7 seems to have an overall TIF that is close to higher abiliites which also more closely matches a higher dichotomization scheme; on the other hand, item #14 seems to have an overall TIF at lower abilities which would more closely align with a lower dichotomization. What this means in central tendency however, seems to be difficult to discern from the dichotomizations.

2. How do standard errors for theta compare when we estimate the GPCM on the data versus low and high dichotimizations?

SE typically have a reciprocal relationship and are minimized where information is maximized. Given the plots above, it would seem that SEs vary depending and are lowest for low dichotimizations compared to higher ones.