Question #1: Exploration of collections of Bernoulli random variables

#################################################
##Exploration of collections of bernoulli variables
#################################################

set.seed(12311)
x1<-matrix(rbinom(1000,1,.5),100,10)

Let’s pretend that x1 is item response data from a test. So 1s and 0s are correct/incorrect responses (rows are people and columns are items). For fun we can look at the correlations across items and the variation in row sums (ie, total scores)

cor(x1)
##               [,1]         [,2]         [,3]          [,4]         [,5]
##  [1,]  1.000000000  0.008410789 -0.009896948 -0.3032770731  0.094629567
##  [2,]  0.008410789  1.000000000  0.089886562 -0.0669602615 -0.057570773
##  [3,] -0.009896948  0.089886562  1.000000000  0.0295469723 -0.125809070
##  [4,] -0.303277073 -0.066960261  0.029546972  1.0000000000  0.089214650
##  [5,]  0.094629567 -0.057570773 -0.125809070  0.0892146505  1.000000000
##  [6,]  0.237526763  0.130744090 -0.001602564 -0.0525279507  0.035484609
##  [7,]  0.081814407 -0.074443750  0.047043222 -0.0988646639  0.060409150
##  [8,]  0.098085811 -0.016333199  0.059259270  0.0455242322 -0.103165975
##  [9,] -0.149960697 -0.107907043  0.053719716 -0.0004168548 -0.001638407
## [10,]  0.025551766  0.179665184  0.060860872  0.0365014114 -0.058030861
##               [,6]         [,7]        [,8]          [,9]       [,10]
##  [1,]  0.237526763  0.081814407  0.09808581 -0.1499606967  0.02555177
##  [2,]  0.130744090 -0.074443750 -0.01633320 -0.1079070433  0.17966518
##  [3,] -0.001602564  0.047043222  0.05925927  0.0537197158  0.06086087
##  [4,] -0.052527951 -0.098864664  0.04552423 -0.0004168548  0.03650141
##  [5,]  0.035484609  0.060409150 -0.10316597 -0.0016384067 -0.05803086
##  [6,]  1.000000000  0.087597723  0.09929932 -0.1904608106 -0.05925927
##  [7,]  0.087597723  1.000000000  0.02350749  0.0090628774  0.05755282
##  [8,]  0.099299317  0.023507488  1.00000000  0.0370118105  0.08043217
##  [9,] -0.190460811  0.009062877  0.03701181  1.0000000000  0.08500515
## [10,] -0.059259270  0.057552816  0.08043217  0.0850051472  1.00000000
var(rowSums(x1))
## [1] 2.706667
mean(rowSums(x1))
## [1] 5.02

Q. If you considered the 1s/0s correct and incorrect responses to test items (where the rows are people and the columns are items), does this seem like it could have come from a realistic scenario? How might we know?

From the variance output, the answer seems pretty realistic; there does not seem to be anything alarming about a sample variance of around 2.7 out of 10 items. I would probably be skeptical if the variance was this large but the average of the matrix was higher (e.g., Mean > 7). And while I am not sure how to intepret the correlation output, it does not appear that there is high correlation amongst the items (but I am not quite sure if this is what we are looking for; what would it mean for item 7 to have a high correlation with item 5?)

Feel free to ignore this chunk of code (skip ahead to below question). I’m going to generate a new set of data.

#### GENERATION NEW DATA
set.seed(12311)
th<-matrix(rnorm(100),100,10,byrow=FALSE)
diff<-matrix<-matrix(rnorm(10),100,10,byrow=TRUE)
kern<- exp(th - diff)
pr<-kern/(1+kern)
test<-matrix(runif(1000),100,10)
x2<-ifelse(pr>test,1,0)

Q. Now, let’s ask the same question of the new matrix x2. Does it seem like realistic item response data? Specifically, how does it compare to the first matrix x1 in terms of whether it seems like a realistic set of item responses? What characteristics influence your opinion on this point?

This one, not so much, for similar reasons as the first one: the variance is quite high

cor(x2)
##             [,1]       [,2]       [,3]        [,4]       [,5]        [,6]
##  [1,] 1.00000000 0.05225302 0.31098566  0.08849268 0.17610919  0.11892066
##  [2,] 0.05225302 1.00000000 0.26505302  0.03714521 0.12156613  0.19791667
##  [3,] 0.31098566 0.26505302 1.00000000  0.17914327 0.17914327  0.06001200
##  [4,] 0.08849268 0.03714521 0.17914327  1.00000000 0.05582923  0.13169665
##  [5,] 0.17610919 0.12156613 0.17914327  0.05582923 1.00000000  0.08104409
##  [6,] 0.11892066 0.19791667 0.06001200  0.13169665 0.08104409  1.00000000
##  [7,] 0.17435686 0.25832804 0.22615825  0.10837438 0.06732348  0.17221869
##  [8,] 0.07143616 0.13262503 0.05803086 -0.01959216 0.06204183 -0.04029115
##  [9,] 0.11133735 0.42075805 0.11903823  0.21353066 0.05114993  0.08014439
## [10,] 0.12808759 0.08071308 0.14553789  0.19806502 0.10679082  0.13514748
##             [,7]        [,8]        [,9]       [,10]
##  [1,] 0.17435686  0.07143616  0.11133735  0.12808759
##  [2,] 0.25832804  0.13262503  0.42075805  0.08071308
##  [3,] 0.22615825  0.05803086  0.11903823  0.14553789
##  [4,] 0.10837438 -0.01959216  0.21353066  0.19806502
##  [5,] 0.06732348  0.06204183  0.05114993  0.10679082
##  [6,] 0.17221869 -0.04029115  0.08014439  0.13514748
##  [7,] 1.00000000  0.14204314  0.15182598  0.21266889
##  [8,] 0.14204314  1.00000000  0.33582739  0.13068593
##  [9,] 0.15182598  0.33582739  1.00000000 -0.01399044
## [10,] 0.21266889  0.13068593 -0.01399044  1.00000000
var(rowSums(x2))
## [1] 5.111111

Q. How would you characterize the key difference between x1 and x2 in terms of what we can observe if we blind ourselves to the data generating process?

Question 2: Regression

load("ps1-logreg.Rdata")
m1<-glm(y1~x,df,family="binomial")
m2<-glm(y2~x,df,family="binomial")

Questions:

  1. How would you compare the association between y1 or y2 & x?
summary(m1)
## 
## Call:
## glm(formula = y1 ~ x, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0732  -1.0230   0.3943   1.0042   2.3292  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.06109    0.06940    0.88    0.379    
## x            0.99636    0.08424   11.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.1  on 999  degrees of freedom
## Residual deviance: 1204.9  on 998  degrees of freedom
## AIC: 1208.9
## 
## Number of Fisher Scoring iterations: 3
summary(m2)
## 
## Call:
## glm(formula = y2 ~ x, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8981  -0.6553   0.4267   0.6887   2.2202  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.44146    0.09658   14.93   <2e-16 ***
## x            1.43951    0.10921   13.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1152.34  on 999  degrees of freedom
## Residual deviance:  896.89  on 998  degrees of freedom
## AIC: 900.89
## 
## Number of Fisher Scoring iterations: 5

First, I would want to take a look at the regression coefficients produced from a summary table of both models and compare those values. The second would be to plot figures of the regression lines and compare visiually how y1 and y2 differ as x changes.

  1. How would you interpret the regression coefficients from (say) m1?
exp(.99636) # regression coefficient of x from m1
## [1] 2.708405
exp(1.43951) # regression coefficient of x from m2
## [1] 4.218628

This tells us how the odds change as p changes. To interpret x, we need to understand how an increase of one unit of x affects the odds. This can be calculated by calculating the following for each model: exp(β1). When we calculate this for each model, we see that the odds are almost twice as high for m2 than it is for m1. Both are positive.

Thanks to THIS website for providing some help on how to answer this question.

  1. Do m1 and m2 show equivalent model fit? Can you notice anything peculiar about either y1 or y2 (in terms of their association with x)? [Note: This one is sneaky. I’d encourage you to avoid fit statistics and look at techniques for model diagnostics (e.g., residuals).]

Question 3: Likelihood Problem

Suppose we just observed x, a bunch of random numbers.

x<-rnorm(100)

We first want to see what the distribution looks like. We can do this:

hist(x)

Looks vaguely normalish, no? [Of course, you can see that I’m drawing variates from the normal distribution, so this isn’t surprising. Pretend you didn’t see that part!] So what if we wanted to estimate the mean and var of the normal distribution that may have generated this set of draws.How do we do this? The key statistical technique is to consider the likelihood.

Let’s start by writing a function that computes the likelihood for “x” in a normal with unknown mean and var (collectively, “pars”).

likelihood<-function(pars,x) { #see the first eqn here, http://mathworld.wolfram.com/NormalDistribution.html
    tmp<-exp(-(x-pars[1])^2/(2*pars[2]))
    tmp/sqrt(2*pars[2]*pi)
}

To completely evaluate this function, we would need to know x and pars. We only know x (this is the problem of estimation in general: the values in pars are not known to us!). With known x, we can think of the likelihood as the “probability” of observing the draws x from a normal distribution with parameters pars.That is, we are thinking of the likelihood as a function of pars (x is known).

Let’s think about what we get if the mean is unknown and the SD=1

out<-list()
for (m in seq(-1,1,length.out=100)) {
    like<-rep(NA,length(x))
    for (i in 1:length(x)) {
        like[i]<-likelihood(c(m,1),x[i])
    }
    c(c(m,prod(like)))->out[[as.character(m) ]]
}
plot(do.call("rbind",out),type="b") # this is a likelihood surface where we're seeing the likelihood as a function of the unknown mean

Q. what do you notice?

The liklihood is positive?

From a computational perspective, working with the products of small numbers is very unstable. So we instead work with the sum of the logs.

Why is this ok? First of all, log(xy)=log(x) + log(y) Second, log(f(x)) is a monotic transformation of f(x). So if we maximize log(f(x)) funtion, then we’ve also maximized f(x)

Below is a function that will do this.

ll<-function(pars,x) {
    likelihood<-function(pars,x) {
        tmp<-exp(-(x-pars[1])^2/(2*pars[2]))
        tmp/sqrt(2*pars[2]*pi)
    }
    like<-rep(NA,length(x))
    for (i in 1:length(x)) {
        like[i]<-likelihood(pars,x[i])
    }
    -1*sum(log(like))
}
optim(par=c(-2,2),ll,x=x)$par #these are the mean and variance estimates produced by maximum likelihood.
## [1] 0.2058405 0.9189130

Q. How do our estimates vary in accuracy as a function of the sample size (change 100 to something much bigger and much smaller in the call to “rnorm” at the top)? What does this connect to in your understanding of estimation theory (think standard error)?

When I ran with rnorm(10000), I got a mean and variance output of 0.00125 and 0.99696, respectively. Conversely, when I did a value of 10, I got a -0.07818 and 0.411641 for the mean and variance.

Question 4: Item Quality

setwd("~/Desktop/WinterQuarter2022/EDUC-252L/252L-master/data")

d1 <- read.delim("emp-rasch.txt", sep = " ", header = FALSE)
d2 <- read.delim("rasch.txt", sep = " ", header = FALSE)

Consider the item statistics (p-values & item-total correlations) discussed in the Crocker & Algina text. What do you think?

get.coors <- function(d1) {
  r.xt <- numeric() # creating a numeric vector
  ss <- rowSums(d1) # sum scores of d1
  for (i in 1:ncol(d1)) {
    r.xt[i] <- cor(ss, d1[,i])
  }
  r.xt
}

plot.fun <- function(d1) {
  layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE))
  par(mgp = c(2,1,0), mar = c(3,3,1,1), oma = rep(.5, 4))
  pv <- colMeans(d1, na.rm = TRUE)
  # r.xt <- get.coors(d1)
  hist(pv, xlim = 0:1, xlab = "p-values", sub = '', main = '')
  # hist(r.xt,xlim=0:1,xlab="item-total cor",sub='',main='')  
  plot(pv, pch = 19, cex = 2)
  
  # details for the red line 
}
plot.fun(d1)

Based on the p-values for the first dataset, there seems to be pretty good spread in terms of items. I was having some trouble loading the plot for the item correlations but if this was able to be shown, I would want to see if there are a decent number of items with acorrelations > .2. I continued to get an error that there were an invalid number of ‘breaks’ when I attempted to run my code and was unable to figure out how to resolve this in a timely fasion.

get.coors <- function(d2) {
  r.xt <- numeric() # creating a numeric vector
  ss <- rowSums(d2) # sum scores of d2
  for (i in 1:ncol(d2)) {
    r.xt[i] <- cor(ss, d2[,i])
  }
  r.xt
}

plot.fun <- function(d2) {
  layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE))
  par(mgp = c(2,1,0), mar = c(3,3,1,1), oma = rep(.5, 4))
  pv <- colMeans(d2, na.rm = TRUE)
  r.xt <- get.coors(d2)
  hist(pv, xlim = 0:1, xlab = "p-values", sub = '', main = '')
  hist(r.xt,xlim=0:1,xlab="item-total cor",sub='',main='')  
  plot(pv, r.xt, pch = 19, cex = 2)
  
  # details for the red line 
  loess(r.xt~pv) -> m
  cbind(pv, fitted(m)) -> tmp
  lines(tmp[order(tmp[,1]),], col = "red", lwd = 3)
  NULL
}
plot.fun(d2)

## NULL

for this dataset, there are not as much spread in p-values compared to the prior dataset; it would have been ideal to have more evenness among the values - especially for p-values > .5. However, item correlation appears to be pretty high!