Class Example: Using CDM models on the Probability Dataset

Section 1: Introduction of the dataset

Title: Probability dataset

Problems in Elementary Probability Theory

The data we are using for this example came from the “pks” R package. Please install the package and load (library) it before you run the model.

Data Description

This data set contains responses to problems in elementary probability theory observed before and after some instructions (the so-called learning object) were given. Data were collected both in the lab and via an online questionnaire. Of the 1127 participants eligible in the online study, 649 were excluded because they did not complete the first set of problems (p101, …, p112) or they responded too quickly or too slowly. Based on similar criteria, further participants were excluded for the second set of problems, indicated by missing values in the variables b201, …, b212. Problems were presented in random order.

Participants were randomized to two conditions: an enhanced learning object including instructions with examples and a basic learning object without examples. Instructions were given on four concepts: how to calculate the classic probability of an event (pb), the probability of the complement of an event (cp), of the union of two disjoint events (un), and of two independent events (id).

Data format

A data frame with 504 cases and 68 variables. Link: https://cran.r-project.org/web/packages/pks/pks.pdf

  • case a factor giving the case id, a five-digits code the fist digit denoting lab or online branch of the study, the last four digits being the case number.

  • lastpage Which page of the questionnaire was reached before quitting? The questionnaire consisted of ten pages.

  • mode a factor; lab or online branch of study.

  • started a timestamp of class POSIXlt. When did participant start working on the questionnaire?

  • sex a factor coding sex of participant.

  • age age of participant.

  • educat education as a factor with three levels: 1 secondary school or below; 2 higher education entrance qualification; 3 university degree.

  • fos field of study. Factor with eight levels: ecla economics, business, law; else miscellaneous; hipo history, politics; lang languages; mabi mathematics, physics, biology; medi medical science; phth philosophy, theology; psco psychology, computer science, cognitive science.

  • semester ordered factor. What semester are you in?

  • learnobj a factor with two levels: enhan learning object enhanced with examples; basic learning object without examples.

  • p101 to p112 The twelve problems of the first part (before the learning object). Coded as correct(1) or error(0).

  • p201 to p212 The twelve problems of the second part (after the learning object). Coded as correct(1) or error(0).

  • time01 to time10 the time (in s) spent on each page of the questionnaire. In the lab branch of the study, participants started directly on Page 2.

Important Variables

For this example, I picked the first 12 items as my target variables.

# First, we need to install the package that contains the target dataset
# install.packages("pks")
library(pks)
## Loading required package: sets
# Then, we load the dataset
data("probability")
# As we learned from the previous class, for running a CDM model, we need to prepare two data files: One is the Students' responses data, the other is the Q-matrix (the attribute map).
# In this Example, I use the first 12 items as our target variables.
probability <- as.data.frame(probability)
pro_data <- probability[,45:56]
# Now, we need to prepare the Q-matrix
## Conjunctive skill function, one-to-one problem function
pro_Q <- read.table(header = TRUE, text = "
item cp id pb un
1 0 0 1 0
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
5 1 0 1 0
6 1 0 1 0
7 0 0 1 1
8 0 0 1 1
9 0 1 1 0
10 1 1 0 0
11 1 1 1 0
12 0 1 1 1
")
# Drop the Item column
pro_Q <- pro_Q[,2:5]

Section 2: Running the GDINA model

Before we use other CDM models on our data, I am always use the GDINA model to run the first round. Because it is a more generlized model.

Running the GDINA model

# Before we run the model, please make sure you have load the package.
library("GDINA")
#Fit the data using G-DINA model
fit1 <- GDINA(dat = pro_data, Q = pro_Q,verbose = 0)
# Check the model result
summary(fit1)
## 
## Test Fit Statistics
## 
## Loglik = -2426.63 
## 
## AIC    = 4979.26  | penalty [2 * p]  = 126.00 
## BIC    = 5245.28  | penalty [log(n) * p]  = 392.02 
## CAIC   = 5308.28  | penalty [(log(n) + 1) * p]  = 455.02 
## SABIC  = 5045.31  | penalty [log((n + 2)/24) * p]  = 192.05 
## 
## No. of parameters (p)  = 63 
##   No. of estimated item parameters =  48 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.1476 0.8524
## A2 0.2900 0.7100
## A3 0.1773 0.8227
## A4 0.1284 0.8716

As show above, the summary function prints some test fit statistics, like AIC and BIC, the number of parameters and attribute prevalence. The attribute prevalence gives the proportion of individuals who master (or do not master) each attribute, labelled as “Level1” (or “Level0”).

According to the Attribute Prevalence table, we can have an general understanding on how well the students mastered each attribute. Overall, students mastered all four attributes because the probabilities showed in level1 are above 50%. A4 (calculating the probability of two indepdent events) shows the highest probability 87.16%, whereas A2 (calculating the probability of the complement of an event) reveals the lowest probability as 71%.

Check Item Parameters

## to extract item parameters, we can use "coef" function, as in
coef(fit1)
## $`Item 1`
##   P(0)   P(1) 
## 0.2419 0.9343 
## 
## $`Item 2`
##   P(0)   P(1) 
## 0.2615 0.9999 
## 
## $`Item 3`
##   P(0)   P(1) 
## 0.0719 0.9728 
## 
## $`Item 4`
##   P(0)   P(1) 
## 0.1245 0.9636 
## 
## $`Item 5`
##  P(00)  P(10)  P(01)  P(11) 
## 0.1105 0.6112 0.8280 0.8526 
## 
## $`Item 6`
##  P(00)  P(10)  P(01)  P(11) 
## 0.1371 0.5919 0.9482 0.9564 
## 
## $`Item 7`
##  P(00)  P(10)  P(01)  P(11) 
## 0.2717 0.7244 0.6458 0.9438 
## 
## $`Item 8`
##  P(00)  P(10)  P(01)  P(11) 
## 0.3557 0.9999 0.9647 0.9476 
## 
## $`Item 9`
##  P(00)  P(10)  P(01)  P(11) 
## 0.0985 0.4929 0.6513 0.7746 
## 
## $`Item 10`
##  P(00)  P(10)  P(01)  P(11) 
## 0.0494 0.0534 0.8279 0.8078 
## 
## $`Item 11`
## P(000) P(100) P(010) P(001) P(110) P(101) P(011) P(111) 
## 0.0754 0.0001 0.9999 0.0757 0.3618 0.0183 0.6426 0.7051 
## 
## $`Item 12`
## P(000) P(100) P(010) P(001) P(110) P(101) P(011) P(111) 
## 0.0414 0.0001 0.0001 0.0001 0.7865 0.0001 0.0537 0.8220

The returned output by coef (when we only specify the first argument) is item success probabilities for each latent group (reduced latent classes).

coef has a few arguments including what, withSE and SE.type what specifies what to show; It can be “itemprob” for item success probabilities, “gs” for guessing and slip parameters, “delta” for delta parameters, “rrum” for RRUM parameters when items are estimated using RRUM, “LCprob” for item success probabilities for all latent classes, and “lambda” for structural parameters of joint attribute distribution (such as higher-order structural parameters).

# Let's try some arguments with coef function
# item success probabilities for item 10
coef(fit1, withSE = TRUE) [[10]]
##       P(00)  P(10)  P(01)  P(11)
## Est. 0.0494 0.0534 0.8279 0.8078
## S.E. 0.0359 0.0351 0.1659 0.0234

For item 10, the students are most likely to mastered the id (calculating the probability of two independent events) atrribute as 82.79%, seconded by the probability of mastered both id and cp (calculating the probability of complement of an event) attriute as 80.78%.

# delta parameters for item 7, which reflecting the difficulty of the item
coef(fit1, what = "delta", withSE = TRUE)[[7]]
##          d0     d1     d2     d12
## Est. 0.2717 0.4527 0.3740 -0.1547
## S.E. 0.0770 0.1797 0.1347  0.2136

Delta Parameters indicate the cumulative probability. If a person masters none attribute, then the probability to answer the item 7 correct is 27.17%.

# population proportions when saturated model is used for joint attribute distribution
coef(fit1, what = "lambda")
## p(0000) p(1000) p(0100) p(0010) p(0001) p(1100) p(1010) p(1001) p(0110) 
##  0.0783  0.0174  0.0047  0.0160  0.0121  0.0000  0.0004  0.0474  0.0000 
## p(0101) p(0011) p(1110) p(1101) p(1011) p(0111) p(1111) 
##  0.0019  0.0051  0.0115  0.0154  0.1132  0.0294  0.6470
# Sort the vector
sort(coef(fit1, what = "lambda"))
## p(1100) p(0110) p(1010) p(0101) p(0100) p(0011) p(1110) p(0001) p(1101) 
##  0.0000  0.0000  0.0004  0.0019  0.0047  0.0051  0.0115  0.0121  0.0154 
## p(0010) p(1000) p(0111) p(1001) p(0000) p(1011) p(1111) 
##  0.0160  0.0174  0.0294  0.0474  0.0783  0.1132  0.6470

Check the population porpotions with the lambda argument, we can see most of the students (64.70%) mastered all 4 attributes, 11.32% students only lack the knowledge on id (calculating the probability of two independent events) atrribute. An interesting factor could be found here is that about 7.83% students mastered none of these attributes.

# Use the gs argument to check the guessing and slip parameter
coef(fit1,what="gs",withSE = TRUE)
##         guessing   slip SE[guessing] SE[slip]
## Item 1    0.2419 0.0657       0.0690   0.0140
## Item 2    0.2615 0.0001       0.0860   0.0107
## Item 3    0.0719 0.0272       0.0527   0.0111
## Item 4    0.1245 0.0364       0.0377   0.0109
## Item 5    0.1105 0.1474       0.0743   0.0197
## Item 6    0.1371 0.0436       0.0685   0.0122
## Item 7    0.2717 0.0562       0.0770   0.0127
## Item 8    0.3557 0.0524       0.0968   0.0119
## Item 9    0.0985 0.2254       0.0552   0.0246
## Item 10   0.0494 0.1922       0.0359   0.0234
## Item 11   0.0754 0.2949       0.0534   0.0270
## Item 12   0.0414 0.1780       0.0394   0.0233

According to the table above, we can tell that Item 8 is the easiest item to guess (35.57%), whereas item 12 is the hardest one to guess (4.14%). For slip parameters, the item 11 contains the highest slip probability (29.49%), whereas the item 2 contains the lowest slip probability (0.001%).

# To plot the probabilities of success for each item, we can use the plot function
plot(fit1,what="IRF",item=c(3,9))

# We can also add an error bar to the IRF plot by using the withSE argument
plot(fit1,what="IRF",item=c(3,9),withSE = TRUE)

Check Person Parameters

Individual attribute patterns can be estimated using EAP, MLE and MAP. To obtain attribute estimates, use the function personparm.

# Let's print out the frist 6 students attribute map
head(personparm(fit1))
##      A1 A2 A3 A4
## [1,]  1  1  1  1
## [2,]  1  1  1  1
## [3,]  1  0  1  1
## [4,]  1  1  1  1
## [5,]  1  0  1  1
## [6,]  1  1  1  1
# We can check any student by specify their id
personparm(fit1)[c(133,234,322),]
##      A1 A2 A3 A4
## [1,]  1  1  1  1
## [2,]  1  1  1  1
## [3,]  1  1  1  1

As shown in above, personparm(fit1) will return a matrix. We use head function to display the first six individuals’ estimates. Like coef function, personparm also has an argument called “what” so that we can specify the type of person parameters that we want to extract. If what is not specified as above, EAP estimates of attribute patterns will be returned.(EAP stands for Expected A Priori)

#probability of mastering each attribute
head(personparm(fit1, what = "mp")) # mp stands for mastering probability
##          A1     A2     A3     A4
## [1,] 0.9878 0.9801 1.0000 0.9990
## [2,] 0.9889 0.9998 1.0000 0.9990
## [3,] 0.9881 0.2399 0.9238 0.9978
## [4,] 0.9860 0.9997 1.0000 0.9990
## [5,] 0.9873 0.0033 0.9946 0.9977
## [6,] 0.9907 0.9997 1.0000 0.9990
# You can also plot probabilities of mastering each attribute for individuals you speficied.
plot(fit1, what = "mp", person = c(5,15,234,289))

Check Ohter Components of G-DINA estimates

We use extract function to extract other elements. Its first input should be an object returned from GDINA function (or other functions), and its second input (i.e., what) specifies what to extract. Use ?extract for a complete list of elements that can be extracted.

# to calculate the discrimination indices,use the argument "discrim"
extract(fit1,what="discrim")
##         P(1)-P(0)        GDI
## Item 1  0.6923669 0.06992676
## Item 2  0.7383328 0.06860122
## Item 3  0.9008952 0.09083679
## Item 4  0.8390675 0.14496829
## Item 5  0.7420568 0.04958274
## Item 6  0.8192867 0.06390767
## Item 7  0.6720341 0.04421999
## Item 8  0.5918907 0.03208092
## Item 9  0.6761210 0.05844260
## Item 10 0.7584619 0.11801611
## Item 11 0.6296787 0.09289076
## Item 12 0.7805630 0.13358299
alist <- extract(fit1,what="discrim")

According to the table above, it reveals that the item 3 has the highest discrimination power (90.09%), whereas the item 8 has the lowest discrimination power (59.19%)

# to calculate the proportion of individuals in each latent class, we can use
extract(fit1,"posterior.prob")
##            0000       1000       0100       0010       0001         1100
## [1,] 0.07834271 0.01737127 0.00474728 0.01599271 0.01210388 1.716614e-15
##              1010       1001         0110       0101        0011
## [1,] 0.0004215915 0.04744766 1.000031e-16 0.00189403 0.005149523
##            1110       1101      1011       0111      1111
## [1,] 0.01153519 0.01540388 0.1131965 0.02941002 0.6469838
# This is the same with the lambda parameter when using the coef function 
coef(fit1, what = "lambda")
## p(0000) p(1000) p(0100) p(0010) p(0001) p(1100) p(1010) p(1001) p(0110) 
##  0.0783  0.0174  0.0047  0.0160  0.0121  0.0000  0.0004  0.0474  0.0000 
## p(0101) p(0011) p(1110) p(1101) p(1011) p(0111) p(1111) 
##  0.0019  0.0051  0.0115  0.0154  0.1132  0.0294  0.6470

Section 3: Running the other CDM model

We can running other models by specify the model type in GDINA function

Fitting the DINA model

# Fit DINA model
fit_DINA <- GDINA(dat = pro_data, Q = pro_Q, model="DINA", verbose = 0)
# Check the model
summary(fit_DINA)
## 
## Test Fit Statistics
## 
## Loglik = -2478.90 
## 
## AIC    = 5035.79  | penalty [2 * p]  = 78.00 
## BIC    = 5200.47  | penalty [log(n) * p]  = 242.68 
## CAIC   = 5239.47  | penalty [(log(n) + 1) * p]  = 281.68 
## SABIC  = 5076.68  | penalty [log((n + 2)/24) * p]  = 118.89 
## 
## No. of parameters (p)  = 39 
##   No. of estimated item parameters =  24 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.1159 0.8841
## A2 0.2898 0.7102
## A3 0.1398 0.8602
## A4 0.1203 0.8797

When you’re considering use a DINA model: learners are assumed to be not expected to answer all the items correctly unless they have already mastered all the attributes measured by this item.

Fitting the DINA model

# Fit DINO model
fit_DINO <- GDINA(dat = pro_data, Q = pro_Q, model="DINO", verbose = 0)
# Check the model
summary(fit_DINO)
## 
## Test Fit Statistics
## 
## Loglik = -2563.23 
## 
## AIC    = 5204.47  | penalty [2 * p]  = 78.00 
## BIC    = 5369.15  | penalty [log(n) * p]  = 242.68 
## CAIC   = 5408.15  | penalty [(log(n) + 1) * p]  = 281.68 
## SABIC  = 5245.36  | penalty [log((n + 2)/24) * p]  = 118.89 
## 
## No. of parameters (p)  = 39 
##   No. of estimated item parameters =  24 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.2873 0.7127
## A2 0.2824 0.7176
## A3 0.3671 0.6329
## A4 0.3436 0.6564

When you’re considering use a DINO model: Leaners are expected to answer an item correctly as long as they have already mastered at least one attribute measured by the item.

Fitting the ACDM model

# Fit ACDM model
fit_ACDM <- GDINA(dat = pro_data, Q = pro_Q, model="ACDM", verbose = 0)
# Check the model
summary(fit_ACDM)
## 
## Test Fit Statistics
## 
## Loglik = -2441.40 
## 
## AIC    = 4980.80  | penalty [2 * p]  = 98.00 
## BIC    = 5187.71  | penalty [log(n) * p]  = 304.91 
## CAIC   = 5236.71  | penalty [(log(n) + 1) * p]  = 353.91 
## SABIC  = 5032.18  | penalty [log((n + 2)/24) * p]  = 149.38 
## 
## No. of parameters (p)  = 49 
##   No. of estimated item parameters =  34 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.1120 0.8880
## A2 0.2933 0.7067
## A3 0.1444 0.8556
## A4 0.1205 0.8795

When you’re considering use a A-CDM model: It indicates that a student who mastering one attribute increase the probability of success on item j by delta_jk independent of the contributions of the other attributes.

Fitting the LLM model

# Fit LLM model
fit_LLM <- GDINA(dat = pro_data, Q = pro_Q, model="LLM", verbose = 0)
# Check the model
summary(fit_LLM)
## 
## Test Fit Statistics
## 
## Loglik = -2434.27 
## 
## AIC    = 4966.53  | penalty [2 * p]  = 98.00 
## BIC    = 5173.44  | penalty [log(n) * p]  = 304.91 
## CAIC   = 5222.44  | penalty [(log(n) + 1) * p]  = 353.91 
## SABIC  = 5017.91  | penalty [log((n + 2)/24) * p]  = 149.38 
## 
## No. of parameters (p)  = 49 
##   No. of estimated item parameters =  34 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.1118 0.8882
## A2 0.2912 0.7088
## A3 0.1597 0.8403
## A4 0.1143 0.8857

When you’re considering use a LLM model: You’re almost doing the ACDM model but on a logit scale.

Fitting the R-RUM model

# Fit R-RUM model
fit_RRUM <- GDINA(dat = pro_data, Q = pro_Q, model="RRUM", verbose = 0)
# Check the model
summary(fit_RRUM)
## 
## Test Fit Statistics
## 
## Loglik = -2440.45 
## 
## AIC    = 4978.91  | penalty [2 * p]  = 98.00 
## BIC    = 5185.82  | penalty [log(n) * p]  = 304.91 
## CAIC   = 5234.82  | penalty [(log(n) + 1) * p]  = 353.91 
## SABIC  = 5030.28  | penalty [log((n + 2)/24) * p]  = 149.38 
## 
## No. of parameters (p)  = 49 
##   No. of estimated item parameters =  34 
##   No. of fixed item parameters =  0 
##   No. of distribution parameters =  15 
## 
## Attribute Prevalence
## 
##    Level0 Level1
## A1 0.1127 0.8873
## A2 0.2915 0.7085
## A3 0.1584 0.8416
## A4 0.1178 0.8822

When you’re considering use a R-RUM model: You’re almost doing the ACDM model but on a log scale.

For check the item parameters or person parameters, you can use any codes that provided in the second section (GDINA Example) on your model outcomes.

This is the end of this example. Thank you.