The data we are using for this example came from the “pks” R package. Please install the package and load (library) it before you run the model.
This data set contains responses to problems in elementary probability theory observed before and after some instructions (the so-called learning object) were given. Data were collected both in the lab and via an online questionnaire. Of the 1127 participants eligible in the online study, 649 were excluded because they did not complete the first set of problems (p101, …, p112) or they responded too quickly or too slowly. Based on similar criteria, further participants were excluded for the second set of problems, indicated by missing values in the variables b201, …, b212. Problems were presented in random order.
Participants were randomized to two conditions: an enhanced learning object including instructions with examples and a basic learning object without examples. Instructions were given on four concepts: how to calculate the classic probability of an event (pb), the probability of the complement of an event (cp), of the union of two disjoint events (un), and of two independent events (id).
A data frame with 504 cases and 68 variables. Link: https://cran.r-project.org/web/packages/pks/pks.pdf
case a factor giving the case id, a five-digits code the fist digit denoting lab or online branch of the study, the last four digits being the case number.
lastpage Which page of the questionnaire was reached before quitting? The questionnaire consisted of ten pages.
mode a factor; lab or online branch of study.
started a timestamp of class POSIXlt. When did participant start working on the questionnaire?
sex a factor coding sex of participant.
age age of participant.
educat education as a factor with three levels: 1 secondary school or below; 2 higher education entrance qualification; 3 university degree.
fos field of study. Factor with eight levels: ecla economics, business, law; else miscellaneous; hipo history, politics; lang languages; mabi mathematics, physics, biology; medi medical science; phth philosophy, theology; psco psychology, computer science, cognitive science.
semester ordered factor. What semester are you in?
learnobj a factor with two levels: enhan learning object enhanced with examples; basic learning object without examples.
p101 to p112 The twelve problems of the first part (before the learning object). Coded as correct(1) or error(0).
p201 to p212 The twelve problems of the second part (after the learning object). Coded as correct(1) or error(0).
time01 to time10 the time (in s) spent on each page of the questionnaire. In the lab branch of the study, participants started directly on Page 2.
For this example, I picked the first 12 items as my target variables.
# First, we need to install the package that contains the target dataset
# install.packages("pks")
library(pks)
## Loading required package: sets
# Then, we load the dataset
data("probability")
# As we learned from the previous class, for running a CDM model, we need to prepare two data files: One is the Students' responses data, the other is the Q-matrix (the attribute map).
# In this Example, I use the first 12 items as our target variables.
probability <- as.data.frame(probability)
pro_data <- probability[,45:56]
# Now, we need to prepare the Q-matrix
## Conjunctive skill function, one-to-one problem function
pro_Q <- read.table(header = TRUE, text = "
item cp id pb un
1 0 0 1 0
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
5 1 0 1 0
6 1 0 1 0
7 0 0 1 1
8 0 0 1 1
9 0 1 1 0
10 1 1 0 0
11 1 1 1 0
12 0 1 1 1
")
# Drop the Item column
pro_Q <- pro_Q[,2:5]
Before we use other CDM models on our data, I am always use the GDINA model to run the first round. Because it is a more generlized model.
# Before we run the model, please make sure you have load the package.
library("GDINA")
#Fit the data using G-DINA model
fit1 <- GDINA(dat = pro_data, Q = pro_Q,verbose = 0)
# Check the model result
summary(fit1)
##
## Test Fit Statistics
##
## Loglik = -2426.63
##
## AIC = 4979.26 | penalty [2 * p] = 126.00
## BIC = 5245.28 | penalty [log(n) * p] = 392.02
## CAIC = 5308.28 | penalty [(log(n) + 1) * p] = 455.02
## SABIC = 5045.31 | penalty [log((n + 2)/24) * p] = 192.05
##
## No. of parameters (p) = 63
## No. of estimated item parameters = 48
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.1476 0.8524
## A2 0.2900 0.7100
## A3 0.1773 0.8227
## A4 0.1284 0.8716
As show above, the summary function prints some test fit statistics, like AIC and BIC, the number of parameters and attribute prevalence. The attribute prevalence gives the proportion of individuals who master (or do not master) each attribute, labelled as “Level1” (or “Level0”).
According to the Attribute Prevalence table, we can have an general understanding on how well the students mastered each attribute. Overall, students mastered all four attributes because the probabilities showed in level1 are above 50%. A4 (calculating the probability of two indepdent events) shows the highest probability 87.16%, whereas A2 (calculating the probability of the complement of an event) reveals the lowest probability as 71%.
## to extract item parameters, we can use "coef" function, as in
coef(fit1)
## $`Item 1`
## P(0) P(1)
## 0.2419 0.9343
##
## $`Item 2`
## P(0) P(1)
## 0.2615 0.9999
##
## $`Item 3`
## P(0) P(1)
## 0.0719 0.9728
##
## $`Item 4`
## P(0) P(1)
## 0.1245 0.9636
##
## $`Item 5`
## P(00) P(10) P(01) P(11)
## 0.1105 0.6112 0.8280 0.8526
##
## $`Item 6`
## P(00) P(10) P(01) P(11)
## 0.1371 0.5919 0.9482 0.9564
##
## $`Item 7`
## P(00) P(10) P(01) P(11)
## 0.2717 0.7244 0.6458 0.9438
##
## $`Item 8`
## P(00) P(10) P(01) P(11)
## 0.3557 0.9999 0.9647 0.9476
##
## $`Item 9`
## P(00) P(10) P(01) P(11)
## 0.0985 0.4929 0.6513 0.7746
##
## $`Item 10`
## P(00) P(10) P(01) P(11)
## 0.0494 0.0534 0.8279 0.8078
##
## $`Item 11`
## P(000) P(100) P(010) P(001) P(110) P(101) P(011) P(111)
## 0.0754 0.0001 0.9999 0.0757 0.3618 0.0183 0.6426 0.7051
##
## $`Item 12`
## P(000) P(100) P(010) P(001) P(110) P(101) P(011) P(111)
## 0.0414 0.0001 0.0001 0.0001 0.7865 0.0001 0.0537 0.8220
The returned output by coef (when we only specify the first argument) is item success probabilities for each latent group (reduced latent classes).
coef has a few arguments including what, withSE and SE.type what specifies what to show; It can be “itemprob” for item success probabilities, “gs” for guessing and slip parameters, “delta” for delta parameters, “rrum” for RRUM parameters when items are estimated using RRUM, “LCprob” for item success probabilities for all latent classes, and “lambda” for structural parameters of joint attribute distribution (such as higher-order structural parameters).
# Let's try some arguments with coef function
# item success probabilities for item 10
coef(fit1, withSE = TRUE) [[10]]
## P(00) P(10) P(01) P(11)
## Est. 0.0494 0.0534 0.8279 0.8078
## S.E. 0.0359 0.0351 0.1659 0.0234
For item 10, the students are most likely to mastered the id (calculating the probability of two independent events) atrribute as 82.79%, seconded by the probability of mastered both id and cp (calculating the probability of complement of an event) attriute as 80.78%.
# delta parameters for item 7, which reflecting the difficulty of the item
coef(fit1, what = "delta", withSE = TRUE)[[7]]
## d0 d1 d2 d12
## Est. 0.2717 0.4527 0.3740 -0.1547
## S.E. 0.0770 0.1797 0.1347 0.2136
Delta Parameters indicate the cumulative probability. If a person masters none attribute, then the probability to answer the item 7 correct is 27.17%.
# population proportions when saturated model is used for joint attribute distribution
coef(fit1, what = "lambda")
## p(0000) p(1000) p(0100) p(0010) p(0001) p(1100) p(1010) p(1001) p(0110)
## 0.0783 0.0174 0.0047 0.0160 0.0121 0.0000 0.0004 0.0474 0.0000
## p(0101) p(0011) p(1110) p(1101) p(1011) p(0111) p(1111)
## 0.0019 0.0051 0.0115 0.0154 0.1132 0.0294 0.6470
# Sort the vector
sort(coef(fit1, what = "lambda"))
## p(1100) p(0110) p(1010) p(0101) p(0100) p(0011) p(1110) p(0001) p(1101)
## 0.0000 0.0000 0.0004 0.0019 0.0047 0.0051 0.0115 0.0121 0.0154
## p(0010) p(1000) p(0111) p(1001) p(0000) p(1011) p(1111)
## 0.0160 0.0174 0.0294 0.0474 0.0783 0.1132 0.6470
Check the population porpotions with the lambda argument, we can see most of the students (64.70%) mastered all 4 attributes, 11.32% students only lack the knowledge on id (calculating the probability of two independent events) atrribute. An interesting factor could be found here is that about 7.83% students mastered none of these attributes.
# Use the gs argument to check the guessing and slip parameter
coef(fit1,what="gs",withSE = TRUE)
## guessing slip SE[guessing] SE[slip]
## Item 1 0.2419 0.0657 0.0690 0.0140
## Item 2 0.2615 0.0001 0.0860 0.0107
## Item 3 0.0719 0.0272 0.0527 0.0111
## Item 4 0.1245 0.0364 0.0377 0.0109
## Item 5 0.1105 0.1474 0.0743 0.0197
## Item 6 0.1371 0.0436 0.0685 0.0122
## Item 7 0.2717 0.0562 0.0770 0.0127
## Item 8 0.3557 0.0524 0.0968 0.0119
## Item 9 0.0985 0.2254 0.0552 0.0246
## Item 10 0.0494 0.1922 0.0359 0.0234
## Item 11 0.0754 0.2949 0.0534 0.0270
## Item 12 0.0414 0.1780 0.0394 0.0233
According to the table above, we can tell that Item 8 is the easiest item to guess (35.57%), whereas item 12 is the hardest one to guess (4.14%). For slip parameters, the item 11 contains the highest slip probability (29.49%), whereas the item 2 contains the lowest slip probability (0.001%).
# To plot the probabilities of success for each item, we can use the plot function
plot(fit1,what="IRF",item=c(3,9))
# We can also add an error bar to the IRF plot by using the withSE argument
plot(fit1,what="IRF",item=c(3,9),withSE = TRUE)
Individual attribute patterns can be estimated using EAP, MLE and MAP. To obtain attribute estimates, use the function personparm.
# Let's print out the frist 6 students attribute map
head(personparm(fit1))
## A1 A2 A3 A4
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 0 1 1
## [4,] 1 1 1 1
## [5,] 1 0 1 1
## [6,] 1 1 1 1
# We can check any student by specify their id
personparm(fit1)[c(133,234,322),]
## A1 A2 A3 A4
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 1 1 1
As shown in above, personparm(fit1) will return a matrix. We use head function to display the first six individuals’ estimates. Like coef function, personparm also has an argument called “what” so that we can specify the type of person parameters that we want to extract. If what is not specified as above, EAP estimates of attribute patterns will be returned.(EAP stands for Expected A Priori)
#probability of mastering each attribute
head(personparm(fit1, what = "mp")) # mp stands for mastering probability
## A1 A2 A3 A4
## [1,] 0.9878 0.9801 1.0000 0.9990
## [2,] 0.9889 0.9998 1.0000 0.9990
## [3,] 0.9881 0.2399 0.9238 0.9978
## [4,] 0.9860 0.9997 1.0000 0.9990
## [5,] 0.9873 0.0033 0.9946 0.9977
## [6,] 0.9907 0.9997 1.0000 0.9990
# You can also plot probabilities of mastering each attribute for individuals you speficied.
plot(fit1, what = "mp", person = c(5,15,234,289))
We use
extract
function to extract other elements. Its first input should be an object returned from GDINA function (or other functions), and its second input (i.e., what) specifies what to extract. Use?extract
for a complete list of elements that can be extracted.
# to calculate the discrimination indices,use the argument "discrim"
extract(fit1,what="discrim")
## P(1)-P(0) GDI
## Item 1 0.6923669 0.06992676
## Item 2 0.7383328 0.06860122
## Item 3 0.9008952 0.09083679
## Item 4 0.8390675 0.14496829
## Item 5 0.7420568 0.04958274
## Item 6 0.8192867 0.06390767
## Item 7 0.6720341 0.04421999
## Item 8 0.5918907 0.03208092
## Item 9 0.6761210 0.05844260
## Item 10 0.7584619 0.11801611
## Item 11 0.6296787 0.09289076
## Item 12 0.7805630 0.13358299
alist <- extract(fit1,what="discrim")
According to the table above, it reveals that the item 3 has the highest discrimination power (90.09%), whereas the item 8 has the lowest discrimination power (59.19%)
# to calculate the proportion of individuals in each latent class, we can use
extract(fit1,"posterior.prob")
## 0000 1000 0100 0010 0001 1100
## [1,] 0.07834271 0.01737127 0.00474728 0.01599271 0.01210388 1.716614e-15
## 1010 1001 0110 0101 0011
## [1,] 0.0004215915 0.04744766 1.000031e-16 0.00189403 0.005149523
## 1110 1101 1011 0111 1111
## [1,] 0.01153519 0.01540388 0.1131965 0.02941002 0.6469838
# This is the same with the lambda parameter when using the coef function
coef(fit1, what = "lambda")
## p(0000) p(1000) p(0100) p(0010) p(0001) p(1100) p(1010) p(1001) p(0110)
## 0.0783 0.0174 0.0047 0.0160 0.0121 0.0000 0.0004 0.0474 0.0000
## p(0101) p(0011) p(1110) p(1101) p(1011) p(0111) p(1111)
## 0.0019 0.0051 0.0115 0.0154 0.1132 0.0294 0.6470
We can running other models by specify the model type in GDINA function
# Fit DINA model
fit_DINA <- GDINA(dat = pro_data, Q = pro_Q, model="DINA", verbose = 0)
# Check the model
summary(fit_DINA)
##
## Test Fit Statistics
##
## Loglik = -2478.90
##
## AIC = 5035.79 | penalty [2 * p] = 78.00
## BIC = 5200.47 | penalty [log(n) * p] = 242.68
## CAIC = 5239.47 | penalty [(log(n) + 1) * p] = 281.68
## SABIC = 5076.68 | penalty [log((n + 2)/24) * p] = 118.89
##
## No. of parameters (p) = 39
## No. of estimated item parameters = 24
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.1159 0.8841
## A2 0.2898 0.7102
## A3 0.1398 0.8602
## A4 0.1203 0.8797
When you’re considering use a DINA model: learners are assumed to be not expected to answer all the items correctly unless they have already mastered all the attributes measured by this item.
# Fit DINO model
fit_DINO <- GDINA(dat = pro_data, Q = pro_Q, model="DINO", verbose = 0)
# Check the model
summary(fit_DINO)
##
## Test Fit Statistics
##
## Loglik = -2563.23
##
## AIC = 5204.47 | penalty [2 * p] = 78.00
## BIC = 5369.15 | penalty [log(n) * p] = 242.68
## CAIC = 5408.15 | penalty [(log(n) + 1) * p] = 281.68
## SABIC = 5245.36 | penalty [log((n + 2)/24) * p] = 118.89
##
## No. of parameters (p) = 39
## No. of estimated item parameters = 24
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.2873 0.7127
## A2 0.2824 0.7176
## A3 0.3671 0.6329
## A4 0.3436 0.6564
When you’re considering use a DINO model: Leaners are expected to answer an item correctly as long as they have already mastered at least one attribute measured by the item.
# Fit ACDM model
fit_ACDM <- GDINA(dat = pro_data, Q = pro_Q, model="ACDM", verbose = 0)
# Check the model
summary(fit_ACDM)
##
## Test Fit Statistics
##
## Loglik = -2441.40
##
## AIC = 4980.80 | penalty [2 * p] = 98.00
## BIC = 5187.71 | penalty [log(n) * p] = 304.91
## CAIC = 5236.71 | penalty [(log(n) + 1) * p] = 353.91
## SABIC = 5032.18 | penalty [log((n + 2)/24) * p] = 149.38
##
## No. of parameters (p) = 49
## No. of estimated item parameters = 34
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.1120 0.8880
## A2 0.2933 0.7067
## A3 0.1444 0.8556
## A4 0.1205 0.8795
When you’re considering use a A-CDM model: It indicates that a student who mastering one attribute increase the probability of success on item j by delta_jk independent of the contributions of the other attributes.
# Fit LLM model
fit_LLM <- GDINA(dat = pro_data, Q = pro_Q, model="LLM", verbose = 0)
# Check the model
summary(fit_LLM)
##
## Test Fit Statistics
##
## Loglik = -2434.27
##
## AIC = 4966.53 | penalty [2 * p] = 98.00
## BIC = 5173.44 | penalty [log(n) * p] = 304.91
## CAIC = 5222.44 | penalty [(log(n) + 1) * p] = 353.91
## SABIC = 5017.91 | penalty [log((n + 2)/24) * p] = 149.38
##
## No. of parameters (p) = 49
## No. of estimated item parameters = 34
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.1118 0.8882
## A2 0.2912 0.7088
## A3 0.1597 0.8403
## A4 0.1143 0.8857
When you’re considering use a LLM model: You’re almost doing the ACDM model but on a logit scale.
# Fit R-RUM model
fit_RRUM <- GDINA(dat = pro_data, Q = pro_Q, model="RRUM", verbose = 0)
# Check the model
summary(fit_RRUM)
##
## Test Fit Statistics
##
## Loglik = -2440.45
##
## AIC = 4978.91 | penalty [2 * p] = 98.00
## BIC = 5185.82 | penalty [log(n) * p] = 304.91
## CAIC = 5234.82 | penalty [(log(n) + 1) * p] = 353.91
## SABIC = 5030.28 | penalty [log((n + 2)/24) * p] = 149.38
##
## No. of parameters (p) = 49
## No. of estimated item parameters = 34
## No. of fixed item parameters = 0
## No. of distribution parameters = 15
##
## Attribute Prevalence
##
## Level0 Level1
## A1 0.1127 0.8873
## A2 0.2915 0.7085
## A3 0.1584 0.8416
## A4 0.1178 0.8822
When you’re considering use a R-RUM model: You’re almost doing the ACDM model but on a log scale.
This is the end of this example. Thank you.