Exploratory analysis:

## Number of unique handles: 176
## Number of unique uid: 180
## Number of unique qid: 14
## head(data.df)
##                 handle                                  uid
## 1 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 2 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 3 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 4 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 5 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 6 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
##                          qid score length Count
## 1 do_31221414791237632023479     1    360     1
## 2 do_31221200484895129621378     0    179     1
## 3 do_31221200160445235221376     1    118     1
## 4 do_31221199846513868821372     1     51     1
## 5 do_31221196333227212811383     1     49     1
## 6 do_31221156833106329611243     1     33     1

User-Question interaction matrix

x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the number of times user has attempted the question.

User-Question response mean score

x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the average of all score that the user has obtained for the question.

Note: User not attempting a question is also represented as 0.

Tetrochoric correlation and item correlations

Correlation plot for the 14 questions. The axes represent the question id.

Distribution of score by question

Factor analysis:

Factor is the smallest index at which (ordered) eigen-values of simulated data exceed the corresponding eigen value of real data. However, if we go by the Kaiser’s rule of thumb that components below an eigen value of 1 are not significant.

## Parallel analysis suggests that the number of factors =  1  and the number of components =  NA
## [1] "From the figure, we choose the number of factors as 2"
## 
## Loadings:
##       MR1    MR2   
##  [1,]  0.617  0.286
##  [2,]  0.561 -0.405
##  [3,]  0.485       
##  [4,]         0.812
##  [5,]  0.214  0.657
##  [6,]  0.763 -0.194
##  [7,] -0.122  0.575
##  [8,]  0.424       
##  [9,]  0.373  0.859
## [10,]  0.879  0.186
## [11,]  0.657  0.174
## [12,]  0.969       
## [13,]  0.886       
## [14,]  0.803  0.431
## 
##                  MR1   MR2
## SS loadings    5.467 2.713
## Proportion Var 0.390 0.194
## Cumulative Var 0.390 0.584

Visualize the factor loadings as a graph

plot forced network of factor loadings

IRT

Analysis of multivariate dichotomous data using latent trait models under the Item Response Theory approach. ####2PL model-two-parameter logistic model

Modeling the response matrix using a 2 parameter model, we get:

##             Dffclt    Dscrmn
## Item 1  -0.8430558  2.622835
## Item 2  -0.3055962  1.823897
## Item 3  -0.7490030  2.733768
## Item 4  -0.7583207  1.957374
## Item 5  -0.6080094  1.997835
## Item 6  -1.0267144  2.480781
## Item 7  -0.7050109  2.019568
## Item 8  -0.9326956  1.984742
## Item 9  -0.8614164  4.476312
## Item 10 -0.7803617 22.250036
## Item 11 -0.7935644  3.323147
## Item 12 -1.0229813  6.133005
## Item 13 -0.9232772  3.513819
## Item 14 -1.1000758  5.251052

As seen form the response matrix, most of the students have answered the questions correctly implying them to be easy. The negetive value of difficulty parameter indicates this. A value of -0.84 indicates only students with standard deviation 0.84 times from mean gets this question wrong.

diff_Disc.df<-as.data.frame(coef(model))
diff_Disc.df$qid<-questions
cat("Question in increasing order of difficulty:")
## Question in increasing order of difficulty:
diff_Disc.df$qid[order(-diff_Disc.df$Dffclt)]
##  [1] "do_31221200484895129621378" "do_31221196333227212811383"
##  [3] "do_31221197647432089621363" "do_31221200160445235221376"
##  [5] "do_31221199846513868821372" "do_31221197474628403211385"
##  [7] "do_31221195964240691221357" "do_31221414791237632023479"
##  [9] "do_31221156454046105611241" "do_31221156477276979221230"
## [11] "do_31221197823005491221366" "do_31221195817094348821353"
## [13] "do_31221156833106329611243" "do_31221196141780992021359"

Simimlarly ‘Dscrmn’ gives the discriminability of question. This would be seen as the slope in ICC curve below.

To understand what the latent parameters are, we look at the item characteristic curves. These are logistic functions, one for each of the 14 questions. Each one shows the likelihood of scoring 1 as a function of the respondent’s score on the underlying latent variable, called ability.

The slope indicates how well the question is able to discriminant among the respondents. A flat slope tells us that the probability of scoring 1 changes slowly with increases or decreases in the latent variable-ability.

## The point at which the curve touces the 50% line denotes 'ability' of student to score 1

Item information

## 
## Item-Fit Statistics and P-values
## 
## Call:
## ltm(formula = score_matrix_2PL ~ z1)
## 
## Alternative: Items do not fit the model
## Ability Categories: 10
## 
##            X^2 Pr(>X^2)
## It 1   15.4002   0.0518
## It 2    5.2520   0.7303
## It 3   13.4060   0.0986
## It 4    8.5278   0.3837
## It 5    6.2458   0.6197
## It 6   16.8677   0.0315
## It 7    6.1478   0.6307
## It 8   10.3436   0.2417
## It 9   61.8880  <0.0001
## It 10   1.8275   0.9859
## It 11  27.5651   0.0006
## It 12 179.3043  <0.0001
## It 13  20.3487   0.0091
## It 14  59.4562  <0.0001

The factor by which the ICC curve is off the 0 probability margin is the guessing parameter. We will capture this using a 3PL model.

3PL model

##                Gussng       Dffclt     Dscrmn
## Item 1  2.136226e-102 -0.830578781   2.716531
## Item 2   2.548204e-01  0.004531795  66.335969
## Item 3  4.745745e-112 -0.748004428   2.769426
## Item 4  3.006904e-119 -0.761245287   1.974536
## Item 5  1.766010e-153 -0.619369749   1.997891
## Item 6   6.678032e-80 -0.970312754   2.788291
## Item 7  2.920604e-120 -0.709620929   2.038679
## Item 8   7.842681e-99 -0.912729156   2.071476
## Item 9  2.379656e-101 -0.850077129   4.573664
## Item 10  4.150836e-96 -0.683747982 369.865463
## Item 11 2.453560e-108 -0.760677914   3.835605
## Item 12  4.899897e-85 -0.720655204  49.500613
## Item 13  4.637113e-62 -0.898100511   3.730905
## Item 14  6.832884e-90 -0.780643780  22.962596

## 
##  Likelihood Ratio Table
##            AIC     BIC log.Lik    LRT df p.value
## model  1860.74 1950.14 -902.37                  
## model2 1906.10 2040.20 -911.05 -17.36 14       1
## [1] "Lower AIC 2PL model is choosen"

As classififcation problem-LibFM

The response matrix can be used to model the student response. The FPR-TPR curve for the model is shown below.

Treating the problem as a classification problems also lets us build a contetn recommendation to user. Suppose we want to implement a RE so as to maximise the performance of a user on each question, we train it using the observed user-question interaction scores.

For each user we obtain a prescibed order to render question such that the score obtained is maximum. This can be extended for questions not played by any user too.

## For user with uid:'c64c28cf-60e7-4b5d-b435-cf371f0b7f44' the content is recommended in the following order:
##  [1] "do_31221199846513868821372" "do_31221197823005491221366"
##  [3] "do_31221200160445235221376" "do_31221414791237632023479"
##  [5] "do_31221196141780992021359" "do_31221196333227212811383"
##  [7] "do_31221195817094348821353" "do_31221195964240691221357"
##  [9] "do_31221156454046105611241" "do_31221156833106329611243"
## [11] "do_31221197474628403211385" "do_31221156477276979221230"
## [13] "do_31221200484895129621378" "do_31221197647432089621363"