## Number of unique handles: 176
## Number of unique uid: 180
## Number of unique qid: 14
## head(data.df)
## handle uid
## 1 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 2 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 3 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 4 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 5 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 6 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## qid score length Count
## 1 do_31221414791237632023479 1 360 1
## 2 do_31221200484895129621378 0 179 1
## 3 do_31221200160445235221376 1 118 1
## 4 do_31221199846513868821372 1 51 1
## 5 do_31221196333227212811383 1 49 1
## 6 do_31221156833106329611243 1 33 1
x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the number of times user has attempted the question.
x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the average of all score that the user has obtained for the question.
Note: User not attempting a question is also represented as 0.
Correlation plot for the 14 questions. The axes represent the question id.
Factor is the smallest index at which (ordered) eigen-values of simulated data exceed the corresponding eigen value of real data. However, if we go by the Kaiser’s rule of thumb that components below an eigen value of 1 are not significant.
## Parallel analysis suggests that the number of factors = 1 and the number of components = NA
## [1] "From the figure, we choose the number of factors as 2"
##
## Loadings:
## MR1 MR2
## [1,] 0.617 0.286
## [2,] 0.561 -0.405
## [3,] 0.485
## [4,] 0.812
## [5,] 0.214 0.657
## [6,] 0.763 -0.194
## [7,] -0.122 0.575
## [8,] 0.424
## [9,] 0.373 0.859
## [10,] 0.879 0.186
## [11,] 0.657 0.174
## [12,] 0.969
## [13,] 0.886
## [14,] 0.803 0.431
##
## MR1 MR2
## SS loadings 5.467 2.713
## Proportion Var 0.390 0.194
## Cumulative Var 0.390 0.584
plot forced network of factor loadings
In network analysis, we use correlation between questions( computed from the user-question response) to view them as a connected graph. From the graph, different network parameters such as connectedness,PageRank and Degree Centrality can be computed.
## [1] "1 - do_31221414791237632023479" "2 - do_31221200484895129621378"
## [3] "3 - do_31221200160445235221376" "4 - do_31221199846513868821372"
## [5] "5 - do_31221196333227212811383" "6 - do_31221156833106329611243"
## [7] "7 - do_31221197647432089621363" "8 - do_31221197823005491221366"
## [9] "9 - do_31221156454046105611241" "10 - do_31221197474628403211385"
## [11] "11 - do_31221195964240691221357" "12 - do_31221195817094348821353"
## [13] "13 - do_31221156477276979221230" "14 - do_31221196141780992021359"
## [1] "Page rank of each question:"
## [1] 0.07076176 0.06180325 0.06952259 0.06495109 0.06499689 0.06849859
## [7] 0.06407950 0.06665999 0.07727750 0.08185294 0.07230655 0.08124272
## [13] 0.07464336 0.08140327
This network is a fully connected graph. To calculate other network parameters we reduce the network density to 40%.
## [1] "0.395604395604396"
the minimum number of edges needed to remove to eliminate all (directed) paths between them. Minimum of edge connectivity for all pair of verices is calculated here:
## [1] 1
In our problem, if we suppose one to one correspondance between Content and concept, an edge_connectivity 0, imply that the concept is covered only by a single question.
Similarly, Edge_connectivity 1 would give a logical order as to which question preceeds another conceptually.
absolute deviation from the maximum connected node in the graph
## [1] "degree centrality : 0.450549450549451"
To interpret this value, a Centrality 1 in this question graph would mean that all the concepts covered are at 1 hop distance from one another. Hence there is no specific order required to ask these questions .
fraction of all edges, such that there exists an undirected path between two nodes
## [1] "Connectedness: 1"
Connctedness of 1 imply each node is accessible from the root through exactly one directional path. So all the concepts covered in these questions provide a learning path to traverse the concept map.
an index of the extent to which a given Question has short paths to all other vertices in the graph
## [1] 0.5909091 0.4814815 0.5416667 0.5000000 0.5416667 0.5652174 0.4482759
## [8] 0.5652174 0.7647059 0.8666667 0.6190476 0.7647059 0.6842105 0.7647059
A number of other network parameters can be found using these techniques. Unlike the one-one correspondance of Content and Concept, we assumed here, in reality a Content is associated to multiple Concepts and these relations would provide many more inferences from Network Analysis.
Analysis of multivariate dichotomous data using latent trait models under the Item Response Theory approach. ####2PL model-two-parameter logistic model
Modeling the response matrix using a 2 parameter model, we get:
## Dffclt Dscrmn
## Item 1 -0.8430558 2.622835
## Item 2 -0.3055962 1.823897
## Item 3 -0.7490030 2.733768
## Item 4 -0.7583207 1.957374
## Item 5 -0.6080094 1.997835
## Item 6 -1.0267144 2.480781
## Item 7 -0.7050109 2.019568
## Item 8 -0.9326956 1.984742
## Item 9 -0.8614164 4.476312
## Item 10 -0.7803617 22.250036
## Item 11 -0.7935644 3.323147
## Item 12 -1.0229813 6.133005
## Item 13 -0.9232772 3.513819
## Item 14 -1.1000758 5.251052
As seen form the response matrix, most of the students have answered the questions correctly implying them to be easy. The negetive value of difficulty parameter indicates this. A value of -0.84 indicates only students with standard deviation 0.84 times from mean gets this question wrong.
## Question in increasing order of difficulty:
## [1] "do_31221200484895129621378" "do_31221196333227212811383"
## [3] "do_31221197647432089621363" "do_31221200160445235221376"
## [5] "do_31221199846513868821372" "do_31221197474628403211385"
## [7] "do_31221195964240691221357" "do_31221414791237632023479"
## [9] "do_31221156454046105611241" "do_31221156477276979221230"
## [11] "do_31221197823005491221366" "do_31221195817094348821353"
## [13] "do_31221156833106329611243" "do_31221196141780992021359"
Simimlarly ‘Dscrmn’ gives the discriminability of question. This would be seen as the slope in ICC curve below.
To understand what the latent parameters are, we look at the item characteristic curves. These are logistic functions, one for each of the 14 questions. Each one shows the likelihood of scoring 1 as a function of the respondent’s score on the underlying latent variable, called ability.
The slope indicates how well the question is able to discriminant among the respondents. A flat slope tells us that the probability of scoring 1 changes slowly with increases or decreases in the latent variable-ability.
## The point at which the curve touces the 50% line denotes 'ability' of student to score 1
##
## Item-Fit Statistics and P-values
##
## Call:
## ltm(formula = score_matrix_2PL ~ z1)
##
## Alternative: Items do not fit the model
## Ability Categories: 10
##
## X^2 Pr(>X^2)
## It 1 15.4002 0.0518
## It 2 5.2520 0.7303
## It 3 13.4060 0.0986
## It 4 8.5278 0.3837
## It 5 6.2458 0.6197
## It 6 16.8677 0.0315
## It 7 6.1478 0.6307
## It 8 10.3436 0.2417
## It 9 61.8880 <0.0001
## It 10 1.8275 0.9859
## It 11 27.5651 0.0006
## It 12 179.3043 <0.0001
## It 13 20.3487 0.0091
## It 14 59.4562 <0.0001
The factor by which the ICC curve is off the 0 probability margin is the guessing parameter. We will capture this using a 3PL model.
## Gussng Dffclt Dscrmn
## Item 1 2.136226e-102 -0.830578781 2.716531
## Item 2 2.548204e-01 0.004531795 66.335969
## Item 3 4.745745e-112 -0.748004428 2.769426
## Item 4 3.006904e-119 -0.761245287 1.974536
## Item 5 1.766010e-153 -0.619369749 1.997891
## Item 6 6.678032e-80 -0.970312754 2.788291
## Item 7 2.920604e-120 -0.709620929 2.038679
## Item 8 7.842681e-99 -0.912729156 2.071476
## Item 9 2.379656e-101 -0.850077129 4.573664
## Item 10 4.150836e-96 -0.683747982 369.865463
## Item 11 2.453560e-108 -0.760677914 3.835605
## Item 12 4.899897e-85 -0.720655204 49.500613
## Item 13 4.637113e-62 -0.898100511 3.730905
## Item 14 6.832884e-90 -0.780643780 22.962596
##
## Likelihood Ratio Table
## AIC BIC log.Lik LRT df p.value
## model 1860.74 1950.14 -902.37
## model2 1906.10 2040.20 -911.05 -17.36 14 1
## [1] "Lower AIC 2PL model is choosen"
The response matrix can be used to model the student response. The FPR-TPR curve for the model is shown below.
Treating the problem as a classification problems also lets us build a contetn recommendation to user. Suppose we want to implement a RE so as to maximise the performance of a user on each question, we train it using the observed user-question interaction scores.
For each user we obtain a prescibed order to render question such that the score obtained is maximum. This can be extended for questions not played by any user too.
## For user with uid:'c64c28cf-60e7-4b5d-b435-cf371f0b7f44' the content is recommended in the following order:
## [1] "do_31221200160445235221376" "do_31221197474628403211385"
## [3] "do_31221156477276979221230" "do_31221195817094348821353"
## [5] "do_31221200484895129621378" "do_31221197823005491221366"
## [7] "do_31221196141780992021359" "do_31221197647432089621363"
## [9] "do_31221156454046105611241" "do_31221199846513868821372"
## [11] "do_31221195964240691221357" "do_31221196333227212811383"
## [13] "do_31221414791237632023479" "do_31221156833106329611243"
Further, from the content id’s provided we could obtain content attributes, calculate corresponding vectors, other ML tags, usage parameters and enrich the model.