El_assesment

Exploratory analysis:

## Number of unique handles: 176

## Number of unique uid: 180

## Number of unique qid: 14

## head(data.df)

##                 handle                                  uid
## 1 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 2 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 3 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 4 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 5 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
## 6 Otteri sandhoshkumar a78ed535-eadf-4b58-a991-afabe103e229
##                          qid score length Count
## 1 do_31221414791237632023479     1    360     1
## 2 do_31221200484895129621378     0    179     1
## 3 do_31221200160445235221376     1    118     1
## 4 do_31221199846513868821372     1     51     1
## 5 do_31221196333227212811383     1     49     1
## 6 do_31221156833106329611243     1     33     1

User-Question interaction matrix

x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the number of times user has attempted the question.

User-Question response mean score

x-axis represent the user id’s and y-axis the question id. Value of the cell indicates the average of all score that the user has obtained for the question.

Note: User not attempting a question is also represented as 0.

Tetrochoric correlation and item correlations

Correlation plot for the 14 questions. The axes represent the question id.

Distribution of score by question

Factor analysis:

Factor is the smallest index at which (ordered) eigen-values of simulated data exceed the corresponding eigen value of real data. However, if we go by the Kaiser’s rule of thumb that components below an eigen value of 1 are not significant.

## Parallel analysis suggests that the number of factors =  1  and the number of components =  NA

## [1] "From the figure, we choose the number of factors as 2"

## 
## Loadings:
##       MR1    MR2   
##  [1,]  0.617  0.286
##  [2,]  0.561 -0.405
##  [3,]  0.485       
##  [4,]         0.812
##  [5,]  0.214  0.657
##  [6,]  0.763 -0.194
##  [7,] -0.122  0.575
##  [8,]  0.424       
##  [9,]  0.373  0.859
## [10,]  0.879  0.186
## [11,]  0.657  0.174
## [12,]  0.969       
## [13,]  0.886       
## [14,]  0.803  0.431
## 
##                  MR1   MR2
## SS loadings    5.467 2.713
## Proportion Var 0.390 0.194
## Cumulative Var 0.390 0.584

Visualize the factor loadings as a graph

plot forced network of factor loadings

Item parameters based on network analysis

In network analysis, we use correlation between questions( computed from the user-question response) to view them as a connected graph. From the graph, different network parameters such as connectedness,PageRank and Degree Centrality can be computed.

##  [1] "1 - do_31221414791237632023479"  "2 - do_31221200484895129621378" 
##  [3] "3 - do_31221200160445235221376"  "4 - do_31221199846513868821372" 
##  [5] "5 - do_31221196333227212811383"  "6 - do_31221156833106329611243" 
##  [7] "7 - do_31221197647432089621363"  "8 - do_31221197823005491221366" 
##  [9] "9 - do_31221156454046105611241"  "10 - do_31221197474628403211385"
## [11] "11 - do_31221195964240691221357" "12 - do_31221195817094348821353"
## [13] "13 - do_31221156477276979221230" "14 - do_31221196141780992021359"

Google PageRank for the specified vertices(question)

## [1] "Page rank of each question:"

##  [1] 0.07076176 0.06180325 0.06952259 0.06495109 0.06499689 0.06849859
##  [7] 0.06407950 0.06665999 0.07727750 0.08185294 0.07230655 0.08124272
## [13] 0.07464336 0.08140327

This network is a fully connected graph. To calculate other network parameters we reduce the network density to 40%.

## [1] "0.395604395604396"

Edge connectivity of a pair of vertices:

the minimum number of edges needed to remove to eliminate all (directed) paths between them. Minimum of edge connectivity for all pair of verices is calculated here:

## [1] 1

In our problem, if we suppose one to one correspondance between Content and concept, an edge_connectivity 0, imply that the concept is covered only by a single question.

Similarly, Edge_connectivity 1 would give a logical order as to which question preceeds another conceptually.

Normalised Centrality

absolute deviation from the maximum connected node in the graph

## [1] "degree centrality : 0.450549450549451"

To interpret this value, a Centrality 1 in this question graph would mean that all the concepts covered are at 1 hop distance from one another. Hence there is no specific order required to ask these questions .

Connectedness

fraction of all edges, such that there exists an undirected path between two nodes

## [1] "Connectedness: 1"

Connctedness of 1 imply each node is accessible from the root through exactly one directional path. So all the concepts covered in these questions provide a learning path to traverse the concept map.

Closeness

an index of the extent to which a given Question has short paths to all other vertices in the graph

##  [1] 0.5909091 0.4814815 0.5416667 0.5000000 0.5416667 0.5652174 0.4482759
##  [8] 0.5652174 0.7647059 0.8666667 0.6190476 0.7647059 0.6842105 0.7647059

A number of other network parameters can be found using these techniques. Unlike the one-one correspondance of Content and Concept, we assumed here, in reality a Content is associated to multiple Concepts and these relations would provide many more inferences from Network Analysis.

IRT

Analysis of multivariate dichotomous data using latent trait models under the Item Response Theory approach. ####2PL model-two-parameter logistic model

Modeling the response matrix using a 2 parameter model, we get:

##             Dffclt    Dscrmn
## Item 1  -0.8430558  2.622835
## Item 2  -0.3055962  1.823897
## Item 3  -0.7490030  2.733768
## Item 4  -0.7583207  1.957374
## Item 5  -0.6080094  1.997835
## Item 6  -1.0267144  2.480781
## Item 7  -0.7050109  2.019568
## Item 8  -0.9326956  1.984742
## Item 9  -0.8614164  4.476312
## Item 10 -0.7803617 22.250036
## Item 11 -0.7935644  3.323147
## Item 12 -1.0229813  6.133005
## Item 13 -0.9232772  3.513819
## Item 14 -1.1000758  5.251052

As seen form the response matrix, most of the students have answered the questions correctly implying them to be easy. The negetive value of difficulty parameter indicates this. A value of -0.84 indicates only students with standard deviation 0.84 times from mean gets this question wrong.

## Question in increasing order of difficulty:

##  [1] "do_31221200484895129621378" "do_31221196333227212811383"
##  [3] "do_31221197647432089621363" "do_31221200160445235221376"
##  [5] "do_31221199846513868821372" "do_31221197474628403211385"
##  [7] "do_31221195964240691221357" "do_31221414791237632023479"
##  [9] "do_31221156454046105611241" "do_31221156477276979221230"
## [11] "do_31221197823005491221366" "do_31221195817094348821353"
## [13] "do_31221156833106329611243" "do_31221196141780992021359"

Simimlarly ‘Dscrmn’ gives the discriminability of question. This would be seen as the slope in ICC curve below.

To understand what the latent parameters are, we look at the item characteristic curves. These are logistic functions, one for each of the 14 questions. Each one shows the likelihood of scoring 1 as a function of the respondent’s score on the underlying latent variable, called ability.

The slope indicates how well the question is able to discriminant among the respondents. A flat slope tells us that the probability of scoring 1 changes slowly with increases or decreases in the latent variable-ability.

## The point at which the curve touces the 50% line denotes 'ability' of student to score 1

Item information

## 
## Item-Fit Statistics and P-values
## 
## Call:
## ltm(formula = score_matrix_2PL ~ z1)
## 
## Alternative: Items do not fit the model
## Ability Categories: 10
## 
##            X^2 Pr(>X^2)
## It 1   15.4002   0.0518
## It 2    5.2520   0.7303
## It 3   13.4060   0.0986
## It 4    8.5278   0.3837
## It 5    6.2458   0.6197
## It 6   16.8677   0.0315
## It 7    6.1478   0.6307
## It 8   10.3436   0.2417
## It 9   61.8880  <0.0001
## It 10   1.8275   0.9859
## It 11  27.5651   0.0006
## It 12 179.3043  <0.0001
## It 13  20.3487   0.0091
## It 14  59.4562  <0.0001

The factor by which the ICC curve is off the 0 probability margin is the guessing parameter. We will capture this using a 3PL model.

3PL model

##                Gussng       Dffclt     Dscrmn
## Item 1  2.136226e-102 -0.830578781   2.716531
## Item 2   2.548204e-01  0.004531795  66.335969
## Item 3  4.745745e-112 -0.748004428   2.769426
## Item 4  3.006904e-119 -0.761245287   1.974536
## Item 5  1.766010e-153 -0.619369749   1.997891
## Item 6   6.678032e-80 -0.970312754   2.788291
## Item 7  2.920604e-120 -0.709620929   2.038679
## Item 8   7.842681e-99 -0.912729156   2.071476
## Item 9  2.379656e-101 -0.850077129   4.573664
## Item 10  4.150836e-96 -0.683747982 369.865463
## Item 11 2.453560e-108 -0.760677914   3.835605
## Item 12  4.899897e-85 -0.720655204  49.500613
## Item 13  4.637113e-62 -0.898100511   3.730905
## Item 14  6.832884e-90 -0.780643780  22.962596

## 
##  Likelihood Ratio Table
##            AIC     BIC log.Lik    LRT df p.value
## model  1860.74 1950.14 -902.37                  
## model2 1906.10 2040.20 -911.05 -17.36 14       1

## [1] "Lower AIC 2PL model is choosen"

As classififcation problem-LibFM

The response matrix can be used to model the student response. The FPR-TPR curve for the model is shown below.

Treating the problem as a classification problems also lets us build a contetn recommendation to user. Suppose we want to implement a RE so as to maximise the performance of a user on each question, we train it using the observed user-question interaction scores.

For each user we obtain a prescibed order to render question such that the score obtained is maximum. This can be extended for questions not played by any user too.

## For user with uid:'c64c28cf-60e7-4b5d-b435-cf371f0b7f44' the content is recommended in the following order:

##  [1] "do_31221200160445235221376" "do_31221197474628403211385"
##  [3] "do_31221156477276979221230" "do_31221195817094348821353"
##  [5] "do_31221200484895129621378" "do_31221197823005491221366"
##  [7] "do_31221196141780992021359" "do_31221197647432089621363"
##  [9] "do_31221156454046105611241" "do_31221199846513868821372"
## [11] "do_31221195964240691221357" "do_31221196333227212811383"
## [13] "do_31221414791237632023479" "do_31221156833106329611243"

Further, from the content id’s provided we could obtain content attributes, calculate corresponding vectors, other ML tags, usage parameters and enrich the model.