Wavumbuzi Peer Review Algorithm Audit
K.R.U
1 Background
The purpose of this research report is to provide audit results of the peer-review algorithm of the Wavumbuzi Entrepreneurship Challenge (hereafter referred to as WEC) game web-app. WEC is a free annual 6-weeks online challenge offered to learners in all secondary/ high schools across Kenya and Rwanda. It is a gamified experiential learning process designed to equip learners with competencies to be the next generation of global leaders, change-makers and innovative thinkers. Focusing on entrepreneurship (as a widely recognised countermeasure to unemployment), the Entrepreneurship Challenge is designed to stimulate and develop the entrepreneurial mindset and 21st Century Skills of learners.
Learners are not taught. Instead, every week – over a 6weeks period - learners get a set of challenges, via mobile and computers, that stimulates them to think like entrepreneurs. Each task requires learners to apply new concepts and utilize their knowledge and skills in solving real-world challenges. Teachers are trained on how to guide and encourage learners to engage in and complete the Challenges.
2 Objective
This main aim of this mini-study is to answer the question “How well is Peer Review working?” using Wavumbuzi Rwanda 2021 game-play data.
Why Rwanda 2021 data?
- Most recent iteration (Kenya is one iteration behind)
- A comprehensive audit exists for the previous iteration - We can assess improvements based on these two reports
3 Peer Review Overview
The Peer Review system determines the final score for a submission based on the suggested scores that reviewers award it. This means that the method we use to decide on the credibility of reviewers’ suggested scores deeply influences the efficacy of the Peer Review system.
Credibility Weighting (“CW” from here on), a mechanism that keeps a running score of how accurate each reviewer is. The score ranges between 0 and 1.
3.1 Peer Review Algorithm
The peer-review process is divided into mandatory and discretionary reviews, which are queued differently in the system.
This process aims to encourage automatic scoring based on submission quality. Should a submission be redirected for Moderation, the following triggers are built into the system to notify discrepancies and/or suspicious behaviour:
- Inconsistent scoring after ten reviews
- Submission not reviewed in the peer review process for an extended period.
The following improvements were made to encourage good and fair behaviour from teachers and learners.
- Notifications warning users of potential bad behaviour i.e, submitting answers too quickly, taking less than 60sec to review submissions.
- Introduction of remark requests, credibility weighting, and patterned behaviour.
Unlike previous challenges, the teacher’s score held equal weight to the learner’s score and thus did not override previous learner scores within the system.
- Types of inappropriate submissions:
- Plagiarism
- Irrelevant & repeated answers
- Inappropriate and explicit links included in submissions.
Regular system checks were implemented at the end of each new challenge week, with additional random checks and adjustments of scores by the systems managers.
A detailed exposition of the algorithm can be found here:
4 Problem Statement
The following are the CW issues ever reported in the previous iterations - for both Kenya and Rwanda.
4.1 1. Wild fluctuation in credibility weights
Previous analyses have shown that users’ CWs were incredibly volatile, swinging aggressively up and down over time. Red flag: CW cannot possibly be a good indicator of the degree to which we trust in someone’s reviews if it fluctuates so wildly for most users.
So, it was concluded that volatility in CW is undesirable, and we ought to figure out how to configure CW such that:
- It converges to some semblance of true credibility for each user;
- Provides a stable sense of the trust to assign to a reviewer - even if evolving.
4.2 2. Credibility weight only going down and not up
During Kenya 2021 Challenge (August), it was raised that the CW was mainly going down and not up for most users.
It is, however, essential to note that there are cases where the CW does not change because the student did an OK review where he gets half points increase. Nonetheless, it is always essential to investigate
4.3 3. Consistency in how CWs are updated relative to # of reviews performed
In a previous iteration, it was reported that 16% of users (~280) who made reviews have more credibility update entries than the number of reviews performed.
4.4 4. Moderation is pulling a lot of scores to 0.
In a previous iteration, it was reported that Moderation seem to pull a lot of scores to 0 as shown in the figure below
5 Methodology
In light of the above, we sought to review CW in terms of:
- the extent to which they agree with one another [Consistency], and
- the extent to which we trust each reviewer’s suggestions [Credibility].
the extent to which credibility (accuracy) is maintained over iterations.
5.1 Univariate analysis
Descriptive measures under the univariate setting will be applied to summarize central tendencies and distribution for metric variables.
- Categorical variables such as gender and nationality will be summarized using frequencies and percentages.
- Categorical variables derived from continuous variables will be categorized using acceptable limits derived from literature
- Continuous variables such as age, CW, and scores will be summarized using mean and corresponding SD - assuming Gaussian distribution.
- Continuous variables that are skewed will be summarized as median and the corresponding interquartile range (IQR).
5.2 Bivariate setting
- Association between categorical variables will be assessed using Pearson’s Chi-Square test
- Association between categorical and continuous variables will be done using a two-sample t-test, assuming Gaussian distribution for continuous variables.
- Association between categorical and continuous variables which are not normally distributed will be conducted using a two-sample Wilcoxon rank-sum test.
- Kruskal Wallis test (one-way ANOVA on ranks) will be used if the levels of the categorical variable are greater than 2.
5.3 Regression Modeling
Where applicable:
- For continuous response, => Linear regression model will be fitted. Effect modifiers and potential confounders will be adjusted. Regression coefficients and the corresponding 95% confidence interval will be reported.
- For binary response => association between multiple risk factors and the response, the variable will be conducted by fitting multiple logistic regression models. Effect modifiers and potential confounders will be adjusted. Odds ratios and the corresponding 95% confidence interval will be reported.
6 Data sources
Game data was collected over a six-week period from 16 August 2021 to 20 October 2021 (excluding weeks where the learners wrote exams and school holidays) using an online web app hosted using AWS platform.
We are leveraging four main datasets from Wavumbuzi game-play logs:
6.1 Submission
| Name | submission.df |
| Number of rows | 175976 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| logical | 4 |
| numeric | 15 |
| POSIXct | 7 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| test | 0 | 1 | 0 | FAL: 175976 |
| demo | 0 | 1 | 0 | FAL: 175976 |
| admin_score | 175976 | 0 | NaN | : |
| appeal_score | 175976 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 88342.80 | 50881.12 | 105 | 44304.75 | 88335.5 | 132419.2 | 176413 | ▇▇▇▇▇ |
| student_id | 0 | 1.00 | 5127.63 | 2844.56 | 20 | 2746.00 | 4796.0 | 7366.0 | 10814 | ▆▇▆▅▅ |
| challenge_level_id | 0 | 1.00 | 423.69 | 329.99 | 8 | 124.00 | 356.0 | 690.0 | 1127 | ▇▅▃▃▂ |
| teacher_id | 172719 | 0.02 | 5511.96 | 3502.07 | 305 | 777.00 | 8083.0 | 8150.0 | 8156 | ▅▁▁▁▇ |
| final_score | 4400 | 0.97 | 108.36 | 125.16 | 0 | 40.00 | 60.0 | 146.0 | 1000 | ▇▁▁▁▁ |
| teacherReview_id | 172719 | 0.02 | 189516.22 | 98677.71 | 49554 | 105456.00 | 166570.0 | 279875.0 | 367988 | ▇▆▃▅▅ |
| reported_count | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0.0 | 0.0 | 0 | ▁▁▇▁▁ |
| number_of_reviews_for_finalisation | 111434 | 0.37 | 3.13 | 0.37 | 3 | 3.00 | 3.0 | 3.0 | 5 | ▇▁▁▁▁ |
| number_of_seconds_until_finalisation | 8126 | 0.95 | 32350.63 | 97845.30 | 0 | 1.00 | 2.0 | 13659.0 | 1636998 | ▇▁▁▁▁ |
| moderator_score | 173358 | 0.01 | 102.36 | 155.98 | 0 | 0.00 | 50.0 | 131.0 | 1000 | ▇▁▁▁▁ |
| inappropriate_score | 174334 | 0.01 | 0.00 | 0.00 | 0 | 0.00 | 0.0 | 0.0 | 0 | ▁▁▇▁▁ |
| remarked_score | 175189 | 0.00 | 302.63 | 256.77 | 0 | 128.50 | 213.0 | 459.5 | 1000 | ▇▃▃▁▁ |
| system_score | 7604 | 0.96 | 109.97 | 124.72 | 0 | 40.00 | 60.0 | 149.0 | 1000 | ▇▁▁▁▁ |
| finalization_reviews_stdev | 0 | 1.00 | 8.53 | 17.60 | 0 | 0.00 | 0.0 | 11.8 | 150 | ▇▁▁▁▁ |
| finalization_reviews_flat_average | 0 | 1.00 | 74.55 | 125.53 | 0 | 0.00 | 0.0 | 126.0 | 1000 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| submitted_on | 4400 | 0.97 | 2021-10-25 05:21:11 | 2021-12-07 13:23:15 | 2021-11-11 14:36:10 | 163021 |
| updated_on | 3195 | 0.98 | 2021-10-25 05:21:11 | 2021-12-07 13:23:15 | 2021-11-11 14:59:32 | 164083 |
| started_on | 0 | 1.00 | 2021-10-25 05:20:33 | 2021-12-07 13:44:54 | 2021-11-11 11:54:41 | 166579 |
| finalised_on | 4400 | 0.97 | 2021-10-25 05:21:12 | 2021-12-09 07:41:43 | 2021-11-12 05:19:05 | 165335 |
| deletedAt | 175976 | 0.00 | NA | NA | NA | 0 |
| moderation_started_on | 172719 | 0.02 | 2021-11-01 10:42:37 | 2021-12-08 22:49:43 | 2021-11-16 09:30:21 | 3256 |
| moderation_completed_on | 172726 | 0.02 | 2021-11-01 10:46:34 | 2021-12-09 07:41:43 | 2021-11-16 17:56:07 | 3247 |
6.2 Credibility Weighting History
| Name | submission_reviews.df |
| Number of rows | 357492 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 3 |
| numeric | 7 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| status | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| test | 0 | 1 | 0.00 | FAL: 357492 |
| discretionary | 0 | 1 | 0.03 | FAL: 346287, TRU: 11205 |
| demo | 0 | 1 | 0.00 | FAL: 357492 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 1.863538e+05 | 105727.05 | 43 | 9.546975e+04 | 187373.5 | 277830.2 | 367988 | ▇▇▇▇▇ |
| submission_id | 0 | 1.00 | 1.027353e+05 | 48199.84 | 168 | 6.396375e+04 | 109838.0 | 145208.0 | 176411 | ▃▅▆▇▇ |
| initiated_submission_id | 190117 | 0.47 | 1.063469e+05 | 47591.87 | 171 | 6.948150e+04 | 115612.0 | 147284.0 | 176410 | ▃▅▅▇▇ |
| reviewer_id | 0 | 1.00 | 3.684960e+03 | 3172.83 | 17 | 1.033000e+03 | 2645.0 | 5817.0 | 10799 | ▇▃▂▂▂ |
| points_awarded | 0 | 1.00 | 1.840700e+02 | 138.77 | 0 | 1.000000e+02 | 150.0 | 219.0 | 1000 | ▇▂▁▁▁ |
| points_available_to_be_awarded | 5140 | 0.99 | 2.865800e+02 | 172.75 | 0 | 1.500000e+02 | 249.0 | 500.0 | 1000 | ▇▅▅▁▁ |
| review_last_fetch_time | 6411 | 0.98 | 1.637115e+09 | 1000615.97 | 1635149204 | 1.636271e+09 | 1637222344.0 | 1637957019.0 | 1639580509 | ▆▆▇▇▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| created_on | 0 | 1 | 2021-10-25 08:06:44 | 2021-12-08 22:49:43 | 2021-11-18 06:27:03 | 331489 |
| deletedAt | 357492 | 0 | NA | NA | NA | 0 |
6.3 Submission Score History
| Name | sub_score_hist.df |
| Number of rows | 179451 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| logical | 2 |
| numeric | 5 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| scoreType | 0 | 1.00 | 12 | 19 | 0 | 4 | 0 |
| comment | 176020 | 0.02 | 3 | 240 | 0 | 1684 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| deletedAt | 179451 | 0 | NaN | : |
| deletedBy_id | 179451 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 102583.36 | 56748.20 | 1 | 54293.5 | 105621 | 151442.5 | 196453 | ▇▆▇▇▇ |
| submission_id | 0 | 1.00 | 88583.78 | 50676.96 | 1 | 44972.5 | 88874 | 132237.5 | 176412 | ▇▇▇▇▇ |
| score | 0 | 1.00 | 112.49 | 127.85 | 0 | 40.0 | 60 | 150.0 | 1000 | ▇▁▁▁▁ |
| createdBy_id | 5310 | 0.97 | 4362.07 | 3214.22 | 17 | 1561.0 | 3918 | 7190.0 | 10814 | ▇▅▅▃▃ |
| updatedBy_id | 5310 | 0.97 | 4362.07 | 3214.22 | 17 | 1561.0 | 3918 | 7190.0 | 10814 | ▇▅▅▃▃ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| createdAt | 0 | 1 | 2021-10-21 17:20:05 | 2021-12-09 07:41:44 | 2021-11-12 11:46:29 | 172717 |
| updatedAt | 0 | 1 | 2021-10-21 17:20:05 | 2021-12-09 07:41:44 | 2021-11-12 11:46:29 | 172717 |
6.4 Users
Trancated due to large table
| Name | users.df[25:50] |
| Number of rows | 10732 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 8 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 5270 | 0.51 | 4 | 6 | 0 | 3 | 0 |
| id_number | 10066 | 0.06 | 1 | 44 | 0 | 605 | 0 |
| title | 9314 | 0.13 | 2 | 5 | 0 | 12 | 0 |
| role | 10545 | 0.02 | 12 | 18 | 0 | 2 | 0 |
| token | 1477 | 0.86 | 20 | 20 | 0 | 9255 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| allan_gray_scholar | 1477 | 0.86 | 0.00 | FAL: 9255 |
| subscribed_to_whatsapp | 1 | 1.00 | 0.21 | FAL: 8440, TRU: 2291 |
| new_registration | 9263 | 0.14 | 0.31 | FAL: 1017, TRU: 452 |
| survey_completed_at | 10732 | 0.00 | NaN | : |
| test | 1477 | 0.86 | 0.00 | FAL: 9255 |
| nationality | 10732 | 0.00 | NaN | : |
| hubspot_association | 0 | 1.00 | 0.11 | FAL: 9580, TRU: 1152 |
| late_sign_up | 1477 | 0.86 | 0.00 | FAL: 9255 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| school_class_id | 10441 | 0.03 | 193.15 | 107.34 | 13 | 49 | 239 | 287.00 | 313 | ▅▁▂▁▇ |
| verified | 2938 | 0.73 | 1.00 | 0.00 | 1 | 1 | 1 | 1.00 | 1 | ▁▁▇▁▁ |
| leaderboard_all_rank | 0 | 1.00 | 3774.31 | 1554.62 | 1 | 2683 | 4922 | 4922.00 | 4922 | ▁▁▁▂▇ |
| leaderboard_week_rank | 0 | 1.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
| number_of_submissions | 0 | 1.00 | 15.99 | 36.71 | 0 | 0 | 0 | 11.00 | 158 | ▇▁▁▁▁ |
| points | 0 | 1.00 | 2110.08 | 5664.28 | 0 | 0 | 0 | 1170.00 | 32916 | ▇▁▁▁▁ |
| week_points | 0 | 1.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
| challenge_points | 0 | 1.00 | 1839.14 | 5152.45 | 0 | 0 | 0 | 896.25 | 30600 | ▇▁▁▁▁ |
| other_points | 0 | 1.00 | 242.06 | 555.18 | 0 | 0 | 0 | 220.00 | 4438 | ▇▁▁▁▁ |
| average_points_per_challenge | 0 | 1.00 | 31.05 | 48.26 | 0 | 0 | 0 | 60.00 | 227 | ▇▂▁▁▁ |
| average_points_per_week | 0 | 1.00 | 603.86 | 1748.06 | 0 | 0 | 0 | 575.00 | 30450 | ▇▁▁▁▁ |
| best_week_points | 0 | 1.00 | 596.66 | 1942.49 | 0 | 0 | 0 | 550.00 | 30450 | ▇▁▁▁▁ |
| best_challenge_points | 0 | 1.00 | 155.04 | 254.94 | 0 | 0 | 0 | 200.00 | 1000 | ▇▁▂▁▁ |
7 Results
7.1 Study Cohort
7.1.1 Engagement & Peer-review Summary
Table 1: Demographics, Engagement and Moderation Summary | |||||
| discr |
| |||
Characteristic | Overall, N = 107321 | student, N = 9,2551 | teacher, N = 1,4691 | user, N = 81 | p-value2 |
age | 17.58 (4.01) | 17.58 (4.01) | NA (NA) | NA (NA) | |
N/A | 6,807 | 5,330 | 1,469 | 8 | |
gender | <0.001 | ||||
female | 41% (2,233) | 42% (2,229) | 4.4% (4) | NA% (0) | |
male | 59% (3,221) | 58% (3,134) | 96% (87) | NA% (0) | |
other | 0.1% (8) | 0.1% (8) | 0% (0) | NA% (0) | |
N/A | 5,270 | 3,884 | 1,378 | 8 | |
average_review_score | 34 (104) | 39 (111) | 0 (0) | 0 (0) | <0.001 |
average_points_per_challenge | 31 (48) | 36 (50) | 0 (0) | 0 (0) | <0.001 |
credibilityWeighting | 0.48 (0.20) | 0.44 (0.16) | 0.71 (0.27) | 0.00 (NA) | <0.001 |
N/A | 7 | 0 | 0 | 7 | |
number_of_submissions | 16 (37) | 19 (39) | 0 (0) | 0 (0) | <0.001 |
challenge_points | 1,839 (5,152) | 2,133 (5,492) | 0 (0) | 0 (0) | <0.001 |
best_challenge_points | 155 (255) | 180 (266) | 0 (0) | 0 (0) | <0.001 |
average_points_per_week | 604 (1,748) | 700 (1,864) | 0 (0) | 0 (0) | <0.001 |
login_count | 479 (1,293) | 550 (1,378) | 38 (140) | 152 (206) | <0.001 |
school_grade_id | >0.9 | ||||
1 | 1.3% (70) | 1.3% (70) | NA% (0) | NA% (0) | |
2 | 2.4% (127) | 2.4% (127) | NA% (0) | NA% (0) | |
3 | 7.4% (401) | 7.4% (401) | NA% (0) | NA% (0) | |
4 | 10% (566) | 10% (566) | NA% (0) | NA% (0) | |
5 | 28% (1,493) | 28% (1,493) | NA% (0) | NA% (0) | |
6 | 24% (1,294) | 24% (1,294) | NA% (0) | NA% (0) | |
7 | 27% (1,444) | 27% (1,444) | NA% (0) | NA% (0) | |
N/A | 5,337 | 3,860 | 1,469 | 8 | |
home_schooled | 1.0% (109) | 1.2% (109) | 0% (0) | 0% (0) | <0.001 |
profile_percentage_complete | |||||
0 | 88% (9,407) | 100% (9,255) | 9.8% (144) | 100% (8) | |
22 | <0.1% (10) | 0% (0) | 0.7% (10) | 0% (0) | |
33 | 2.0% (217) | 0% (0) | 15% (217) | 0% (0) | |
44 | 6.6% (706) | 0% (0) | 48% (706) | 0% (0) | |
56 | 3.5% (372) | 0% (0) | 25% (372) | 0% (0) | |
67 | 0.2% (20) | 0% (0) | 1.4% (20) | 0% (0) | |
1Mean (SD); % (n) | |||||
2Fisher's exact test; Kruskal-Wallis rank sum test | |||||
7.1.2 Peer-review submission distribution by User-type
- Teachers perform more reviews than students on average - Note that we have less teachers compared to learners
7.1.3 Peer-review incompletion reviews by User-type
- Students: 4,790 incomplete reviews (93%)
- Teachers: 346 incomplete reviews (7%)
- Point of concern : Students have higher incompletion rates than teachers
7.1.4 Peer-review submission distribution by Points Awarded
- Points awarded attributed by peer review is higher in teachers than students
7.1.5 Peer-review submission distribution by CW scores
- 8,215/187,099 (4.4%) reviews had CW > 1
- Point of concern : The teacher’s CW max limit is 2; Student’s CW max is 1. Was this intentional?
7.2 Credibility Analysis
Remember: credibility is the extent to which we trust each reviewer’s suggestions
7.2.1 CW directionality for the entire cohort
By fitting a regression line across the entire cohort, we can be able to know the overall directionality for change in credibility weights.
- Overall, change in CW has a positive slope and positive correlation coefficient - meaning, on average, updates in CW are positively correlated with time/day.
- Change in CW (slope) is more significant as from November 20th
This rules out the problem of CW updates mainly going down and not up - Reported during the Kenya 2021 Challenge (August).
7.2.2 CW Volatility assessment
Comparing this with changes in CW from the previous iteration (depicted in Fig.4 below):
- There is more trust in how the credibility scores are updated
- The updates do not fluctuate so wildly for most users, as seen in the previous report
- For most users, CW updates are smoother and predictable compared to previous iterations
- Volatility in CW (which is an undesirable characteristic) has significantly reduced
Questions to investigate
- Why do we see a significant change in CW only during the first 15 days and the last 20 days of the challenge?
7.2.3 Correlation analysis
The correlation coefficient can also be a good indicator to assess credibility in terms of the directionality of CW updates.
A positive or negative correlation for change in CW over time/day is desirable. Why? If 90% of users have a 0 coefficient in the CW change over time, then it means our peer review algorithm is not working.
- Correlation of > +0.1 or < -0.1 is even more desirable - more credibility of the peer review algorithm
7.3 Consistency Analysis
the extent to which they agree with one another [Consistency]
7.3.1 Consistency in how CWs are updated
As a rule of thumb, the number of CW updates should be == the number of reviews submitted. However, in a previous iteration, it was reported that 16% of users (~280) who made reviews had more credibility update entries than the number of reviews performed.
Table 4: Trust in how CW are updated | ||||
| user |
| ||
Characteristic | Overall, N = 2503 (95% CI)12 | student, N = 2,031 (95% CI)12 | teacher, N = 472 (95% CI)12 | p-value3 |
complete reviews | 369,144 (127, 168) | 182,391 (85, 95) | 186,753 (292, 499) | 0.5 |
incomplete reviews | 5,136 (1.9, 2.2) | 4,790 (2.2, 2.5) | 346 (0.64, 0.82) | <0.001 |
total reviews | 374,280 (129, 170) | 187,181 (87, 97) | 187,099 (293, 500) | >0.9 |
CW updates | 213,962 (85, 108) | 125,779 (66, 73) | 88,183 (160, 284) | <0.001 |
N/A | 290 | 215 | 75 | |
CW updates / reviews submitted | 1,484.51 (0.66, 0.68) | 1,312.17 (0.72, 0.73) | 172.34 (0.42, 0.45) | <0.001 |
N/A | 290 | 215 | 75 | |
Reviews submitted - CW updates | 155,046 (58, 82) | 56,582 (29, 33) | 98,464 (186, 310) | <0.001 |
N/A | 290 | 215 | 75 | |
CW updates > reviews performed | 0% (0) (0.00%, 0.22%) | 0% (0) (0.00%, 0.26%) | 0% (0) (0.00%, 1.2%) | |
N/A | 290 | 215 | 75 | |
CW updates < reviews performed | 91% (2,011) (90%, 92%) | 90% (1,626) (88%, 91%) | 97% (385) (95%, 98%) | <0.001 |
N/A | 290 | 215 | 75 | |
1Sum; % (n) | ||||
2CI = Confidence Interval | ||||
3Wilcoxon rank sum test; Pearson's Chi-squared test | ||||
It seems that the bug reported in the previous iteration was fixed since our analysis revealed that the number of CW updates was not > the number of reviews performed
Point of concern : Why do we have 91% (2,011) of users having the number of CW updates less than the number of reviews performed?
7.3.2 Consistency between Moderated Score and Weighted Score across the entire cohort
In a previous iteration, it was reported that Moderation seems to pull a lot of scores to 0, as shown in the Figure 5., under problem statement.
- Overall, the scores are highly correlated with the final score (moderated score) with a correlation coefficient of 0.99
However, when only looking at scores that were moderated, the correlation coefficient significantly reduced by 27% - A good indication that moderation is working as expected
Only 18 % of the moderated scores were pushed to 0
Additionally, compared to the previous iteration, moderation does not push a considerable number of the scores to zero - it seems this was fixed?