Wavumbuzi Peer Review Algorithm Audit

K.R.U

1 Background

The purpose of this research report is to provide audit results of the peer-review algorithm of the Wavumbuzi Entrepreneurship Challenge (hereafter referred to as WEC) game web-app. WEC is a free annual 6-weeks online challenge offered to learners in all secondary/ high schools across Kenya and Rwanda. It is a gamified experiential learning process designed to equip learners with competencies to be the next generation of global leaders, change-makers and innovative thinkers. Focusing on entrepreneurship (as a widely recognised countermeasure to unemployment), the Entrepreneurship Challenge is designed to stimulate and develop the entrepreneurial mindset and 21st Century Skills of learners.

Learners are not taught. Instead, every week – over a 6weeks period - learners get a set of challenges, via mobile and computers, that stimulates them to think like entrepreneurs. Each task requires learners to apply new concepts and utilize their knowledge and skills in solving real-world challenges. Teachers are trained on how to guide and encourage learners to engage in and complete the Challenges.

2 Objective

This main aim of this mini-study is to answer the question “How well is Peer Review working?” using Wavumbuzi Rwanda 2021 game-play data.

Why Rwanda 2021 data?

  • Most recent iteration (Kenya is one iteration behind)
  • A comprehensive audit exists for the previous iteration - We can assess improvements based on these two reports

3 Peer Review Overview

The Peer Review system determines the final score for a submission based on the suggested scores that reviewers award it. This means that the method we use to decide on the credibility of reviewers’ suggested scores deeply influences the efficacy of the Peer Review system.

Credibility Weighting (“CW” from here on), a mechanism that keeps a running score of how accurate each reviewer is. The score ranges between 0 and 1.

3.1 Peer Review Algorithm

The peer-review process is divided into mandatory and discretionary reviews, which are queued differently in the system.

drawing

This process aims to encourage automatic scoring based on submission quality. Should a submission be redirected for Moderation, the following triggers are built into the system to notify discrepancies and/or suspicious behaviour:

  • Inconsistent scoring after ten reviews
  • Submission not reviewed in the peer review process for an extended period.

The following improvements were made to encourage good and fair behaviour from teachers and learners.

  • Notifications warning users of potential bad behaviour i.e, submitting answers too quickly, taking less than 60sec to review submissions.
  • Introduction of remark requests, credibility weighting, and patterned behaviour.

Unlike previous challenges, the teacher’s score held equal weight to the learner’s score and thus did not override previous learner scores within the system.

  • Types of inappropriate submissions:
  • Plagiarism
  • Irrelevant & repeated answers
  • Inappropriate and explicit links included in submissions.

Regular system checks were implemented at the end of each new challenge week, with additional random checks and adjustments of scores by the systems managers.

A detailed exposition of the algorithm can be found here:

4 Problem Statement

The following are the CW issues ever reported in the previous iterations - for both Kenya and Rwanda.

4.1 1. Wild fluctuation in credibility weights

Previous analyses have shown that users’ CWs were incredibly volatile, swinging aggressively up and down over time. Red flag: CW cannot possibly be a good indicator of the degree to which we trust in someone’s reviews if it fluctuates so wildly for most users.

drawing
Fig.4 - Change in CW from previous iteration: Notice the volatility and wild fluctuation in credibility weights?

So, it was concluded that volatility in CW is undesirable, and we ought to figure out how to configure CW such that:

  • It converges to some semblance of true credibility for each user;
  • Provides a stable sense of the trust to assign to a reviewer - even if evolving.

4.2 2. Credibility weight only going down and not up

During Kenya 2021 Challenge (August), it was raised that the CW was mainly going down and not up for most users.

It is, however, essential to note that there are cases where the CW does not change because the student did an OK review where he gets half points increase. Nonetheless, it is always essential to investigate

4.3 3. Consistency in how CWs are updated relative to # of reviews performed

In a previous iteration, it was reported that 16% of users (~280) who made reviews have more credibility update entries than the number of reviews performed.

4.4 4. Moderation is pulling a lot of scores to 0.

In a previous iteration, it was reported that Moderation seem to pull a lot of scores to 0 as shown in the figure below

drawing
Fig.5 - In a previous iteration, it was reported that Moderation seem to pull a lot of scores to 0?

5 Methodology

In light of the above, we sought to review CW in terms of:

  • the extent to which they agree with one another [Consistency], and
  • the extent to which we trust each reviewer’s suggestions [Credibility].
  • the extent to which credibility (accuracy) is maintained over iterations.

5.1 Univariate analysis

Descriptive measures under the univariate setting will be applied to summarize central tendencies and distribution for metric variables.

  • Categorical variables such as gender and nationality will be summarized using frequencies and percentages.
  • Categorical variables derived from continuous variables will be categorized using acceptable limits derived from literature
  • Continuous variables such as age, CW, and scores will be summarized using mean and corresponding SD - assuming Gaussian distribution.
  • Continuous variables that are skewed will be summarized as median and the corresponding interquartile range (IQR).

5.2 Bivariate setting

  • Association between categorical variables will be assessed using Pearson’s Chi-Square test
  • Association between categorical and continuous variables will be done using a two-sample t-test, assuming Gaussian distribution for continuous variables.
  • Association between categorical and continuous variables which are not normally distributed will be conducted using a two-sample Wilcoxon rank-sum test.
  • Kruskal Wallis test (one-way ANOVA on ranks) will be used if the levels of the categorical variable are greater than 2.

5.3 Regression Modeling

Where applicable:

  • For continuous response, => Linear regression model will be fitted. Effect modifiers and potential confounders will be adjusted. Regression coefficients and the corresponding 95% confidence interval will be reported.
  • For binary response => association between multiple risk factors and the response, the variable will be conducted by fitting multiple logistic regression models. Effect modifiers and potential confounders will be adjusted. Odds ratios and the corresponding 95% confidence interval will be reported.

6 Data sources

Game data was collected over a six-week period from 16 August 2021 to 20 October 2021 (excluding weeks where the learners wrote exams and school holidays) using an online web app hosted using AWS platform.

We are leveraging four main datasets from Wavumbuzi game-play logs:

6.1 Submission

Data summary
Name submission.df
Number of rows 175976
Number of columns 26
_______________________
Column type frequency:
logical 4
numeric 15
POSIXct 7
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
test 0 1 0 FAL: 175976
demo 0 1 0 FAL: 175976
admin_score 175976 0 NaN :
appeal_score 175976 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 88342.80 50881.12 105 44304.75 88335.5 132419.2 176413 ▇▇▇▇▇
student_id 0 1.00 5127.63 2844.56 20 2746.00 4796.0 7366.0 10814 ▆▇▆▅▅
challenge_level_id 0 1.00 423.69 329.99 8 124.00 356.0 690.0 1127 ▇▅▃▃▂
teacher_id 172719 0.02 5511.96 3502.07 305 777.00 8083.0 8150.0 8156 ▅▁▁▁▇
final_score 4400 0.97 108.36 125.16 0 40.00 60.0 146.0 1000 ▇▁▁▁▁
teacherReview_id 172719 0.02 189516.22 98677.71 49554 105456.00 166570.0 279875.0 367988 ▇▆▃▅▅
reported_count 0 1.00 0.00 0.00 0 0.00 0.0 0.0 0 ▁▁▇▁▁
number_of_reviews_for_finalisation 111434 0.37 3.13 0.37 3 3.00 3.0 3.0 5 ▇▁▁▁▁
number_of_seconds_until_finalisation 8126 0.95 32350.63 97845.30 0 1.00 2.0 13659.0 1636998 ▇▁▁▁▁
moderator_score 173358 0.01 102.36 155.98 0 0.00 50.0 131.0 1000 ▇▁▁▁▁
inappropriate_score 174334 0.01 0.00 0.00 0 0.00 0.0 0.0 0 ▁▁▇▁▁
remarked_score 175189 0.00 302.63 256.77 0 128.50 213.0 459.5 1000 ▇▃▃▁▁
system_score 7604 0.96 109.97 124.72 0 40.00 60.0 149.0 1000 ▇▁▁▁▁
finalization_reviews_stdev 0 1.00 8.53 17.60 0 0.00 0.0 11.8 150 ▇▁▁▁▁
finalization_reviews_flat_average 0 1.00 74.55 125.53 0 0.00 0.0 126.0 1000 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
submitted_on 4400 0.97 2021-10-25 05:21:11 2021-12-07 13:23:15 2021-11-11 14:36:10 163021
updated_on 3195 0.98 2021-10-25 05:21:11 2021-12-07 13:23:15 2021-11-11 14:59:32 164083
started_on 0 1.00 2021-10-25 05:20:33 2021-12-07 13:44:54 2021-11-11 11:54:41 166579
finalised_on 4400 0.97 2021-10-25 05:21:12 2021-12-09 07:41:43 2021-11-12 05:19:05 165335
deletedAt 175976 0.00 NA NA NA 0
moderation_started_on 172719 0.02 2021-11-01 10:42:37 2021-12-08 22:49:43 2021-11-16 09:30:21 3256
moderation_completed_on 172726 0.02 2021-11-01 10:46:34 2021-12-09 07:41:43 2021-11-16 17:56:07 3247

6.2 Credibility Weighting History

Data summary
Name submission_reviews.df
Number of rows 357492
Number of columns 13
_______________________
Column type frequency:
character 1
logical 3
numeric 7
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
status 0 1 8 10 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
test 0 1 0.00 FAL: 357492
discretionary 0 1 0.03 FAL: 346287, TRU: 11205
demo 0 1 0.00 FAL: 357492

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 1.863538e+05 105727.05 43 9.546975e+04 187373.5 277830.2 367988 ▇▇▇▇▇
submission_id 0 1.00 1.027353e+05 48199.84 168 6.396375e+04 109838.0 145208.0 176411 ▃▅▆▇▇
initiated_submission_id 190117 0.47 1.063469e+05 47591.87 171 6.948150e+04 115612.0 147284.0 176410 ▃▅▅▇▇
reviewer_id 0 1.00 3.684960e+03 3172.83 17 1.033000e+03 2645.0 5817.0 10799 ▇▃▂▂▂
points_awarded 0 1.00 1.840700e+02 138.77 0 1.000000e+02 150.0 219.0 1000 ▇▂▁▁▁
points_available_to_be_awarded 5140 0.99 2.865800e+02 172.75 0 1.500000e+02 249.0 500.0 1000 ▇▅▅▁▁
review_last_fetch_time 6411 0.98 1.637115e+09 1000615.97 1635149204 1.636271e+09 1637222344.0 1637957019.0 1639580509 ▆▆▇▇▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
created_on 0 1 2021-10-25 08:06:44 2021-12-08 22:49:43 2021-11-18 06:27:03 331489
deletedAt 357492 0 NA NA NA 0

6.3 Submission Score History

Data summary
Name sub_score_hist.df
Number of rows 179451
Number of columns 11
_______________________
Column type frequency:
character 2
logical 2
numeric 5
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scoreType 0 1.00 12 19 0 4 0
comment 176020 0.02 3 240 0 1684 0

Variable type: logical

skim_variable n_missing complete_rate mean count
deletedAt 179451 0 NaN :
deletedBy_id 179451 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 102583.36 56748.20 1 54293.5 105621 151442.5 196453 ▇▆▇▇▇
submission_id 0 1.00 88583.78 50676.96 1 44972.5 88874 132237.5 176412 ▇▇▇▇▇
score 0 1.00 112.49 127.85 0 40.0 60 150.0 1000 ▇▁▁▁▁
createdBy_id 5310 0.97 4362.07 3214.22 17 1561.0 3918 7190.0 10814 ▇▅▅▃▃
updatedBy_id 5310 0.97 4362.07 3214.22 17 1561.0 3918 7190.0 10814 ▇▅▅▃▃

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
createdAt 0 1 2021-10-21 17:20:05 2021-12-09 07:41:44 2021-11-12 11:46:29 172717
updatedAt 0 1 2021-10-21 17:20:05 2021-12-09 07:41:44 2021-11-12 11:46:29 172717

6.4 Users

Trancated due to large table

Data summary
Name users.df[25:50]
Number of rows 10732
Number of columns 26
_______________________
Column type frequency:
character 5
logical 8
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 5270 0.51 4 6 0 3 0
id_number 10066 0.06 1 44 0 605 0
title 9314 0.13 2 5 0 12 0
role 10545 0.02 12 18 0 2 0
token 1477 0.86 20 20 0 9255 0

Variable type: logical

skim_variable n_missing complete_rate mean count
allan_gray_scholar 1477 0.86 0.00 FAL: 9255
subscribed_to_whatsapp 1 1.00 0.21 FAL: 8440, TRU: 2291
new_registration 9263 0.14 0.31 FAL: 1017, TRU: 452
survey_completed_at 10732 0.00 NaN :
test 1477 0.86 0.00 FAL: 9255
nationality 10732 0.00 NaN :
hubspot_association 0 1.00 0.11 FAL: 9580, TRU: 1152
late_sign_up 1477 0.86 0.00 FAL: 9255

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
school_class_id 10441 0.03 193.15 107.34 13 49 239 287.00 313 ▅▁▂▁▇
verified 2938 0.73 1.00 0.00 1 1 1 1.00 1 ▁▁▇▁▁
leaderboard_all_rank 0 1.00 3774.31 1554.62 1 2683 4922 4922.00 4922 ▁▁▁▂▇
leaderboard_week_rank 0 1.00 0.00 0.00 0 0 0 0.00 0 ▁▁▇▁▁
number_of_submissions 0 1.00 15.99 36.71 0 0 0 11.00 158 ▇▁▁▁▁
points 0 1.00 2110.08 5664.28 0 0 0 1170.00 32916 ▇▁▁▁▁
week_points 0 1.00 0.00 0.00 0 0 0 0.00 0 ▁▁▇▁▁
challenge_points 0 1.00 1839.14 5152.45 0 0 0 896.25 30600 ▇▁▁▁▁
other_points 0 1.00 242.06 555.18 0 0 0 220.00 4438 ▇▁▁▁▁
average_points_per_challenge 0 1.00 31.05 48.26 0 0 0 60.00 227 ▇▂▁▁▁
average_points_per_week 0 1.00 603.86 1748.06 0 0 0 575.00 30450 ▇▁▁▁▁
best_week_points 0 1.00 596.66 1942.49 0 0 0 550.00 30450 ▇▁▁▁▁
best_challenge_points 0 1.00 155.04 254.94 0 0 0 200.00 1000 ▇▁▂▁▁

7 Results

7.1 Study Cohort

7.1.1 Engagement & Peer-review Summary

7.1.2 Peer-review submission distribution by User-type

  • Teachers perform more reviews than students on average - Note that we have less teachers compared to learners

7.1.3 Peer-review incompletion reviews by User-type

  • Students: 4,790 incomplete reviews (93%)
  • Teachers: 346 incomplete reviews (7%)
  • Point of concern : Students have higher incompletion rates than teachers

7.1.4 Peer-review submission distribution by Points Awarded

  • Points awarded attributed by peer review is higher in teachers than students

7.1.5 Peer-review submission distribution by CW scores

  • 8,215/187,099 (4.4%) reviews had CW > 1
  • Point of concern : The teacher’s CW max limit is 2; Student’s CW max is 1. Was this intentional?

7.2 Credibility Analysis

Remember: credibility is the extent to which we trust each reviewer’s suggestions

7.2.1 CW directionality for the entire cohort

By fitting a regression line across the entire cohort, we can be able to know the overall directionality for change in credibility weights.

  • Overall, change in CW has a positive slope and positive correlation coefficient - meaning, on average, updates in CW are positively correlated with time/day.
  • Change in CW (slope) is more significant as from November 20th

This rules out the problem of CW updates mainly going down and not up - Reported during the Kenya 2021 Challenge (August).

7.2.2 CW Volatility assessment

Comparing this with changes in CW from the previous iteration (depicted in Fig.4 below):

  • There is more trust in how the credibility scores are updated
  • The updates do not fluctuate so wildly for most users, as seen in the previous report
  • For most users, CW updates are smoother and predictable compared to previous iterations
  • Volatility in CW (which is an undesirable characteristic) has significantly reduced
drawing
Fig.4 - Change in CW from previous iteration: Notice the volatility and wild fluctuation in credibility weights?

Questions to investigate

  • Why do we see a significant change in CW only during the first 15 days and the last 20 days of the challenge?

7.2.3 Correlation analysis

The correlation coefficient can also be a good indicator to assess credibility in terms of the directionality of CW updates.

drawing

A positive or negative correlation for change in CW over time/day is desirable. Why? If 90% of users have a 0 coefficient in the CW change over time, then it means our peer review algorithm is not working.

  • Correlation of > +0.1 or < -0.1 is even more desirable - more credibility of the peer review algorithm

7.3 Consistency Analysis

the extent to which they agree with one another [Consistency]

7.3.1 Consistency in how CWs are updated

As a rule of thumb, the number of CW updates should be == the number of reviews submitted. However, in a previous iteration, it was reported that 16% of users (~280) who made reviews had more credibility update entries than the number of reviews performed.

  • It seems that the bug reported in the previous iteration was fixed since our analysis revealed that the number of CW updates was not > the number of reviews performed

  • Point of concern : Why do we have 91% (2,011) of users having the number of CW updates less than the number of reviews performed?

7.3.2 Consistency between Moderated Score and Weighted Score across the entire cohort

In a previous iteration, it was reported that Moderation seems to pull a lot of scores to 0, as shown in the Figure 5., under problem statement.

  • Overall, the scores are highly correlated with the final score (moderated score) with a correlation coefficient of 0.99

  • However, when only looking at scores that were moderated, the correlation coefficient significantly reduced by 27% - A good indication that moderation is working as expected

  • Only 18 % of the moderated scores were pushed to 0

  • Additionally, compared to the previous iteration, moderation does not push a considerable number of the scores to zero - it seems this was fixed?

9 Appendix