1 Background

The purpose of this research report is to provide audit results of the peer-review algorithm of the Wavumbuzi Entrepreneurship Challenge (hereafter referred to as WEC) game web-app. WEC is a free annual 6-weeks online challenge offered to learners in all secondary/ high schools across Kenya and Rwanda. It is a gamified experiential learning process designed to equip learners with competencies to be the next generation of global leaders, change-makers and innovative thinkers. Focusing on entrepreneurship (as a widely recognised countermeasure to unemployment), the Entrepreneurship Challenge is designed to stimulate and develop the entrepreneurial mindset and 21st Century Skills of learners.

Learners are not taught. Instead, every week – over a 6weeks period - learners get a set of challenges, via mobile and computers, that stimulates them to think like entrepreneurs. Each task requires learners to apply new concepts and utilize their knowledge and skills in solving real-world challenges. Teachers are trained on how to guide and encourage learners to engage in and complete the Challenges.

2 Objective

This main aim of this mini-study is to answer the question “How well is Peer Review working?” using Wavumbuzi Rwanda 2021 game-play data.

Why Rwanda 2021 data?

Most recent iteration (Kenya is one iteration behind)
A comprehensive audit exists for the previous iteration - We can assess improvements based on these two reports

3 Peer Review Overview

The Peer Review system determines the final score for a submission based on the suggested scores that reviewers award it. This means that the method we use to decide on the credibility of reviewers’ suggested scores deeply influences the efficacy of the Peer Review system.

Credibility Weighting (“CW” from here on), a mechanism that keeps a running score of how accurate each reviewer is. The score ranges between 0 and 1.

3.1 Peer Review Algorithm

The peer-review process is divided into mandatory and discretionary reviews, which are queued differently in the system.

drawing

This process aims to encourage automatic scoring based on submission quality. Should a submission be redirected for Moderation, the following triggers are built into the system to notify discrepancies and/or suspicious behaviour:

Inconsistent scoring after ten reviews
Submission not reviewed in the peer review process for an extended period.

The following improvements were made to encourage good and fair behaviour from teachers and learners.

Notifications warning users of potential bad behaviour i.e, submitting answers too quickly, taking less than 60sec to review submissions.
Introduction of remark requests, credibility weighting, and patterned behaviour.

Unlike previous challenges, the teacher’s score held equal weight to the learner’s score and thus did not override previous learner scores within the system.

Types of inappropriate submissions:
Plagiarism
Irrelevant & repeated answers
Inappropriate and explicit links included in submissions.

Regular system checks were implemented at the end of each new challenge week, with additional random checks and adjustments of scores by the systems managers.

A detailed exposition of the algorithm can be found here:

4 Problem Statement

The following are the CW issues ever reported in the previous iterations - for both Kenya and Rwanda.

4.1 1. Wild fluctuation in credibility weights

Previous analyses have shown that users’ CWs were incredibly volatile, swinging aggressively up and down over time. Red flag: CW cannot possibly be a good indicator of the degree to which we trust in someone’s reviews if it fluctuates so wildly for most users.

drawing — Fig.4 - Change in CW from previous iteration: Notice the volatility and wild fluctuation in credibility weights?

So, it was concluded that volatility in CW is undesirable, and we ought to figure out how to configure CW such that:

It converges to some semblance of true credibility for each user;
Provides a stable sense of the trust to assign to a reviewer - even if evolving.

4.2 2. Credibility weight only going down and not up

During Kenya 2021 Challenge (August), it was raised that the CW was mainly going down and not up for most users.

It is, however, essential to note that there are cases where the CW does not change because the student did an OK review where he gets half points increase. Nonetheless, it is always essential to investigate

4.3 3. Consistency in how CWs are updated relative to # of reviews performed

In a previous iteration, it was reported that 16% of users (~280) who made reviews have more credibility update entries than the number of reviews performed.

4.4 4. Moderation is pulling a lot of scores to 0.

In a previous iteration, it was reported that Moderation seem to pull a lot of scores to 0 as shown in the figure below

5 Methodology

In light of the above, we sought to review CW in terms of:

the extent to which they agree with one another [Consistency], and
the extent to which we trust each reviewer’s suggestions [Credibility].
~~the extent to which credibility (accuracy) is maintained over iterations.~~

5.1 Univariate analysis

Descriptive measures under the univariate setting will be applied to summarize central tendencies and distribution for metric variables.

Categorical variables such as gender and nationality will be summarized using frequencies and percentages.
Categorical variables derived from continuous variables will be categorized using acceptable limits derived from literature
Continuous variables such as age, CW, and scores will be summarized using mean and corresponding SD - assuming Gaussian distribution.
Continuous variables that are skewed will be summarized as median and the corresponding interquartile range (IQR).

5.2 Bivariate setting

Association between categorical variables will be assessed using Pearson’s Chi-Square test
Association between categorical and continuous variables will be done using a two-sample t-test, assuming Gaussian distribution for continuous variables.
Association between categorical and continuous variables which are not normally distributed will be conducted using a two-sample Wilcoxon rank-sum test.
Kruskal Wallis test (one-way ANOVA on ranks) will be used if the levels of the categorical variable are greater than 2.

5.3 Regression Modeling

Where applicable:

For continuous response, => Linear regression model will be fitted. Effect modifiers and potential confounders will be adjusted. Regression coefficients and the corresponding 95% confidence interval will be reported.
For binary response => association between multiple risk factors and the response, the variable will be conducted by fitting multiple logistic regression models. Effect modifiers and potential confounders will be adjusted. Odds ratios and the corresponding 95% confidence interval will be reported.

6 Data sources

Game data was collected over a six-week period from 16 August 2021 to 20 October 2021 (excluding weeks where the learners wrote exams and school holidays) using an online web app hosted using AWS platform.

We are leveraging four main datasets from Wavumbuzi game-play logs:

6.1 Submission

Data summary
Name	submission.df
Number of rows	175976
Number of columns	26
_______________________
Column type frequency:
logical	4
numeric	15
POSIXct	7
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
test	0	1	0	FAL: 175976
demo	0	1	0	FAL: 175976
admin_score	175976	0	NaN	:
appeal_score	175976	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	88342.80	50881.12	105	44304.75	88335.5	132419.2	176413	▇▇▇▇▇
student_id	0	1.00	5127.63	2844.56	20	2746.00	4796.0	7366.0	10814	▆▇▆▅▅
challenge_level_id	0	1.00	423.69	329.99	8	124.00	356.0	690.0	1127	▇▅▃▃▂
teacher_id	172719	0.02	5511.96	3502.07	305	777.00	8083.0	8150.0	8156	▅▁▁▁▇
final_score	4400	0.97	108.36	125.16	0	40.00	60.0	146.0	1000	▇▁▁▁▁
teacherReview_id	172719	0.02	189516.22	98677.71	49554	105456.00	166570.0	279875.0	367988	▇▆▃▅▅
reported_count	0	1.00	0.00	0.00	0	0.00	0.0	0.0	0	▁▁▇▁▁
number_of_reviews_for_finalisation	111434	0.37	3.13	0.37	3	3.00	3.0	3.0	5	▇▁▁▁▁
number_of_seconds_until_finalisation	8126	0.95	32350.63	97845.30	0	1.00	2.0	13659.0	1636998	▇▁▁▁▁
moderator_score	173358	0.01	102.36	155.98	0	0.00	50.0	131.0	1000	▇▁▁▁▁
inappropriate_score	174334	0.01	0.00	0.00	0	0.00	0.0	0.0	0	▁▁▇▁▁
remarked_score	175189	0.00	302.63	256.77	0	128.50	213.0	459.5	1000	▇▃▃▁▁
system_score	7604	0.96	109.97	124.72	0	40.00	60.0	149.0	1000	▇▁▁▁▁
finalization_reviews_stdev	0	1.00	8.53	17.60	0	0.00	0.0	11.8	150	▇▁▁▁▁
finalization_reviews_flat_average	0	1.00	74.55	125.53	0	0.00	0.0	126.0	1000	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
submitted_on	4400	0.97	2021-10-25 05:21:11	2021-12-07 13:23:15	2021-11-11 14:36:10	163021
updated_on	3195	0.98	2021-10-25 05:21:11	2021-12-07 13:23:15	2021-11-11 14:59:32	164083
started_on	0	1.00	2021-10-25 05:20:33	2021-12-07 13:44:54	2021-11-11 11:54:41	166579
finalised_on	4400	0.97	2021-10-25 05:21:12	2021-12-09 07:41:43	2021-11-12 05:19:05	165335
deletedAt	175976	0.00	NA	NA	NA	0
moderation_started_on	172719	0.02	2021-11-01 10:42:37	2021-12-08 22:49:43	2021-11-16 09:30:21	3256
moderation_completed_on	172726	0.02	2021-11-01 10:46:34	2021-12-09 07:41:43	2021-11-16 17:56:07	3247

6.2 Credibility Weighting History

Data summary
Name	submission_reviews.df
Number of rows	357492
Number of columns	13
_______________________
Column type frequency:
character	1
logical	3
numeric	7
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
status	0	1	8	10	0	2	0

Variable type: logical

skim_variable	complete_rate	mean	count
test	1	0.00	FAL: 357492
discretionary	1	0.03	FAL: 346287, TRU: 11205
demo	1	0.00	FAL: 357492

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	1.863538e+05	105727.05	43	9.546975e+04	187373.5	277830.2	367988	▇▇▇▇▇
submission_id	0	1.00	1.027353e+05	48199.84	168	6.396375e+04	109838.0	145208.0	176411	▃▅▆▇▇
initiated_submission_id	190117	0.47	1.063469e+05	47591.87	171	6.948150e+04	115612.0	147284.0	176410	▃▅▅▇▇
reviewer_id	0	1.00	3.684960e+03	3172.83	17	1.033000e+03	2645.0	5817.0	10799	▇▃▂▂▂
points_awarded	0	1.00	1.840700e+02	138.77	0	1.000000e+02	150.0	219.0	1000	▇▂▁▁▁
points_available_to_be_awarded	5140	0.99	2.865800e+02	172.75	0	1.500000e+02	249.0	500.0	1000	▇▅▅▁▁
review_last_fetch_time	6411	0.98	1.637115e+09	1000615.97	1635149204	1.636271e+09	1637222344.0	1637957019.0	1639580509	▆▆▇▇▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
created_on	0	1	2021-10-25 08:06:44	2021-12-08 22:49:43	2021-11-18 06:27:03	331489
deletedAt	357492	0	NA	NA	NA	0

6.3 Submission Score History

Data summary
Name	sub_score_hist.df
Number of rows	179451
Number of columns	11
_______________________
Column type frequency:
character	2
logical	2
numeric	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
scoreType	0	1.00	12	19	0	4	0
comment	176020	0.02	3	240	0	1684	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
deletedAt	179451	0	NaN	:
deletedBy_id	179451	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	102583.36	56748.20	1	54293.5	105621	151442.5	196453	▇▆▇▇▇
submission_id	0	1.00	88583.78	50676.96	1	44972.5	88874	132237.5	176412	▇▇▇▇▇
score	0	1.00	112.49	127.85	0	40.0	60	150.0	1000	▇▁▁▁▁
createdBy_id	5310	0.97	4362.07	3214.22	17	1561.0	3918	7190.0	10814	▇▅▅▃▃
updatedBy_id	5310	0.97	4362.07	3214.22	17	1561.0	3918	7190.0	10814	▇▅▅▃▃

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
createdAt	0	1	2021-10-21 17:20:05	2021-12-09 07:41:44	2021-11-12 11:46:29	172717
updatedAt	0	1	2021-10-21 17:20:05	2021-12-09 07:41:44	2021-11-12 11:46:29	172717

6.4 Users

Trancated due to large table

Data summary
Name	users.df[25:50]
Number of rows	10732
Number of columns	26
_______________________
Column type frequency:
character	5
logical	8
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
gender	5270	0.51	4	6	3
id_number	10066	0.06	1	44	605
title	9314	0.13	2	5	12
role	10545	0.02	12	18	2
token	1477	0.86	20	20	9255

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
allan_gray_scholar	1477	0.86	0.00	FAL: 9255
subscribed_to_whatsapp	1	1.00	0.21	FAL: 8440, TRU: 2291
new_registration	9263	0.14	0.31	FAL: 1017, TRU: 452
survey_completed_at	10732	0.00	NaN	:
test	1477	0.86	0.00	FAL: 9255
nationality	10732	0.00	NaN	:
hubspot_association	0	1.00	0.11	FAL: 9580, TRU: 1152
late_sign_up	1477	0.86	0.00	FAL: 9255

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
school_class_id	10441	0.03	193.15	107.34	13	49	239	287.00	313	▅▁▂▁▇
verified	2938	0.73	1.00	0.00	1	1	1	1.00	1	▁▁▇▁▁
leaderboard_all_rank	0	1.00	3774.31	1554.62	1	2683	4922	4922.00	4922	▁▁▁▂▇
leaderboard_week_rank	0	1.00	0.00	0.00	0	0	0	0.00	0	▁▁▇▁▁
number_of_submissions	0	1.00	15.99	36.71	0	0	0	11.00	158	▇▁▁▁▁
points	0	1.00	2110.08	5664.28	0	0	0	1170.00	32916	▇▁▁▁▁
week_points	0	1.00	0.00	0.00	0	0	0	0.00	0	▁▁▇▁▁
challenge_points	0	1.00	1839.14	5152.45	0	0	0	896.25	30600	▇▁▁▁▁
other_points	0	1.00	242.06	555.18	0	0	0	220.00	4438	▇▁▁▁▁
average_points_per_challenge	0	1.00	31.05	48.26	0	0	0	60.00	227	▇▂▁▁▁
average_points_per_week	0	1.00	603.86	1748.06	0	0	0	575.00	30450	▇▁▁▁▁
best_week_points	0	1.00	596.66	1942.49	0	0	0	550.00	30450	▇▁▁▁▁
best_challenge_points	0	1.00	155.04	254.94	0	0	0	200.00	1000	▇▁▂▁▁

7 Results

7.1 Study Cohort

7.1.1 Engagement & Peer-review Summary

Table 1: Demographics, Engagement and Moderation Summary
age	17.58 (4.01)	17.58 (4.01)	NA (NA)	NA (NA)
N/A	6,807	5,330	1,469	8
gender					<0.001
female	41% (2,233)	42% (2,229)	4.4% (4)	NA% (0)
male	59% (3,221)	58% (3,134)	96% (87)	NA% (0)
other	0.1% (8)	0.1% (8)	0% (0)	NA% (0)
N/A	5,270	3,884	1,378	8
average_review_score	34 (104)	39 (111)	0 (0)	0 (0)	<0.001
average_points_per_challenge	31 (48)	36 (50)	0 (0)	0 (0)	<0.001
credibilityWeighting	0.48 (0.20)	0.44 (0.16)	0.71 (0.27)	0.00 (NA)	<0.001
N/A	7	0	0	7
number_of_submissions	16 (37)	19 (39)	0 (0)	0 (0)	<0.001
challenge_points	1,839 (5,152)	2,133 (5,492)	0 (0)	0 (0)	<0.001
best_challenge_points	155 (255)	180 (266)	0 (0)	0 (0)	<0.001
average_points_per_week	604 (1,748)	700 (1,864)	0 (0)	0 (0)	<0.001
login_count	479 (1,293)	550 (1,378)	38 (140)	152 (206)	<0.001
school_grade_id					>0.9
1	1.3% (70)	1.3% (70)	NA% (0)	NA% (0)
2	2.4% (127)	2.4% (127)	NA% (0)	NA% (0)
3	7.4% (401)	7.4% (401)	NA% (0)	NA% (0)
4	10% (566)	10% (566)	NA% (0)	NA% (0)
5	28% (1,493)	28% (1,493)	NA% (0)	NA% (0)
6	24% (1,294)	24% (1,294)	NA% (0)	NA% (0)
7	27% (1,444)	27% (1,444)	NA% (0)	NA% (0)
N/A	5,337	3,860	1,469	8
home_schooled	1.0% (109)	1.2% (109)	0% (0)	0% (0)	<0.001
profile_percentage_complete
0	88% (9,407)	100% (9,255)	9.8% (144)	100% (8)
22	<0.1% (10)	0% (0)	0.7% (10)	0% (0)
33	2.0% (217)	0% (0)	15% (217)	0% (0)
44	6.6% (706)	0% (0)	48% (706)	0% (0)
56	3.5% (372)	0% (0)	25% (372)	0% (0)
67	0.2% (20)	0% (0)	1.4% (20)	0% (0)
1Mean (SD); % (n)
2Fisher's exact test; Kruskal-Wallis rank sum test

7.1.2 Peer-review submission distribution by User-type

Teachers perform more reviews than students on average - Note that we have less teachers compared to learners

7.1.3 Peer-review incompletion reviews by User-type

Students: 4,790 incomplete reviews (93%)
Teachers: 346 incomplete reviews (7%)
Point of concern : Students have higher incompletion rates than teachers

7.1.4 Peer-review submission distribution by Points Awarded

Points awarded attributed by peer review is higher in teachers than students

7.1.5 Peer-review submission distribution by CW scores

8,215/187,099 (4.4%) reviews had CW > 1
Point of concern : The teacher’s CW max limit is 2; Student’s CW max is 1. Was this intentional?

7.2 Credibility Analysis

Remember: credibility is the extent to which we trust each reviewer’s suggestions

7.2.1 CW directionality for the entire cohort

By fitting a regression line across the entire cohort, we can be able to know the overall directionality for change in credibility weights.

Overall, change in CW has a positive slope and positive correlation coefficient - meaning, on average, updates in CW are positively correlated with time/day.
Change in CW (slope) is more significant as from November 20th

This rules out the problem of CW updates mainly going down and not up - Reported during the Kenya 2021 Challenge (August).

7.2.2 CW Volatility assessment

Comparing this with changes in CW from the previous iteration (depicted in Fig.4 below):

There is more trust in how the credibility scores are updated
The updates do not fluctuate so wildly for most users, as seen in the previous report
For most users, CW updates are smoother and predictable compared to previous iterations
Volatility in CW (which is an undesirable characteristic) has significantly reduced

Questions to investigate

Why do we see a significant change in CW only during the first 15 days and the last 20 days of the challenge?

7.2.3 Correlation analysis

The correlation coefficient can also be a good indicator to assess credibility in terms of the directionality of CW updates.

drawing

A positive or negative correlation for change in CW over time/day is desirable. Why? If 90% of users have a 0 coefficient in the CW change over time, then it means our peer review algorithm is not working.

Correlation of > +0.1 or < -0.1 is even more desirable - more credibility of the peer review algorithm

7.3 Consistency Analysis

the extent to which they agree with one another [Consistency]

7.3.1 Consistency in how CWs are updated

As a rule of thumb, the number of CW updates should be == the number of reviews submitted. However, in a previous iteration, it was reported that 16% of users (~280) who made reviews had more credibility update entries than the number of reviews performed.

Table 4: Trust in how CW are updated
complete reviews	369,144 (127, 168)	182,391 (85, 95)	186,753 (292, 499)	0.5
incomplete reviews	5,136 (1.9, 2.2)	4,790 (2.2, 2.5)	346 (0.64, 0.82)	<0.001
total reviews	374,280 (129, 170)	187,181 (87, 97)	187,099 (293, 500)	>0.9
CW updates	213,962 (85, 108)	125,779 (66, 73)	88,183 (160, 284)	<0.001
N/A	290	215	75
CW updates / reviews submitted	1,484.51 (0.66, 0.68)	1,312.17 (0.72, 0.73)	172.34 (0.42, 0.45)	<0.001
N/A	290	215	75
Reviews submitted - CW updates	155,046 (58, 82)	56,582 (29, 33)	98,464 (186, 310)	<0.001
N/A	290	215	75
CW updates > reviews performed	0% (0) (0.00%, 0.22%)	0% (0) (0.00%, 0.26%)	0% (0) (0.00%, 1.2%)
N/A	290	215	75
CW updates < reviews performed	91% (2,011) (90%, 92%)	90% (1,626) (88%, 91%)	97% (385) (95%, 98%)	<0.001
N/A	290	215	75
1Sum; % (n)
2CI = Confidence Interval
3Wilcoxon rank sum test; Pearson's Chi-squared test

It seems that the bug reported in the previous iteration was fixed since our analysis revealed that the number of CW updates was not > the number of reviews performed
Point of concern : Why do we have 91% (2,011) of users having the number of CW updates less than the number of reviews performed?

7.3.2 Consistency between Moderated Score and Weighted Score across the entire cohort

In a previous iteration, it was reported that Moderation seems to pull a lot of scores to 0, as shown in the Figure 5., under problem statement.

Overall, the scores are highly correlated with the final score (moderated score) with a correlation coefficient of 0.99

However, when only looking at scores that were moderated, the correlation coefficient significantly reduced by 27% - A good indication that moderation is working as expected
Only 18 % of the moderated scores were pushed to 0
Additionally, compared to the previous iteration, moderation does not push a considerable number of the scores to zero - it seems this was fixed?

Wavumbuzi Peer Review Algorithm Audit

Wavumbuzi Peer Review Algorithm Audit

1 Background

2 Objective

3 Peer Review Overview

3.1 Peer Review Algorithm

4 Problem Statement

4.1 1. Wild fluctuation in credibility weights

4.2 2. Credibility weight only going down and not up

4.3 3. Consistency in how CWs are updated relative to # of reviews performed

4.4 4. Moderation is pulling a lot of scores to 0.

5 Methodology

5.1 Univariate analysis

5.2 Bivariate setting

5.3 Regression Modeling

6 Data sources

6.1 Submission

6.2 Credibility Weighting History

6.3 Submission Score History

6.4 Users

7 Results

7.1 Study Cohort

7.1.1 Engagement & Peer-review Summary

7.1.2 Peer-review submission distribution by User-type

7.1.3 Peer-review incompletion reviews by User-type

7.1.4 Peer-review submission distribution by Points Awarded

7.1.5 Peer-review submission distribution by CW scores

7.2 Credibility Analysis

7.2.1 CW directionality for the entire cohort

7.2.2 CW Volatility assessment

7.2.3 Correlation analysis

7.3 Consistency Analysis

7.3.1 Consistency in how CWs are updated

7.3.2 Consistency between Moderated Score and Weighted Score across the entire cohort

8 References

9 Appendix

Table 1: Demographics, Engagement and Moderation Summary
	discr
Characteristic	Overall, N = 107321	student, N = 9,2551	teacher, N = 1,4691	user, N = 81	p-value2
age	17.58 (4.01)	17.58 (4.01)	NA (NA)	NA (NA)
N/A	6,807	5,330	1,469	8
gender					<0.001
female	41% (2,233)	42% (2,229)	4.4% (4)	NA% (0)
male	59% (3,221)	58% (3,134)	96% (87)	NA% (0)
other	0.1% (8)	0.1% (8)	0% (0)	NA% (0)
N/A	5,270	3,884	1,378	8
average_review_score	34 (104)	39 (111)	0 (0)	0 (0)	<0.001
average_points_per_challenge	31 (48)	36 (50)	0 (0)	0 (0)	<0.001
credibilityWeighting	0.48 (0.20)	0.44 (0.16)	0.71 (0.27)	0.00 (NA)	<0.001
N/A	7	0	0	7
number_of_submissions	16 (37)	19 (39)	0 (0)	0 (0)	<0.001
challenge_points	1,839 (5,152)	2,133 (5,492)	0 (0)	0 (0)	<0.001
best_challenge_points	155 (255)	180 (266)	0 (0)	0 (0)	<0.001
average_points_per_week	604 (1,748)	700 (1,864)	0 (0)	0 (0)	<0.001
login_count	479 (1,293)	550 (1,378)	38 (140)	152 (206)	<0.001
school_grade_id					>0.9
1	1.3% (70)	1.3% (70)	NA% (0)	NA% (0)
2	2.4% (127)	2.4% (127)	NA% (0)	NA% (0)
3	7.4% (401)	7.4% (401)	NA% (0)	NA% (0)
4	10% (566)	10% (566)	NA% (0)	NA% (0)
5	28% (1,493)	28% (1,493)	NA% (0)	NA% (0)
6	24% (1,294)	24% (1,294)	NA% (0)	NA% (0)
7	27% (1,444)	27% (1,444)	NA% (0)	NA% (0)
N/A	5,337	3,860	1,469	8
home_schooled	1.0% (109)	1.2% (109)	0% (0)	0% (0)	<0.001
profile_percentage_complete
0	88% (9,407)	100% (9,255)	9.8% (144)	100% (8)
22	<0.1% (10)	0% (0)	0.7% (10)	0% (0)
33	2.0% (217)	0% (0)	15% (217)	0% (0)
44	6.6% (706)	0% (0)	48% (706)	0% (0)
56	3.5% (372)	0% (0)	25% (372)	0% (0)
67	0.2% (20)	0% (0)	1.4% (20)	0% (0)
1Mean (SD); % (n)
2Fisher's exact test; Kruskal-Wallis rank sum test

Table 4: Trust in how CW are updated
	user
Characteristic	Overall, N = 2503 (95% CI)12	student, N = 2,031 (95% CI)12	teacher, N = 472 (95% CI)12	p-value3
complete reviews	369,144 (127, 168)	182,391 (85, 95)	186,753 (292, 499)	0.5
incomplete reviews	5,136 (1.9, 2.2)	4,790 (2.2, 2.5)	346 (0.64, 0.82)	<0.001
total reviews	374,280 (129, 170)	187,181 (87, 97)	187,099 (293, 500)	>0.9
CW updates	213,962 (85, 108)	125,779 (66, 73)	88,183 (160, 284)	<0.001
N/A	290	215	75
CW updates / reviews submitted	1,484.51 (0.66, 0.68)	1,312.17 (0.72, 0.73)	172.34 (0.42, 0.45)	<0.001
N/A	290	215	75
Reviews submitted - CW updates	155,046 (58, 82)	56,582 (29, 33)	98,464 (186, 310)	<0.001
N/A	290	215	75
CW updates > reviews performed	0% (0) (0.00%, 0.22%)	0% (0) (0.00%, 0.26%)	0% (0) (0.00%, 1.2%)
N/A	290	215	75
CW updates < reviews performed	91% (2,011) (90%, 92%)	90% (1,626) (88%, 91%)	97% (385) (95%, 98%)	<0.001
N/A	290	215	75
1Sum; % (n)
2CI = Confidence Interval
3Wilcoxon rank sum test; Pearson's Chi-squared test