Effects of repeating the Wavumbuzi Entrepreneurship Challenge on Engagement

K.R.U

1 Background

The purpose of this research report is to present findings about the effects of repeating the Wavumbuzi Entrepreneurship Challenge (hereafter referred to as WEC). WEC is a free annual 6-week online challenge offered to learners in all secondary/ high schools across Kenya and Rwanda. It is a gamified experiential learning process designed to equip learners with competencies to be the next generation of global leaders, change-makers, and innovative thinkers. Focusing on entrepreneurship (as a widely recognised countermeasure to unemployment), the Entrepreneurship Challenge is designed to stimulate and develop the entrepreneurial mindset and 21st Century Skills of learners.

Learners are not taught. Instead, every week – over a 6weeks period - learners get a set of challenges via mobile and computers that stimulates them to think like entrepreneurs. Each task requires learners to apply new concepts and utilize their knowledge and skills in solving real-world challenges. Teachers are trained on how to guide and encourage learners to engage in and complete the Challenges.

2 Premise

This is a longitudinal study that aims to answer questions relating to user re-engagement across WEC iterations/installations.

2.1 Primary Objective

To assess the effects of repeating WEC on engagement in the following iteration? Apart from repeating WEC gameplay the following year, what other variables are impacted by re-engagement over iterations?

2.2 Secondary Objective

To identify the key predictors of a learner’s likelihood of repeating the WEC - Can we predict how likely a given learner is to re-engage next year? Does participating in one iteration make you more likely to participate in the next one?

3 Methodology

3.1 Univariate analysis

Descriptive measures under the univariate setting will be applied to summarize central tendencies and distribution for metric variables.

  • Categorical variables such as gender and nationality will be summarized using frequencies and percentages.
  • Categorical variables derived from continuous variables will be categorized using acceptable limits derived from literature
  • Continuous variables such as age, submissions, and scores will be summarized using mean and corresponding SD - assuming Gaussian distribution.
  • Continuous variables that are skewed will be summarized as median and the corresponding interquartile range (IQR).

3.2 Bivariate setting

  • Association between categorical variables will be assessed using Pearson’s Chi-Square test
  • Association between categorical and continuous variables will be done using a two-sample t-test, assuming Gaussian distribution for continuous variables.
  • Association between categorical and continuous variables which are not normally distributed will be conducted using a two-sample Wilcoxon rank-sum test.
  • Kruskal Wallis test (one-way ANOVA on ranks) will be used if the levels of the categorical variable are greater than 2.

3.3 Regression Modeling

Where applicable:

  • For continuous response, => Linear regression model will be fitted. Effect modifiers and potential confounders will be adjusted. Regression coefficients and the corresponding 95% confidence interval will be reported.
  • For binary response => association between multiple risk factors and the response, the variable will be conducted by fitting multiple logistic regression models. Effect modifiers and potential confounders will be adjusted. Odds ratios and the corresponding 95% confidence interval will be reported.

3.4 Participant Record Linkage

3.4.1 Probabilistic record linkage

Probabilistic linking is a method for combining information from records on different datasets to form a new linked dataset. It has been described as a process that attempts to link records on different files that have the greatest probability of belonging to the same person/organisation. Whereas deterministic (or exact) linking uses a unique identifier to link datasets, probabilistic linking uses a number of identifiers, in combination, to identify and evaluate links. Probabilistic linking is generally used when a unique identifier is not available or is of insufficient quality.

There are various methods of probabilistic record linkage. - Fellegi and Sunter (1969) -> requires sophisticated software to perform the calculations. References at the end of this sheet provide more information about linking algorithms. - Active Learning -> is an extension to Fellegi and Sunter (1969) + semi-supervised machine learning implementation to probabilistic record linkage

3.4.2 The Fellegi and Sunter method

It is a probabilistic approach to solving record linkage problems based on a decision model. Records in data sources are assumed to represent observations of entities taken from a particular population (individuals, companies, enterprises, farms, geographic regions, families, households…). The records are assumed to contain some attributes identifying an individual entity. Examples of identifying attributes are name, address, age, and gender when dealing with people; a firm’s style (or name); legal form, address, number of local units, number of employees, and turnover value when dealing with businesses. According to the method, given two (or more) sources of data, all pairs coming from the Cartesian product of the two sources have to be classified into three independent and mutually exclusive subsets: the set of matches, the set of non-matches, and the set of pairs requiring manual review. In order to classify the pairs, the comparisons on common attributes are used to estimate for each pair the probabilities of belonging to both the set of matches and the set of non-matches. The pair classification criteria are based on the ratio between such conditional probabilities. The decision model aims to minimize both the misclassification errors and the probability of classifying a pair as belonging to the subset of pairs requiring manual review.

The key steps of probabilistic linking (as shown in Diagram 1) are: 1. Data cleaning and standardisation 2. Blocking 3. Linking 4. Clerical review 5. Evaluating data quality

  • Blocking – divides datasets into groups, called blocks, in order to reduce the number of comparisons that need to be conducted to find which pairs of records should be linked. Only records in corresponding blocks on each dataset are compared to identify possible links.
  • Fields – types of information, such as name, address, and date of birth, on the records in datasets.
  • Linked dataset – the result of linking different datasets is a new dataset whose records contain some information from each of the original datasets.
  • Links – records that have been combined after being assessed as referring to the same entity (i.e., person/family/organisation/region).
  • Unique identifier – a number or code that uniquely identifies a person, business, or organisation, such as passport number or Australian Business Number (ABN).

3.4.3 Active Learning

The main hypothesis in active learning is that if a learning algorithm can choose the data it wants to learn from, it can perform better than traditional methods with substantially less data for training. In active learning, there are typically three scenarios or settings in which the learner will query the labels of instances. The three main scenarios that have been considered in the literature are Membership Query Synthesis, Stream-Based Selective Sampling, and Pool-Based sampling.

Pool-Based sampling: assumes that there is a large pool of unlabelled data, as with the stream-based selective sampling. Instances are then drawn from the pool according to some informativeness measure. This measure is applied to all instances in the pool (or some subset if the pool is very large), and then the most informative instance(s) are selected.

3.5 Sample Linkage Results

4 Data sources

Game data was collected over the last two years using an online web app hosted using AWS platform. The WEC was first launched in Kenya in 2019 while in Rwanda it was first launched in 2020. As such,

  • In Kenya we are using WEC engagement dataset for 2019 and 2020
  • In Rwanda we are using dataset for 2020 and 2021.

5 Results

5.1 User Loyalty Overview

5.1.1 Kenya and Rwanda

5.1.2 Country Disaggregation

5.1.2.1 Kenya

5.1.2.2 Rwanda

5.2 Effects of repeating WEC on engagement

To assess the effects of repeating WEC on engagement in the following iteration? Apart from repeating WEC gameplay the following year, what other variables are impacted by re-engagement over iterations?

5.2.1 Country Comparison

5.2.2 Kenya and Rwanda

5.2.3 Country Disaggregation

5.2.3.1 Kenya

5.2.3.2 Rwanda

5.3 Predictors of re-engagement

5.3.1 Kenya and Rwanda

To identify the key predictors of a learner repeating the WEC - Can we predict how likely a given learner is to re-engage next year? Does participating in one iteration make you more likely to participate in the next one?

5.3.1.1 Model Tabulation - Predictors of re-engagement | KE & RW

5.3.1.2 Model visualization | KE & RW

The precision plot of the findings indicates the range within which the predicted estimates (OR) of population may lie.

A confidence interval help us evaluate the range of estimated values within which the actual “ground-truth” result is found. Essentially, a CI of 95% means that if a trial was repeated an infinite number of times, 95% of the results would fall within this range of values. In our analysis we see very precise estimates, i.e., narrow confidence intervals.

5.3.1.3 Average Marginal Effect | KE & RW

The average marginal effect gives you an effect on the probability scale, i.e., a number between 0 and 1. It is the average change in probability when each predictor increases by one unit, holding the other variables constant/or at the reference level for categorical variables.

Since a probit is a non-linear model, that effect will differ from individual to individual. Therefore, the average marginal effect is computed for each individual and then computed as the average. To get the effect on the percentage, you need to multiply by 100, so the chances of a learner becoming re-engaged increases by x percentage points.

5.3.1.4 Model Validation Checks | KE & RW

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

  • True Positive Rate
  • False Positive Rate

## 
## Call:
## roc.formula(formula = Loyalty ~ final.logit$fitted.values, data = model.df1,     plot = TRUE, grid = TRUE, print.auc = TRUE, show.thres = TRUE,     ci = TRUE, boot.n = 100, ci.alpha = 0.9, stratified = FALSE,     main = "ROC Curve", col = "blue")
## 
## Data: final.logit$fitted.values in 7260 controls (Loyalty 1st Time Users) < 2638 cases (Loyalty Repeat Users).
## Area under the curve: 0.6965
## 95% CI: 0.6848-0.7082 (DeLong)

5.3.1.5 Model Interpretation | KE & RW

Model Interpretation | Re-engagement
variable_name interpretation Significant
Age The odds of re-engagement in age is 0.99 ( 0.98 , 1 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
number_of_submissions The odds of re-engagement in number_of_submissions is 0.99 ( 0.95 , 1.02 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
points The odds of re-engagement in points is 0.98 ( 0.94 , 1.02 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
challenge_points The odds of re-engagement in challenge_points is 1.02 ( 0.98 , 1.06 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
other_points The odds of re-engagement in other_points is 1.02 ( 0.98 , 1.06 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
average_points_per_challenge The odds of re-engagement in average_points_per_challenge is 1 ( 0.99 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
average_points_per_week The odds of re-engagement in average_points_per_week is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
best_week_points The odds of re-engagement in best_week_points is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
best_challenge_points The odds of re-engagement in best_challenge_points is 1 ( 1 , 1.01 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
average_review_score The odds of re-engagement in average_review_score is 1 ( 0.99 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
number_of_finalisations The odds of re-engagement in number_of_finalisations is 1.02 ( 1 , 1.04 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
average_number_of_seconds_until_finalisation The odds of re-engagement in average_number_of_seconds_until_finalisation is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
login_count The odds of re-engagement in login_count is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
leaderboard_all_rank The odds of re-engagement in leaderboard_all_rank is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
password_requested The odds of re-engagement in password_requested is 1.77 ( 1.3 , 2.38 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
ever_login The odds of re-engagement in ever_login is 0.85 ( 0.59 , 1.23 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
email_setup The odds of re-engagement in email_setup is 0.8 ( 0.63 , 1.01 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
mobile_setup The odds of re-engagement in mobile_setup is 1 ( 0.81 , 1.23 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
school_setup The odds of re-engagement in school_setup is 1.48 ( 1.22 , 1.79 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
avatar_setup The odds of re-engagement in avatar_setup is 0.9 ( 0.75 , 1.07 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
id_number_setup The odds of re-engagement in id_number_setup is 1.94 ( 1.53 , 2.47 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
school_gradeForm 2 / Senior 2 The odds of re-engagement in school_gradeform 2 / senior 2 is 1.21 ( 0.87 , 1.69 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
school_gradeForm 3 / Senior 3 The odds of re-engagement in school_gradeform 3 / senior 3 is 1.33 ( 0.97 , 1.84 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
school_gradeForm 4 / Senior 4 The odds of re-engagement in school_gradeform 4 / senior 4 is 1.33 ( 0.98 , 1.8 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
school_gradeIncomplete Profile The odds of re-engagement in school_gradeincomplete profile is 0.81 ( 0.52 , 1.28 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. No
school_gradeNot in high school The odds of re-engagement in school_gradenot in high school is 2.4 ( 1.06 , 5.01 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
school_gradeSenior 5 The odds of re-engagement in school_gradesenior 5 is 1.38 ( 1.03 , 1.87 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
school_gradeSenior 6 The odds of re-engagement in school_gradesenior 6 is 1.2 ( 0.88 , 1.65 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
leaderboard_school_rank The odds of re-engagement in leaderboard_school_rank is 1 ( 1 , 1 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes
enabledTRUE The odds of re-engagement in enabledtrue is 1.51 ( 0.88 , 2.64 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. No
Verified The odds of re-engagement in verified is 0.52 ( 0.3 , 0.88 ) times lower than the odds of re-engagement in the reference level, fixing all else constant. Yes
countryRwanda The odds of re-engagement in countryrwanda is 5.15 ( 3.28 , 8.14 ) times higher than the odds of re-engagement in the reference level, fixing all else constant. Yes

6 References

  • https://www.datacamp.com/community/tutorials/active-learning
  • Fellegi, I. and Sunter, A. 1969, ‘A Theory for Record Linkage’, Journal of the American Statistical Association, Vol.64, no.328, pp. 1183-1210.
  • Winkler, W.E. 2006, ‘Overview of Record Linkage and Current Research Directions’, Research Report Series, no. 2006–2, Statistical Research Division, U.S. Census Bureau, Washington.
  • Christen, P. 2012, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer, Canberra.
  • Herzog, T.N., Scheuren, F.J. and Winkler, W.E. 2007, Data quality and record linkage techniques, Springer, New York.
  • Jaro, M. 1995, ‘Probabilistic Linkage of Large Public Health Data Files’, Statistics in Medicine, Vol. 14, pp. 491-498.
  • Samuels, C. 2012, ‘Using the EM Algorithm to Estimate the Parameters of the Fellegi-Sunter Model for Data Linking research paper’, Methodology Advisory Committee Paper, Cat. no. 1352.0.55.120, Australian Bureau of Statistics, Canberra.
  • Statistics New Zealand 2006, Data Integration Manual, Statistics New Zealand, Wellington.
  • Winkler, W.E. 1990, ‘String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage’, Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354-359.

7 Appendix