The omitted variable: could DuoTest enable a new way to assess the link between individual and team performance in team-based learning?

WELCOME TO OUR PRESENTATION

If the video does not load automatically, you can find it here: How to assess online the individual performance in team projects?

This is the link to our article link

Corresponding author: Riccardo Bonazzi (University of Applied Sciences Western Switzerland (HES-SO)

We have decided to see the current situation as an opportunity to try a new presentation format that combines (a) a short video (2 minutes) and (b) a set of slides to describe our idea more in details. Hence, this storyboard illustrates our algorithm in a way that is …

structured. Thanks to R Markdown, the text of each slide is embedded with the code the algorithm. So you are seeing how the data gets created.
dynamic. To test our models we split our data in a training and testing sample. In the original version executed on RStudio, the results change every time you restart the page and that shows that we did not use a specific set of parameters to obtain good results. In this static version on the web, the results do not change over time.

At the end of the dashboard, we share some of the new insights that we are obtaining from the data of this semester.

02) PROBLEM STATEMENT

In this article we study learning analytics for experiential learning.

Team projects can teach students how to work together but to assess in a team project each individual contributions is hard, and no existing tool can easily assess group effects on individual performance.

To solve our problem, we assess “performativity” that measures how much the students build on each other’s’ contributions and we ask ourselves the following research question: How to rapidly assess performativity in project-based learning scenarios?

In this presentation, we want to improve the immediate feedback assessment technique (IF-AT) by developing an online solution, which could allow students to do the final exam by themselves and then to get partial credits if they managed to correct their mistakes, by discussing with their team members. This way, we could measure the degree of performativity in each team.

03) DESIGN SCIENCE METHODOLOGY

We position our study in the field of design science research and we describe how we developed an artefact in the shape of a prototype, by following the guidelines of Peffers et al. (2007). This presentation has been built around the 6 steps suggested by the authors.

(i - ii) Identify the problem and and define the objectives of the solution –> Slide 02

Design and development of the artefact –> slide 04
Demonstration of our methodology –> Slide 05
EValuation (preliminary data) –> Slide 06
Communication. beside presenting our results at HICSS, we have shared our preliminary insights with colleagues, whose courses have been disrupted by the Covid-19 situation.

04) DESIGN AND DEVELOPMENT: The DuoTest prototype to assess group effects

The underlying idea of DuoTest is simple: to allow students to do their final exams twice in a row.

The first time, participants do their exam individually (Exa01); the second time, they solve the same exam in groups (Exa02).

By comparing individual and team performances, the system induces the positive (or negative) effect of each group over the individual performances.

Previous studies have already used Rapid Assessment Tests both at the individual and group level; nonetheless, they did not use those tests during the final evaluation and they did try to minimize the effort required to perform such assessments.

05) DEMONSTRATION: OUR ARTEFACT

The figure illustrates in detail how the DuoTest can be made by using an open-source learning management system (Moodle) and how the data can be analyzed with R Studio to assess team health and transactivity.

This example comes from 70 bachelor students who took the exam in January 2020 after having done a group project with a firm during the fall semester.

Step 1: Students submit online their individual answers to a test
Step 2: Students solve the same test in group and submit their individual answers
Step 3: A (multilevel) model uses the two tests to assess the groups effect

The overall test can be implemented with open source technology (we used Moodle for our demonstration). This type of test can take some 4 hours to be conceived and it is corrected automatically, once the rules are set, like a QCM.

06) EVALUATION: PRELIMINARY DATA ANALYSIS

We tested our prototype with three classes of undergraduate students undertaking the same course, for a total of 71 students attending the final exam in Sierre (Switzerland) the 20th of January 2020.

We claim that the exam was (a) valid, since chosen questions provide useful information about the concepts seen in class, (b) reliable, thanks to the rule-driven correction of each question, and (c) recognizable, since it fully replicated the way students work during the semester.

The second graph shows that We can assign to each team a trendline, which describes the positive/negative group effect. The Adjusted R2 of the linear regression is very good (0.70).

In the next slides we show the detailed analysis. We take group 9 as reference, since it has the lowest score.

7) COMMUNICATION: IF WE CHANGE THE FINAL EXAM, DO WE STILL NEED A MIDTERM EVALUATION ?

Insight 01: If high midterm scores lead to low final exam scores, what is the function of the midterm exam ?
Insight 02: each class might have a different degree of peer-learning effect

The decision tree model shows a set of rules to predict the Exa02 scores. Let’s try to assess its predictive power

In this table, we can see the predicted values as rows and the real values in the column. In the diagonal of the table, we can see how many times the system has correctly guessed the scores of the students.

    
     9 8 10 7
  9  5 1  2 0
  8  1 3  1 3
  10 1 1  1 1
  7  0 0  0 0

The precision of the model can be calculated as the sum of the values on the diagonal divided by the overall set of values.

Accuracy 
    0.45

Is this value reliable? Since we took a random sample of 70% to train the algorithm and 30% to test it, we can assess the probability that the precision will be different if we change training and testing sample.
The p value is not very good, probably because we do not have a lot of data to obtain a very realiable estimate.

AccuracyPValue 
     0.2376224

CONCLUSION: CAN WE SHIFT AWAY FROM MIDTERM EXAMS FOR TBL ? A QUICK VIEW OF PRELIMINARY DATA FROM 2020

This year, the students did not have a midterm exam; instead, they had peer evaluations every week during their blended learning course.

The first graph shows that peer reviews lead to a normally distributed distribution between 1.0/6.0 and 6.0/6.0.

The second graph shows the evolution over time of the weekly scores.
The violin shape has been chosen to better show the distribution of scores, instead of a boxplot (link here to see why).

Phase 01 (From Week 01 to Week 04): the overall set of students splits into two subsets, associated with bimodal distributions: (1) active participants and (2) those lagging behind.
Phase 02 (from W05 to W08): the bimodal distributions stretch apart.
Phase 03 (W09-W12): laggers seem try to catch up and to commit more.

Week 08 and week 12 were Q&A sessions and do not have a score; W13 is the simulation of the exam and it can be used to test how well our model can predict individual scores.

APPENDIX: USING OTHER DATA FROM THE SEMESTER TO PREDICT THE FINAL SCORE (the DETOTUS approach)


Call:
lm(formula = scale(Exa02) ~ scale(Exa01) + relevel(as.factor(Group), 
    ref = "9") + lowAttendance, data = final_scores)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.62263 -0.18849  0.03382  0.22472  0.86861 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                            -2.18969    0.27916  -7.844 4.36e-10 ***
scale(Exa01)                            0.03951    0.07667   0.515 0.608799    
relevel(as.factor(Group), ref = "9")1   0.12268    0.41865   0.293 0.770777    
relevel(as.factor(Group), ref = "9")2   1.55657    0.38007   4.096 0.000165 ***
relevel(as.factor(Group), ref = "9")3   2.29889    0.44744   5.138 5.27e-06 ***
relevel(as.factor(Group), ref = "9")4   1.54871    0.44384   3.489 0.001063 ** 
relevel(as.factor(Group), ref = "9")5   2.12212    0.41624   5.098 6.03e-06 ***
relevel(as.factor(Group), ref = "9")6   2.28810    0.45089   5.075 6.53e-06 ***
relevel(as.factor(Group), ref = "9")7   2.02183    0.39589   5.107 5.85e-06 ***
relevel(as.factor(Group), ref = "9")8   1.94661    0.36708   5.303 3.00e-06 ***
relevel(as.factor(Group), ref = "9")10  1.23299    0.41765   2.952 0.004912 ** 
relevel(as.factor(Group), ref = "9")11  1.99172    0.40104   4.966 9.43e-06 ***
relevel(as.factor(Group), ref = "9")12  2.14648    0.45263   4.742 2.00e-05 ***
relevel(as.factor(Group), ref = "9")13  0.77245    0.41805   1.848 0.070937 .  
relevel(as.factor(Group), ref = "9")14  2.31402    0.38762   5.970 2.99e-07 ***
relevel(as.factor(Group), ref = "9")15  2.90654    0.40139   7.241 3.53e-09 ***
relevel(as.factor(Group), ref = "9")16  2.75175    0.39660   6.938 1.01e-08 ***
lowAttendance                           0.48309    0.25262   1.912 0.061942 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5257 on 47 degrees of freedom
Multiple R-squared:  0.797, Adjusted R-squared:  0.7236 
F-statistic: 10.86 on 17 and 47 DF,  p-value: 4.884e-11

We create a new model that includes the notion of “Attendance” score, which is measured every week.
The Adjusted R2 increases slightly, leading us to believe that attendance in a team-beased learning environment should not be scored by a weighted grade but it should be crucial for the final score.