Were some questions answered by more skillful workers?

The goal of the study

Distribution of worker skill level across questions has the potential to bias the outcomes of the larger crowdsourcing task. The reason is that software faults can be overlooked if they are covered by questions* that were mostly answered by lower skill workers.

*Each question covers a certain number of source code lines. Questions have the following format:“Do you believe the source code between lines 35 and 45 is related to the described failure?”

Therefore, my goal is to investigate how worker skill level is distributed across questions. Worker skill level was measured by the following three attributes.

Test score: workers received a score (zero to five) after performing a computer programming test used to filter out workers without enough programming knowledge. Workers who scored 2 or less were not allowed to participate.

Years of programming experience: Worker quality can also be measured by years of programming experience, which is an information self-declared by workers befote they take any test.

Profession: worker profession is also an indicator of quality. Professions consisted of professional programmers, hobbyists, graduate students, undergraduate students, and others.

Focus of the current analysis

The current analysis focuses only on the worker test score.

Fig.1 shows that 20 questions had each 6 workers with score level-3, while 25 had each 5 workers with score level-3 The shape of the histograms resemble a normal distribuions Nonetheless, the data actually does not pass the Shapiro-Wilk normality test shapiro.test(tableScore_3$scoreCount);

This is also the case for the score levels 4 and 5 as shown in the charts below.

## 
##  Shapiro-Wilk normality test
## 
## data:  tableScore_4$scoreCount
## W = 0.92156, p-value = 1.407e-06

## 
##  Shapiro-Wilk normality test
## 
## data:  tableScore_5$scoreCount
## W = 0.95365, p-value = 0.0002321

The distribution shapes are distinct, but we performed a non-parametric test to verify if the averages are also distinct.Since we are doing multiple comparisons, I applied the Bonferroni adjustment. Since I will make 3 comparisons, the corrected confidence level of is 1 - 0.05/3 = 0.0167 =0.983

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tableScore_3$scoreCount and tableScore_4$scoreCount
## W = 5991.5, p-value = 8.678e-05
## alternative hypothesis: true location shift is not equal to 0

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tableScore_3$scoreCount and tableScore_5$scoreCount
## W = 2203, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tableScore_5$scoreCount and tableScore_4$scoreCount
## W = 13964, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

In all test, the null-hypothesis was rejected, which implies that the three data sets have different averages considering a confidence interval of 98.3%.

The previous analysis show that workers are not equally distributed in terms of test score. This might be an issue because questions that were answered by workers with lower score, might have been overlooked.

However, on average, worker score might still be more equally distributed across questions. This is what I will investigate next.

##        V1       
##  Min.   :3.632  
##  1st Qu.:4.053  
##  Median :4.200  
##  Mean   :4.178  
##  3rd Qu.:4.300  
##  Max.   :4.600

We can see that the interval of average scores is reasonably tight, from 3.8 to 4.6 We have two data points below 3.6, which I at least one outlier below 3.8

## 
##  Shapiro-Wilk normality test
## 
## data:  averageScores$averageScores
## W = 0.98248, p-value = 0.1146

I can also quantify how concentrated the average score is around score level 4. I can do this by computing the the probability of a question having an average score above 4.

## Probability of average score above 4= 89.12 %

Conclusion

Although workers are not equally distributed in terms of score, the average worker score per question follows a normal distribution, which allows to infer that on average 90% of questions received workers with score above 4. Since the maximum score is 5, this represents a small variation across questions.

Were some questions answered by more skillful workers?

Christian Medeiros Adriano

May 8, 2017

The goal of the study

Focus of the current analysis

Conclusion