About the data

Mechanical Turk workers answered questions about different popular open source projects. More specifically, workers were asked if they believed that a set of lines of Java source code were related to a software failure (e.g., null pointer exception). More than 600 workers provided 2580 answers for 129 different question (each question was asked to 20 different workers). The experiment that generated this data is described in this paper

About the larger study

I am studying how to delegate tasks to workers in a way that I am to identify software faults with more precision while asking fewer questions than in the original study. My approach is to investigate how to predict which answers from which workers have the highest accuracy. Therefore, the my first step is to understand which factors affect answer accuracy, as for example: worker skill, worker confidence on her answers, worker perceived difficulty of questions, and size and complexity of the source code.

The goal of this analysis

I tried to answer the following questions:
  • How is answer accuracy distributed over all the questions asked?
  • What are the questions with the most and least accurate answers?
  • Do these questions have anything in common?
  • Compute the number of correct answers per each of the 129 questions

    ## Data successfully loaded. Number of entries: 2530

    Plotting distribution

    Since the shape of the histogram resembles a normal distribuion, I performed a Shapiro-Wilk normality test. This test is based on null-hypothesis that the sample comes from a normal distribution. This means that if I reject the null-hipothesis (p-value<0.05), then there is evidence tha the data does not come from a normal distribution. As shown in the output of the test below, the distribution of correct answers per question is not normally distributed.

    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  summaryTable$CorrectAnswers
    ## W = 0.97883, p-value = 0.0411

    Why some questions have on average different levels of answer accuracy?

    This is an important reflection if we want to explore the data by filtering. The reason is that filtering can reinforce or offset certain biases already present in the data. Take for instance that some questions might have been answered by a disproportionate number of less skilled workers or workers from certain professions. Therefore, if we apply a filter by skill, we might be favoring certain questions more than others. The problem is that this effect is arbitrary, i.e., another dataset would produce a different result.

    With respect to the possible biases, I envisaged at least two sources: variation in worker quality and differences in question difficulty. I will investigate these two factors in the next analyzes.