Carrying out a one way ANOVA

1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1070/MB1080 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed there and on the front page of the BS1070/MB1080 blackboard site. This assessment contributes 5% of module marks. You will receive feedback on this assessment after the submission deadline.

I would like you to upload to blackboard the script with which you carried out this session. As well as every command you need, I would also like you to add comments briefly explaining what this line does. This is not marked, but will be the plagiarism test for this assessment. It is very likely the commands will be very similar, but if the comments are identical, I will become suspicious. You may need to upload this as a docx file to blackboard. If so just cut and paste your script into MS Word. Don’t worry about formatting etc. The upload can be found on the BS1070/MB1080 Blackboard site under Assessments and Feedback/Data analysis scripts.

2 Does the quality of a lecture affect exam results?

In today’s session, we are going to go through a full analysis on some simulated data available in openintro. This data (classData) represents students’ scores from three different lectures that were all give the same exam.

A good general framework for any analysis is Plot -> Model -> Check assumptions -> Interpret -> Plot again. We will follow this below.

3 What is the data like?

The first thing to do is use the dfSummary command (package summarytools) to just have a quick look at the classData data. The data is available in the openintro package, so you’ll need to install and load this. We first did this in session 1. We also used dfSummary back in session 1. If you are having problems with dfSummary (Mac X11 or Rcurl), I’d just use base R, summary()

Next you want to have a visual check to see if lecture had an effect on exam performance (m1). Plot exam performance (y-axis) using a boxplot with lecture as the x-axis. You encountered boxplots back in session 2.

Blackboard MCQ: Based just on your intial work above, which lecture is the best?

4 Build your linear model

I would use the example code I gave you in the lecture. There, I build a model called model_weight. You can build one called model_exam (or anything you like, it’s just a name). model_exam should look at the effect of lecture on exam result.

By the way, this build model->build anova->create anova table is a bit long winded. An anova can be coded slightly more simply in R. I’m teaching you this way because 1) it introduces the idea of linear models, which will be important next year and 2) it makes it easier to look at the assumptions. Do it my way, but you won’t lose marks if you do it a different way and get the right answer.

Blackboard MCQ: Correctly (as described in the slides) report the results of your ANOVA for the effects of lecture on exam result.

5 Check the assumptions of your model

Remember the three important assumptions of an ANOVA are

  • Independence of observations .
  • Normality – the distributions of the residuals are normal. (Robust)
  • Homoscedasticity — the variance of data in groups should be the same.

The first one is to do with how the data was collected. Were each of the samples independent of each other? Imagine if you did an experiment where you measured the same individual each day. Here the observation collected each day would not be independent from the other observations. FYI, these data are called repeated measures and ANOVA can handle them but you have to tell it (repeated measures ANOVA). The data you are looking at today are independent.

The next two can be examined using the second (qqplot) and third (scale location) graph produced by the autoplot function in the ggfortify package (shown during lecture). Remember that autoplot tests the linear model (model_exam).

6 Which lecture was the best

Spoiler alert! Your ANOVA should have found a significant effect of lecture on exam score. But which lecture is the best? We used the tukey post-hoc test to answer a similar question in the lecture. Remember the tukey test works on the anova (the output of aov), not the linear model (output of lm).

Blackboard MCQ: According to the tukey test, which lecture is better than which?

7 Imagine your data isn’t normal, it’s easy if you try

If your data was not normal, the first thing to try would be to normalise it by transformation. But don’t do that, rather as an exercise I would like you to use nonparametric tests to carry out the same analysis. Does lecture have an effect on exam result Kruskal-Wallis. And if so which lecture was the best Dunn’s test.

Blackboard MCQ: Are the results of the non-parametric analysis qualitatively different than the parametric analysis’ results?

8 Replot your data for publication

The last step in Plot -> Model -> Check assumptions -> Interpret -> Plot again is to plot your data the way you want people to see them. There is noting wrong with the boxplot you did at the beginning. But for sunday best, I would at a minimum add dots to the boxplot (session 2). What about colour? Is there a better graph type you’d like to play with. Basicially, use this time to have a play with ggplot2.

Eamonn Mallon

2020-01-17