1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS2004 Blackboard site under Assessments and Feedback. The deadline is listed there and on the front page of the BS2004 blackboard site. This assessment is formative. Remember to save your commented script and copy and paste it where you answer the MCQs (don’t worry about making it neat). You will receive feedback on this assessment after the submission deadline.

2 ANOVA for dioecious trees

A plant species is dioecious if each individual produces all male flowers or all female flowers. The dataset dioecious trees contains data from 50 trees of one particular dioecious species, from a ten hectare area of mixed woodland. For each individual, the SEX was recorded (coded as 1 for male and 2 for female), the diameter at breast height in millimetres (DBH), and the number of flowers on the tree at the time of measurement (FLOWERS). This dataset is taken from Grafen and Halls’ book. It is available on blackboard.

A good general framework for any analysis is Plot -> Model -> Check assumptions -> Interpret -> Plot again. We will follow this below.

2.1 What is the data like?

The first thing to do is use the skim command (package skimr) or summary command to just have a quick look at the dioecious data. The data is available on blackboard (dioecious.csv file). We looked at how to load data into R here. We also used skim back then.

When I ran this I could see a small problem. SEX is classed as a continous variable because it was coded as 1 and 2. We want it to be a factor.

dioecious$SEX <- as.factor(dioecious$SEX)

Why do we need to do this? Otherwise R thinks 2 is a number and the double of 1, whereas in reality 1 is just a code for males and 2 was the code for females.

Next you want to have a visual check to see if SEX had an effect on the number of flowers (FLOWERS). Plot FLOWERS (y-axis) using a boxplot with SEX as the x-axis. You encountered boxplots back in BS1040 session 2.

Blackboard MCQ 1: From your plot results, what are the median number of flowers for male and female plants?

2.2 Build your linear model

Test the null hypothesis that male and female trees produce the same number of flowers. I would use the example code I gave you in the first year lecture. There, I build a model called model_ozone. You can build one called model_flowers (or anything you like, it’s just a name). model_flowers should look at the effect of SEX on FLOWERS.

By the way, this build model->build anova->create anova table is a bit long winded. An anova can be coded slightly more simply in R. I’m teaching you this way because 1) it introduces the idea of linear models, which will be important this year and 2) it makes it easier to look at the assumptions.

Blackboard MCQ 2: Correctly (as described in the slides) report and interpret the results of your ANOVA for the effects of SEX on FLOWERS.

2.3 Check the assumptions of your model

Remember the three important assumptions of an ANOVA are

  • Independence of observations .
  • Normality – the distributions of the residuals are normal. (Robust)
  • Homoscedasticity — the variance of data in groups should be the same.

The first one is to do with how the data was collected. Were each of the samples independent of each other? Imagine if you did an experiment where you measured the same individual each day. Here the observation collected each day would not be independent from the other observations. FYI, these data are called repeated measures and ANOVA can handle them but you have to tell it (repeated measures ANOVA). The data you are looking at today are independent.

The next two can be examined using the second (qqplot) and third (scale location) graph produced by the autoplot function in the ggfortify package (shown during lecture). Here is an excellent resource for interpreting the diagnostic plots. Remember that autoplot tests the linear model (model_flowers).

2.4 Replot your data for publication

The last step in Plot -> Model -> Check assumptions -> Interpret -> Plot again is to plot your data the way you want people to see them. There is noting wrong with the boxplot you did at the beginning. But for sunday best, I would at a minimum add dots to the boxplot. What about colour? Is there a better graph type you’d like to play with. Use this time to have a play with ggplot2.

3 Regression for IQ on test results

3.1 Does IQ affect GPA results?

In this part, we are going to go through a full analysis on some simulated data available in openintro. This data (gpa_iq) represents students’ IQ scores and gpa exam scores. So the relationship between intelligence, if that’s what IQ measures, and academic achievement in secondary school.

A good general framework for any analysis is Plot -> Model -> Check assumptions -> Interpret -> Plot again. We will follow this below.

3.2 What is the data like?

The first thing to do is use the summary command to just have a quick look at the gpa_iq data. The data is available in the openintro package, so you’ll need to install and load this.

Next you want to have a visual check to see if IQ (iq) had an effect on exam performance (gpa). Plot exam performance (y-axis) using a scatterplot with IQ as the x-axis. You encountered scatterplots (geom_point) back in session 2.

Blackboard MCQ 3: Based just on your intial work above, how does IQ affect GPA?

3.3 Build your linear model

I would use the example code I gave you in the first year lecture. There, I build a model of how tannin affects growth. You can build one called model_iq (or anything you like, it’s just a name). model_iq should look at the effect of IQ on GPA. Remember ~ in R means depends on. So gpa~IQ, says GPA depends on IQ, exactly what we want here.

Blackboard MCQ 4: Correctly (as described in the slides) report the statistics of your regression for the effects of IQ on GPA.

3.4 Check the assumptions of your model

Remember the three important assumptions of a regression are

  • Independence of observations .
  • Normality – the distributions of the residuals are normal. (Robust)
  • Homoscedasticity — the variance of data in groups should be the same.

The first one is to do with how the data was collected. Were each of the samples independent of each other? Imagine if you did an experiment where you measured the same individual each day. Here the observation collected each day would not be independent from the other observations. The data you are looking at today are independent.

The next two can be examined using the second (qqplot) and third (scale location) graph produced by the autoplot function in the ggfortify package (shown during lecture). Remember that autoplot tests the linear model (model_iq). Here is a nice explanation of a qqplot from a student’s perspective.

3.5 Replot your data for publication

The last step in Plot -> Model -> Check assumptions -> Interpret -> Plot again is to plot your data the way you want people to see them. There is noting wrong with the scatterplot you did at the beginning. But for sunday best, I would at a minimum add a trend line. Add + geom_smooth(method = “lm”) to your ggplot script.

If you wanted to go the whole hog with equations etc., this is the simplest way I could think of.

library("ggplot2")
library("ggpubr")
library("openintro")
ggscatter(gpa_iq, x = "iq", y = "gpa", add = "reg.line") + stat_cor(label.y = 10,
    aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) + stat_regline_equation(label.y = 12)