1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1070/MB1080 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed there and on the front page of the BS1070/MB1080 blackboard site. This assessment contributes 5% of module marks. You will receive feedback on this assessment after the submission deadline.

2 Does IQ affect GPA results?

In today’s session, we are going to go through a full analysis on some simulated data available in openintro. This data (gpa_iq) represents students’ IQ scores and gpa exam scores. So the relationship between intelligence, if that’s what IQ measures, and academic achievement in secondary school.

A good general framework for any analysis is Plot -> Model -> Check assumptions -> Interpret -> Plot again. We will follow this below.

3 What is the data like?

The first thing to do is use the summary command to just have a quick look at the gpa_iq data. The data is available in the openintro package, so you’ll need to install and load this.

Next you want to have a visual check to see if IQ (iq) had an effect on exam performance (gpa). Plot exam performance (y-axis) using a scatterplot with IQ as the x-axis. You encountered scatterplots (geom_point) back in session 2.

Blackboard MCQ: Based just on your intial work above, how does IQ affect GPA?

4 Build your linear model

I would use the example code I gave you in the lecture. There, I build a model of how tannin affects growth. You can build one called model_iq (or anything you like, it’s just a name). model_iq should look at the effect of IQ on GPA. Rememember ~ in R means depends on. So gpa~iq, says GPA depends on IQ, exactly what we want here.

Blackboard MCQ: What is the equation of the line of your regression for the effects of IQ on GPA.

Blackboard MCQ: Correctly (as described in the slides) report the statistics of your regression for the effects of IQ on GPA.

5 Check the assumptions of your model

Remember the three important assumptions of a regression are

Independence of observations .
Normality – the distributions of the residuals are normal. (Robust)
Homoscedasticity — the variance of data in groups should be the same.

The first one is to do with how the data was collected. Were each of the samples independent of each other? Imagine if you did an experiment where you measured the same individual each day. Here the observation collected each day would not be independent from the other observations. The data you are looking at today are independent.

The next two can be examined using the second (qqplot) and third (scale location) graph produced by the autoplot function in the ggfortify package (shown during lecture). Remember that autoplot tests the linear model (model_iq). Here is a nice explanation of a qqplot from a student’s perspective.

Blackboard MCQ From just your qqplot and the information in the linked qqplot webpage, what directon is your data skewed?

6 Replot your data for publication

The last step in Plot -> Model -> Check assumptions -> Interpret -> Plot again is to plot your data the way you want people to see them. There is noting wrong with the scatterplot you did at the beginning. But for sunday best, I would at a minimum add a trend line. Add + geom_smooth(method = “lm”) to your ggplot script.

If you wanted to go the whole hog with equations etc., this is the simplest way I could think of.

library("ggplot2")
library("ggpubr")
library("openintro")
ggscatter(gpa_iq, x = "iq", y = "gpa", add = "reg.line") +
  stat_cor(label.y = 10, 
           aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
  stat_regline_equation(label.y = 12)

Carrying out a linear regression