1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS2004 Blackboard site for this week. The deadline is listed there and on the front page of the BS2004 blackboard site. This assessment contributes about 4% of module marks. You will receive feedback on this assessment on the Friday after the submission deadline.

2 High school awards

Awards2.csv is a simulated a data set. In this example, num_awards is the response variable and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students’ scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as General, Academic and Vocational.

Before doing anything else, make sure id and prog are factors.

Blackboard MCQ 1: From your exploratory data analysis, ignoring maths scores, students in which programme on average get the most awards? (Hint: In the last lecture, I used tapply to calculate group means).

3 A generalized linear model

As this is count data (number of awards), a generalized linear model using a Poisson distribution would make sense. Create this model using \[ \eta_i = \beta_0 + \beta_1prog + \beta_2math \] That is, no interaction term (I already checked it is not interesting).

Blackboard MCQ 2: From your model, what is the effect of maths score on number of awards won by a student?

Blackboard MCQ 3: How much variance in awards won is explained by the above model? (Hint: pseudo \(R^2\))

4 Model validation

Blackboard MCQ 4: Is the data overdispersed?

For future reference, Stack exchange have a nice description of what to do when you have overdispersion.

4.1 Model misfit

I’ve produced the model diagnostic plots from autoplot below.

From the link given in the lecture, I would judge the residuals versus fitted graph to show a good model fit. They have a nice discussion of what to do if this is not the case.

5 Which programme produces the most awards?

Blackboard question 5: Using the commands in the emmeans package, which programme, produces the most awards?

Here’s a graph I produced to look at predicted values. You might find it useful. Important to note that the lines are just trends not regression lines.

## calculate and store predicted values
programmes$phat <- predict(awards_model, type = "response")

## order by program and then by math
programmes <- programmes[with(programmes, order(prog, math)), ]

## create the plot
ggplot(programmes, aes(x = math, y = phat, colour = prog)) + geom_point(aes(y = num_awards),
    alpha = 0.5, position = position_jitter(h = 0.2)) + geom_line(size = 1) + labs(x = "Math Score",
    y = "Expected number of awards")