In Topic 8 we introduced the concept of analysing data using correlation and simple linear regression. In this computer lab, we will practice describing the relationship between two numeric variables, using these statistical techniques.
For some of this computer lab’s questions, we will consider happiness and average income data (Gapminder 2021a), also (Gapminder 2021b). You might recall that we assessed some of this data back in Computer Lab 3. Check back there if you’d like a refresher on the details of the happiness and income variables.
Note: In this lab, the version of the file we will be using has income expressed in thousands of dollars ($000) - so e.g. $20,000 would be recorded as 20.
🏡 Suppose that we are interested in determining if there is a correlation between the average happiness rating and the average income level of citizens in a country. In this question we will consider ways to assess this correlation numerically.
💻 Download the file hi_2019_000s.csv
from LMS and save it to your working directory.
💻 Open up RStudio, and load in the hi_2019_000s.csv
data using the following code:
gapminder.2019 <- read.csv(file = "hi_2019_000s.csv", header = T)
This data set contains happiness scores (happiness_2019
) and average income expressed in thousands of dollars (income_2019
), for 164 countries in 2019.
Note: Make sure you have saved this file to your working directory.
💻 To begin, use the plot
function to create a scatter plot of happiness_2019
on the y-axis versus income_2019
on the x-axis.
💻 Based on the scatter plot produced in 1.3, it would appear that there is some positive correlation between the happiness_2019
and income_2019
data.
If we use the R function cor
, as shown in the code below, we find that this correlation is 0.739.
cor(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")
🏡 While the results of 1.4 certainly suggest there is a strong correlation between these two variables, a more sophisticated approach is to use the R function cor.test
.
Using the code in 1.4 as a guide, use the cor.test
function to check if there is a statistically significant association between the happiness ratings and average incomes of the countries in our 2019 data set.
Based on the output of this test, what do you conclude?
Note: The default correlation method in the cor.test
function is method = "pearson"
. While this is fine for situations where we have data following a straight or relatively straight line, we should use the method = "spearman"
option when dealing with data that exhibits curvature.
🏡 Let us now consider ways to fit a simple linear regression (SLR) model to our hi_2019_000s.csv
data in R.
💻 There are many ways to fit a straight line to data. If we use the lm
R function, this will ensure that the sum of the squared residuals of our fitted model will be minimised, which means that our fitted line should fit the data well (if the relationship between the two continuous variables is linear).
Take a look at the code below, which shows the general layout of the arguments for the lm
function. Once you are confident you understand the function, try to fit a simple linear regression to the hi_2019_000s.csv
data set, where:
happiness_2019
is the dependent variable, andincome_2019
is the independent variablelinear.model <- lm(formula = y ~ x, data = data.set)
Hint: You will have to replace the y
, x
and data.set
arguments, and you may also like to use a different name for the model.
💻 The best way to assess the lm
function output is to use the summary
command. Run the R code summary(...)
, replacing the ...
with whatever you called your fitted linear regression model (e.g. hi.linear.model
).
What are the estimated coefficients for your model?
🏡 Using the coefficient estimates obtained in 2.2, write down your estimated i.e. fitted linear regression model.
🏡 Use your fitted linear regression model to answer the following question:
income_2019 = 20
(i.e. $20,000), what would be the estimated value of happiness_2019
?
🏡 Interpret the value of \(\widehat{\beta}_1\).
🏡 Do we have evidence of a significant linear association between income_2019
and happiness_2019
? Does this align with your results from 1?
🏡 What is the multiple R-squared value?
💻 Using the \(R^2\) value, evaluate the fit of the model.
💻 Plot a residuals versus fits plot and a Normal Q-Q plot for the linear model you fitted in 2.1, using the following R code:
# Replace the ... below with the name of your fitted model
windows(height = 3.5) # N.B. For MacOS users, use the quartz function instead
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)
plot(..., which = c(1, 2))
🏡 Using the plots created in 2.3, comment on whether or not you believe the following assumptions have been violated:
🏡 For this question we will try to develop a better understanding of residuals versus fits plots, by considering several examples. Recall that the response variable for an SLR model is denoted by \(y\), and the explanatory variable is denoted by \(x\).
Before you begin this question, you might like to take a look back at section 2.4 of the Topic 8 readings.
🏡 In Figure 3.1 below we present 9 plots, labelled Plot 1
, Plot 2
,Plot 3
and so on.
These are scatter plots of 9 simulated data sets, with responses (\(y_i\)’s) plotted against the explanatory variable values (\(x_i\)’s). Each data set consists of \(n = 100\) simulated pairs of \((x_i, y_i)\) observations - i.e. \((x_1, y_1)\) for the first pair of observations, etc.
Each plot also includes the fitted line associated with the estimated SLR model fitted to the simulated data.
Figure 3.1: Plots of the \(y_i\)’s versus \(x_i\)’s including the estimated least squares line for each of 9 data sets of size \(n=100\).
Take a close look at the plots in Figure 3.1. Using your knowledge of fits and residuals, try and imagine what the associated residuals versus fits plots (also known as the residuals versus fitted plots) would look like for each of these 9 plots.
🏡 The residual versus fits plots for each of the 9 data sets are shown below in Figure 3.2.
However, the order of these plots in Figure 3.2 is random so that, for example, residuals versus fits plot A
is not necessarily the plot associated with Plot 1
in Figure 3.1.
Figure 3.2: Residuals versus fits plots associated with the data sets considered in Figure 4.1. The order of these plots have been chosen randomly so that A
does not necessarily associate with Plot 1
in Figure 4.1 and so on.
Your task is to match the plots in Figure 3.1 with their respective residuals versus fits plot in Figure 3.2. To do so, complete Table 3.1 by writing the Figure 3.2 label (i.e. A, B etc) under what you believe is the correct (matching) Figure 3.1 label.
Discuss this question and your results with your classmates and demonstrator.
Figure 1 label | Plot 1 | Plot 2 | Plot 3 | Plot 4 | Plot 5 | Plot 6 | Plot 7 | Plot 8 | Plot 9 |
Figure 2 label |
Once you think you have the correct matches, you can check your results below:
D E G B I A C H F
Using the residuals versus fits plots in Figure 3.2 and your completed table above, are you still happy with your choice of data
sets that satisfy the simple linear regression model?
What about for those that do not satisfy the simple linear regression model?
If you were previously unsure whether some data sets did or didn’t satisfy the simple linear regression model, did the residuals versus fitted plots help?
💻 In this question, we will carry out a simple linear regression analysis using some simulated data.
💻 Run the R code below to simulate a small data set. Don’t worry if you do not understand the code; it simply needs to be run as a step towards creating the data set that we will then analyse.
set.seed(5050)
sim.x <- abs(2*sample(round(rnorm(50, mean = 2, sd = 2), 1), replace = T))
sim.y <- 2*sim.x + round(rpois(50, 8), 2)
simulated.data <- data.frame(sim.x, sim.y)
💻 Use the plot
function to create a scatter plot of your simulated.data
data set.
💻 The abline
R function can be used to add a straight line to an existing plot. The two main arguments for this function are a
and b
, which are the user-specified values for the intercept and slope of the line respectively.
Use the abline
function to fit a straight line of best fit to the scatter plot you created in 4.2, as accurately as you can. You may have to try several different values of a
and b
before you are happy with the fit.
💻 Use the lm
function to fit a simple linear model to your simulated.data
data set.
💻 Use the summary
function to assess the output of your linear model.
Note down the following information from the output:
💻 Using the coefficient estimates obtained in 4.5, write out the estimated linear regression model.
💻 Compare the coefficient estimates obtained via the lm
function to your selected values of a
and b
for your fitted abline
line from 4.3.
Were you close? Don’t worry if you weren’t, it is typically quite difficult to accurately fit such a line by visual assessment alone - which is why we use R!
To include the fitted linear regression model line to your 4.2 scatter plot, you can use the R code:
# Make sure to change the colour here
abline(linear.model, col = ..., lwd = 2)
Replot your scatter plot from 4.2, add your original abline from 4.3, and then run the code above to add a second, new abline with the SLR specifications, to compare your guess with the fitted model.
💻 Use the estimated SLR model to answer the following question:
sim.x = 5
, what would be the estimated value of sim.y
?
💻 Provide an interpretation of the value of \(\widehat{\beta}_1\).
💻 Do we have evidence of a significant linear association between sim.x
and sim.y
?
💻 What is the multiple R-squared value?
💻 Using the \(R^2\) value, evaluate the fit of the model.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.