Topic 8: Correlation and Simple Linear Regression


In Topic 8 we introduced the concept of analysing data using correlation and simple linear regression. In this computer lab, we will practice describing the relationship between two numeric variables, using these statistical techniques.


For some of this computer lab’s questions, we will consider happiness and average income data (Gapminder 2021a), also (Gapminder 2021b). You might recall that we assessed some of this data back in Computer Lab 3. Check back there if you’d like a refresher on the details of the happiness and income variables.

Note: In this lab, the version of the file we will be using has income expressed in thousands of dollars ($000) - so e.g. $20,000 would be recorded as 20.

1 Correlation

🏡 Suppose that we are interested in determining if there is a correlation between the average happiness rating and the average income level of citizens in a country. In this question we will consider ways to assess this correlation numerically.

1.1

💻 Download the file hi_2019_000s.csv from LMS and save it to your working directory.

1.2

💻 Open up RStudio, and load in the hi_2019_000s.csv data using the following code:

gapminder.2019 <- read.csv(file = "hi_2019_000s.csv", header = T)

This data set contains happiness scores (happiness_2019) and average income expressed in thousands of dollars (income_2019), for 164 countries in 2019.

Note: Make sure you have saved this file to your working directory.

1.3

💻 To begin, use the plot function to create a scatter plot of happiness_2019 on the y-axis versus income_2019 on the x-axis.

1.4

💻 Based on the scatter plot produced in 1.3, it would appear that there is some positive correlation between the happiness_2019 and income_2019 data.

If we use the R function cor, as shown in the code below, we find that this correlation is 0.739.

cor(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")

1.5

🏡 While the results of 1.4 certainly suggest there is a strong correlation between these two variables, a more sophisticated approach is to use the R function cor.test.

Using the code in 1.4 as a guide, use the cor.test function to check if there is a statistically significant association between the happiness ratings and average incomes of the countries in our 2019 data set.

Based on the output of this test, what do you conclude?

Note: The default correlation method in the cor.test function is method = "pearson". While this is fine for situations where we have data following a straight or relatively straight line, we should use the method = "spearman" option when dealing with data that exhibits curvature.

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

2 Simple Linear Regression (SLR) Models in R

🏡 Let us now consider ways to fit a simple linear regression (SLR) model to our hi_2019_000s.csv data in R.

2.1 Fitting an SLR Model

💻 There are many ways to fit a straight line to data. If we use the lm R function, this will ensure that the sum of the squared residuals of our fitted model will be minimised, which means that our fitted line should fit the data well (if the relationship between the two continuous variables is linear).

Take a look at the code below, which shows the general layout of the arguments for the lm function. Once you are confident you understand the function, try to fit a simple linear regression to the hi_2019_000s.csv data set, where:

  • happiness_2019 is the dependent variable, and
  • income_2019 is the independent variable
linear.model <- lm(formula = y ~ x, data = data.set)

Hint: You will have to replace the y, x and data.set arguments, and you may also like to use a different name for the model.

2.2 Assessing our fitted SLR Model

💻 The best way to assess the lm function output is to use the summary command. Run the R code summary(...), replacing the ... with whatever you called your fitted linear regression model (e.g. hi.linear.model).

What are the estimated coefficients for your model?

2.2.1

🏡 Using the coefficient estimates obtained in 2.2, write down your estimated i.e. fitted linear regression model.

2.2.2

🏡 Use your fitted linear regression model to answer the following question:

For an observation that has a value of income_2019 = 20 (i.e. $20,000), what would be the estimated value of happiness_2019?

2.2.3

🏡 Interpret the value of \(\widehat{\beta}_1\).

2.2.4

🏡 Do we have evidence of a significant linear association between income_2019 and happiness_2019? Does this align with your results from 1?

2.2.5

🏡 What is the multiple R-squared value?

2.2.6

💻 Using the \(R^2\) value, evaluate the fit of the model.

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

2.3

💻 Plot a residuals versus fits plot and a Normal Q-Q plot for the linear model you fitted in 2.1, using the following R code:

# Replace the ... below with the name of your fitted model
windows(height = 3.5) # N.B. For MacOS users, use the quartz function instead
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)

plot(..., which = c(1, 2))

2.4 SLR Assumption Checks

🏡 Using the plots created in 2.3, comment on whether or not you believe the following assumptions have been violated:

  • Linearity (refer to the residuals versus fits plot)
  • Constant variance (refer to the residuals versus fits plot)
  • Normality (refer to the Normal Q-Q plot)
🎧 Online students 💬 Volunteer to share your screen and explain your answers to this question.

3 Understanding Residuals versus Fits Plots

🏡 For this question we will try to develop a better understanding of residuals versus fits plots, by considering several examples. Recall that the response variable for an SLR model is denoted by \(y\), and the explanatory variable is denoted by \(x\).

Before you begin this question, you might like to take a look back at section 2.4 of the Topic 8 readings.

3.1

🏡 In Figure 3.1 below we present 9 plots, labelled Plot 1, Plot 2,Plot 3 and so on.

These are scatter plots of 9 simulated data sets, with responses (\(y_i\)’s) plotted against the explanatory variable values (\(x_i\)’s). Each data set consists of \(n = 100\) simulated pairs of \((x_i, y_i)\) observations - i.e. \((x_1, y_1)\) for the first pair of observations, etc.

Each plot also includes the fitted line associated with the estimated SLR model fitted to the simulated data.

Plots of the $y_i$'s versus $x_i$'s including the estimated least squares line for each of 9 data sets of size $n=100$.

Figure 3.1: Plots of the \(y_i\)’s versus \(x_i\)’s including the estimated least squares line for each of 9 data sets of size \(n=100\).

Take a close look at the plots in Figure 3.1. Using your knowledge of fits and residuals, try and imagine what the associated residuals versus fits plots (also known as the residuals versus fitted plots) would look like for each of these 9 plots.

  • For which of these plots do you think the simple linear regression model (including the linearity and constant variance assumptions) holds at least approximately?
  • Which plots suggest clear violations of the simple linear regression model?
  • Which plots are you not sure about?
🎧 Online students 💬 Enter your answers next to the question in the shared Google Doc.

3.2

🏡 The residual versus fits plots for each of the 9 data sets are shown below in Figure 3.2. However, the order of these plots in Figure 3.2 is random so that, for example, residuals versus fits plot A is not necessarily the plot associated with Plot 1 in Figure 3.1.

Residuals versus fits plots associated with the data sets considered in Figure 4.1.  The order of these plots have been chosen randomly so that `A` does not necessarily associate with `Plot 1` in Figure 4.1 and so on.

Figure 3.2: Residuals versus fits plots associated with the data sets considered in Figure 4.1. The order of these plots have been chosen randomly so that A does not necessarily associate with Plot 1 in Figure 4.1 and so on.


Your task is to match the plots in Figure 3.1 with their respective residuals versus fits plot in Figure 3.2. To do so, complete Table 3.1 by writing the Figure 3.2 label (i.e. A, B etc) under what you believe is the correct (matching) Figure 3.1 label.

Discuss this question and your results with your classmates and demonstrator.

Table 3.1: Matching the Figure 1 and Figure 2 plots
Figure 1 label Plot 1 Plot 2 Plot 3 Plot 4 Plot 5 Plot 6 Plot 7 Plot 8 Plot 9
Figure 2 label


🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.


Once you think you have the correct matches, you can check your results below:

D E G B I A C H F

Using the residuals versus fits plots in Figure 3.2 and your completed table above, are you still happy with your choice of data sets that satisfy the simple linear regression model?
What about for those that do not satisfy the simple linear regression model?

If you were previously unsure whether some data sets did or didn’t satisfy the simple linear regression model, did the residuals versus fitted plots help?

4 Extension: Simple Linear Regression Practice

💻 In this question, we will carry out a simple linear regression analysis using some simulated data.

4.1

💻 Run the R code below to simulate a small data set. Don’t worry if you do not understand the code; it simply needs to be run as a step towards creating the data set that we will then analyse.

set.seed(5050)
sim.x <- abs(2*sample(round(rnorm(50, mean = 2, sd = 2), 1), replace = T)) 
sim.y <- 2*sim.x + round(rpois(50, 8), 2) 
simulated.data <- data.frame(sim.x, sim.y)

4.2

💻 Use the plot function to create a scatter plot of your simulated.data data set.

4.3

💻 The abline R function can be used to add a straight line to an existing plot. The two main arguments for this function are a and b, which are the user-specified values for the intercept and slope of the line respectively.

Use the abline function to fit a straight line of best fit to the scatter plot you created in 4.2, as accurately as you can. You may have to try several different values of a and b before you are happy with the fit.

4.4

💻 Use the lm function to fit a simple linear model to your simulated.data data set.

4.5

💻 Use the summary function to assess the output of your linear model.

Note down the following information from the output:

  • What is the intercept coefficient estimate \(\widehat{\beta}_0\)?
  • What is the slope coefficient estimate \(\widehat{\beta}_1\)?

4.6

💻 Using the coefficient estimates obtained in 4.5, write out the estimated linear regression model.

4.7

💻 Compare the coefficient estimates obtained via the lm function to your selected values of a and b for your fitted abline line from 4.3.

Were you close? Don’t worry if you weren’t, it is typically quite difficult to accurately fit such a line by visual assessment alone - which is why we use R!

To include the fitted linear regression model line to your 4.2 scatter plot, you can use the R code:

# Make sure to change the colour here
abline(linear.model, col = ..., lwd = 2)

Replot your scatter plot from 4.2, add your original abline from 4.3, and then run the code above to add a second, new abline with the SLR specifications, to compare your guess with the fitted model.

4.8

💻 Use the estimated SLR model to answer the following question:

For an observation that has a value of sim.x = 5, what would be the estimated value of sim.y?

4.9

💻 Provide an interpretation of the value of \(\widehat{\beta}_1\).

4.10

💻 Do we have evidence of a significant linear association between sim.x and sim.y?

4.11

💻 What is the multiple R-squared value?

4.12

💻 Using the \(R^2\) value, evaluate the fit of the model.

🎧 Online students 💬 If you were able to reach this point during the lab, volunteer to share your screen and explain your answers to this question.


Well done, that’s everything for this computer lab!


References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.
———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.


These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.