Topic 8: Correlation and Simple Linear Regression


In Topic 8 we introduced the concept of analysing data using correlation and simple linear regression. In this computer lab, we will practice describing the relationship between two numeric variables, using these statistical techniques.


For some of this computer lab’s questions, we will consider happiness and average income data (Gapminder 2021a), also (Gapminder 2021b). You might recall that we assessed some of this data back in Computer Lab 3. Check back there if you’d like a refresher on the details of the happiness and income variables. However, note that in this computer lab, the version of the file we will be using has income expressed in thousands of dollars ($000).

1 Correlation

🏡 Suppose that we are interested in determining if there is a correlation between the average happiness rating and the average income level of citizens in a country. In this question we will consider ways to assess this correlation.

1.1

💻 Download the hi_2019_000s.csv file from the LMS, and save it in a relevant location on your PC.

Once you have done so, import the hi_2019_000s.csv file in jamovi.

This data set contains happiness scores (happiness_2019) and average income (income_2019, in $000s) for 16 countries in 2019.

1.2

💻 To begin, create a scatter plot of happiness_2019 on the \(y\)-axis versus income_2019 on the \(x\)-axis. 💬

1.3

💻 Carry out a correlation analysis for happiness_2019 and income_2019 in jamovi. Be sure to include a confidence interval for \(\rho\) in your output by selecting Confidence intervals in the Additional Options section.

Note: It is worth noting that the default correlation method is “Pearson”`. While this is fine for situations where we have data following a (relatively) straight line, we should use the “Spearman” option when dealing with data that exhibits curvature.

Based on the output of this test, what do you conclude? 💬

2 Understanding Residuals versus Fits Plots

🏡 For this question we will try to develop a better understanding of simple linear regression (SLR) model residuals versus fits plots, by considering several examples. Recall that the response variable for an SLR model is denoted by \(y\), and the explanatory variable is denoted by \(x\).

Before you begin this question, you might like to take a look back at section 2.4 of the Topic 8 readings.

2.1

🏡 In Figure 2.1 below we present 9 plots, labelled Plot 1, Plot 2,Plot 3 and so on.

These are scatter plots of 9 simulated data sets, with responses (\(y_i\)’s) plotted against the explanatory variable values (\(x_i\)’s). Each data set consists of \(n = 100\) simulated pairs of \((x_i, y_i)\) observations.

Each plot also includes the fitted line associated with the estimated least squares model fitted to the simulated data.

Plots of the $y_i$'s versus $x_i$'s including the estimated least squares line for each of 9 data sets of size $n=100$.

Figure 2.1: Plots of the \(y_i\)’s versus \(x_i\)’s including the estimated least squares line for each of 9 data sets of size \(n=100\).

Take a close look at the plots in Figure 2.1. Using your knowledge of fits and residuals, try and imagine what the associated residuals versus fits plots (also known as the residuals versus fitted plots) would look like for each of these 9 plots.

  • For which of these plots do you think the simple linear regression model (including the linearity and constant variance assumptions) holds at least approximately?
  • Which plots suggest clear violations of the simple linear regression model?
  • Which plots are you not sure about?

2.2

🏡 The residual versus fits plots for each of the 9 data sets are shown below in Figure 2.2. However, the order of these plots in Figure 2.2 is random so that, for example, residuals versus fits plot A is not necessarily the plot associated with Plot 1 in Figure 2.1.

Residuals versus fits plots associated with the data sets considered in Figure 4.1.  The order of these plots have been chosen randomly so that `A` does not necessarily associate with `Plot 1` in Figure 4.1 and so on.

Figure 2.2: Residuals versus fits plots associated with the data sets considered in Figure 4.1. The order of these plots have been chosen randomly so that A does not necessarily associate with Plot 1 in Figure 4.1 and so on.


Your task is to match the plots in Figure 2.1 with their respective residuals versus fits plot in Figure 2.2. To do so, complete Table 2.1 by writing the Figure 2.2 label (i.e. A, B etc) under what you believe is the correct (matching) Figure 2.1 label.

Discuss this question and your results with your classmates and demonstrator.

Table 2.1: Matching the Figure 1 and Figure 2 plots
Figure 1 label Plot 1 Plot 2 Plot 3 Plot 4 Plot 5 Plot 6 Plot 7 Plot 8 Plot 9
Figure 2 label


Once you think you have the correct matches, you can check your results below (click the Code button):

D E G B I A C H F

Using the residuals versus fits plots in Figure 2.2 and your completed table above, are you still happy with your choice of data sets that satisfy the simple linear regression model? What about for those that do not satisfy the simple linear regression model?

If you were previously unsure whether some data sets did or didn’t satisfy the simple linear regression model, did the residuals versus fitted plots help?

3 Simple Linear Regression

3.1

💻 Using jamovi, fit a simple linear regression model, modelling happiness_2019 (dependent variable) against income_2019 (independent variable, or ‘covariate’).

3.2

💻 Add a “line of best fit” (the line that represents the simple linear regression model) to your scatter plot of the data. To do so:

  • In jamovi, click on the scatter plot you created earlier
  • In the Analyses window, under the Regression Line heading, select Linear.

By looking at your scatter plot and line of best fit, do you believe the variables are linearly associated? 💬

Note: When answering the questions that follow, you may assume that the linear regression assumptions have been met. We will check these more formally later, but for now, please assume the assumptions have been met.

3.3

🏡 What are the estimated coefficients? 💬

3.4

🏡 Using the coefficient estimates obtained, write down the estimated linear regression model. 💬

3.5

🏡 Use the estimated model to answer the following question:

For an observation that has a value of income_2019 = 20 (i.e., $20,000), what would be the estimated value of happiness_2019? 💬

3.6

🏡 Interpret the value of \(\widehat{\beta}_1\). 💬

3.7

🏡 Do we have evidence of a significant linear association between income_2019 and happiness_2019? 💬 Does this align with your results from 1?

3.8

🏡 What is the multiple R-squared value? 💬

3.8.1

🏡 Using the \(R^2\) value, evaluate the fit of the model. 💬

3.9

🏡 Comment on whether or not you believe the following assumptions have been violated:

  • Linearity (refer to residuals versus fits plot) 💬
  • Constant variance (refer to residuals versus fits plot) 💬
  • Normality (refer to Normal Q-Q plot) 💬

4 Simple Linear Regression: more practice

💻 In this question, we will carry out a simple linear regression analysis using some simulated data. Download the file called sim_data.csv from LMS, and import it into jamovi.

The data set has two variables of interest:

  • sim.x
  • sim.y

4.1

Using jamovi, create a scatter plot of the sim_data.csv data, with sim.y on the \(y\)-axis and sim.x on the \(x\)-axis.

4.2

By looking at your scatter plot, where do you think the most appropriate “line of best fit” belongs? More specifically:

  • What value would you guess for the \(y\)-intercept of the line?
  • What value would you guess for the slope of the line?

4.3

Now use jamovi to add a line of best fit to your scatter plot. By looking at the line, do you think your guesses were close?

Were you close? Don’t worry if you weren’t, it is typically quite difficult to accurately fit such a line by visual assessment alone - which is why we use jamovi!

4.4

Using jamovi, fit a simple linear regression model, modelling happiness_2019 (dependent variable) against income_2019 (independent variable, or ‘covariate’).

Note down the following information from the output:

  • What is the intercept coefficient estimate \(\widehat{\beta}_0\)?
  • What is the slope coefficient estimate \(\widehat{\beta}_1\)?

Now, by looking at the estimated intercept and slope coefficients from the simple linear regression analysis, do you think your guesses were close?

4.5

Using the coefficient estimates obtained in ??, write down the estimated linear regression model.

4.6

Use the estimated model to answer the following question:

For an observation that has a value of sim.x = 5, what would be the estimated value of sim.y?

4.7

Interpret the value of \(\widehat{\beta}_1\).

4.8

Do we have evidence of a significant linear association between sim.x and sim.y?

4.9

What is the multiple R-squared value?

4.10

Using the \(R^2\) value, evaluate the fit of the model.


Well done, that’s everything for today! If you still have time, you may like to have a go at Quiz 9, which is based on the Topic 9 readings.

Before you finish up, remember to save your work (e.g. your jamovi and Word files) somewhere safe (e.g. OneDrive) so that you can access it at a later time.


References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.
———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.


These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.