In Topic 8 we introduced the concept of analysing data using correlation and simple linear regression. In this computer lab, we will practice describing the relationship between two numeric variables, using these statistical techniques.
For some of this computer lab’s questions, we will consider happiness and average income data (Gapminder 2021a), also (Gapminder 2021b). You might recall that we assessed some of this data back in Computer Lab 3. Check back there if you’d like a refresher on the details of the happiness and income variables. However, note that in this computer lab, the version of the file we will be using has income expressed in thousands of dollars ($000).
🏡 Suppose that we are interested in determining if there is a correlation between the average happiness rating and the average income level of citizens in a country. In this question we will consider ways to assess this correlation.
💻 Download the hi_2019_000s.csv
file from the LMS, and save it in a relevant location on your PC.
Once you have done so, import the hi_2019_000s.csv
file in jamovi.
This data set contains happiness scores (happiness_2019
) and average income (income_2019
, in $000s) for 16 countries in 2019.
💻 To begin, create a scatter plot of happiness_2019
on the \(y\)-axis versus income_2019
on the \(x\)-axis. 💬
💻 Carry out a correlation analysis for happiness_2019
and income_2019
in jamovi. Be sure to include a confidence interval for \(\rho\) in your output by selecting Confidence intervals
in the Additional Options section.
Note: It is worth noting that the default correlation method is “Pearson”`. While this is fine for situations where we have data following a (relatively) straight line, we should use the “Spearman” option when dealing with data that exhibits curvature.
Based on the output of this test, what do you conclude? 💬
🏡 For this question we will try to develop a better understanding of simple linear regression (SLR) model residuals versus fits plots, by considering several examples. Recall that the response variable for an SLR model is denoted by \(y\), and the explanatory variable is denoted by \(x\).
Before you begin this question, you might like to take a look back at section 2.4 of the Topic 8 readings.
🏡 In Figure 2.1 below we present 9 plots, labelled Plot 1
, Plot 2
,Plot 3
and so on.
These are scatter plots of 9 simulated data sets, with responses (\(y_i\)’s) plotted against the explanatory variable values (\(x_i\)’s). Each data set consists of \(n = 100\) simulated pairs of \((x_i, y_i)\) observations.
Each plot also includes the fitted line associated with the estimated least squares model fitted to the simulated data.
Figure 2.1: Plots of the \(y_i\)’s versus \(x_i\)’s including the estimated least squares line for each of 9 data sets of size \(n=100\).
Take a close look at the plots in Figure 2.1. Using your knowledge of fits and residuals, try and imagine what the associated residuals versus fits plots (also known as the residuals versus fitted plots) would look like for each of these 9 plots.
🏡 The residual versus fits plots for each of the 9 data sets are shown below in Figure 2.2. However, the order of these plots in Figure 2.2 is
random so that, for example, residuals versus fits plot A
is not necessarily the plot associated with Plot 1
in Figure 2.1.
Figure 2.2: Residuals versus fits plots associated with the data sets considered in Figure 4.1. The order of these plots have been chosen randomly so that A
does not necessarily associate with Plot 1
in Figure 4.1 and so on.
Your task is to match the plots in Figure 2.1 with their respective residuals versus fits plot in Figure 2.2. To do so, complete Table 2.1 by writing the Figure 2.2 label (i.e. A, B etc) under what you believe is the correct (matching) Figure 2.1 label.
Discuss this question and your results with your classmates and demonstrator.
Figure 1 label | Plot 1 | Plot 2 | Plot 3 | Plot 4 | Plot 5 | Plot 6 | Plot 7 | Plot 8 | Plot 9 |
Figure 2 label |
Once you think you have the correct matches, you can check your results below (click the Code
button):
D E G B I A C H F
Using the residuals versus fits plots in Figure 2.2 and your completed table above, are you still happy with your choice of data sets that satisfy the simple linear regression model? What about for those that do not satisfy the simple linear regression model?
If you were previously unsure whether some data sets did or didn’t satisfy the simple linear regression model, did the residuals versus fitted plots help?
💻 Using jamovi, fit a simple linear regression model, modelling happiness_2019
(dependent variable) against income_2019
(independent variable, or ‘covariate’).
💻 Add a “line of best fit” (the line that represents the simple linear regression model) to your scatter plot of the data. To do so:
Linear
.By looking at your scatter plot and line of best fit, do you believe the variables are linearly associated? 💬
Note: When answering the questions that follow, you may assume that the linear regression assumptions have been met. We will check these more formally later, but for now, please assume the assumptions have been met.
🏡 What are the estimated coefficients? 💬
🏡 Using the coefficient estimates obtained, write down the estimated linear regression model. 💬
🏡 Use the estimated model to answer the following question:
income_2019 = 20
(i.e., $20,000), what would be the estimated value of happiness_2019
? 💬
🏡 Interpret the value of \(\widehat{\beta}_1\). 💬
🏡 Do we have evidence of a significant linear association between income_2019
and happiness_2019
? 💬 Does this align with your results from 1?
🏡 What is the multiple R-squared value? 💬
🏡 Using the \(R^2\) value, evaluate the fit of the model. 💬
🏡 Comment on whether or not you believe the following assumptions have been violated:
💻 In this question, we will carry out a simple linear regression analysis using some simulated data. Download the file called sim_data.csv
from LMS, and import it into jamovi.
The data set has two variables of interest:
sim.x
sim.y
Using jamovi, create a scatter plot of the sim_data.csv
data, with sim.y
on the \(y\)-axis and sim.x
on the \(x\)-axis.
By looking at your scatter plot, where do you think the most appropriate “line of best fit” belongs? More specifically:
Now use jamovi to add a line of best fit to your scatter plot. By looking at the line, do you think your guesses were close?
Were you close? Don’t worry if you weren’t, it is typically quite difficult to accurately fit such a line by visual assessment alone - which is why we use jamovi!
Using jamovi, fit a simple linear regression model, modelling happiness_2019
(dependent variable) against income_2019
(independent variable, or ‘covariate’).
Note down the following information from the output:
Now, by looking at the estimated intercept and slope coefficients from the simple linear regression analysis, do you think your guesses were close?
Using the coefficient estimates obtained in ??, write down the estimated linear regression model.
Use the estimated model to answer the following question:
sim.x = 5
, what would be the estimated value of sim.y
?
Interpret the value of \(\widehat{\beta}_1\).
Do we have evidence of a significant linear association between sim.x
and sim.y
?
What is the multiple R-squared value?
Using the \(R^2\) value, evaluate the fit of the model.
These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.