Topic 8: Correlation and Simple Linear Regression


These are the solutions for Computer Lab 9. Data analysed is from Gapminder (2021a), also (Gapminder 2021b).


1 Correlation

1.1

No answer required.

1.2

gapminder.2019 <- read.csv(file = "hi_2019_000s.csv", header = T)

1.3

Example R code is provided below:

plot(gapminder.2019$income_2019, gapminder.2019$happiness_2019, pch = 21, bg = "chartreuse3",
     cex = 1.2, main = "2019 Happiness index versus Income per person",
     xlab = "Average income per person", ylab = "Average happiness score")

1.4

cor(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")
## [1] 0.7389331

1.5

Example R code is provided below:

cor.test(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## t = 13.387, df = 149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6567159 0.8037912
## sample estimates:
##       cor 
## 0.7389331

Based on this output, we conclude that there is a statistically significant correlation between the 2019 average happiness levels and 2019 average income levels of citizens of the countries in our data set. Note that the \(p\)-value is approximately \(0\), and the confidence interval does not contain \(0\). These both suggest that we should reject the null hypothesis that the correlation between these variables is \(0\).

Since our correlation sample estimate is 0.739, it seems safe to conclude that there is a strong positive correlation between these two variables.

Note that, since the scatter plot in 1.3 exhibits some curvature, you may have chosen to compute the spearman correlation. Example R code for this approach is provided below:

cor.test(gapminder.2019$income_2019, gapminder.2019$happiness_2019,
         use = "complete.obs", method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## S = 114194, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8009868

The high rho value of 0.8 here supports our conclusion that there is a strong positive correlation between these two variables.

2 Simple Linear Regression (SLR) Models in R

No answer required.

2.1 Fitting an SLR Model

Example code is provided below:

hi.linear.model <- lm(formula = gapminder.2019$happiness_2019 ~ gapminder.2019$income_2019)

2.2 Assessing our fitted SLR Model

Example code is provided below:

summary(hi.linear.model)
## 
## Call:
## lm(formula = gapminder.2019$happiness_2019 ~ gapminder.2019$income_2019)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4665  -4.2288   0.4431   4.9295  18.1519 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                46.19680    0.88234   52.36   <2e-16 ***
## gapminder.2019$income_2019  0.43363    0.03239   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.553 on 149 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.546,  Adjusted R-squared:  0.543 
## F-statistic: 179.2 on 1 and 149 DF,  p-value: < 2.2e-16
  • The intercept coefficient estimate \(\widehat{\beta}_0\) is 46.2.
  • The income_2019 coefficient estimate \(\widehat{\beta}_1\) is 0.43.

2.2.1

The estimated linear regression model is

\[\widehat{\text{happiness_2019}} = 46.20 + 0.43 \times \text{income_2019}.\]

2.2.2

For income_2019 = 20, our estimated value of happiness_2019 is

\[46.20 + 0.43 \times 20 = 54.8.\]

2.2.3

An interpretation of \(\widehat{\beta}_1\) is as follows: We estimate that on average, a 1 unit increase (i.e. a $1000 increase) in average income will lead to an increase of 0.43 in the average happiness rating.

2.2.4

The \(p\)-value for the slope coefficient is very close to 0 (2e-16). This \(p\)-value is for a test of the form \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\). Since we have \(p < 0.05\), we reject \(H_0\) and conclude that \(\beta_1\) is not equal to zero. Therefore, we have evidence of a significant linear association between income_2019 and happiness_2019.

2.2.5

The multiple \(R^2\) value is \(0.546\).

2.2.6

The multiple \(R^2\) value of \(0.546\) indicates that \(54.6\%\) of the variation in the response can be explained by our linear regression model, suggesting that the model is a good fit.

2.3

Example code is provided below:

# Replace the ... below with the name of your fitted model
windows(height = 3.5)
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)

plot(hi.linear.model, which = c(1, 2))

2.4 SLR Assumption Checks

  • The residuals plot appears to show signs of curvature, meaning a pattern is visible. Therefore, it is likely that the linearity assumption has been violated.

  • Although the variance of the residuals appears to decrease as the fitted values increase (that is, as we look from left to right, the dots become less and less spread out), we also observe less observations as the fitted values increase. It is natural to expect less variation where we see less observations, so we do not have major concerns with regard to the constant variance assumptions.

  • The Normal Q-Q plot looks very good in the middle of the plot, although there is some slight deviation away at the tails. Overall, there are no major concerns regarding normality, especially when considering the sample size.

3 Understanding Residuals versus Fits Plots

3.1

No answer required.

3.2

The correct sequence is

Table 3.1: Matching the Figure 1 and Figure 2 plots
Figure 1 label Plot 1 Plot 2 Plot 3 Plot 4 Plot 5 Plot 6 Plot 7 Plot 8 Plot 9
Figure 2 label D E G B I A C H F

4 Extension: Simple Linear Regression Practice

4.1

Simulation code is as follows.

set.seed(5050)

sim.x <- abs(2*sample(round(rnorm(50, mean = 2, sd = 2), 1), replace = T)) 
sim.y <- 2*sim.x + round(rpois(50, 8), 2) 
simulated.data <- data.frame(sim.x, sim.y)

4.2

Example code is provided below:

plot(sim.x, sim.y, main = "Simulated Data")

4.3

Example code is provided below:

# We could try something like the following
abline(a = 4.6, b = 2.6, col = "red", lwd = 2)

4.4

Example code is provided below:

linear.model <- lm(formula = sim.y ~ sim.x, data = simulated.data)

4.5

Example code is provided below:

summary(linear.model)
## 
## Call:
## lm(formula = sim.y ~ sim.x, data = simulated.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5816 -2.6011 -0.5427  2.3184  6.4184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.5298     0.9127    8.25 9.26e-11 ***
## sim.x         2.0185     0.1627   12.41  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2 on 48 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7574 
## F-statistic:   154 on 1 and 48 DF,  p-value: < 2.2e-16

We note that

  • The intercept coefficient estimate \(\widehat{\beta}_0\) is 7.5298.
  • The slope coefficient estimate \(\widehat{\beta}_1\) is 2.0185.

Both of these estimates are statistically significantly different from 0.

4.6

The estimated linear regression model is

\[\widehat{\text{sim.y}} = 7.5298 + 2.0185\times \text{sim.x}.\]

4.7

Example R code is shown below, with the guessed line in red, and the fitted line in blue.

plot(sim.x, sim.y, main = "Simulated Data")
abline(a = 4.6, b = 2.6, col = "red", lwd = 2)
abline(linear.model, col = "blue", lwd =2)

4.8

For sim.x=5, our estimated value of sim.y is \[7.5298 + 2.0185\times 5 = 17.6223.\]

4.9

The value of \(\widehat{\beta}_1\) is \(2.0185\). This means that for every one-unit increase in sim.x, we estimate that on average the value of sim.y will be \(2.0185\) units higher.

4.10

The \(p\)-value for the slope coefficient is very close to 0 (2e-16). This \(p\)-value is for a test of the form \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\). Since we have \(p < 0.05\), we reject \(H_0\) and conclude that \(\beta_1\) is not equal to zero. Therefore, we have evidence of a significant linear association between sim.x and sim.y.

4.10.1

The multiple R-squared value is 0.7623.

4.10.2

The multiple R-squared value of 0.7623 indicates that \(76.23\%\) of the variation in the response can be explained by our linear regression model, which is a good fit.


That’s all the questions covered! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 8 material.


References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.
———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.