Topic 8: Correlation and Simple Linear Regression

These are the solutions for Computer Lab 9.

1 Correlation

1.1

No solution required.

1.2

Note that this scatter plot also includes a line of best fit:

1.3

Compare your results with the R output as follows:

## 
##  Pearson's product-moment correlation
## 
## data:  gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## t = 13.387, df = 149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6567159 0.8037912
## sample estimates:
##       cor 
## 0.7389331

Based on this output, we conclude that there is a statistically significant correlation between the 2019 average happiness levels and 2019 average income levels of citizens of the countries in our data set. Note that the $p$-value is approximately $0$, and the confidence interval does not contain $0$. These both suggest that we should reject the null hypothesis that the correlation between these variables is $0$.

Since our correlation sample estimate is 0.739, it seems safe to conclude that there is a strong positive correlation between these two variables.

Note that, since the scatter plot in 1.2 exhibits some curvature, you may have chosen to compute the spearman correlation. Example R output for this approach is provided below:

## 
##  Spearman's rank correlation rho
## 
## data:  gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## S = 114194, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8009868

The high rho value of 0.8 here supports our conclusion that there is a strong positive correlation between these two variables.

2 Understanding Residuals versus Fits Plots

2.1

No answer required.

2.2

The correct sequence is

Table 2.1: Matching the Figure 1 and Figure 2 plots
Figure 1 label	Plot 1	Plot 2	Plot 3	Plot 4	Plot 5	Plot 6	Plot 7	Plot 8	Plot 9
Figure 2 label	D	E	G	B	I	A	C	H	F

3 Simple Linear Regression

3.1

Compare your results with the R output below:

## 
## Call:
## lm(formula = gapminder.2019$happiness_2019 ~ gapminder.2019$income_2019)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4665  -4.2288   0.4431   4.9295  18.1519 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                46.19680    0.88234   52.36   <2e-16 ***
## gapminder.2019$income_2019  0.43363    0.03239   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.553 on 149 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.546,  Adjusted R-squared:  0.543 
## F-statistic: 179.2 on 1 and 149 DF,  p-value: < 2.2e-16

3.2

For the scatter plot, refer to Question 1.2.

By looking at the scatter plot and line of best fit, the variables do not appear to be linearly associated.

3.3

The intercept coefficient estimate $\widehat{\beta}_0$ is 46.197, and the income_2019 coefficient estimate $\widehat{\beta}_1$ is 0.434

3.4

The estimated linear regression model is

\[\widehat{\text{happiness_2019}} = 46.20 + 0.434 \times \text{income_2019}.\]

3.5

For income_2019 = 20, our estimated value of happiness_2019 is

\[46.197 + 0.434 \times 20 = 54.877.\]

3.6

We estimate that on average, a 1 unit increase (i.e. a $1000 increase) in average income will lead to an increase of 0.434 in the average happiness rating.

3.7

The $p$-value for the slope coefficient is very close to 0 (2e-16). This $p$-value is for a test of the form $H_0 : \beta_1 = 0$ versus $H_1 : \beta_1 \neq 0$. Since we have $p < 0.05$, we reject $H_0$ and conclude that $\beta_1$ is not equal to zero. Therefore, we have evidence of a significant linear association between income_2019 and happiness_2019.

3.8

The multiple $R^2$ value is $0.546$.

3.8.1

The multiple $R^2$ value of $0.546$ indicates that $54.6\%$ of the variation in the response can be explained by our linear regression model, suggesting that the model is a good fit.

3.9

The residuals plot appears to show signs of curvature, meaning a pattern is visible. Therefore, it is likely that the linearity assumption has been violated.
Although the variance of the residuals appears to decrease as the fitted values increase (that is, as we look from left to right, the dots become less and less spread out), we also observe less observations as the fitted values increase. It is natural to expect less variation where we see less observations, so we do not have major concerns with regard to the constant variance assumptions.
The Normal Q-Q plot looks very good in the middle of the plot, although there is some slight deviation away at the tails. Overall, no major concerns regarding normality, especially when considering the sample size.

4 Simple Linear Regression: more practice

4.1

No solutions required.

4.2

No solutions required.

4.3

4.4

Compare your results with the following R output:

## 
## Call:
## lm(formula = sim.y ~ sim.x, data = simulated.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5816 -2.6011 -0.5427  2.3184  6.4184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.5298     0.9127    8.25 9.26e-11 ***
## sim.x         2.0185     0.1627   12.41  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2 on 48 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7574 
## F-statistic:   154 on 1 and 48 DF,  p-value: < 2.2e-16

We note that the intercept coefficient estimate $\widehat{\beta}_0$ is 7.5298, and that the slope coefficient estimate $\widehat{\beta}_1$ is 2.0185. Both of these estimates are statistically significantly different from 0.

4.5

The estimated linear regression model is

\[\widehat{\text{sim.y}} = 7.5298 + 2.0185\times \text{sim.x}.\]

4.6

For sim.x=5, our estimated value of sim.y is \[7.5298 + 2.0185\times 5 = 17.6223.\]

4.7

The value of $\widehat{\beta}_1$ is $2.0185$. This means that for every one-unit increase in sim.x, we estimate that on average the value of sim.y will be $2.0185$ units higher.

4.8

4.9

The multiple R-squared value is 0.7623.

4.10

The multiple R-squared value of 0.7623 indicates that $76.23\%$ of the variation in the response can be explained by our linear regression model, which is a good fit.

That’s everything! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 8 material.

References

These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 9 Solutions (jamovi)