These are the solutions for Computer Lab 9. Data analysed is from Gapminder (2021a), also (Gapminder 2021b).
No answer required.
gapminder.2019 <- read.csv(file = "hi_2019_000s.csv", header = T)
Example R code is provided below:
plot(gapminder.2019$income_2019, gapminder.2019$happiness_2019, pch = 21, bg = "chartreuse3",
cex = 1.2, main = "2019 Happiness index versus Income per person",
xlab = "Average income per person", ylab = "Average happiness score")
cor(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")
## [1] 0.7389331
Example R code is provided below:
cor.test(gapminder.2019$income_2019, gapminder.2019$happiness_2019, use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## t = 13.387, df = 149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6567159 0.8037912
## sample estimates:
## cor
## 0.7389331
Based on this output, we conclude that there is a statistically significant correlation between the 2019 average happiness levels and 2019 average income levels of citizens of the countries in our data set. Note that the \(p\)-value is approximately \(0\), and the confidence interval does not contain \(0\). These both suggest that we should reject the null hypothesis that the correlation between these variables is \(0\).
Since our correlation sample estimate is 0.739, it seems safe to conclude that there is a strong positive correlation between these two variables.
Note that, since the scatter plot in 1.3 exhibits some curvature, you may have chosen to compute the spearman
correlation. Example R code for this approach is provided below:
cor.test(gapminder.2019$income_2019, gapminder.2019$happiness_2019,
use = "complete.obs", method = "spearman")
##
## Spearman's rank correlation rho
##
## data: gapminder.2019$income_2019 and gapminder.2019$happiness_2019
## S = 114194, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8009868
The high rho
value of 0.8 here supports our conclusion that there is a strong positive correlation between these two variables.
No answer required.
Example code is provided below:
hi.linear.model <- lm(formula = gapminder.2019$happiness_2019 ~ gapminder.2019$income_2019)
Example code is provided below:
summary(hi.linear.model)
##
## Call:
## lm(formula = gapminder.2019$happiness_2019 ~ gapminder.2019$income_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4665 -4.2288 0.4431 4.9295 18.1519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.19680 0.88234 52.36 <2e-16 ***
## gapminder.2019$income_2019 0.43363 0.03239 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.553 on 149 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.546, Adjusted R-squared: 0.543
## F-statistic: 179.2 on 1 and 149 DF, p-value: < 2.2e-16
income_2019
coefficient estimate \(\widehat{\beta}_1\) is 0.43.The estimated linear regression model is
\[\widehat{\text{happiness_2019}} = 46.20 + 0.43 \times \text{income_2019}.\]
For income_2019 = 20
, our estimated value of happiness_2019
is
\[46.20 + 0.43 \times 20 = 54.8.\]
An interpretation of \(\widehat{\beta}_1\) is as follows: We estimate that on average, a 1 unit increase (i.e. a $1000 increase) in average income will lead to an increase of 0.43 in the average happiness rating.
The \(p\)-value for the slope coefficient is very close to 0 (2e-16
). This \(p\)-value is for a test of the form \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\). Since we have \(p < 0.05\), we reject \(H_0\) and conclude that \(\beta_1\) is not equal to zero. Therefore, we have evidence of a significant linear association between income_2019
and happiness_2019
.
The multiple \(R^2\) value is \(0.546\).
The multiple \(R^2\) value of \(0.546\) indicates that \(54.6\%\) of the variation in the response can be explained by our linear regression model, suggesting that the model is a good fit.
Example code is provided below:
# Replace the ... below with the name of your fitted model
windows(height = 3.5)
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)
plot(hi.linear.model, which = c(1, 2))
The residuals plot appears to show signs of curvature, meaning a pattern is visible. Therefore, it is likely that the linearity assumption has been violated.
Although the variance of the residuals appears to decrease as the fitted values increase (that is, as we look from left to right, the dots become less and less spread out), we also observe less observations as the fitted values increase. It is natural to expect less variation where we see less observations, so we do not have major concerns with regard to the constant variance assumptions.
The Normal Q-Q plot looks very good in the middle of the plot, although there is some slight deviation away at the tails. Overall, there are no major concerns regarding normality, especially when considering the sample size.
No answer required.
The correct sequence is
Figure 1 label | Plot 1 | Plot 2 | Plot 3 | Plot 4 | Plot 5 | Plot 6 | Plot 7 | Plot 8 | Plot 9 |
Figure 2 label | D | E | G | B | I | A | C | H | F |
Simulation code is as follows.
set.seed(5050)
sim.x <- abs(2*sample(round(rnorm(50, mean = 2, sd = 2), 1), replace = T))
sim.y <- 2*sim.x + round(rpois(50, 8), 2)
simulated.data <- data.frame(sim.x, sim.y)
Example code is provided below:
plot(sim.x, sim.y, main = "Simulated Data")
Example code is provided below:
# We could try something like the following
abline(a = 4.6, b = 2.6, col = "red", lwd = 2)
Example code is provided below:
linear.model <- lm(formula = sim.y ~ sim.x, data = simulated.data)
Example code is provided below:
summary(linear.model)
##
## Call:
## lm(formula = sim.y ~ sim.x, data = simulated.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5816 -2.6011 -0.5427 2.3184 6.4184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5298 0.9127 8.25 9.26e-11 ***
## sim.x 2.0185 0.1627 12.41 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.2 on 48 degrees of freedom
## Multiple R-squared: 0.7623, Adjusted R-squared: 0.7574
## F-statistic: 154 on 1 and 48 DF, p-value: < 2.2e-16
We note that
Both of these estimates are statistically significantly different from 0.
The estimated linear regression model is
\[\widehat{\text{sim.y}} = 7.5298 + 2.0185\times \text{sim.x}.\]
Example R code is shown below, with the guessed line in red, and the fitted line in blue.
plot(sim.x, sim.y, main = "Simulated Data")
abline(a = 4.6, b = 2.6, col = "red", lwd = 2)
abline(linear.model, col = "blue", lwd =2)
For sim.x=5
, our estimated value of sim.y
is \[7.5298 + 2.0185\times 5 = 17.6223.\]
The value of \(\widehat{\beta}_1\) is \(2.0185\). This means that for every one-unit increase in sim.x
, we estimate that on average the value of sim.y
will be \(2.0185\) units higher.
The \(p\)-value for the slope coefficient is very close to 0 (2e-16
). This \(p\)-value is for a test of the form \(H_0 : \beta_1 = 0\) versus \(H_1 : \beta_1 \neq 0\). Since we have \(p < 0.05\), we reject \(H_0\) and conclude that \(\beta_1\) is not equal to zero. Therefore, we have evidence of a significant linear association between sim.x
and sim.y
.
The multiple R-squared value is 0.7623.
The multiple R-squared value of 0.7623 indicates that \(76.23\%\) of the variation in the response can be explained by our linear regression model, which is a good fit.
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.