Using the “swiss” dataset, conduct a regression of two variables of interest. Interpret the assumptions and results. Post your solutions and R code here.
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
There is a higher correlation between education and examination so we’ll examine them.
plot(swiss$Education, swiss$Examination)
model=lm(swiss$Education~swiss$Examination)
summary(model)
##
## Call:
## lm(formula = swiss$Education ~ swiss$Examination)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1427 -3.4877 -0.8833 2.7212 24.7560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.9015 2.3507 -1.234 0.223
## swiss$Examination 0.8418 0.1286 6.546 4.81e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.958 on 45 degrees of freedom
## Multiple R-squared: 0.4878, Adjusted R-squared: 0.4764
## F-statistic: 42.85 on 1 and 45 DF, p-value: 4.811e-08
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
ggplot(swiss, aes(x = Education, y = Examination)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'
anova(model)
## Analysis of Variance Table
##
## Response: swiss$Education
## Df Sum Sq Mean Sq F value Pr(>F)
## swiss$Examination 1 2074.5 2074.53 42.853 4.811e-08 ***
## Residuals 45 2178.4 48.41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model)
## 2.5 % 97.5 %
## (Intercept) -7.6360982 1.833026
## swiss$Examination 0.5827811 1.100760
plot(model)
Given that Adjusted R-squared = 0.4764 and 45 p-value is 4.811e-08 this means the correlation is statistically significant however, education only explains 48% of the variability in examination. We see there is heteroscedasticity, as seen in the pattern of residuals, the facts that there isn’t good normality for tail values in upper quantiles, and the relationship is linear but the tail ends are skewing the relationship. All of these three facts suggest a linear regression may not be the best fit but there there is significant evidence that we must reject the null hypothesis.