Using the “swiss” dataset, conduct a regression of two variables of interest. Interpret the assumptions and results. Post your solutions and R code here.

str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
cor(swiss)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

There is a higher correlation between education and examination so we’ll examine them.

plot(swiss$Education, swiss$Examination)

model=lm(swiss$Education~swiss$Examination)
summary(model)
## 
## Call:
## lm(formula = swiss$Education ~ swiss$Examination)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1427  -3.4877  -0.8833   2.7212  24.7560 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.9015     2.3507  -1.234    0.223    
## swiss$Examination   0.8418     0.1286   6.546 4.81e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.958 on 45 degrees of freedom
## Multiple R-squared:  0.4878, Adjusted R-squared:  0.4764 
## F-statistic: 42.85 on 1 and 45 DF,  p-value: 4.811e-08
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
ggplot(swiss, aes(x = Education, y = Examination)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'

anova(model)
## Analysis of Variance Table
## 
## Response: swiss$Education
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## swiss$Examination  1 2074.5 2074.53  42.853 4.811e-08 ***
## Residuals         45 2178.4   48.41                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(model)
##                        2.5 %   97.5 %
## (Intercept)       -7.6360982 1.833026
## swiss$Examination  0.5827811 1.100760
plot(model)

Given that Adjusted R-squared = 0.4764 and 45 p-value is 4.811e-08 this means the correlation is statistically significant however, education only explains 48% of the variability in examination. We see there is heteroscedasticity, as seen in the pattern of residuals, the facts that there isn’t good normality for tail values in upper quantiles, and the relationship is linear but the tail ends are skewing the relationship. All of these three facts suggest a linear regression may not be the best fit but there there is significant evidence that we must reject the null hypothesis.