I am using the “swiss” dataset, a cross-sectional dataset
looking at an array of factors of the Swiss population. The dataset
includes six variables, all of which are quantitative variables
describing different characteristics of one of 47 Swiss towns.
head(swiss, n=10)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Broye 83.8 70.2 16 7 92.85
## Glane 92.4 67.8 14 8 97.16
## Gruyere 82.4 53.3 12 7 97.67
## Sarine 82.9 45.2 16 13 91.38
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
## Broye 23.6
## Glane 24.9
## Gruyere 21.0
## Sarine 24.4
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
library(ggplot2)
ggplot(data=swiss,aes(y=Infant.Mortality, x=Education))+ geom_point() + geom_smooth(method=lm,se=FALSE)+labs(y="Infant Mortality",x="Education", title="The relationship between Education and Infant Mortality")
In my regression, I will continue investigating
the impact of education levels on infant mortality rates. I will
additionally include the agriculture variable since agriculture-heavy
areas tend to be less wealthy and more remote, possibly preventing
inhabitants of the said area from accessing quality healthcare. My other
controls will include the presence of pre-natal examination, which
should have a negative impact on infant mortality, and the rate of
Catholics within a population since they are more likely to reject
treatment during some high-risk pregnancies.
\(Infant\) \(Mortality_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(Education_{i}\) \(+\) \(\beta_{2}\) \(Agriculture_{i}\) \(+\) \(\beta_{2}\) \(Examination_{i}\) \(+\) \(\beta_{3}\) \(Catholic_{i}\) + \(\epsilon_{i}\)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
OLS <- lm(Infant.Mortality~Education+Agriculture+Examination+Catholic, data=swiss)
tab_model(OLS)
| Infant.Mortality | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 22.41 | 17.16 – 27.66 | <0.001 |
| Education | -0.08 | -0.23 – 0.06 | 0.254 |
| Agriculture | -0.05 | -0.10 – 0.01 | 0.109 |
| Examination | -0.00 | -0.21 – 0.20 | 0.982 |
| Catholic | 0.02 | -0.01 – 0.05 | 0.184 |
| Observations | 47 | ||
| R2 / R2 adjusted | 0.097 / 0.011 | ||
.
InfantMortality <- as.vector(swiss$Infant.Mortality)
Education <- as.vector(swiss$Education)
Agriculture <- as.vector(swiss$Agriculture)
Examination <- as.vector(swiss$Examination)
Catholic <- as.vector(swiss$Catholic)
BetaZero <- rep(x=1, length= 47)
Matrix <- cbind(BetaZero,Education,Agriculture,Examination,Catholic)
Slopes <- solve(t(Matrix) %*% Matrix) %*% t(Matrix) %*% InfantMortality
Slopes <- round(x = Slopes, digits = 2)
Slopes
## [,1]
## BetaZero 22.41
## Education -0.08
## Agriculture -0.05
## Examination 0.00
## Catholic 0.02
As we can see, after generating six vectors, five
corresponding to our independent variables and one corresponding to our
\(\beta_{0}\) coefficient, we bind all
of the vectors together, creating a matrix. Afterwards, we input the
formula into R, which I labeled “Slopes”, and the results, rounded to
two decimals, are identical to the ones we got from the OLS
regression.