I. Dataset ~ Swiss


      I am using the “swiss” dataset, a cross-sectional dataset looking at an array of factors of the Swiss population. The dataset includes six variables, all of which are quantitative variables describing different characteristics of one of 47 Swiss towns.

head(swiss, n=10)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
## Broye             83.8        70.2          16         7    92.85
## Glane             92.4        67.8          14         8    97.16
## Gruyere           82.4        53.3          12         7    97.67
## Sarine            82.9        45.2          16        13    91.38
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
## Broye                    23.6
## Glane                    24.9
## Gruyere                  21.0
## Sarine                   24.4
summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60


library(ggplot2)
ggplot(data=swiss,aes(y=Infant.Mortality, x=Education))+ geom_point() + geom_smooth(method=lm,se=FALSE)+labs(y="Infant Mortality",x="Education", title="The relationship between Education and Infant Mortality")

      Since we are looking at cross-sectional data, a scatter plot with a trendline is the most optimal way to display this dataset graphically.
      While the trendline is sloping downwards, suggesting the assumption that an increase in education leads to a decrease in infant mortality, there is no visible discrete jump in the data distribution. Optimally, there would be a visible discrete jump so that we can confirm that completing high school, for example, decreases childhood mortality. Multiple factors could cause the lack of this discrete jump; I suspect that, in this case, it is due to the relatively low number of observations, but it could also be caused by the impact simply not existing.

II. Regression Model


      In my regression, I will continue investigating the impact of education levels on infant mortality rates. I will additionally include the agriculture variable since agriculture-heavy areas tend to be less wealthy and more remote, possibly preventing inhabitants of the said area from accessing quality healthcare. My other controls will include the presence of pre-natal examination, which should have a negative impact on infant mortality, and the rate of Catholics within a population since they are more likely to reject treatment during some high-risk pregnancies.

\(Infant\) \(Mortality_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(Education_{i}\) \(+\) \(\beta_{2}\) \(Agriculture_{i}\) \(+\) \(\beta_{2}\) \(Examination_{i}\) \(+\) \(\beta_{3}\) \(Catholic_{i}\) + \(\epsilon_{i}\)


III. Linear Regression Results


library(sjPlot)
library(sjmisc)
library(sjlabelled)

OLS <- lm(Infant.Mortality~Education+Agriculture+Examination+Catholic, data=swiss)

tab_model(OLS)
  Infant.Mortality
Predictors Estimates CI p
(Intercept) 22.41 17.16 – 27.66 <0.001
Education -0.08 -0.23 – 0.06 0.254
Agriculture -0.05 -0.10 – 0.01 0.109
Examination -0.00 -0.21 – 0.20 0.982
Catholic 0.02 -0.01 – 0.05 0.184
Observations 47
R2 / R2 adjusted 0.097 / 0.011


     As we can see from the regression results, none of our variables have a statistically significant impact on infant mortality. Since the Education variable is in percentage points, the interpretation of the \(\beta_{1}\) coefficient would be that for each additional percentage point of education that a city has, on average, infant mortality decreases by 0.08 children dying before the age of one, cetris paribus.
      There are some clear issues with the model. Most importantly, the dataset does not include enough variables to be used as controls. Aspects such as average income level, employment rate, and others are necessary to draw any conclusions from the results of this regression.

IV. Results Confirmation

      To confirm my results, I will use the formula:
\(\beta=(X'X)^{-1}X'y\)

.

InfantMortality <- as.vector(swiss$Infant.Mortality)
Education <- as.vector(swiss$Education)
Agriculture <- as.vector(swiss$Agriculture)
Examination <- as.vector(swiss$Examination)
Catholic <- as.vector(swiss$Catholic)
BetaZero <- rep(x=1, length= 47)

Matrix <- cbind(BetaZero,Education,Agriculture,Examination,Catholic)

Slopes <- solve(t(Matrix) %*% Matrix) %*% t(Matrix) %*% InfantMortality
Slopes <- round(x = Slopes, digits = 2)
Slopes
##              [,1]
## BetaZero    22.41
## Education   -0.08
## Agriculture -0.05
## Examination  0.00
## Catholic     0.02


      As we can see, after generating six vectors, five corresponding to our independent variables and one corresponding to our \(\beta_{0}\) coefficient, we bind all of the vectors together, creating a matrix. Afterwards, we input the formula into R, which I labeled “Slopes”, and the results, rounded to two decimals, are identical to the ones we got from the OLS regression.