Dataset: Swiss
Tell us what are the dependent and independent variable.
#Using the Swiss dataset in R
swiss
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Broye 83.8 70.2 16 7 92.85
## Glane 92.4 67.8 14 8 97.16
## Gruyere 82.4 53.3 12 7 97.67
## Sarine 82.9 45.2 16 13 91.38
## Veveyse 87.1 64.5 14 6 98.61
## Aigle 64.1 62.0 21 12 8.52
## Aubonne 66.9 67.5 14 7 2.27
## Avenches 68.9 60.7 19 12 4.43
## Cossonay 61.7 69.3 22 5 2.82
## Echallens 68.3 72.6 18 2 24.20
## Grandson 71.7 34.0 17 8 3.30
## Lausanne 55.7 19.4 26 28 12.11
## La Vallee 54.3 15.2 31 20 2.15
## Lavaux 65.1 73.0 19 9 2.84
## Morges 65.5 59.8 22 10 5.23
## Moudon 65.0 55.1 14 3 4.52
## Nyone 56.6 50.9 22 12 15.14
## Orbe 57.4 54.1 20 6 4.20
## Oron 72.5 71.2 12 1 2.40
## Payerne 74.2 58.1 14 8 5.23
## Paysd'enhaut 72.0 63.5 6 3 2.56
## Rolle 60.5 60.8 16 10 7.72
## Vevey 58.3 26.8 25 19 18.46
## Yverdon 65.4 49.5 15 8 6.10
## Conthey 75.5 85.9 3 2 99.71
## Entremont 69.3 84.9 7 6 99.68
## Herens 77.3 89.7 5 2 100.00
## Martigwy 70.5 78.2 12 6 98.96
## Monthey 79.4 64.9 7 3 98.22
## St Maurice 65.0 75.9 9 9 99.06
## Sierre 92.2 84.6 3 3 99.46
## Sion 79.3 63.1 13 13 96.83
## Boudry 70.4 38.4 26 12 5.62
## La Chauxdfnd 65.7 7.7 29 11 13.79
## Le Locle 72.7 16.7 22 13 11.22
## Neuchatel 64.4 17.6 35 32 16.92
## Val de Ruz 77.6 37.6 15 7 4.97
## ValdeTravers 67.6 18.7 25 7 8.65
## V. De Geneve 35.0 1.2 37 53 42.34
## Rive Droite 44.7 46.6 16 29 50.43
## Rive Gauche 42.8 27.7 22 29 58.33
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
## Broye 23.6
## Glane 24.9
## Gruyere 21.0
## Sarine 24.4
## Veveyse 24.5
## Aigle 16.5
## Aubonne 19.1
## Avenches 22.7
## Cossonay 18.7
## Echallens 21.2
## Grandson 20.0
## Lausanne 20.2
## La Vallee 10.8
## Lavaux 20.0
## Morges 18.0
## Moudon 22.4
## Nyone 16.7
## Orbe 15.3
## Oron 21.0
## Payerne 23.8
## Paysd'enhaut 18.0
## Rolle 16.3
## Vevey 20.9
## Yverdon 22.5
## Conthey 15.1
## Entremont 19.8
## Herens 18.3
## Martigwy 19.4
## Monthey 20.2
## St Maurice 17.8
## Sierre 16.3
## Sion 18.1
## Boudry 20.3
## La Chauxdfnd 20.5
## Le Locle 18.9
## Neuchatel 23.0
## Val de Ruz 20.0
## ValdeTravers 19.5
## V. De Geneve 18.0
## Rive Droite 18.2
## Rive Gauche 19.3
library(psych)
describe(swiss)
## vars n mean sd median trimmed mad min max range
## Fertility 1 47 70.14 12.49 70.40 70.66 10.23 35.00 92.5 57.50
## Agriculture 2 47 50.66 22.71 54.10 51.16 23.87 1.20 89.7 88.50
## Examination 3 47 16.49 7.98 16.00 16.08 7.41 3.00 37.0 34.00
## Education 4 47 10.98 9.62 8.00 9.38 5.93 1.00 53.0 52.00
## Catholic 5 47 41.14 41.70 15.14 39.12 18.65 2.15 100.0 97.85
## Infant.Mortality 6 47 19.94 2.91 20.00 19.98 2.82 10.80 26.6 15.80
## skew kurtosis se
## Fertility -0.46 0.26 1.82
## Agriculture -0.32 -0.89 3.31
## Examination 0.45 -0.14 1.16
## Education 2.27 6.14 1.40
## Catholic 0.48 -1.67 6.08
## Infant.Mortality -0.33 0.78 0.42
The independent variable is education, and the dependent variable is Infant Mortality. Estimating equation: \[ \hat{y}_i = \beta_0 + X\beta_1 + \epsilon \] \[ \hat{Infant.Mortality}_i = \hat{\beta_0} + Education_i \hat{\beta_1} + \hat{\epsilon_i} \] where y = infant mortality rate (number of live births who live less than 1 year) and x = the percent of education post-primary school for draftees.
plot(Infant.Mortality ~ Education, data = swiss)
swiss.lm <- lm(Infant.Mortality ~ Education, data=swiss)
coef(swiss.lm)
## (Intercept) Education
## 20.27286508 -0.03008655
#Add line of best fit (used values from coef after I ran that part of the code)
abline(a=20.27286508,b=-0.03008655)
The slope can be interpreted as, with each 1 percentage point increase in education level post-primary school, the rate of infant mortality decreases by 0.03. The intercept can be interpreted as, when education level is 0% (education = primary school i.e., education ends at primary school), infant mortality rate is ~20.27.
cov(swiss$Education,swiss$Infant.Mortality)
## [1] -2.781684
cor(swiss$Education,swiss$Infant.Mortality)
## [1] -0.09932185
#Calculate slope using covariance and variance
slope <- cov(swiss$Education,swiss$Infant.Mortality)/var(swiss$Education)
print(slope)
## [1] -0.03008655
#Calculate slope using correlation coefficient and the sample standard deviations of the Y and x data (Infant mortality and education)
slope_2 <- cor(swiss$Education,swiss$Infant.Mortality) * (sd(swiss$Infant.Mortality)/sd(swiss$Education))
print(slope_2)
## [1] -0.03008655
\[ \hat{Infant.Mortality}_i = 20.27 - 0.03Education_i \]
The Gauss-Markov theorem states that the Ordinary Least Squares (OLS) model is BLUE if classical assumptions hold. BLUE stands for: Best Linear Unbiased Estimator. This means that the variance is minimized the most when using this estimator compared to other linear unbiased estimators.
The Gauss-Markov assumptions/conditions for OLS to be BLUE are as follows:
Linearity: The regression model must be linear in its parameters. For example, are the betas linear? (not \(\beta_1^2\) –> must be \(\beta_1\)) (We can also accept models where the betas can be linearized through logarithmic functions).
Exogeneity: All explanatory variables should be uncorrelated with the error term. For example: \(cov(x_i,u_i) = 0\) –> \(x_i\) and \(u_i\) must not be correlated. Otherwise, the OLS estimate of the slope coefficient won’t be accurate. The error term has an expected value of zero, conditional on the independent variables: \(E (\epsilon_i|x_i) = 0\)
Homoskedasticity: There should be an equal spread (or variance) at different values of x. \(var(\epsilon_i | x_i) = \sigma^2\) In other words, the error of the variance is constant. If this does not occur, then we have heteroskedasticity, or unequal variances at different values of x.
Non-collinearity: No independent variable can be perfectly correlated with another being estimated.
Random: The data must be collected at random or have been randomly sampled from the population.
Source: Old econometrics notes from undergrad, https://www.statisticshowto.com/gauss-markov-theorem-assumptions/, and https://edooko.com/blog/assumptions-under-which-the-ols-estimators-are-blue/