1

Dataset: Swiss

1a.

Tell us what are the dependent and independent variable.

#Using the Swiss dataset in R
swiss
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
## Broye             83.8        70.2          16         7    92.85
## Glane             92.4        67.8          14         8    97.16
## Gruyere           82.4        53.3          12         7    97.67
## Sarine            82.9        45.2          16        13    91.38
## Veveyse           87.1        64.5          14         6    98.61
## Aigle             64.1        62.0          21        12     8.52
## Aubonne           66.9        67.5          14         7     2.27
## Avenches          68.9        60.7          19        12     4.43
## Cossonay          61.7        69.3          22         5     2.82
## Echallens         68.3        72.6          18         2    24.20
## Grandson          71.7        34.0          17         8     3.30
## Lausanne          55.7        19.4          26        28    12.11
## La Vallee         54.3        15.2          31        20     2.15
## Lavaux            65.1        73.0          19         9     2.84
## Morges            65.5        59.8          22        10     5.23
## Moudon            65.0        55.1          14         3     4.52
## Nyone             56.6        50.9          22        12    15.14
## Orbe              57.4        54.1          20         6     4.20
## Oron              72.5        71.2          12         1     2.40
## Payerne           74.2        58.1          14         8     5.23
## Paysd'enhaut      72.0        63.5           6         3     2.56
## Rolle             60.5        60.8          16        10     7.72
## Vevey             58.3        26.8          25        19    18.46
## Yverdon           65.4        49.5          15         8     6.10
## Conthey           75.5        85.9           3         2    99.71
## Entremont         69.3        84.9           7         6    99.68
## Herens            77.3        89.7           5         2   100.00
## Martigwy          70.5        78.2          12         6    98.96
## Monthey           79.4        64.9           7         3    98.22
## St Maurice        65.0        75.9           9         9    99.06
## Sierre            92.2        84.6           3         3    99.46
## Sion              79.3        63.1          13        13    96.83
## Boudry            70.4        38.4          26        12     5.62
## La Chauxdfnd      65.7         7.7          29        11    13.79
## Le Locle          72.7        16.7          22        13    11.22
## Neuchatel         64.4        17.6          35        32    16.92
## Val de Ruz        77.6        37.6          15         7     4.97
## ValdeTravers      67.6        18.7          25         7     8.65
## V. De Geneve      35.0         1.2          37        53    42.34
## Rive Droite       44.7        46.6          16        29    50.43
## Rive Gauche       42.8        27.7          22        29    58.33
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
## Broye                    23.6
## Glane                    24.9
## Gruyere                  21.0
## Sarine                   24.4
## Veveyse                  24.5
## Aigle                    16.5
## Aubonne                  19.1
## Avenches                 22.7
## Cossonay                 18.7
## Echallens                21.2
## Grandson                 20.0
## Lausanne                 20.2
## La Vallee                10.8
## Lavaux                   20.0
## Morges                   18.0
## Moudon                   22.4
## Nyone                    16.7
## Orbe                     15.3
## Oron                     21.0
## Payerne                  23.8
## Paysd'enhaut             18.0
## Rolle                    16.3
## Vevey                    20.9
## Yverdon                  22.5
## Conthey                  15.1
## Entremont                19.8
## Herens                   18.3
## Martigwy                 19.4
## Monthey                  20.2
## St Maurice               17.8
## Sierre                   16.3
## Sion                     18.1
## Boudry                   20.3
## La Chauxdfnd             20.5
## Le Locle                 18.9
## Neuchatel                23.0
## Val de Ruz               20.0
## ValdeTravers             19.5
## V. De Geneve             18.0
## Rive Droite              18.2
## Rive Gauche              19.3
library(psych)
describe(swiss)
##                  vars  n  mean    sd median trimmed   mad   min   max range
## Fertility           1 47 70.14 12.49  70.40   70.66 10.23 35.00  92.5 57.50
## Agriculture         2 47 50.66 22.71  54.10   51.16 23.87  1.20  89.7 88.50
## Examination         3 47 16.49  7.98  16.00   16.08  7.41  3.00  37.0 34.00
## Education           4 47 10.98  9.62   8.00    9.38  5.93  1.00  53.0 52.00
## Catholic            5 47 41.14 41.70  15.14   39.12 18.65  2.15 100.0 97.85
## Infant.Mortality    6 47 19.94  2.91  20.00   19.98  2.82 10.80  26.6 15.80
##                   skew kurtosis   se
## Fertility        -0.46     0.26 1.82
## Agriculture      -0.32    -0.89 3.31
## Examination       0.45    -0.14 1.16
## Education         2.27     6.14 1.40
## Catholic          0.48    -1.67 6.08
## Infant.Mortality -0.33     0.78 0.42

Answer 1a.

The independent variable is education, and the dependent variable is Infant Mortality. Estimating equation: \[ \hat{y}_i = \beta_0 + X\beta_1 + \epsilon \] \[ \hat{Infant.Mortality}_i = \hat{\beta_0} + Education_i \hat{\beta_1} + \hat{\epsilon_i} \] where y = infant mortality rate (number of live births who live less than 1 year) and x = the percent of education post-primary school for draftees.

1b. Estimate the linear regression in R using the lm() command.

plot(Infant.Mortality ~ Education, data = swiss)
swiss.lm <- lm(Infant.Mortality ~ Education, data=swiss)
coef(swiss.lm)
## (Intercept)   Education 
## 20.27286508 -0.03008655
#Add line of best fit (used values from coef after I ran that part of the code)
abline(a=20.27286508,b=-0.03008655)

Answer 1c.

The slope can be interpreted as, with each 1 percentage point increase in education level post-primary school, the rate of infant mortality decreases by 0.03. The intercept can be interpreted as, when education level is 0% (education = primary school i.e., education ends at primary school), infant mortality rate is ~20.27.

Question 1D. Replicate the slope and intercept parameter using the covariance/variance formulas

Answer 1D:

cov(swiss$Education,swiss$Infant.Mortality)
## [1] -2.781684
cor(swiss$Education,swiss$Infant.Mortality)
## [1] -0.09932185
#Calculate slope using covariance and variance
slope <- cov(swiss$Education,swiss$Infant.Mortality)/var(swiss$Education)
print(slope)
## [1] -0.03008655
#Calculate slope using correlation coefficient and the sample standard deviations of the Y and x data (Infant mortality and education)
slope_2 <- cor(swiss$Education,swiss$Infant.Mortality) * (sd(swiss$Infant.Mortality)/sd(swiss$Education))
print(slope_2)
## [1] -0.03008655

Final form of Sample Regression Function (estimated):

\[ \hat{Infant.Mortality}_i = 20.27 - 0.03Education_i \]

Question 2: Gauss-Markov Assumptions/Theorem

The Gauss-Markov theorem states that the Ordinary Least Squares (OLS) model is BLUE if classical assumptions hold. BLUE stands for: Best Linear Unbiased Estimator. This means that the variance is minimized the most when using this estimator compared to other linear unbiased estimators.

The Gauss-Markov assumptions/conditions for OLS to be BLUE are as follows:

  1. Linearity: The regression model must be linear in its parameters. For example, are the betas linear? (not \(\beta_1^2\) –> must be \(\beta_1\)) (We can also accept models where the betas can be linearized through logarithmic functions).

  2. Exogeneity: All explanatory variables should be uncorrelated with the error term. For example: \(cov(x_i,u_i) = 0\) –> \(x_i\) and \(u_i\) must not be correlated. Otherwise, the OLS estimate of the slope coefficient won’t be accurate. The error term has an expected value of zero, conditional on the independent variables: \(E (\epsilon_i|x_i) = 0\)

  3. Homoskedasticity: There should be an equal spread (or variance) at different values of x. \(var(\epsilon_i | x_i) = \sigma^2\) In other words, the error of the variance is constant. If this does not occur, then we have heteroskedasticity, or unequal variances at different values of x.

  4. Non-collinearity: No independent variable can be perfectly correlated with another being estimated.

  5. Random: The data must be collected at random or have been randomly sampled from the population.

Source: Old econometrics notes from undergrad, https://www.statisticshowto.com/gauss-markov-theorem-assumptions/, and https://edooko.com/blog/assumptions-under-which-the-ols-estimators-are-blue/