rm(list = ls())

I.

library('stargazer')
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

1.

data = swiss
summary(data)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

The multivariate regression model is given by:

\[ Fertility_i = \beta_0 + \beta_1 \times Agriculture_i + \beta_2 \times Examination_i + \beta_3 \times Education_i + \beta_4 \times Catholic_i + \beta_5 \times Infant.Mortality_i + \varepsilon_i \]

# Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality

model <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data = data)

# Print summary statistics of the model
summary(model)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
par(mfrow = c(2, 2))
plot(model)

# Display regression results
stargazer(model, title = "Multivariate Regression Results", align = TRUE, out = "regression_output.html")
## 
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: Tue, Dec 12, 2023 - 01:46:30
## % Requires LaTeX packages: dcolumn 
## \begin{table}[!htbp] \centering 
##   \caption{Multivariate Regression Results} 
##   \label{} 
## \begin{tabular}{@{\extracolsep{5pt}}lD{.}{.}{-3} } 
## \\[-1.8ex]\hline 
## \hline \\[-1.8ex] 
##  & \multicolumn{1}{c}{\textit{Dependent variable:}} \\ 
## \cline{2-2} 
## \\[-1.8ex] & \multicolumn{1}{c}{Fertility} \\ 
## \hline \\[-1.8ex] 
##  Agriculture & -0.172^{**} \\ 
##   & (0.070) \\ 
##   & \\ 
##  Examination & -0.258 \\ 
##   & (0.254) \\ 
##   & \\ 
##  Education & -0.871^{***} \\ 
##   & (0.183) \\ 
##   & \\ 
##  Catholic & 0.104^{***} \\ 
##   & (0.035) \\ 
##   & \\ 
##  Infant.Mortality & 1.077^{***} \\ 
##   & (0.382) \\ 
##   & \\ 
##  Constant & 66.915^{***} \\ 
##   & (10.706) \\ 
##   & \\ 
## \hline \\[-1.8ex] 
## Observations & \multicolumn{1}{c}{47} \\ 
## R$^{2}$ & \multicolumn{1}{c}{0.707} \\ 
## Adjusted R$^{2}$ & \multicolumn{1}{c}{0.671} \\ 
## Residual Std. Error & \multicolumn{1}{c}{7.165 (df = 41)} \\ 
## F Statistic & \multicolumn{1}{c}{19.761$^{***}$ (df = 5; 41)} \\ 
## \hline 
## \hline \\[-1.8ex] 
## \textit{Note:}  & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
## \end{tabular} 
## \end{table}

2.

Interpretation:

When all other parameters are zero, the intercept indicates the baseline fertility rate. Lower fertility rates (-0.172) are connected with a higher share of males in agriculture. The results of the examinations and fertility suggest a negative association, but it is not statistically significant. Higher levels of education after primary school are associated with reduced fertility rates (-0.871), and this association is statistically significant. Larger fertility rates are connected with a larger percentage of Catholics (0.104), which is statistically significant. Furthermore, greater infant death rates are associated with higher fertility (1.077), and this association is statistically significant.

3.

The plot of residuals against fitted values shows a linear connection, which supports the linearity assumption in the regression model. The overall distribution appears normal in the second graph illustrating residual normality, but the presence of data points at the extremities signals potential outliers or extreme values. Furthermore, the scatter of residuals exhibits heteroscedasticity, showing that residual variability is not consistent across all levels of the independent variables. Notably, all data points are inside Cook’s distance, implying that no observations have a large leverage on the regression model.

4.

Gauss Assumptions:

  1. Linearity: The OLS method assumes that the relationship between the variables being estimated is linear. In other words, the parameters being estimated should exhibit a linear pattern.

  2. Random Sampling: To ensure generalizability, the data must be a random sample from the larger population to make unbiased inferences about the population based on the observed sample.

  3. Non-Collinearity: The variables I am using in the regression should not be perfectly correlated with each other. This avoids redundancy in the information provided by the predictors.

  4. Exogeneity: The independent variables, or regressors, should not be correlated with the error term. This assumption ensures that changes in the dependent variable are not caused by omitted variables.

  5. Homoscedasticity: The variance of the errors remains constant across all levels of the independent variables. This means that the spread of the residuals is consistent, regardless of the values of the predictors.

cor_matrix <- cor(data[, c("Agriculture", "Examination", "Education", "Catholic", "Infant.Mortality")])

# Print the correlation matrix
print(cor_matrix)
##                  Agriculture Examination   Education   Catholic
## Agriculture       1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

All except homoscedasticity is nearly followed.

5.

OLS is BLUE is an acronym that stands for “Ordinary Least Squares is the Best Linear Unbiased Estimator.” This phrase encapsulates the fundamental characteristics of the Ordinary Least Squares (OLS) approach, which is used in linear regression:

  1. Best: OLS delivers the most efficient estimates among the class of linear unbiased estimators. The term “efficiency” alludes to the fact that OLS reduces the variance of parameter estimations, making them more precise.

  2. Linear: OLS estimates parameters in a linear fashion, assuming a linear connection between the independent and dependent variables.

  3. Unbiased: OLS generates unbiased estimations of the true population parameters. Unbiasedness means that the estimated values are, on average, equal to the genuine values.

  4. Estimator: OLS is a method for estimating the parameters of a linear regression model.

Because of the BLUE property, OLS is a suitable approach for parameter estimation in linear regression because it produces efficient, unbiased estimates with low variance, if the Gauss-Markov assumptions are met.

II.

Taking the log of a variable in your linear regression can be beneficial for several reasons:

1. Improved normality: Real-world data often exhibits skewed or non-normal distributions. Logarithm transformation can help “normalize” the distribution of the data, making it more bell-shaped and conforming to the assumptions of linear regression. This improves the accuracy and reliability of model.

2. Constant variance: Linear regression assumes homoscedasticity, meaning the variance of the error term should be constant across all values of the independent variable. Since the data shows heteroscedasticity (unequal variance), taking the log can help stabilize the variance and improve the model’s fit.

3. Interpretation: In some cases, log-transforming the dependent variable changes the interpretation of the model coefficients. Instead of representing the absolute change in the dependent variable, the coefficients now represent the percentage change in response to a unit change in the independent variable. This can be more intuitive and meaningful in certain contexts.

Here are some resources for further reading:

Interpreting Log Transformations in a Linear Model: http://library.virginia.edu/data/articles/interpreting-log-transformations-in-a-linear-model

Logarithmic Transformation in Linear Regression Models: Why & When: https://www.researchgate.net/post/When_is_better_to_use_log_transformation_to_obtain_a_linear_regression_model