library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
data(swiss)
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
plot(swiss)
# 2 Talk about what you find in a few lines i.e. interpret a few slopes.
Is the sign in the expected direction, and is the magnitude meaningful?
What about the statistical significance?
library(psych)
data <- lm(swiss$Fertility~swiss$Education + swiss$Examination)
summary(data)
##
## Call:
## lm(formula = swiss$Fertility ~ swiss$Education + swiss$Examination)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9935 -6.8894 -0.3621 7.1640 19.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.2533 3.0855 27.630 <2e-16 ***
## swiss$Education -0.5395 0.1924 -2.803 0.0075 **
## swiss$Examination -0.5572 0.2319 -2.402 0.0206 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.982 on 44 degrees of freedom
## Multiple R-squared: 0.5055, Adjusted R-squared: 0.483
## F-statistic: 22.49 on 2 and 44 DF, p-value: 1.87e-07
# Plot Goals Positions
par(mfrow = c(2,2))
plot(data)
#The following report presents the results of a model that has been run on residual error and coefficients. The residuals indicate a symmetric distribution with a median close to zero. With a minimum value of -15.9935 and a maximum value of 19.2634, the distribution appears to be approximately normal. The coefficient estimation for the intercept term is 85.2533, indicating the expected fertility rate when both independent variables Agriculture and Education are zero. The coefficient for Education is -0.5395, although not statistically significant. The coefficient for Examination is -0.5572, and it is statistically significant, with a p-value of 0.0206.
#Regarding the model fit, the multiple R-squared value is 0.5055, signifying that approximately 50.55% of the variation in the fertility rate can be explained by the independent variables Examination and Education. The adjusted R-squared value of 0.483 accounts for the number of predictors in the model. The F-statistic of 22.49 with a p-value of 1.87e^-07 indicates that the model is statistically significant as a whole.
#The results highlight the significance of the coefficient for Education, which not only is statistically significant but also has the expected negative sign. This suggests a meaningful and relevant relationship between education level and fertility rate. On the other hand, the coefficient for Agriculture, although not statistically significant, still provides valuable insights, albeit less conclusive, about its relationship with the fertility rate.
#In conclusion, the results of the model provide important insights into the relationships between variables and their effects on fertility rates, with the coefficient for Education being particularly significant.
par(mfrow = c(2,2))
plot(data)
#1. Residuals vs Fitted: The residuals vs fitted graph indicates the presence of heteroscedasticity. Although the residuals are closer to zero, the predictive line decreases. Additionally, some homoskedasticity is present in the plots. It is observed that some of the variables are above the zero, while some are below the zero. Both positive and negative values are present in the plot for the fertility of agriculture with education. The predictive values all go down for the fertility.
#2. Q-Q: The quantile-quantile plot suggests that the data is not normally distributed. This is because the line is only in the y-axis and does not come from the x-axis. Some of the data points are negative, while others are positive. Hence, the plot has extremely high and short values.
#3. Scale Location: Similar to the above findings, the predictive values are low for fertility. Here, homoskedasticity is also observed.
#4. Residuals vs Leverage: The data points are outside of the cook’s distance. We can observe which data points are outside of the cook’s distance.
#The Gauss-Markov assumptions are a set of necessary conditions that ensure the Ordinary Least Squares (OLS) estimator to be the Best Linear Unbiased Estimator (BLUE) of the coefficients in a linear regression model. These assumptions encompass various aspects such as linearity, independence, homoscedasticity, no perfect multicollinearity, and exogeneity. The linearity assumption implies that there exists a linear relationship between the dependent and independent variables. The independence assumption states that the residuals are independent of each other. The homoscedasticity assumption indicates that the residuals have constant variance at all levels of the independent variable. The no perfect multicollinearity assumption is based on the idea that the independent variables are not perfectly linearly related to each other. Finally, the exogeneity assumption states that the residuals are uncorrelated with the independent variables.
#It is essential to ensure that these assumptions hold in the given dataset and regression model being used. Therefore, diagnostic tests and plots analyzing residuals are usually performed to check these conditions. By doing so, one can ascertain the validity of the OLS estimator and its effectiveness in estimating the coefficients in linear regression.
#The phrase “OLS is BLUE” refers to the Ordinary Least Squares (OLS) estimator of the coefficients in a linear regression model. It means that OLS is the Best Linear Unbiased Estimator. This means that OLS has the smallest variance when compared to other linear and unbiased estimators. However, this optimality only holds true under specific Gauss-Markov assumptions. These assumptions include linearity, independence, homoscedasticity, no perfect multicollinearity, and exogeneity of the regressors. When these assumptions are met, OLS estimates are the most reliable linear unbiased predictions possible.
#Log transformation is a helpful tool for dealing with variables that have a skewed distribution, which is common in economic and biological data sets. In such cases, most observations tend to cluster around the lower end of the scale while a few extreme values lie far out on the upper end, leading to a skewed distribution. This skewness can cause problems with heteroscedasticity, where the variability of the dependent variable is unequal across the range of values of an independent variable. However, by transforming these variables into their logarithmic form, their distributions can be normalized, leading to a more symmetrical spread of data points. This normalization helps meet the Gauss-Markov assumption of homoscedasticity, thus providing more efficient, reliable estimates.
#Additionally, log transformation makes the interpretation of the regression coefficients more straightforward and meaningful in economic terms. For example, in a model where both the dependent variable Y and an independent variable X are log-transformed, the regression coefficient of X can be interpreted as the elasticity of Y with respect to X. This means that a 1% increase in X is associated with a β% change in Y, offering a clear and direct economic interpretation. This makes it easier for decision-makers to understand and act upon the findings of the regression analysis.