# Load required packages
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
data(swiss)
# Display the structure of the dataset
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
# Show the first few rows of the dataset
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
# Run multivariate regression
model <- lm(Fertility ~ Agriculture + Education, data = swiss)
# Present regression results with stargazer
stargazer(model, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## Fertility
## -----------------------------------------------
## Agriculture -0.066
## (0.080)
##
## Education -0.963***
## (0.189)
##
## Constant 84.080***
## (5.782)
##
## -----------------------------------------------
## Observations 47
## R2 0.449
## Adjusted R2 0.424
## Residual Std. Error 9.479 (df = 44)
## F Statistic 17.945*** (df = 2; 44)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
summary(model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3072 -6.6157 -0.9443 8.7028 20.5291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.08005 5.78180 14.542 < 2e-16 ***
## Agriculture -0.06648 0.08005 -0.830 0.411
## Education -0.96276 0.18906 -5.092 7.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.479 on 44 degrees of freedom
## Multiple R-squared: 0.4492, Adjusted R-squared: 0.4242
## F-statistic: 17.95 on 2 and 44 DF, p-value: 2e-06
This output shows the model we just ran, residual error, and coefficients.
Residuals: The residuals have a minimum value of -17.3072 and a maximum value of 20.5291. The distribution appears to be approximately symmetric, with a median close to zero.
Coefficients: The intercept term is estimated at 84.08005, indicating the expected fertility rate when both independent variables (Agriculture and Education) are zero. The coefficient for Agriculture is -0.06648, but it is not statistically significant (p-value = 0.411). The coefficient for Education is -0.96276, and it is statistically significant (p-value < 0.001).
Model Fit: The model has a multiple R-squared value of 0.4492, indicating that approximately 44.92% of the variation in the fertility rate can be explained by the independent variables Agriculture and Education. The adjusted R-squared value of 0.4242 accounts for the number of predictors in the model. The F-statistic of 17.95 with a p-value of 2e-06 suggests that the model is statistically significant as a whole.
Overall, the results indicate that the coefficient for Education is statistically significant and has the expected negative sign, suggesting a meaningful relationship between education level and fertility rate. On the other hand, the coefficient for Agriculture is not statistically significant, and therefore, we cannot make strong conclusions about its relationship with fertility rate.
plot(model)
In the residuals vs Fitted graph is Heteroscedasticity. We can see the residuals are closer to zero but the predictive line decreases. Also some homoskedasticity are in the plots. Some of the variables are above the zero some of them are below the zero. Lot of negative values are in the plot as well as positive values for the fertility of agriculture with education.The predictive values all the go down for the fertility.
The quantile quantile represents which is it is not a normally distributed because the line only in the y axis it doesn’t comes from x-axis. I am not sure about this. Some of the data points are negative value and some of the data points are positive value. So it has extremely high and short value in the plot.
As we already discussed the predictive values are low for the fertility. Here also it is a homoskedasticity.
Data points are outside from the cook’s distance. We can observe the which data points are in the outside of cook’s distance.
The Gauss-Markov assumptions, also known as the classical linear regression assumptions, are a set of assumptions that underpin the ordinary least squares (OLS) regression method. These assumptions are as follows:
Linearity: The relationship between the dependent variable and the independent variables is linear in the parameters. This assumption implies that the true relationship can be accurately represented by a linear equation.
Independence: The observations in the dataset are independent of each other. This assumption ensures that the errors or residuals of one observation do not affect the errors of other observations.
Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. Homoscedasticity implies that the spread of the residuals is the same throughout the range of the predictors. Violations of this assumption result in heteroscedasticity.
No Autocorrelation: The errors or residuals are not correlated with each other. Autocorrelation suggests that the errors in one observation are related to the errors in previous or subsequent observations.
Zero Mean: The errors have a mean of zero. This assumption ensures that the model is not systematically biased in predicting the dependent variable.
No Perfect Multicollinearity: There is no perfect linear relationship among the independent variables. Perfect multicollinearity occurs when one independent variable can be perfectly predicted by a linear combination of the other independent variables, leading to issues in estimating the regression coefficients.
Normality: The errors or residuals follow a normal distribution. This assumption allows for valid hypothesis testing, confidence interval estimation, and determination of statistical significance.
BLUE stands for Best Linear Unbiased Estimators. The statement that “OLS is BLUE” refers to the desirable properties of the OLS estimators in the context of linear regression.
Here’s what each term means:
Best: The OLS estimators are considered “best” because, under certain assumptions, they have the smallest variance among all linear unbiased estimators. In other words, among all unbiased linear estimators for the regression coefficients, the OLS estimators have the minimum amount of variability.
Linear: OLS estimators are linear combinations of the observed values of the dependent variable. They are obtained by minimizing the sum of squared residuals, which leads to a linear equation for estimating the coefficients.
Unbiased: OLS estimators are “unbiased” because, on average, they provide estimations that are equal to the true population parameters. Unbiasedness means that, over repeated sampling, the expected value of the OLS estimators is equal to the population parameter being estimated.
Estimators: OLS provides estimates for the regression coefficients. These estimates allow us to make inferences about the relationships between the independent variables and the dependent variable in the population.
The BLUE property of OLS is desirable because it ensures that the estimated coefficients are not only unbiased but also have the smallest variance among all linear unbiased estimators. This property makes OLS estimators efficient and provides a basis for valid statistical inference, including hypothesis testing, confidence interval estimation, and model selection.
Taking the logarithm of a variable in linear regression can be beneficial for several reasons. Firstly, it can help improve the interpretation of the relationship between variables. By taking the logarithm, we can transform a variable with a nonlinear relationship into a more linear relationship, making it easier to understand and interpret the effect of the variable on the dependent variable. For example, in economic contexts, taking the logarithm of variables like income or GDP can convert exponential growth rates into constant percentage changes, which are often more meaningful and intuitive.
Secondly, using the logarithm can improve the fit of the regression model. In cases where the relationship between the variables is nonlinear or exhibits heteroscedasticity (varying levels of spread), taking the logarithm can help stabilize the variance and achieve a more consistent spread of residuals. This can lead to a better fit of the model and more accurate estimations of the regression coefficients.
Lastly, taking the logarithm can address the issue of skewed or heavily skewed variables. Logarithmic transformation can help normalize the distribution of skewed variables, making them more suitable for linear regression analysis. This is particularly useful when dealing with variables that have a large range or contain extreme values, as the logarithmic transformation can help mitigate the influence of outliers and improve the overall model performance.