1. Find a dataset and run a multivariate regression in R(have at-least 2 independant variables). Make sure to type out the estimating equation with subscripts, and provide summary statistics of the dataset and present the final regression with stargazer package (you can try and present a few different specifications).
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
data(swiss)
str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
plot(swiss)

# 2 Talk about what you find in a few lines i.e. interpret a few slopes. Is the sign in the expected direction, and is the magnitude meaningful? What about the statistical significance?

library(psych)

data <- lm(swiss$Fertility~swiss$Education + swiss$Examination)
summary(data)
## 
## Call:
## lm(formula = swiss$Fertility ~ swiss$Education + swiss$Examination)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9935  -6.8894  -0.3621   7.1640  19.2634 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        85.2533     3.0855  27.630   <2e-16 ***
## swiss$Education    -0.5395     0.1924  -2.803   0.0075 ** 
## swiss$Examination  -0.5572     0.2319  -2.402   0.0206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.982 on 44 degrees of freedom
## Multiple R-squared:  0.5055, Adjusted R-squared:  0.483 
## F-statistic: 22.49 on 2 and 44 DF,  p-value: 1.87e-07
# Plot Goals Positions
par(mfrow = c(2,2))
plot(data)

#The following report presents the results of a model that has been run on residual error and coefficients. The residuals indicate a symmetric distribution with a median close to zero. With a minimum value of -15.9935 and a maximum value of 19.2634, the distribution appears to be approximately normal. The coefficient estimation for the intercept term is 85.2533, indicating the expected fertility rate when both independent variables Agriculture and Education are zero. The coefficient for Education is -0.5395, although not statistically significant. The coefficient for Examination is -0.5572, and it is statistically significant, with a p-value of 0.0206.

#Regarding the model fit, the multiple R-squared value is 0.5055, signifying that approximately 50.55% of the variation in the fertility rate can be explained by the independent variables Examination and Education. The adjusted R-squared value of 0.483 accounts for the number of predictors in the model. The F-statistic of 22.49 with a p-value of 1.87e^-07 indicates that the model is statistically significant as a whole.

#The results highlight the significance of the coefficient for Education, which not only is statistically significant but also has the expected negative sign. This suggests a meaningful and relevant relationship between education level and fertility rate. On the other hand, the coefficient for Agriculture, although not statistically significant, still provides valuable insights, albeit less conclusive, about its relationship with the fertility rate.

#In conclusion, the results of the model provide important insights into the relationships between variables and their effects on fertility rates, with the coefficient for Education being particularly significant.

3 More importantly, interpret the residuals.

par(mfrow = c(2,2))
plot(data)

#1. Residuals vs Fitted: The residuals vs fitted graph indicates the presence of heteroscedasticity. Although the residuals are closer to zero, the predictive line decreases. Additionally, some homoskedasticity is present in the plots. It is observed that some of the variables are above the zero, while some are below the zero. Both positive and negative values are present in the plot for the fertility of agriculture with education. The predictive values all go down for the fertility.

#2. Q-Q: The quantile-quantile plot suggests that the data is not normally distributed. This is because the line is only in the y-axis and does not come from the x-axis. Some of the data points are negative, while others are positive. Hence, the plot has extremely high and short values.

#3. Scale Location: Similar to the above findings, the predictive values are low for fertility. Here, homoskedasticity is also observed.

#4. Residuals vs Leverage: The data points are outside of the cook’s distance. We can observe which data points are outside of the cook’s distance.

4 What are the Gauss Markov Assumptions assumptions, and did they hold?

#The Gauss-Markov assumptions are a set of necessary conditions that ensure the Ordinary Least Squares (OLS) estimator to be the Best Linear Unbiased Estimator (BLUE) of the coefficients in a linear regression model. These assumptions encompass various aspects such as linearity, independence, homoscedasticity, no perfect multicollinearity, and exogeneity. The linearity assumption implies that there exists a linear relationship between the dependent and independent variables. The independence assumption states that the residuals are independent of each other. The homoscedasticity assumption indicates that the residuals have constant variance at all levels of the independent variable. The no perfect multicollinearity assumption is based on the idea that the independent variables are not perfectly linearly related to each other. Finally, the exogeneity assumption states that the residuals are uncorrelated with the independent variables.

#It is essential to ensure that these assumptions hold in the given dataset and regression model being used. Therefore, diagnostic tests and plots analyzing residuals are usually performed to check these conditions. By doing so, one can ascertain the validity of the OLS estimator and its effectiveness in estimating the coefficients in linear regression.

5 What does OLS is BLUE mean?

#The phrase “OLS is BLUE” refers to the Ordinary Least Squares (OLS) estimator of the coefficients in a linear regression model. It means that OLS is the Best Linear Unbiased Estimator. This means that OLS has the smallest variance when compared to other linear and unbiased estimators. However, this optimality only holds true under specific Gauss-Markov assumptions. These assumptions include linearity, independence, homoscedasticity, no perfect multicollinearity, and exogeneity of the regressors. When these assumptions are met, OLS estimates are the most reliable linear unbiased predictions possible.