# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
  gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 521626 27.9    1160167   62   660385 35.3
## Vcells 950167  7.3    8388608   64  1769879 13.6

1. Gauss-Markov Assumptions.

The Gauss-Markov theorem is a fundamental concept in the field of statistics, particularly within the context of linear regression models. It states that in a linear regression model where the expectations of the error terms are zero, are not correlated, and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. This theorem underscores the efficiency of the OLS estimator under the specified conditions, emphasizing that no other linear estimator can provide a more precise estimate of the regression coefficients without bias.

The Gauss-Markov theorem relies on several critical assumptions for its validity:

  1. Linearity
  2. No perfect multicollinearity
  3. Exogeneity:
  4. No autocorrelation
  5. Homoscedasticity
  6. Normality of errors

2. Explain each assumption (what does it mean) and why we need to make it as if the crowd is non-technical i.e. understands only plain English

  1. Linearity : By this assumption, the relationship between the independent variables—the ones we use to make predictions—and the dependent variable—the one we want to predict or explain—can be represented by a straight line. It is predicated on the idea that the independent variables’ effects on the dependent variable are additive and constant.

  2. No perfect multicollinearity: This assumption means that the independent variables are not highly correlated with each other. It avoids situations where one independent variable is a perfect duplicate or combination of others, which can cause problems in estimating the unique contribution of each variable.

  3. Exogeneity: It’s states that the error term (residual) in our regression model is unrelated to the independent variables. It means that there are no systematic factors or omitted variables that are affecting both the dependent variable and the independent variables.

  4. No autocorrelation: This assumption states that the error term does not exhibit any systematic patterns or correlations with itself across observations. It means that the residuals are not dependent on the order or timing of the observations.

  5. Homoscedasticity: This assumption means that the variability of the error term (residual) is constant across all levels of the independent variables. In other words, the spread or dispersion of the errors is the same for all observations.

  6. Normality of errors : The error term follows a normal distribution.

3. Explain each assumption and why we need to make it as if the crowd has some technical background

According to the first linearity assumption, there is a linear relationship between x and y. Consequently,

y=Xβ+ϵ

There isn’t perfect multicollinearity between column vectors according to the second assumption, which is full column rank. X’s columns are linearly independent of one another. (Condition of identification). According to the third premise of zero conditional mean, there is no error pattern and the error term averages out to zero for any value selected for X. Consequently,

E[ϵ|X] = 0,

because the disturbances are assumed to average out to 0 for all values of X. E(y) = X*beta is implied by the assumption. The equal distribution of all error terms under each variable and the randomness of the positive and negative error values denote the fourth assumption of homoskedasticity and lack of autocorrelation. Consequently,

E(ϵϵ′|X)=σ2I

Given the assumptions of homoskedasticity, which states that the variance of ϵi is the same for all i, and no autocorrelation, which implies that cov(ϵi,ϵj|X)=0 for all i that are not equal to j. Since we should take a random sample from the population, the data generated cannot be generated in relation to the error term, according to the fifth assumption of data generation. Finally, while it’s not necessary, assuming a normal distribution of the data can simplify hypothesis testing. This indicates that values near the mean occur more frequently and that the data distribution is symmetrical about the mean.

Find a cross-sectional dataset for simplicity with more than 120 rows/observations

data("airquality")
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
# Store the data mtcars as a dataframe  
df <- as.data.frame(airquality)

# Create a scatter plot 
plot(x = airquality$Temp, 
     y = airquality$Ozone,
     xlab = "Temp",         
     ylab = "Ozone", 
     main = "Airquality"
     )

library(lmtest)  # for significance testing
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Fit linear regression model
lm_model <- lm(Temp ~ Ozone, data = airquality)

# Summary of regression results
summary(lm_model)
## 
## Call:
## lm(formula = Temp ~ Ozone, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.147  -4.858   1.828   4.342  12.328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.41072    1.02971   67.41   <2e-16 ***
## Ozone        0.20081    0.01928   10.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.819 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

Now, create the 4 linear regression plots we saw in class using the “plot(my_reg)” command in R.

This plot() function produces four-panel plots a Normal QQ plot of the residuals, a plot of the residuals vs. fitted values, and a plot of the residuals vs. leverage.

# Set up the plotting region / graphical parameters to have four subplots
par(mfrow = c(2, 2))

# Create a residual plot
plot(lm_model)

Tell us what each of the 4 charts measures what is the logic behind the chart setup? Also, are there any trends/rules of thumb you should rely on to analyze the model fit from these charts?

# Create a residual plot of residuals vs. fitted values

plot(lm_model, 
     which = 1
     )

This will produce a plot of the residuals against the fitted values. “residuals vs fitted” in the context of regression analysis, we are examining the relationship between the residuals (or errors) and the predicted or fitted values from the regression model. This examination helps assess the appropriateness of the model and identify potential patterns or issues in the model’s performance.

Based on the residuals vs fitted values plot, we are able to see if the type of model used is appropriate. In this case, we are using a level-level model, and we can see that the best-fit line for the residuals starts a downward trend, from close to 0, between fitted values = -10 to -20 This could pose a problem, as this is reflecting an issue of heteroskedasticity.

library(ggplot2)

# Create a data frame of fitted values and residuals
df <- data.frame(fitted   = lm_model$fitted.values, 
                 residual = lm_model$residuals
                 )

# Create a residual plot using ggplot2
ggplot(data = df, 
       mapping = aes(x = fitted, 
                     y = residual)
       ) +
  geom_point() +
  geom_smooth(method = "loess") + 
  xlab("Fitted Values") +
  ylab("Residuals") +
  ggtitle("Residual Plot")
## `geom_smooth()` using formula = 'y ~ x'

Normal QQ plot of the residuals

# Create a QQ plot of the residuals

plot(lm_model, 
     which = 2
     )

Based on the normal Q-Q plot reveals that most of our data lies within the -2 to 2 range, aligning with a normal distribution. However, observation 117 seems to deviate from this trend and might be an outlier and 15, 24 are inside of the line in the plot.

library(ggplot2)

# Create a data frame of the residuals
df <- data.frame(residuals = lm_model$residuals)

# Create a QQ plot using ggplot2
ggplot(data = df, 
       mapping = aes(sample = residuals)
       ) +
  stat_qq() +
  stat_qq_line() +
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  ggtitle("Normal QQ Plot of Residuals")

Scale Location

# Create a scale location plot

plot(lm_model, 
     which = 3
     )

This is also know as the spread location plot. The Scale-Location Plot, which ideally should show dots evenly scattered without discernible patterns, unfortunately mirrors our Residuals and Fitted plot. This plot shows us if the residuals are spread equally among our predictions in order to check for homoscedasticity. We want, in this case, to have an even dispersion of values across the plot and we don’t want to see any pattern for the residuals. We also want the red line to be relatively horizontal. The points are near the red line initially but deviate as we move along. We can see that the values are skew to the left and right also the red line near to the spot and move to the along.

plot(lm_model)

Moreover, the plot of Residuals versus Leverage validates that observation 117 is an anomaly. In summary, our model seems to defy the homoscedasticity assumption and might be experiencing underfitting. We might think about increasing the number of variables or the size of our dataset in the linear regression model to lessen this.

Sometimes it helps if you transform a variable (log, square root, et cetra) in terms of better model fit (Overcome problems due to nonconstant variance or Overcome problems due to nonlinearity), but it can make the interpretation of coefficients harder. Try to play around by transforming the variables and tell us if the model fit improves or not.

# Transform all variables to natural log
data_transform <- transform(airquality, 
                                ln_Temp = log(airquality$Temp),
                                ln_Ozone = log(airquality$Ozone))
# Run the simple linear regression again with the log transformed y and x variables
my_reg2 <- lm(ln_Temp ~ ln_Ozone, data = data_transform)
summary(my_reg2)
## 
## Call:
## lm(formula = ln_Temp ~ ln_Ozone, data = data_transform)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24155 -0.04616  0.01970  0.05486  0.19084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.984743   0.032582  122.30   <2e-16 ***
## ln_Ozone    0.106090   0.009242   11.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08578 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5362, Adjusted R-squared:  0.5321 
## F-statistic: 131.8 on 1 and 114 DF,  p-value: < 2.2e-16
# Plot the linear regession of log transformed y and x variables
plot(my_reg2)

After myreg2 statistics there is noticable improvement in the model. The updated residuals vs fitted values are better than fit when we are compare to old fitted values and also has both positive and negative residuals. The r squared also risen from 0.4877 to 0.5362.