I. 1) State the Gauss-Markov Assumptions:

The Gauss-Markov theorem states that the OLS estimator is the Best Linear Unbiased Estimator (BLUE) if certain assumptions are met. It means that we can use the least squares estimators on the sample data. If an estimator is BLUE, it means that there are no other linear unbiased estimators that having lower sampling variance than that particular estimator. The six assumptions are:

  1. Linearity

  2. Full column rank

  3. Exogeneity of the independent variables (Zero conditional mean)

  4. Homoscedasticity

  5. Data generation

  6. Normal distribution

2) Explain each assumption to the non-technical crowd:

  1. Linearity: This assumption means that the relationship between the independent variables and the dependent variable is a straight line. It’s like saying the effect of each independent variable on the dependent variable is constant, i.e, linear and doesn’t curve or bend.

  2. Full column rank:

    This assumption says that there’s no perfect linear relationship between independent variables. It means you can’t perfectly predict one independent variable from another, they should be distinct.

  3. Exogeneity of the independent variables (Zero conditional mean of error):

    It means that the expectation of the error term, given the predictor variable will not help me to predict whether the error is above or below the regression line. This assumption says that the errors (the differences between the observed values and the predicted values) are not related to each other. In other words, one error doesn’t depend on or affect the others.

  4. Homoscedasticity and nonautocorrelation:

    If the distribution of errors stay relatively constant with the regression line, then it means that errors are homoscedastic. This means that the spread or variability of the errors is the same across all values of the independent variables. It’s like saying the errors are equally scattered around the regression line.

  5. Data generation:

    The set of data, both the independent and the dependent variable should be random sample from the population. It means that each sample is equally likely to be picked from the population. It also means that all the sample data comes from the same population.

  6. Normal distribution:

    The disturbances are normally distributed.

    3) Explain each assumption to the technical crowd:

    1. Linearity: In the regression equation,

      \[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \] The parameters have to be linear, but the variables can be non-linear. The following first equation is completely fine, but the second equation doesn’t hold the linearity assumption.

      \[ y_i = \beta_0 + \beta_1x_i^2 + \epsilon_i -1 \]

      \[ y_i = \beta_0\beta_1 + \beta_2x_i + \epsilon_i -2 \]

    2. Full column rank:

      There is no exact linear relationship among any of the predictors. No perfect multicollinearity. The column vectors are linearly independent. If the column vectors are linearly dependent then the det=0 and you can't inverse the matrix.

    3. Exogeneity of the independent variables (Zero conditional mean of error):

      The expected value of the disturbance (error term) at observation I in the sample is NOT a function of the predictors observed at any observation.

      \[ E[e_i|X] = 0 , where, i = 1,2,…….,n \]

    4. Homoscedasticity and nonautocorrelation:

      Each error term has the same finite variance. Homoscedasticity means the errors have constant variance. The variability of the errors is the same across all levels of the independent variables.

      \[ E( \epsilon\epsilon^ | X) = \sigma^2I , where, i = 1,2,……,n, \]

    5. Data generation:

      \[ (x_i, y_i) \]

      All the data is sampled from the same population.

    6. Normal distribution

      Constant variance – Homoscedasticity. The central limit theorem is typically evoked to justify this assumption.\[ \epsilon|X - N [0,\sigma^2I] \]

    II. Find a cross-sectional dataset for simplicity with more than 120 rows/observations. Run a linear regression, and store the regression results in an object called “my_reg”.

    I chose a cross-sectional dataset from the R. The “iris” dataset contains measurements of sepal and petal length and width for three different species of iris flowers. It has 150 rows/observations, which makes it suitable for demonstrating the Central Limit Theorem.

    data(iris)
    
    my_reg <- lm(Sepal.Length ~ Sepal.Width, data = iris)
    summary(my_reg)
    ## 
    ## Call:
    ## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
    ## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.8251 on 148 degrees of freedom
    ## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
    ## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

    In the linear regression, dependent variable (response) is “Sepal.Length” and it represents the length of the sepal in iris flowers. The independent variable (predictor) is “Sepal.Width,” representing the width of the sepal.

    The estimating equation for this simple linear regression is:

    \[ Sepal.Length_i = \beta_0 + \beta_1*Sepal.Width_i + \epsilon_i\] \(\beta_0\) is the intercept, representing the estimated sepal length when the sepal width is zero.

    \(\beta_1\)

    is the slope, representing the change in sepal length for a one-unit change in sepal width.

    The units of both the independent and dependent variables is in centimeters (cm). From the linear regression, the intercept has a value of approximately 6.5262. This represents the estimated value of the dependent variable (Sepal.Length) when the independent variable (Sepal.Width) is zero. In this context, having a sepal width of zero is not meaningful, and the interpretation of the intercept doesn’t have practical significance.

    The coefficient for “Sepal.Width” is approximately -0.2234. This coefficient represents the change in the dependent variable (Sepal.Length) for a one-unit change in the independent variable (Sepal.Width). For each additional unit increase in sepal width, Sepal.Length is estimated to decrease by approximately 0.2234 units. The negative sign indicates an inverse relationship between sepal width and sepal length. As sepal width increases, sepal length tends to decrease. To assess the statistical significance of the coefficients, we look at the t-value and the associated p-value. For the intercept, the t-value is 13.63, and the p-value is very close to zero. This indicates that the intercept is highly statistically significant.

    For the coefficient of Sepal.Width, the t-value is -1.44, and the p-value is approximately 0.152. This suggests that the coefficient is not statistically significant at a conventional significance level of 0.05 (alpha). This means that the relationship between sepal width and sepal length in this dataset may not be statistically significant. The p-value suggests that we cannot confidently reject the null hypothesis. There is insufficient evidence to conclude that there is a statistically significant linear relationship between sepal width and sepal length.

    III. Now, create the 4 linear regression plots we saw in class using the “plot(my_reg)” command in R. 

    # Creating a scatter plot
    plot(x = iris$Sepal.Width,
         y = iris$Sepal.Length,
         xlab = "Sepal Width (in cms)",
         ylab = "Sepal Length (in cms)",
         main = "Iris Flower")

    plot(my_reg)

    The four plots are:

    1. Residuals vs. Fitted Values

      This plot helps assess the linearity and homoscedasticity assumptions. If the points in the plot are randomly scattered around the horizontal line at zero, it suggests that the linearity assumption is met. Homoscedasticity is satisfied if the spread of the points is relatively consistent across the range of fitted values. Deviations from these patterns might indicate issues with linearity or heteroscedasticity.

    2. Normal Q-Q Plot

      This plot checks the assumption of normality in the residuals. If the points in the plot closely follow the diagonal line, it suggests that the residuals are normally distributed. Deviations from the diagonal line may indicate departures from normality in the residuals.

    3. Scale-Location Plot

      This plot helps assess the assumption of constant variance of residuals. It displays the square root of the standardized residuals against the fitted values. A horizontal line with relatively constant spread of points suggests homoscedasticity. A fan-shaped pattern or changing spread could indicate heteroscedasticity.

    4. Residuals vs. Leverage

      This plot helps identify influential data points and outliers. Data points with high leverage are plotted higher on the y-axis. Cook’s Distance measures the influence of each data point on the regression coefficients. Points with high Cook’s Distance are potential outliers or influential observations that can disproportionately affect the model.

    These plots help assess the model fit, normality of residuals, homoscedasticity, and influential data points.

    What I infer from these plots:

    1. From the Residuals vs Fitted Values plot, all the points are randomly scattered around the horizontal line at zero and it says that the linearity assumption is met. Homoscedasticity assumption is also satisfied as the spread of the points is relatively consistent across the range of fitted values.

    2. In the Normal Q-Q plot, the points closely follow the diagonal line and from it we can say that the residuals are normally distributed. And from the plot we can also see that there aren’t any potential outliers

    3. From the Scale-Location plot, the points are spread evenly across the y-axis and the x-axis. I could not find any funnel shape in the plot as well, this tells us that the homoscedasticity condition has been met.

    4. From the residuals vs leverage plot, we can observe that roughly most of the points do not have any specific leverage to effect the regression except one particular point which has a leverage of 0.07, but it isn’t an outlier since it’s standardized residual value is closer to zero. Unlike this point, the observations 42, 118, 132 are oultiers but these points don’t have a much leverage value, and it potentially doesn’t effect the regression coefficients.

      From the linear regression I ran, the Gauss-Markov assumptions are not seriously violated.

      Transforming the variable and plotting the graph:

      my_reg_1 <- lm(Sepal.Length ~ Sepal.Width + I(Sepal.Width^2), data = iris)
      summary(my_reg_1)
      ## 
      ## Call:
      ## lm(formula = Sepal.Length ~ Sepal.Width + I(Sepal.Width^2), data = iris)
      ## 
      ## Residuals:
      ##      Min       1Q   Median       3Q      Max 
      ## -1.63153 -0.62177 -0.08282  0.50531  2.33336 
      ## 
      ## Coefficients:
      ##                  Estimate Std. Error t value Pr(>|t|)  
      ## (Intercept)        2.4594     2.4020   1.024   0.3076  
      ## Sepal.Width        2.4312     1.5445   1.574   0.1176  
      ## I(Sepal.Width^2)  -0.4246     0.2458  -1.727   0.0862 .
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 0.8196 on 147 degrees of freedom
      ## Multiple R-squared:  0.03344,    Adjusted R-squared:  0.02029 
      ## F-statistic: 2.543 on 2 and 147 DF,  p-value: 0.08209
      plot(my_reg_1)

      After transforming the variable, we can clearly see from the residual vs fitted values plot that the linearity assumption is not met as the points are not randomly scattered around the horizontal line. I just introduced non-linearity in the variables and not the parameters, but still the linearity assumption is not met. There isn’t a huge change in the Q-Q plot which tells that the residuals are still normally distributed. And there is a fan shape in the Scale-Location plot, and this indicates that I introduced heteroscedasticity. The new thing that I observed in the Residuals vs Leverage plot is that, I could identify the observation (16th) is effecting the regression.