1 I.

The Gauss-Markov assumptions set criterias under which OLS is the best linear unbiased estimator. To be more specific, the relationship between dependent and independent variables is linear in parameters. Observations must be randomly drawn from the population. There shouldn’t be perfect collinearity among predictors. Heteroskedasticity is not present. Error terms are not autocorrelated. Finally, errors are assumed to be normally distributed.
The relationship between dependent and independent variables being “linear in parameter” suggests that our concern lies in the linearity of the parameters rather than the variables themselves. Consider trying to determine how much a wage increases with each additional year of education. The linearity assumption in linear regression postulates that such relationships follow a consistent, straight-line pattern. Hence, the prediction of the wage based on the influencing factor (education) should trace a straight, predictable, and readily interpretable trajectory.

Random sampling’s importance is evident when we aim to ensure our sample accurately represents the entire population. For example, if we seek to estimate the average height of students on a campus and only sample students from the basketball team, our results will likely skew higher than the true average for the entire campus. This discrepancy introduces bias into our results.

The “No perfect collinearity” assumption implies that no two or more variables should exert the same or exceedingly similar effects on our predictions. For instance, when trying to predict whether a tumor is malignant, using factors like diameter and radius simultaneously could lead to issues. This is because if both diameter and radius increase by the same unit, it becomes challenging to discern which variable has a more significant impact on the prediction.

The “Zero conditional mean” assumption postulates that the average prediction error should be zero. Let’s consider forecasting ice cream sales based on seasons. Simplistically, we’d predict maximum sales during summer. However, external factors, like individual preferences to consume ice cream irrespective of the season, act as “hidden variables” affecting sales. If we overlook such variables and consistently overestimate or underestimate sales, our prediction errors won’t average out to zero.

The “Homoscedasticity” assumption is crucial for maintaining prediction accuracy across all data points. Suppose we’re estimating house prices based on size. If our model accurately predicts prices for smaller houses but not for larger ones, it exhibits heteroskedasticity. For a reliable model, accuracy should remain consistent, whether we’re predicting the price of a small or a large house.

Lastly, the “No serial correlation” assumption posits that prediction errors from one period shouldn’t influence those from subsequent periods. For instance, when evaluating GDP values seasonally, each season’s value should be independent, ensuring past errors don’t influence future ones.
Based on the Gauss-Markov assumption, we care about linearity in parameter instead of variables. Model like is violate linearity in parameter. To be more specific, if beta1 can be either positive or negative, squaring it will always produce a non-negative value. Thus, it will not capture the true relationship between x and y variable.

\[ y = \beta_0 + \beta_1^2X_1 + \epsilon \]

The second assumption pertains to the existence of an inverse. If observations are not randomly sampled, certain rows might be dependent on one another. This can potentially render XX’ non-invertible.The “No perfect multicollinearity” assumption ensures the existence of a unique solution for the coefficient. Without this assumption, multiple beta values could fit the data, causing ambiguity in interpretation.

The “Zero conditional mean” assumption is fundamental for valid hypothesis testing. A violation of this assumption means the standard errors of the coefficient estimates won’t be accurate, resulting in unreliable confidence intervals.

Homoscedasticity ensures that each variable exhibits equal variance.

Lastly, the “No serial correlation” assumption implies that the residuals are independent of one another. In matrix representation, this matrix would exhibit a specific form that indicates the independence of residuals

\[ \Sigma = E(\epsilon \epsilon') = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} \]

2 II.

airquality

##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

help("airquality")

my_reg = lm(airquality$Temp ~ airquality$Ozone)
summary(my_reg)

## 
## Call:
## lm(formula = airquality$Temp ~ airquality$Ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.147  -4.858   1.828   4.342  12.328 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      69.41072    1.02971   67.41   <2e-16 ***
## airquality$Ozone  0.20081    0.01928   10.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.819 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

I tend to choose temperature as a dependent variable and ozone as independent variable. Regressed on level of ozone to predict the temperature variation.
The estimating equation is:

\[ Temp = 69.41072 + 0.200811Ozone + \epsilon\ \]
Based on the dataset dictionary, ozone is measured in ppb and temperature is measured in degrees Fahrenheit.
Interpretation on slope: With one ppb increase in ozone level, the temperature will increase 0.200811 degrees F, holding all else constant.
Interpretation on intercept: The predicted temperature would be 69.41072 degrees F when ozone level equals to zero, holding all else constant.
P-value indicates the probability to get t-statistics. Based on the summary statistics, the p-value for coefficient is 2.2e-16 and the t-statistics is 10.42. The value of t-statistics measures the effect of temperature on Ozone level. The chance of seeing 10.42, under the assumption of the null hypothesis that there is no effect of temperature on Ozone is less than 2.2e-16. Thus, we reject the null hypothesis assumption, concluding temperature has effect on predicting Ozone and the coefficient of temperature is statistically significant.

3 III.

Residuals vs. Fitted Plot indicates the differences between actual y and predicted y. Ideally, the residual should be 0 for every observations if our model able to perfect predict. Thus, Residuals vs. Fitted Plot tells us whether the linear model is appropriate to capture the relationship between x and y.

Normal Q-Q Plot illustrates where the residuals are normally distributed by comparing then with an actual normal distribution. This is plot is related to linear regression assumption of random sampling.

Scale-location Plot shows whether the residuals have equals variance. Thus, we expect the dots in graph randomly scattered without significance pattern.

Residuals vs. Leverage Plot shows outlier that have significance effect on the model. Based on the plot, we identifying values in the upper or lower right corners which indicates points have lots of leverage.

plot(my_reg)

From the Residuals and Fitted plot, we observe that the residuals are close to 0 initially, but they diverge as our predicted y (Ozone) value increases. This suggests a potential issue of heteroscedasticity, indicating a lack of consistent prediction. The normal Q-Q plot reveals that most of our data lies within the -2 to 2 range, aligning with a normal distribution. However, observation 117 seems to deviate from this trend and might be an outlier. The Scale-Location Plot, which ideally should show dots evenly scattered without discernible patterns, unfortunately mirrors our Residuals and Fitted plot. The points are near the red line initially but deviate as we move along. Furthermore, the Residuals vs. Leverage plot confirms that observation 117 stands out as an outlier. In conclusion, our model appears to violate the homoscedasticity assumption and may have an under-fitting problem. To mitigate this, we might consider incorporating more variables or expanding our dataset in the linear regression model.

# Adding a quadratic term for Ozone
my_reg_1 <- lm(airquality$Temp ~ airquality$Ozone + I(airquality$Ozone^2))
summary(my_reg_1)

## 
## Call:
## lm(formula = airquality$Temp ~ airquality$Ozone + I(airquality$Ozone^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1553  -3.9374   0.9296   4.0393  12.9195 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           63.8614538  1.3163562  48.514  < 2e-16 ***
## airquality$Ozone       0.4896669  0.0524715   9.332 1.07e-15 ***
## I(airquality$Ozone^2) -0.0023198  0.0003987  -5.818 5.64e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.008 on 113 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.6058, Adjusted R-squared:  0.5988 
## F-statistic: 86.83 on 2 and 113 DF,  p-value: < 2.2e-16

plot(my_reg_1)

Upon comparing the summary statistics between my_reg and my_reg_1, there’s a noticeable improvement in model fit. Firstly, the R-squared value has risen from 0.4877 to 0.6058, indicating a better explanatory power of the model. Additionally, the updated Residuals vs. Fitted plot reveals a more random scattering of both negative and positive residuals, further suggesting a better model fit.

Weekly Discussion: Gauss Markov Assumptions and Residual Analysis

Yuqi

2023-09-24

1 I.

2 II.

3 III.