Discussion - Gauss Markov Assumptions and Residual Analysis
Author
Gina Occhipinti
Gauss Markov Assumptions
State the Gauss-Markov Assumptions
The Gauss-Markov Assumptions State the Following:
There is a linear relationship between X and Y
The data must be randomly sampled from a population
There is no perfect multicollinearity -in other words, the data contained in the columns are linearly independent.
Exogeneity - the independent variables are not affected by factors that influence the error term.
The disturbances in the data for the model average out to 0 for any value of X.
There is homoskedasticity.
Non-Technical Description:
To effectively study and determine a relationship between given inputs and outputs, the relationship must be visualized on a plot in a line shape. The line must be drawn through the data points of the inputs and outputs together. Also, the input and output must change proportionally together. Ideally, if there is a small change in the input, then there should also be a small change in the output, and same for large changes.
The data must also be random because it can otherwise cause unwanted effects. Ideally, the each row of data in the sample should have an equal or near equal likelihood of occurring.
For effective analysis, the input should not predict each other. Meaning, knowing one input should not indicate how the other input variable will perform. This would make the model less reliable because the effects of a given input aren’t fully isolated - it’s essentially changing because another input is also present.
Additionally, the input variable cannot be affected by anything that also affects our errors that happen in our data. This is for a similar reason above - to isolate the affects of the input on the output. For example, if there’s another input not included in the study and it affects both the included input and output, it could introduce bias or cause the study to be inaccurate.
The last point is about how the data points fall around the best fit line. The points should ideally look same scattered the same, ideally all being the same distance from the line, rather than varying widely around the line. If the data is widely spread out, the analysis won’t work well.
Technical Description:
Linearity: Our X and Y variables must be linear to effectively perform a regression analysis. A line must be drawn through the sample points to determine the points best representing the true population. If X and Y are linear, the beta coefficients can be correctly interpreted (i.e., a unit change in X affects Y by beta).
Random sample: A simple random sample is a set of N objects in a population of N objects where all possible samples are equally likely to happen (or at least nearly equally). There are ways to random sample such as square root sampling but simple random sampling is most common.
Non-collinearity: With this criteria, X variables cannot be correlated with each other, at least not strongly. Otherwise the effects of each X variable on Y cannot be isolated. The independent variables should truly be independent. At its core, regression shows the isolated influences of each X on Y, keeping other X’s constant. However if the other X’s aren’t constant and change together, this is condition is violated. Collinearity also inflates the variance and standard error of our coefficient estimates. It reduces the reliability of the model and where the P-value to justify whether X is statistically significant for the model cannot be trusted (Stratascratch).
Exogeneity: With this assumption, the X variables are not dependent on Y. In other words, an exogenous X variable influences the system without being influenced back by it. With a strictly exogenous variable, the error term, which captures the effects of omitted variables in the system, from Y is completely unaffected by X. In other words the expectation of the error term is zero, given any X variable.
Homoscedasticity: With this assumption the data points are all about the same distance from the regression line. Essentially the standard deviations or variances are equal for all points.
Dataset Example
Medicaid Utilization Data (Medicaid1986 Dataset)
About the Data
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(Medicaid1986, type ="text")
=============================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------
visits 996 1.931 3.354 0 50
exposure 996 104.060 9.145 32 120
children 996 1.314 1.509 0 9
age 996 55.206 24.961 16 105
income 996 8.191 3.631 0.500 17.500
health1 996 -0.00001 1.437 -2.264 7.217
health2 996 0.00002 0.740 -2.177 3.048
access 996 0.398 0.184 0.000 1.000
school 996 9.029 4.354 0 18
---------------------------------------------
# cite: Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
This dataset is a Cross-section data originating from the 1986 Medicaid Consumer Survey. The data comprise two groups of Medicaid eligibles at two sites in California (Santa Barbara and Ventura counties): a group enrolled in a managed care demonstration program and a fee-for-service comparison group of non-enrollees.
my_reg <-lm(health1 ~ age, data = Medicaid1986)summary(my_reg)
Call:
lm(formula = health1 ~ age, data = Medicaid1986)
Residuals:
Min 1Q Median 3Q Max
-2.3943 -1.0802 -0.2948 0.8634 7.2274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.258869 0.110248 -2.348 0.0191 *
age 0.004689 0.001820 2.577 0.0101 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.433 on 994 degrees of freedom
Multiple R-squared: 0.006635, Adjusted R-squared: 0.005635
F-statistic: 6.639 on 1 and 994 DF, p-value: 0.01012
The output compares the independent variable age against the dependent variable health1. We can see from the summary of the regression model that when age is 0 years old, health1 is at -0.245, measured in units of the first principal component analysis (PCA) divided by 1000. This is a variable that summarize health indicators including functional limitations, acute conditions, and chronic conditions. This intercept is not useful in this example because age would not be 0 , unless perhaps considering newborns under 1, however our minimum age in the dataset is 16 years.
The coefficients show that as age increases by 1 year, Health1 increases by 0.005. Analyzing the p-value at an alpha or significance level of 0.01, which determines whether the relationship observed also exists in the larger population, Age is a statistically significant variable. This indicates there is sufficient evidence in our sample to conclude a non-zero correlation exists (Statistic by Jim).
It does not have large economic magnitude as age only increases health1 by a positive, but small amount (.005). The R2 is 0.006635 and adjusted R-squared is 0.005635 which are both very low and indicate the model explains only about .67% of the variance in health1. Overall, this suggests that age is not a strong predictor of health.
Regression Plots
plot(my_reg)
The Residuals vs. Fitted graph shows if the appropriate type of model (linear) for the dataset is used. The residuals are the difference between the observed value and the predicted value. In the my_reg model, it’s the difference between actual health1 and what the model predicted for health1.
The Scale-Location plot shows if the residuals are spread equally among predictions in order to check homoskedasticity.
The Residuals vs. Leverage plot shows influential data points that have a big effect on the linear model.
The Q-Q Residuals plot show whether the model follows a normal distribution. Ideally, the points should fall along a diagonal.
Violations of the Gauss-Markov Assumptions
From the plots, there is some clustering happening in the data which could point to a missing predictor influencing age, or it’s possible the relationship is not linear with health1. However, the trend line is about horizontal at 0, in the Residuals vs. Leverage plot for example, which is important for an accurate model. However, there are some outliers far from the mean of age, indicating some of these points could affect the model’s estimate. There could be some heteroskedasticity here as well because there’s some variation in the residuals across the fitted values, shown in both the Fitted vs. Residuals and Scale-Location plots. From the Q-Q plot, there are some deviations from the diagonal line in the tails, indicating the residuals are not completely normally distributed.
my_reg2 <-lm(health1 ~log(age), data = Medicaid1986)summary(my_reg2)
Call:
lm(formula = health1 ~ log(age), data = Medicaid1986)
Residuals:
Min 1Q Median 3Q Max
-2.4157 -1.0682 -0.2877 0.8717 7.1935
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.11080 0.34212 -3.247 0.00121 **
log(age) 0.28570 0.08722 3.276 0.00109 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.43 on 994 degrees of freedom
Multiple R-squared: 0.01068, Adjusted R-squared: 0.009684
F-statistic: 10.73 on 1 and 994 DF, p-value: 0.001091
plot(my_reg2)
As a test, one can apply some transformations on the different variables. It’s not possible to apply the natural log to health1 because it contains negative numbers, however one can take the natural log of age. This caused our coefficient to increase, while maintaining statistical significance. An increase in 1 year of age causes health to increase by 0.29 at a .01 significance level. However, analyzing the plots, the Gauss-Markov assumptions are still similarly violated somewhat, showing heterokedasticity of the residuals.
my_reg3 <-lm(health1 ~ access, data = Medicaid1986)summary(my_reg3)
Call:
lm(formula = health1 ~ access, data = Medicaid1986)
Residuals:
Min 1Q Median 3Q Max
-2.3549 -1.1322 -0.3342 0.8470 6.9726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2444 0.1081 2.261 0.0240 *
access -0.6139 0.2463 -2.492 0.0129 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.433 on 994 degrees of freedom
Multiple R-squared: 0.006209, Adjusted R-squared: 0.005209
F-statistic: 6.21 on 1 and 994 DF, p-value: 0.01286
plot(my_reg3)
Trying other variables to predict health1, like access, there is reduced violation of the GM assumptions and effects of having high availability of health services. However, this is not a strong model as shown by the R-squared 0.006209.