Gauss Markov Assumptions and Residual Analysis

Part I

Gauss-Markov Assumptions

The Gauss-Markov assumptions are the basic conditions we usually rely on when we run an ordinary least squares regression. First, the model has to be set up correctly so the dependent variable is related to the explanatory variables plus some leftover error. Second, the data should come from a random sample. Third, the explanatory variables cannot be perfect copies of each other. Fourth, the error term should have an average value of zero once we control for the explanatory variables. Fifth, the spread of the error term should stay roughly constant across observations. When these assumptions hold, OLS gives us the best linear unbiased estimates. Normality is often discussed with regression too, but technically it is not one of the core Gauss-Markov assumptions.

Plain-English Explanation

In plain English, these assumptions matter because they help make the regression results believable. We need the model to be set up in a reasonable way, because if the relationship is written down badly, the regression can point us in the wrong direction before we even start. We want a random sample so the data are not stacked in favor of one type of observation, which would make the results less trustworthy. We also need the explanatory variables to provide different information, because if two variables are basically the same thing, the model cannot tell which one is doing the work. Another big assumption is that the things we leave out of the model are not secretly connected to the variables we included. If they are connected, the regression may assign too much or too little effect to the wrong variable. We also hope the amount of unexplained noise is fairly stable across the sample, because if the noise changes a lot from one group to another, then the usual measures of precision can be misleading. Overall, these assumptions are what allow us to treat the regression as more than just a pattern in one sample and instead as a useful estimate of a real relationship.

Technical Explanation

For a more technical explanation, the first assumption is that the model is correctly specified and linear in parameters, so OLS is estimating the relationship we actually intend to study. The random sampling assumption means the observations are drawn independently from the population, which supports consistency and the usual asymptotic results. No perfect multicollinearity means none of the regressors is an exact combination of the others, so each coefficient can be separately identified. The zero conditional mean assumption, written as E(u|X)=0, is the most important one because it means the regressors are uncorrelated with the error term. That is what gives us unbiased coefficient estimates. If that assumption fails because of omitted variables, simultaneity, or measurement error, then the coefficients can become biased and inconsistent. The homoskedasticity assumption means the variance of the error term is constant across observations, which is needed for the classic Gauss-Markov result that OLS is BLUE. If homoskedasticity does not hold, OLS can still be unbiased under exogeneity, but the usual standard errors are no longer reliable unless we correct for that problem.

Part II

The dataset I am using here is diamonds from the ggplot2 package. This is a cross-sectional dataset, and it has far more than 120 observations, so it works well for this assignment. In my regression, the dependent variable is price, which is measured in U.S. dollars, and the independent variable is carat, which measures the weight of the diamond in carats. The simple regression I am estimating can be written as:

\[ price_i = \beta_0 + \beta_1 carat_i + u_i \]

In this equation, $\beta_0$ is the intercept, which represents the predicted price of a diamond when carat is equal to zero. $\beta_1$ is the slope coefficient, and it tells us how much the predicted price changes when carat increases by one unit. The $u_i$ term captures all the other factors that affect diamond price but are not included in this simple model. I stored the regression results in an object called my_reg, and the code also prints the coefficients so I can use them later when I interpret the regression results.

# Load dataset
library(ggplot2)
data("diamonds")

# Keep only the two variables we need
my_data <- diamonds[, c("price", "carat")]

# Run simple linear regression and store results in my_reg
my_reg <- lm(price ~ carat, data = my_data)

# Print regression output
summary(my_reg)

## 
## Call:
## lm(formula = price ~ carat, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

# Print coefficients only
coef(my_reg)

## (Intercept)       carat 
##   -2256.361    7756.426

The regression results show a strong positive relationship between diamond weight and diamond price. The estimated equation is:

\[ \widehat{price}_i = -2256.36 + 7756.43 \, carat_i \]

The slope coefficient on carat is 7756.43, which means that, on average, a one-carat increase in diamond weight is associated with about a $7,756 increase in price. This is a very large effect, so the economic magnitude is clearly meaningful. The intercept is -2256.36, which would be the predicted price when carat is equal to zero. In practice, that number is not very meaningful on its own, because a diamond cannot really have a weight of zero carats, so the intercept is mainly there to help fit the regression line.

Both coefficients are statistically significant at the 1% level, since their p-values are less than 0.001. In fact, the p-values are reported as smaller than 2e-16, which means there is extremely strong evidence that the relationship is not due to random chance. The model also has an $R^2$ of 0.8493, which means that about 84.9% of the variation in diamond prices is explained by carat alone. That is a very high share for a simple regression with only one explanatory variable. Overall, this suggests that carat is a very important predictor of diamond price in this dataset.

Part III

# Show the 4 default regression diagnostic plots in one window
par(mfrow = c(2, 2))
plot(my_reg)

par(mfrow = c(1, 1))

When I look at the four default regression diagnostic plots, I use each one to check a different part of model fit. The Residuals vs Fitted plot helps me see whether the model is missing some pattern, because the residuals should look like a random cloud around zero if the linear form fits well. If I see a curve, a funnel shape, or some other clear pattern, that tells me the relationship may not be fully captured by a straight-line model or that the error variance is changing. The Normal Q-Q plot compares my residuals to what they would look like if they followed a normal distribution, and I usually want the points to stay close to the reference line. If the points bend away a lot in the tails, that suggests outliers or non-normal residuals. The Scale-Location plot is another way to check whether the spread of the residuals stays fairly constant across fitted values, and ideally I want a flat band with no strong upward or downward pattern. The Residuals vs Leverage plot helps me identify observations that have unusual predictor values and may have too much influence on the regression line. As a rule of thumb, I look for randomness in the first plot, closeness to the line in the Q-Q plot, a roughly even spread in the Scale-Location plot, and no highly influential points dominating the Leverage plot. If several of these plots show patterns at the same time, I take that as a sign that the model may fit the data poorly or that some assumptions need more attention.

For my regression, the plots suggest that the simple model of price on carat has some clear problems. I see evidence of nonconstant variance, because the spread of the residuals gets larger as the fitted values increase, and I also see departures from normality in the Q-Q plot. There are also a few influential and unusual observations, and the Residuals vs Fitted plot suggests the relationship may not be perfectly captured by one straight line. So, I would say some Gauss-Markov assumptions, especially constant variance, are not holding very well, although the sample is so large that the coefficient estimates are still informative.

If I transform the variables, I would expect the model fit to improve, especially because diamond prices often rise nonlinearly with carat and the residual spread increases at higher fitted values. A log transformation such as lm(log(price) ~ log(carat), data = my_data) would probably reduce the funnel shape and make the relationship more stable. The tradeoff is that the coefficients become a little harder to explain, since they are then interpreted in percentage terms rather than dollar changes.

Gauss Markov Assumptions and Residual Analysis

Pawemi Kumwenda

2026-04-01

Part I