The Gauss-Markov assumptions are the basic conditions we usually rely on when we run an ordinary least squares regression. First, the model has to be set up correctly so the dependent variable is related to the explanatory variables plus some leftover error. Second, the data should come from a random sample. Third, the explanatory variables cannot be perfect copies of each other. Fourth, the error term should have an average value of zero once we control for the explanatory variables. Fifth, the spread of the error term should stay roughly constant across observations. When these assumptions hold, OLS gives us the best linear unbiased estimates. Normality is often discussed with regression too, but technically it is not one of the core Gauss-Markov assumptions.
In plain English, these assumptions matter because they help make the regression results believable. We need the model to be set up in a reasonable way, because if the relationship is written down badly, the regression can point us in the wrong direction before we even start. We want a random sample so the data are not stacked in favor of one type of observation, which would make the results less trustworthy. We also need the explanatory variables to provide different information, because if two variables are basically the same thing, the model cannot tell which one is doing the work. Another big assumption is that the things we leave out of the model are not secretly connected to the variables we included. If they are connected, the regression may assign too much or too little effect to the wrong variable. We also hope the amount of unexplained noise is fairly stable across the sample, because if the noise changes a lot from one group to another, then the usual measures of precision can be misleading. Overall, these assumptions are what allow us to treat the regression as more than just a pattern in one sample and instead as a useful estimate of a real relationship.
For a more technical explanation, the first assumption is that the
model is correctly specified and linear in parameters, so OLS is
estimating the relationship we actually intend to study. The random
sampling assumption means the observations are drawn independently from
the population, which supports consistency and the usual asymptotic
results. No perfect multicollinearity means none of the regressors is an
exact combination of the others, so each coefficient can be separately
identified. The zero conditional mean assumption, written as
E(u|X)=0, is the most important one because it means the
regressors are uncorrelated with the error term. That is what gives us
unbiased coefficient estimates. If that assumption fails because of
omitted variables, simultaneity, or measurement error, then the
coefficients can become biased and inconsistent. The homoskedasticity
assumption means the variance of the error term is constant across
observations, which is needed for the classic Gauss-Markov result that
OLS is BLUE. If homoskedasticity does not hold, OLS can still be
unbiased under exogeneity, but the usual standard errors are no longer
reliable unless we correct for that problem.
The dataset I am using here is diamonds from the
ggplot2 package. This is a cross-sectional dataset, and it
has far more than 120 observations, so it works well for this
assignment. In my regression, the dependent variable is
price, which is measured in U.S. dollars, and the
independent variable is carat, which measures the weight of
the diamond in carats. The simple regression I am estimating can be
written as:
\[ price_i = \beta_0 + \beta_1 carat_i + u_i \]
In this equation, \(\beta_0\) is the
intercept, which represents the predicted price of a diamond when carat
is equal to zero. \(\beta_1\) is the
slope coefficient, and it tells us how much the predicted price changes
when carat increases by one unit. The \(u_i\) term captures all the other factors
that affect diamond price but are not included in this simple model. I
stored the regression results in an object called my_reg,
and the code also prints the coefficients so I can use them later when I
interpret the regression results.
# Load dataset
library(ggplot2)
data("diamonds")
# Keep only the two variables we need
my_data <- diamonds[, c("price", "carat")]
# Run simple linear regression and store results in my_reg
my_reg <- lm(price ~ carat, data = my_data)
# Print regression output
summary(my_reg)
##
## Call:
## lm(formula = price ~ carat, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
# Print coefficients only
coef(my_reg)
## (Intercept) carat
## -2256.361 7756.426
The regression results show a strong positive relationship between diamond weight and diamond price. The estimated equation is:
\[ \widehat{price}_i = -2256.36 + 7756.43 \, carat_i \]
The slope coefficient on carat is 7756.43, which means
that, on average, a one-carat increase in diamond weight is associated
with about a $7,756 increase in price. This is a very large effect, so
the economic magnitude is clearly meaningful. The intercept is -2256.36,
which would be the predicted price when carat is equal to zero. In
practice, that number is not very meaningful on its own, because a
diamond cannot really have a weight of zero carats, so the intercept is
mainly there to help fit the regression line.
Both coefficients are statistically significant at the 1% level, since their p-values are less than 0.001. In fact, the p-values are reported as smaller than 2e-16, which means there is extremely strong evidence that the relationship is not due to random chance. The model also has an \(R^2\) of 0.8493, which means that about 84.9% of the variation in diamond prices is explained by carat alone. That is a very high share for a simple regression with only one explanatory variable. Overall, this suggests that carat is a very important predictor of diamond price in this dataset.
# Show the 4 default regression diagnostic plots in one window
par(mfrow = c(2, 2))
plot(my_reg)
par(mfrow = c(1, 1))
When I look at the four default regression diagnostic plots, I use each one to check a different part of model fit. The Residuals vs Fitted plot helps me see whether the model is missing some pattern, because the residuals should look like a random cloud around zero if the linear form fits well. If I see a curve, a funnel shape, or some other clear pattern, that tells me the relationship may not be fully captured by a straight-line model or that the error variance is changing. The Normal Q-Q plot compares my residuals to what they would look like if they followed a normal distribution, and I usually want the points to stay close to the reference line. If the points bend away a lot in the tails, that suggests outliers or non-normal residuals. The Scale-Location plot is another way to check whether the spread of the residuals stays fairly constant across fitted values, and ideally I want a flat band with no strong upward or downward pattern. The Residuals vs Leverage plot helps me identify observations that have unusual predictor values and may have too much influence on the regression line. As a rule of thumb, I look for randomness in the first plot, closeness to the line in the Q-Q plot, a roughly even spread in the Scale-Location plot, and no highly influential points dominating the Leverage plot. If several of these plots show patterns at the same time, I take that as a sign that the model may fit the data poorly or that some assumptions need more attention.
For my regression, the plots suggest that the simple model of
price on carat has some clear problems. I see
evidence of nonconstant variance, because the spread of the residuals
gets larger as the fitted values increase, and I also see departures
from normality in the Q-Q plot. There are also a few influential and
unusual observations, and the Residuals vs Fitted plot suggests the
relationship may not be perfectly captured by one straight line. So, I
would say some Gauss-Markov assumptions, especially constant variance,
are not holding very well, although the sample is so large that the
coefficient estimates are still informative.
If I transform the variables, I would expect the model fit to
improve, especially because diamond prices often rise nonlinearly with
carat and the residual spread increases at higher fitted values. A log
transformation such as
lm(log(price) ~ log(carat), data = my_data) would probably
reduce the funnel shape and make the relationship more stable. The
tradeoff is that the coefficients become a little harder to explain,
since they are then interpreted in percentage terms rather than dollar
changes.