1. State the Gauss-Markov Assumptions. 7 sentences max.
1) Linearity:
The dependent variable is a linear function of the independent variables.
2) Full Rank (No Perfect Multicollinearity):
The independent variables are not perfectly correlated with each other, ensuring that each provides unique information.
3) Exogeneity (Zero Conditional Mean of Errors):
The error term has an expected value of zero given the independent variables.
4) Homoscedasticity:
The variance of the error terms is constant across all observations.
5) No Autocorrelation:
The error terms for different observations are uncorrelated.
6) Independence of X and Error Terms:
The independent variables are generated in a way that is not correlated with the errors.
7) Normality of Errors :
Errors are normally distributed, used mainly for hypothesis testing.
2. Explain each assumption (what does it mean) and why we need to make it as if the crowd is non-technical i.e. understands only plain English. Avoid mathematical terms like matrix, inverse, rank, linearity, et cetra. This is a good interview question by the way. Try to explain the intuition/logic behind the assumptionLinks to an external site.. 7 sentences min.
1) Linearity:
We assume that the relationship between what we are trying to predict (like a home price) and the things that affect it (like size or location) is simple and perfect enough to be represented as a straight line. This makes the prediction easier to understand and work with.
2) Full Rank (No Perfect Multicollinearity):
The factors we use to predict something should not overlap entirely. For example, if we use both a house’s number of bedrooms and total square footage, they shouldn’t provide the same information, or it will confuse the prediction process.
3) Exogeneity (Zero Conditional Mean of Errors):
The errors, or mistakes, we make when predicting should have an average value of zero, no matter what the inputs are. This ensures that our predictions are not biased in one direction.
4) Homoscedasticity:
No matter how high or low the prediction, the range of possible errors should be roughly the same. If errors are bigger for more expensive houses, our predictions could become unreliable.
5) No Autocorrelation:
The mistakes made in one prediction should not affect the mistakes in another prediction. This ensures that the errors are independent, allowing each prediction to stand on its own.
6) Independence of X and Error Terms:
The factors influencing the prediction should not be related to the unpredictable parts of the model. This ensures the model is capturing the real effect of the factors rather than just noise. X may be fixed or random, but must be generated by a mechanism that is unrelated(independent) to error.
7) Normality of Errors:
This assumption says that the unpredictable part of the model follows a regular pattern (like a bell curve). It makes statistical tests more reliable but isn’t always required.
3. Explain each assumption and why we need to make it as if the crowd has some technical background (like I assumed some matrix algebraLinks to an external site. familiarity in the lecture). 7 sentences min.
1) Linearity:
This assumes the model is of the form \[
y=Xβ+ϵ,
\] where y is the dependent variable, X is the matrix of independent variables, and ϵ is the error term. This form allows us to solve for the coefficients β using the method of Ordinary Least Squares (OLS).
2) Full Rank (No Perfect Multicollinearity):
The matrix X must have full rank, meaning none of its columns are linearly dependent. If multicollinearity exists, we cannot uniquely solve for β because some variables provide redundant information.
3) Exogeneity (Zero Conditional Mean of Errors):
The assumption \[
E[\epsilon∣X]=0
\] ensures that the errors have no systematic relationship with the independent variables. This prevents bias in the estimation of β, making the model’s predictions reliable.
4) Homoscedasticity:
This requires that the error terms have constant variance, meaning \[
E[\epsilon \epsilon' ∣X]=σ^2 I.
\] If the errors have varying variance (heteroscedasticity), the standard errors of the estimated coefficients become unreliable.
5) No Autocorrelation:
The assumption \[
E[\epsilon_i \epsilon_j |X] = 0 \forall i \neq j
\] ensures that the errors for different observations are uncorrelated. Autocorrelation results in inefficiencies in coefficient estimation, particularly in time series data.
6) Independence of X and Error Terms:
The assumption that X and ϵ are independent implies that the independent variables are not affected by the random errors. This is crucial for unbiased and consistent estimates.
7) Normality of Errors:
\[
\epsilon | X \sim N(0,\sigma^2 I)
\] is often assumed for hypothesis testing. Under this assumption, the coefficients β are normally distributed, allowing the use of t-tests and F-tests. However, this is not required for the Gauss-Markov theorem to hold, only for inference.
Part 2: Find a cross-sectional dataset for simplicity with more than 120 rows/observations (traditionally considered a large sample size).
test dataset “Boston” to see if it works
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
# test row #number_of_rows <-nrow(Boston)# Print the number of rowsprint(number_of_rows)
[1] 506
result 506 rows, which means we got more then 120 observations in dataset “Boston”
1. Run a simple linear regression. Of course, I am expecting you to explain the linear regression you are running i.e. be sure to type out the estimating equation, tell me the units of the dependent and the independent variable, and interpret the slope parameter along with the intercept term. Do you find your coefficients are statistically significant (good interview question again), and if so at what level (alpha)? Is the economic magnitude meaningful?
Use medv (median value of owner-occupied homes in $1000s) as the dependent variable (Y), and lstat (percentage of lower status of the population) as the independent variable (X). The estimating equation is: \[
medv=\beta_0 + \beta_1 * lstat + \epsilon
\]
where β is the intercept term, also the value of medv when lstat is zero. It provides the baseline level of the median home value when the lower-status population percentage is zero. β_1 is the slope parameter that measures the effect of lstat on medv. ϵ is the error term. It represents the change in the median home value for each additional percentage point increase in the lower status of the population. A negative coefficient suggests that as the percentage of lower status population increases, the median home value decreases. Check the p-value of the slope coefficient (β_1). If the p-value is less than the chosen significance level (α), typically 0.05, then the coefficient is statistically significant, indicating a strong relationship between lstat and medv.
2. Store the regression results in an object called “my_reg”.
# Simple linear regression using 'medv' (Median value of owner-occupied homes) as the dependent variable and 'lstat' (percentage of lower status of the population) as the independent variable.my_reg <-lm(medv ~ lstat, data = Boston)# Summary of the regressionsummary(my_reg)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
# Create a scatter plot with the regression lineggplot(Boston, aes(x = lstat, y = medv)) +geom_point(color ="blue", alpha =0.6) +# Scatter plot of data pointsgeom_smooth(method ="lm", se =FALSE, color ="red") +# Add regression linelabs(title ="Linear Regression of Median Home Value on Lower Status Population",x ="Lower Status Population (%)",y ="Median Value of Homes ($1000s)" ) +theme_minimal() # Clean theme for the plot
`geom_smooth()` using formula = 'y ~ x'
when lstat 0, estimated β_0 is about 34.6 thousands box β_1 is about-.95, that is per 1% of increasing of lower-status population, the median home value decreases by $950.05. The negative sign shows an inverse relationship between lstat and medv. P is less then 0.05 (even 0.001), where the coefficient is defintely statistically significant at any common level and x, y have obvious relationship between each other Multiple R-squared (0.5441) means approximately 54.41% of the variation in the median home value (medv) is explained by the percentage of the lower-status population (lstat). F-statistic 601.6 with p-value < 2.2e-16 shows that the model is statistically significant overall.
The Economic Magnitude should be significant where a change of $950.05 in median home value for each 1% change in lower-status population. Housing prices are sensitive to socio-economic factors. So it can be very meaningful depends on potential buyers.
Part 3: ow, create the 4 linear regression plots we saw in class using the “plot(my_reg)” command in R. This video may be a good review, Links to an external site.and so would Chapter 3 of MARR (textbook is in Dropbox folder as you already know). You can pick up the Valid and Invalid Regression Models: Anscombe’s Four Data Sets Download Valid and Invalid Regression Models: Anscombe’s Four Data Setsexample code here to play with if you wish.
1. Tell us what each of the 4 charts measures what is the logic behind the chart setup? Also, are there any trends/rules of thumb you should rely on to analyze the model fit from these charts? Please try to make your own notes
1) Residuals vs. Fitted Values Plot:
This plot helps check for non-linearity and homoscedasticity (constant variance of errors). The residuals should be randomly scattered around the horizontal line at zero, with no distinct patterns or funnel shapes. A clear pattern or trend in the residuals indicates that the model may not be capturing the relationship properly or that variance changes across fitted values. Random scatter in Residuals vs. Fitted plot: Indicates a good fit.
2) Normal Q-Q Plot:
This plot compares the distribution of residuals to a normal distribution. If the residuals follow a normal distribution, the points will align closely along the 45-degree line. Deviations from this line suggest that the residuals are not normally distributed, which can affect hypothesis testing and confidence intervals. Q-Q Plot close to a straight line: Suggests normality of residuals.
3) Scale-Location Plot (Spread-Location Plot):
This plot displays the square root of the standardized residuals versus the fitted values. It checks the assumption of homoscedasticity. A horizontal line with equally spread points indicates constant variance of residuals. A funnel-shaped pattern suggests heteroscedasticity, meaning the variance of residuals changes with the level of the fitted values. Constant spread in Scale-Location plot: Suggests homoscedasticity.
4) Residuals vs. Leverage Plot:
This plot helps identify influential data points that might disproportionately affect the model. Points with high leverage or large residuals indicate a high potential to influence the regression line significantly. Look for points in the top right or bottom left, as these are particularly influential. No clear outliers or high leverage points: Indicates no influential data points.
In other words,
If the residuals vs. fitted plot shows random scatter, the model captures the linearity well. If the normal Q-Q plot is close to a straight line, the normality assumption holds. A consistent spread in the Scale-Location plot indicates no heteroscedasticity. The absence of outliers in the Residuals vs. Leverage plot suggests no major influential points. Overall, if these conditions are met, the Gauss-Markov assumptions are not seriously violated.
Residuals vs. Fitted Plot: The plot shows a clear pattern, indicating non-random residuals. This suggests a potential issue with non-linearity or heteroscedasticity in the model. Residuals tend to be heavily concentrated in the 15-30 range of fitted values, and they deviate significantly below the horizontal line, indicating a poor fit in this range.
Normal Q-Q Plot: There is a deviation from the 45-degree line, especially in the range of theoretical quantiles greater than 1. This suggests that the residuals are not normally distributed, possibly due to outliers or skewness in the data. The pattern parallel to the 45-degree line at higher values suggests heavy-tailed distributions.
Scale-Location Plot: The plot shows increasing variability of residuals with increasing fitted values, indicating heteroscedasticity. The spread of the residuals increases as the fitted values increase, especially between 15-30, suggesting that the variance of the errors is not constant.
Residuals vs. Leverage Plot: Most observations have low leverage, but there are a few points with high residuals and leverage. This suggests potential influential points that could disproportionately affect the model.
2. Now, given the linear regression you ran, what is the chart suggesting? Are Gauss-Markov Assumptions are seriously violated? 4 sentences max.
The Gauss-Markov assumptions appear to be violated in this model. The Residuals vs. Fitted Plot shows a clear non-random pattern, indicating non-linearity or heteroscedasticity, which violates the assumption of homoscedasticity. The Normal Q-Q Plot shows deviations from the 45-degree line, suggesting non-normality of residuals. The Scale-Location Plot indicates increasing spread of residuals with fitted values, further confirming heteroscedasticity.
3. Sometimes it helps if you transform a variable (log, square root, et cetra) in terms of better model fit (Overcome problems due to nonconstant variance or Overcome problems due to nonlinearity), but it can make the interpretation of coefficients harder. Try to play around by transforming the variables and tell us if the model fit improves or not. 3 sentences max.
# Log transformation on 'medv' and 'lstat'Boston$log_medv <-log(Boston$medv)Boston$log_lstat <-log(Boston$lstat +1) # Adding 1 to avoid log(0)# Fit a linear regression modelmodel <-lm(log_medv ~ log_lstat, data = Boston)# Print the summary of the regression modelsummary(model)
Call:
lm(formula = log_medv ~ log_lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-0.98609 -0.13181 -0.00025 0.14011 0.80156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.58347 0.04835 94.80 <2e-16 ***
log_lstat -0.62569 0.01908 -32.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2312 on 504 degrees of freedom
Multiple R-squared: 0.6808, Adjusted R-squared: 0.6802
F-statistic: 1075 on 1 and 504 DF, p-value: < 2.2e-16
# Create a regression plot using ggplot2ggplot(Boston, aes(x = log_lstat, y = log_medv)) +geom_point(color ="blue") +# Scatter plotgeom_smooth(method ="lm", color ="red", se =FALSE) +# Regression linelabs(title ="Regression of log(medv) on log(lstat)",x ="Log of LSTAT (Lower Status Population)",y ="Log of MEDV (Median Home Value)") +theme_minimal()
The log transformation improved the spread and symmetry of residuals across the fitted values, indicating better homoscedasticity in the residual vs. fitted plot. The normal Q-Q plot now shows residuals closer to the 45-degree line, suggesting improved normality of residuals. Although the residuals vs. leverage plot is still similar to before, the spread of residuals appears more evenly distributed after the transformation.