Discussion W3-1

Part 1

1. Gauss-Markov Assumptions

There is a linear relationship between y and X.
There is no perfect multicollinearity.
The disturbances average out to 0 for any value of X.
Errors are homoskedastic and not correlated.
X can be chosen ahead of time or come from random data, but however it’s picked, it can’t be influenced by the unmeasured factors (the error term).
This assumption is not required but the errors, given X, are normally distributed with a mean of zero and constant variance.

2. Each Assumption Explained (Non-Technical)

Linearity: We assume that the outcome changes steadily when we change the predictors which would be X. This helps in understanding the relationship and interpreting the impact of one unit change.
No perfect multicollinearity: This means that each predictor needs to add something new. For example, you can not ask for “height” and “cm of height” together. Otherwise, the model can’t decide which variable is doing the work.
Zero conditional mean: This essentially means that the model is unbiased and does not overestimate or underestimate for any subgroup.
Homoskedasticity and no autocorrelation: Observations are an equal scatter on both ends of a trendline and one error should not predict another one.
Strict exogeneity: Independent variables should not be influenced by any hidden factors in the data.
Normal errors: Assuming the curve is bell-shaped makes it easier to calculate things.

3. Each Assumption Explained (Technical)

Linearity: This assumption is that the relationship between the dependent variable and the independent variables is linear in the parameters. In matrix notation, this ensures that we can use the OLS formulas and that the estimators are consistent and unbiased.
No perfect multicollinearity: The independent variables are not perfectly collinear. This means in matrix terms, X’X is invertible. This is ensures that unique solutions exist.
Zero conditional mean: Given the independent variables, the expected value of the error term is 0.This assumption implies that E(y) = Xβ.
Homoskedasticity and no correlation: Errors are homoskedastic and uncorrelated.E(ee’|X) = o^2I, where I is the identity matrix. This means that the variance of each error term is the same for all observations. It also means that the covariance between error terms is zero for all observations.
Strict exogeneity: The independent variables must be fixed or random and generated via a process that is independent of the error terms. This ensures that that X and ε are uncorrelated and that E(ε|X) = 0.
Normal errors: Given X, we assume the error terms are normally distributed with a mean of zero and a constant variance.

Part 2

I am using the “diamonds” dataset in R since it has over 50,000 rows.

library(ggplot2)
head(diamonds)

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

nrow(diamonds)

[1] 53940

1. Simple Linear Regression

Estimating Equation:

\[ price_i = \beta_0 + \beta_1 \times carat_i + \varepsilon_i \]

Price: Diamond price in USD (dependent)
Carat: Diamond weight in carats (independent)

Interpretation:

The slope tells us how much the price is expected change for each 1 carat increase in the weight of the diamond.
The intercept tells us the estimated price when carat = 0.

my_reg <- lm(price ~ carat, data = diamonds)
summary(my_reg)


Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18585.3   -804.8    -18.9    537.4  12731.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8493 
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Interpretation of Results:

The intercept is -2,256.36 which means -$2,256.36 is the predicted price of a diamond when carat = 0. We know that a 0-carat diamond does not exist in real life but this positions the regression line.
The slope is 7,756.43. This means that for each additional carat, the predicted price of a diamond increases by $7,756.43 on average when everything else is held constant.

Statistical Significance:

Both the slope and intercept have very low p-values that are highly significant at the alpha levels of 0.01, 0.05, or 0.10.

Economic Magnitude

The estimated effect is economically meaningful considering the slope interpretation.
The R-squared value also being 0.8493 means that about 85% of the variation in price can be explained by carat alone.

my_reg <- lm(price ~ carat, data = diamonds)

Part 3

plot(my_reg)

1. Chart Logic

Residuals vs Fitted Plot

This plot checks for non-linearity, homoscedasticity, and outliers. If the model fits well the then points should be randomly scattered with no pattern. Patterns are what to look for since they suggest issues.
Q-Q Residuals

This plot checks to see if the residuals are normally distributed. If points are far from the line, this suggests errors.
Scale-Location Plot

This plot checks if the residuals spread equally across the line. This how the homoscedasticity assumption is checked. A funnel shape suggest heteroskedasticity.
Residuals vs Leverage Plot

This plot helps identify influential points if there are any. Points that are far outside can be problematic and affect the model.

2. Are Assumptions Violated?

Based on the output for the residuals vs fitted plot, there is a clear funnel shape shown. This suggests that there could be heteroskedasticity which is a violation. There is some deviation at the tail of the Q-Q residuals plot as well which could mean that the errors are not a perfect normal distribution. This however might be minor with a larger dataset like this. Overall, it seems the constant variance appears to be violated but the model might still be useful.

3. Transforming

# Log-transform both variables
my_reg_log <- lm(log(price) ~ log(carat), data = diamonds)
summary(my_reg_log)


Call:
lm(formula = log(price) ~ log(carat), data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50833 -0.16951 -0.00591  0.16637  1.33793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.448661   0.001365  6190.9   <2e-16 ***
log(carat)  1.675817   0.001934   866.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 53938 degrees of freedom
Multiple R-squared:  0.933, Adjusted R-squared:  0.933 
F-statistic: 7.51e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

plot(my_reg_log)

Applying a log-log transformation to both price and carat increased the R-squared value to 0.933 compared to 0.85 in the original model. Additionally the residuals are more evenly spread and the transformation helped address the heteroskedasticity issue.