Gauss Markov Assumptions and Residual Analysis

Author

Song

Linear Regression of Iris

df <- iris

my_reg <- lm(df)
my_reg

Call:
lm(formula = df)

Coefficients:
      (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
           2.1713             0.4959             0.8292            -0.3152  
Speciesversicolor   Speciesvirginica  
          -0.7236            -1.0235  

The estimation equation is:

Sepal.Length = \(2.1713\) + \(0.4959 \times Sepal.Width\) + \(0.8292 \times Petal.Length\) + \((-0.3152) \times Petal.Width\) + \((-0.7236) \times Speciesversicolor\) + \((-1.0235) \times Speciesvirginica\)

Dependent Variable:

Sepal.Length: Centimeters (cm)

Independent Variables:

Sepal.Width: Centimeters (cm)

Petal.Length: Centimeters (cm)

Petal.Width: Centimeters (cm)

Species: Categorical variable (Setosa, Versicolor, Virginica)

Interpretation of Coefficients:

Intercept: 2.1713 is the expected value of Sepal.Length when all other variables are zero. However, since zero values for these predictors are not realistic in this context, the intercept mainly serves as a baseline value.

Sepal.Width: For each additional centimeter increase in Sepal.Width, the Sepal.Length is expected to increase by approximately 0.496 cm, holding all other variables constant.

Petal.Length: For each additional centimeter increase in Petal.Length, the Sepal.Length is expected to increase by approximately 0.829 cm, holding all other variables constant.

Petal.Width: For each additional centimeter increase in Petal.Width, the Sepal.Length is expected to decrease by approximately 0.315 cm, holding all other variables constant.

Speciesversicolor: Speciesversicolor is a dummy variable compared to species Setosa. If the species is versicolor. the Sepal.Length is expected to decrease by approximately 0.724 cm, holding all other var

Speciesvirginica: Speciesvirginica is a dummy variable compared to species Setosa. If the species is virginica. the Sepal.Length is expected to decrease by approximately 1.024 cm, holding all other variables constant.

Statistical Importance:

summary(my_reg)

Call:
lm(formula = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79424 -0.21874  0.00899  0.20255  0.73103 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3068 on 144 degrees of freedom
Multiple R-squared:  0.8673,    Adjusted R-squared:  0.8627 
F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

Based on the p-values for each coefficients, we can say Sepal.Width, Petal.Length, are statistically significant above 0.001 alpha level; Speciesversicolor, and Speciesvirginica are all less statistically significant at above 0.01 alpha level; Petal.Width is the least statistically significant at above 0.05 alpha level.

Alpha levels are meaningful because it tells us how much confidence we have in our prediction. An alpha level of 0.05 would mean the margin of error is 5%, so as we increase the alpha level, the more confident we have about our predictions.

Linear Regression Plots

plot(my_reg)

Residuals vs. Fitted

The chart checks for linearity. The residuals should randomly scatter around y = 0. Any trend or pattern could indicate non-linearity

Normal Q-Q

The chart checks for normal distribution. If the residuals are normally distributed, they should fall along the reference line.

Scale-Location

The chart checks for homoscedasticity. residuals should spread equally along the horizontal line. Deviations could indicate heteroscedasticity

Residuals vs. Leverage

The chart checks for outliers. Points outside of the congregation are influential outliers and can have a disproportionate impact on the model

Based on these charts, Gauss-Markov assumptions are not seriously violated in our regression. As we can see, there is no residuals that indicate non-linearity or un-normal distribution in the charts. In addition, the residuals are spread equally along the line in scale-location, indicating variances are spread constantly at all levels, suggesting a heteroscedasticity distribution.

Log Transformation

data(iris)
iris$Log_Sepal.Width <- log(iris$Sepal.Width)

my_reg <- lm(Sepal.Length ~ Log_Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)

summary(my_reg)

Call:
lm(formula = Sepal.Length ~ Log_Sepal.Width + Petal.Length + 
    Petal.Width + Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.80370 -0.21803  0.02392  0.22908  0.76674 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.1171     0.3080   6.874 1.75e-10 ***
Log_Sepal.Width     1.4321     0.2669   5.365 3.15e-07 ***
Petal.Length        0.8281     0.0696  11.897  < 2e-16 ***
Petal.Width        -0.3142     0.1542  -2.037  0.04347 *  
Speciesversicolor  -0.7417     0.2470  -3.003  0.00315 ** 
Speciesvirginica   -1.0426     0.3426  -3.044  0.00278 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3107 on 144 degrees of freedom
Multiple R-squared:  0.8639,    Adjusted R-squared:  0.8592 
F-statistic: 182.9 on 5 and 144 DF,  p-value: < 2.2e-16

The coefficient for log_sepal.width is 0.7890. This means that for a 1% increase in log_sepal.width, sepal.length is expected to increase by approximately 0.00789 cm, holding all other variables constant.