Betty Wang - Residual Analysis

Author

Betty Wang

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

data("iris")

# Linear Regression Explanation:
# We are running a simple linear regression using the iris dataset, where we
# model the relationship between Sepal Length (dependent variable) and
# Petal Length (independent variable). The goal is to see how changes in
# the petal length can predict changes in sepal length.

# Estimating Equation:

\[ Sepal.Length_i = \beta_0 + \beta_1 Petal.Length_i + \epsilon_i \]

# Run the simple linear regression
my_reg <- lm(Sepal.Length ~ Petal.Length, data = iris)

# Display the results
summary(my_reg)


Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24675 -0.29657 -0.01515  0.27676  1.00269 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.30660    0.07839   54.94   <2e-16 ***
Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

\[ Sepal.Length_i = 4.31 + 0.41 Petal.Length_i + \epsilon_i \]

# Units:
# Dependent variable (Sepal.Length): The sepal length is measured
# in centimeters (cm).
# Independent variable (Petal.Length): The petal length is also measured
# in centimeters (cm).

# Interpretation:
# Intercept (beta_0): The intercept is 4.31 cm. This means that when the
# petal length is zero (which might not be realistic for a flower, but
# it's part of the mathematical model), the expected sepal length is 4.31 cm.
# Slope (beta_1): The slope is 0.41. This means that for every 1 cm increase
# in petal length, the sepal length is expected to increase by 0.41 cm,
# assuming a linear relationship.
# This model indicates a positive relationship between petal length and
# sepal length, where longer petals predict longer sepals.

# Statistical significance:
# Check the p-value associated with the independent variable's coefficient
# from the regression output in order to determine if the coefficients are
# statistically significant. The p-value associated with the
# coefficient of Petal.Length. is 2e-16 which is smaller than the significance
# level (alpha) that is typically 5% or 1%. Such a small p-value (2e-16)
# suggests that the coefficient is highly statistically significant. It
# indicates that the probability of observing such a relationship by random
# chance is virtually zero. Also, the result is statistically significant at
# even the most stringent levels, such as 0.1% or 0.01%. Therefore, the
# regression coefficients are statistically significant at all alpha levels.

# Economic Magnitude:
# The slope coefficient represents the change in Sepal Length for a one-unit
# change in Petal Length. That is, a 1 cm increase in petal length increases
# sepal length by 0.41 cm. In the context of botanical studies, a 0.41 cm
# change is relatively modest but still meaningful.

# The 4 linear regression plots:
plot(my_reg)

# 1. The Residuals vs Fitted plot is used to assess the quality of a linear
# regression model by visualizing the relationship between the residuals
# (errors) and the fitted (predicted) values. It helps identify whether the
# assumptions of the regression model, particularly linearity and
# homoscedasticity, are being met. The X-axis represents the fitted values,
# which are the predicted values from the regression model. The Y-axis
# represents the residuals, which are the differences between the observed and
# predicted values. In a well-fitted linear regression model, the residuals
# should be randomly scattered around the horizontal line at zero
# (mean residual = 0) and no clear pattern should be observed among the
# residuals.
# 2. The Q-Q (Quantile-Quantile) Residuals Plot is used to assess whether the
# residuals (errors) of a regression model follow a normal distribution.
# It's a diagnostic tool for checking one of the key assumptions of linear
# regression: that the residuals are normally distributed. This assumption is
# particularly important for making valid inferences (e.g., t-tests, F-tests).
# The X-axis is Theoretical quantiles, which are the expected values of the
# residuals if they were normally distributed. The Y-axis: Sample quantiles,
# which are the actual sorted residuals from the regression model. If the
# residuals are normally distributed, the points will lie close to a
# straight 45-degree line (diagonal) in the plot. If the points deviate from
# the diagonal line at the ends (i.e., the tails), this indicates that the
# residuals may have heavy tails or light tails, suggesting that the residuals
# may not follow a normal distribution (e.g., they may follow a distribution
# with more or fewer extreme values than normal).
# 3. The Scale-Location plot is a diagnostic tool checks for homoscedasticity
# in a linear regression model. Homoscedasticity means that the residuals
# (errors) have constant variance across all levels of the fitted values,
# which is an important assumption in linear regression. The Scale-Location
# plot helps to identify if this assumption is being met. The X-axis is the
# predicted values from the regression model. The Y-axis is the square root of
# the standardized residuals (scaled residuals). The square root transformation
# is used to stabilize the variance and make patterns more visible.
# The plot examines whether the residuals are evenly spread across the range of
# predicted (fitted) values. A horizontal band of points with no pattern would
# indicate homoscedasticity where the residuals are equally spread along the
# range of fitted values. This indicates the variance of the errors is constant
# across all fitted values.
# 4. The Residuals vs Leverage plot detects influential data points in a linear
# regression model. It helps assess how much individual data points influence
# the overall model fit. This plot combines information about the size of
# residuals and the leverage of data points, identifying those that may
# disproportionately affect the regression results. The X-axis represents
# the leverage of each data point, which measures how much the independent
# variables influence the predicted values. Data points with high leverage have
# values for the independent variables that are far from the mean. The Y-axis
# represents the standardized residuals (scaled residuals), which show how far
# the actual values are from the fitted (predicted) values. The plot also
# includes Cook’s distance contour lines, which show thresholds for identifying
# influential data points based on both residual size and leverage. Leverage
# values range from 0 to 1 and high leverage points close to 1 indicate the
# point can have a large influence on the model. Points with both high leverage
# and large residuals are potential influential points, meaning they can
# disproportionately affect the regression estimates. Cook's distance > 1
# suggests the point may be highly influential.

# The 4 charts in the regression suggests the following:
# 1. The homoscedasticity assumption is not met and we cannot accurately
# predict sepal length consistently. A curved pattern is observed in the
# Residuals vs Fitted plot, indicating the relationship between Y and X may be
# non-linear.
# 2. The residuals are approximately normally distributed with few points
# deviating from the diagonal line at the ends (i.e., the tails), indicating
# that the residuals may have heavy or light tails.
# 3. A curve appears in the Scale-Location plot (instead of a random scatter),
# suggesting that there may be non-linearity in the relationship between the
# independent (petal length) and dependent (sepal length) variables. The linear
# model being run may not be fully capturing the relationship.
# 4. Large Residuals, or points with large standardized residuals (far from the
# horizontal line at zero), indicate that the model is not fitting these points
# well.

# At least 2 of the Gauss-Markov Assumptions are violated:
# 1. Linearity: The relationship between the dependent and independent
# variables is not linear in parameters.
# 2. Homoscedasticity: The variance of the error term is not constant across
# all levels of the independent variables.

# Transformation: dealing with heteroscedasticity
# Apply a logarithmic transformation to the dependent variable (Sepal.Length)
my_reg_log <- lm(log(Sepal.Length) ~ Petal.Length, data = iris)
# log(Sepal.Length): apply the natural logarithm to Sepal.Length. This can help
# stabilize the variance and make the relationship more linear.
# Petal.Length: The independent variable remains unchanged.

# Display the summary of the new model
summary(my_reg_log)


Call:
lm(formula = log(Sepal.Length) ~ Petal.Length, data = iris)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.218301 -0.046171 -0.000453  0.051148  0.182225 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.491305   0.013026  114.49   <2e-16 ***
Petal.Length 0.070273   0.003139   22.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06764 on 148 degrees of freedom
Multiple R-squared:  0.772, Adjusted R-squared:  0.7705 
F-statistic: 501.1 on 1 and 148 DF,  p-value: < 2.2e-16

# The 4 linear regression plots:
plot(my_reg_log)

# After the transformation, both the Multiple R-squared and the
# Adjusted R-squared improve, from 0.76 to 0.772 and from 0.7583 to 0.7705
# respectively. The p-value: < 2.2e-16 remains unchanged.

# The 4 plots have not changed much, indicating that transforming the dependent
# variable through logarithm does not significantly improve the model fit. The
# problem of non-linearity remains and the relationship needs a better model.