2025-04-10

Introduction

Residuals is a statistical topic that describes the discrepancies in how the actual data differs from the actual data. They are an important tool in Machine Learning for checking the hypothesis-based assumptions of statistical models. This presentation will cover methods of residual analysis and visualization with the Iris data in R.

The Iris Dataset

The Iris dataset is dataset that contains 150 observations of Iris flowers. Each has four features: sepal length, sepal width, petal length, and petal width and is labeled with the species name of the Iris.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Fitting a Model

Let’s fit a linear model predicting sepal length based on the other three features, and examine the residuals.

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

Residuals Visualization using ggplot

This plot shows us the following:

  • Most of the residuals are centered around 0, indicating that on average, the model’s predictions are fairly accurate.
  • However, the spread of residuals also indicates the variability in the errors. A perfectly accurate model would have all residuals clustered at 0.
  • The roughly bell-shaped distribution suggests that the residuals are approximately normally distributed, which is an assumption for many regression models.

More Residual Plots with ggplot

  • The QQ (Quantile-Quantile) plot is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
  • In the context of residuals, it’s often used to check if the residuals are normally distributed.
  • The points in the QQ plot follow the line fairly closely, especially in the central portion. This is an indication that the residuals are approximately normally distributed. Deviations from the line at the tails might suggest potential outliers or heavier tails than a normal distribution.

3D Residual Visualization using plotly

Understanding Residuals: Mathematical Formulation

Residuals are the differences between the observed values and the predicted values from the model. Mathematically, this can be represented as:

\[ e_i = y_i - \hat{y}_i \]

where:

  • \(e_i\) is the residual,
  • \(y_i\) is the observed value, and
  • \(\hat{y}_i\) is the predicted value.

Residual Sum of Squares (RSS)

The Residual Sum of Squares is a measure of the discrepancy between the observed and predicted values. It is calculated using the following formula:

\[ RSS = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

where:

  • \(n\) is the number of observations,
  • \(e_i\) is the residual, and
  • \(y_i\) and \(\hat{y}_i\) are the observed and predicted values, respectively.