Purpose :
As analysts, we often use regression analytics as a descriptive tool to explain the relationship between two or more variables. However there can be a disconnect between reviewing the regression metrics and intuitively understanding them. To help bolster that intuitive understanding, I will create a series of visualizations and then we will review the metrics.

Consider a typical multivariate regression equation:

\[ \hat{y} = b + m_1x_1 + m_2x_2 ... m_nx_n \]

The metrics we typically look at are:

- R-Squared: The proportion of variance in the dependent variable (y) that is explained by the independent variables (x) in the model, on a scale of 0 to 1 or simply the “explained variance”.
- Effect Size: The magnitude of the impact of the independent variables (x) on the dependent variable (y), measured by the coefficients (m) in the model. Each coefficient represents the expected change in y for a one-unit increase in the corresponding x, holding all other x variables constant.
- P-Value: The probability of obtaining a coefficient (m) as extreme as or more extreme than the one observed in the sample, under the null hypothesis that the coefficient is zero. A low p-value indicates that the coefficient is statistically significant, meaning that it is unlikely to be zero by chance.

Note that I and other practitioners may refer to the effect size as slope, the coefficient or m while R-Squared and P-Value are typically not called other names.

SampleSize = 50
MeanX = 50
InterceptY = -10
# Standard Deviation of X, using the T-Score typical ratio of 0.2
SDX = 0.5 * MeanX
Slope = seq(1, 5, by = 1)
# Error that around the Y slope that will be applied
MeanError = 1 * MeanX
# Std Error SD of the Mean Error going with values around the typical T-Score ratio of 0.2
SDError = seq(0.2, 1, by = .2) * MeanError
ReplaceRate = 0.1
  
ParameterDF = crossing(Slope, MeanError, SDError, MeanX, SDX) %>%
  arrange(Slope, MeanError, SDError, MeanX, SDX)

 
for (i in 1:nrow(ParameterDF)) {
  set.seed(1234)
  TempDF = tibble(
      SetID = i,
      SetLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") Sl ",  
                        ParameterDF$Slope[i] %>% comma(accuracy=0.1),
                        " Er ", ParameterDF$MeanError[i]  %>% comma(accuracy=0.1),
                        " SDEr ", ParameterDF$SDError[i]  %>% comma(accuracy=0.1)),
      AxisX = rnorm(SampleSize, ParameterDF$MeanX[i], ParameterDF$SDX[i]),
      AxisY = InterceptY +
              (AxisX * ParameterDF$Slope[i]) 
               + rnorm(SampleSize, ParameterDF$MeanError[i], ParameterDF$SDError[i])) %>%
    # Do Random Replacement
    mutate(AxisX = ReplaceRandom(AxisX, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 100),
           AxisY = ReplaceRandom(AxisY, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 400))
  TempLM = lm(TempDF$AxisY ~ TempDF$AxisX)
  TempDF = mutate(TempDF, ModelLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") ",
                                              GetModelLabel(TempLM)))
  if (i == 1) {
    CoordinateDF = TempDF
  } else {
    CoordinateDF = bind_rows(CoordinateDF, TempDF)
  }
}

We will generate some random coordinate sets to see how those metrics apply to both highly correlated and noisy two dimensional data. Here is how we will do that:

- Sample Size : We’ll look at the changes in regression metrics as the sample size increases from 50 to 100 and finally 500.
- Randomized X : We will keep this fairly constant using a normal distribution with a mean of 50 and standard deviation of 25.
- Randomized Y : We will initially apply a multiplier or slope (M) to X of 1 to 5 and then add a mean error of 50 with a standard deviation from 10 to 50.
- Further Randomization of Y : In order to completely disconnect Y as a function of X, we will randomly replace between 10% and 50% of the coordinates with random values.

Let’s visualize those randomized coordinate sets:

CoordinateDF %>%
  ggplot(aes(AxisX, AxisY)) +
  geom_point(na.rm = T, color = "grey50", size = 1) +
  geom_smooth(formula="y~x", method="lm", se=F, na.rm = T, 
              color = "darkred", alpha = 0.5) +
  facet_wrap(~ModelLabel, ncol = 5) +
  scale_x_continuous(limits = c(0, 100)) + 
  scale_y_continuous(limits = c(0, max(CoordinateDF$AxisY))) +
  theme_classic() +
  theme(legend.position="none") +
  labs(title = paste0("Regression Examples - n = ", SampleSize),
       subtitle = paste0("Slope (M) Targets by Row: ", paste(Slope, collapse = ", "),
                         " Std Dev Y by Column", paste(SDError, collapse = ", "),
                         " Noise Rate by Column: ", paste(seq(1:5)*ReplaceRate, collapse = ", ")),
       x = "", y = "")

Row 1 : At our lowest sample size of 50, we can see that the first row with the lowest target slope has high P values across the board or in other words it’s statistically likely that the slope is zero which makes sense because the even with out the noise in column 1 the measured effect size is low to begin with. Likewise the R-Squared or explained variance is low as well and both P and R-Squared generally get worse as progressively more noise is introduced in columns 2-5

Columns 3-5 : Regardless of the row which corresponds with our target slope (M), columns 3-5 have high noise with random replacement and standard deviation effects working together to create high P Values and low R-Squared values. The more noise that is introduced, the lower your explained variance will be.

Rows 4-5, Columns 1-2 : In these examples we have our highest target slopes with relatively low levels of noise so that means that our P values will be low as it’s statistically unlikely that our true slope is zero and our R-Squared / explained variance is relatively high.

You can see that the measured slopes and R-Squared values are generally but not always steadily decreasing in columns 1-5 because the noise is truly random and has inconsistent effects.

Now what happens if we increase our sample size keeping everything else constant?

SampleSize = 100
MeanX = 50
InterceptY = -10
# Standard Deviation of X, using the T-Score typical ratio of 0.2
SDX = 0.5 * MeanX
Slope = seq(1, 5, by = 1)
# Error that around the Y slope that will be applied
MeanError = 1 * MeanX
# Std Error SD of the Mean Error going with values around the typical T-Score ratio of 0.2
SDError = seq(0.2, 1, by = .2) * MeanError
ReplaceRate = 0.1
  
ParameterDF = crossing(Slope, MeanError, SDError, MeanX, SDX) %>%
  arrange(Slope, MeanError, SDError, MeanX, SDX)

 
for (i in 1:nrow(ParameterDF)) {
  set.seed(1234)
  TempDF = tibble(
      SetID = i,
      SetLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") Sl ",  
                        ParameterDF$Slope[i] %>% comma(accuracy=0.1),
                        " Er ", ParameterDF$MeanError[i]  %>% comma(accuracy=0.1),
                        " SDEr ", ParameterDF$SDError[i]  %>% comma(accuracy=0.1)),
      AxisX = rnorm(SampleSize, ParameterDF$MeanX[i], ParameterDF$SDX[i]),
      AxisY = InterceptY +
              (AxisX * ParameterDF$Slope[i]) 
               + rnorm(SampleSize, ParameterDF$MeanError[i], ParameterDF$SDError[i])) %>%
    # Do Random Replacement
    mutate(AxisX = ReplaceRandom(AxisX, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 100),
           AxisY = ReplaceRandom(AxisY, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 400))
  TempLM = lm(TempDF$AxisY ~ TempDF$AxisX)
  TempDF = mutate(TempDF, ModelLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") ",
                                              GetModelLabel(TempLM)))
  if (i == 1) {
    CoordinateDF = TempDF
  } else {
    CoordinateDF = bind_rows(CoordinateDF, TempDF)
  }
}
 
CoordinateDF %>%
  ggplot(aes(AxisX, AxisY)) +
  geom_point(na.rm = T, color = "grey50", size = 1) +
  geom_smooth(formula="y~x", method="lm", se=F, na.rm = T, 
              color = "darkred", alpha = 0.5) +
  facet_wrap(~ModelLabel, ncol = 5) +
  scale_x_continuous(limits = c(0, 100)) + 
  scale_y_continuous(limits = c(0, max(CoordinateDF$AxisY))) +
  theme_classic() +
  theme(legend.position="none") +
  labs(title = paste0("Regression Examples - n = ", SampleSize),
       subtitle = paste0("Slope (M) Targets by Row: ", paste(Slope, collapse = ", "),
                         " Std Dev Y by Column", paste(SDError, collapse = ", "),
                         " Noise Rate by Column: ", paste(seq(1:5)*ReplaceRate, collapse = ", ")),
       x = "", y = "")

By increasing the sample size to 100, we notice two things:

  1. Many of the slopes flatten slightly because the randomized noise becomes more significant with the higher sample size.
  2. Most importantly, we see that the P Values begin to decrease across the board because the probability of a measured slope not being zero decreases with the higher sample size.

The other metric trends across columns and down rows are similar to the smaller samle sizes.

Finally, we’ll increase our sample size to 500.

SampleSize = 500
MeanX = 50
InterceptY = -10
# Standard Deviation of X, using the T-Score typical ratio of 0.2
SDX = 0.5 * MeanX
Slope = seq(1, 5, by = 1)
# Error that around the Y slope that will be applied
MeanError = 1 * MeanX
# Std Error SD of the Mean Error going with values around the typical T-Score ratio of 0.2
SDError = seq(0.2, 1, by = .2) * MeanError
ReplaceRate = 0.1
  
ParameterDF = crossing(Slope, MeanError, SDError, MeanX, SDX) %>%
  arrange(Slope, MeanError, SDError, MeanX, SDX)

 
for (i in 1:nrow(ParameterDF)) {
  set.seed(1234)
  TempDF = tibble(
      SetID = i,
      SetLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") Sl ",  
                        ParameterDF$Slope[i] %>% comma(accuracy=0.1),
                        " Er ", ParameterDF$MeanError[i]  %>% comma(accuracy=0.1),
                        " SDEr ", ParameterDF$SDError[i]  %>% comma(accuracy=0.1)),
      AxisX = rnorm(SampleSize, ParameterDF$MeanX[i], ParameterDF$SDX[i]),
      AxisY = InterceptY +
              (AxisX * ParameterDF$Slope[i]) 
               + rnorm(SampleSize, ParameterDF$MeanError[i], ParameterDF$SDError[i])) %>%
    # Do Random Replacement
    mutate(AxisX = ReplaceRandom(AxisX, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 100),
           AxisY = ReplaceRandom(AxisY, if_else(i %% 5 == 0, 5, i %% 5)*ReplaceRate, 1, 400))
  TempLM = lm(TempDF$AxisY ~ TempDF$AxisX)
  TempDF = mutate(TempDF, ModelLabel = paste0("(",str_pad(i, 2, side="left", pad="0"), ") ",
                                              GetModelLabel(TempLM)))
  if (i == 1) {
    CoordinateDF = TempDF
  } else {
    CoordinateDF = bind_rows(CoordinateDF, TempDF)
  }
}
 
CoordinateDF %>%
  ggplot(aes(AxisX, AxisY)) +
  geom_point(na.rm = T, color = "grey50", size = 1) +
  geom_smooth(formula="y~x", method="lm", se=F, na.rm = T, 
              color = "darkred", alpha = 0.5) +
  facet_wrap(~ModelLabel, ncol = 5) +
  scale_x_continuous(limits = c(0, 100)) + 
  scale_y_continuous(limits = c(0, max(CoordinateDF$AxisY))) +
  theme_classic() +
  theme(legend.position="none") +
  labs(title = paste0("Regression Examples - n = ", SampleSize),
       subtitle = paste0("Slope (M) Targets by Row: ", paste(Slope, collapse = ", "),
                         " Std Dev Y by Column", paste(SDError, collapse = ", "),
                         " Noise Rate by Column: ", paste(seq(1:5)*ReplaceRate, collapse = ", ")),
       x = "", y = "")

With the increased sample size, P Values below Row 1 are all very low as the probability of a measured slope not being zero again decreases with the higher sample size.

Finally, what can we do with regression models at certain R-Squared values? I have a personal rule of thumb that any R-Squared above 40% can be useful. Let’s look at a quick classification example.

Let’s pick one of the models where the R-Squared is near 40% and pretend that our X variable is an admissions test that predicts if a student will pass the course given that the passing grade is 70%. The actual course result will be our Y variable.

If the admissions test predicts 70%+ and the student actually scores 70%+ OR the admissions test predicts < 70% and the student actually scores < 70%, we’ll consider that to be an accurate prediction.

We will of course do a 75% train, 25% test split with each data set having a similar distribution of Y values. The model will be calculated on the training data and then applied to the test data to estimate how will the model would do against new data.

ModelDF = CoordinateDF %>%
  mutate(RSQ = as.numeric(str_extract(ModelLabel, "(?<=R:\\s)\\d*")) / 100) %>%
  filter(RSQ >= 0.37 & RSQ <= 0.6) %>%
  mutate(RSQ_MIN = min(RSQ)) %>%
  filter(RSQ == RSQ_MIN) %>%
  group_by(SetID) %>%
  mutate(SetID_MIN = min(SetID)) %>%
  ungroup() %>%
  filter(SetID == SetID_MIN) %>%
  select(-c(RSQ_MIN, SetID_MIN))

TrainProportion = .75
RandomSeed = as.numeric(as.Date("2063-04-05")) + 100

# Create Split
set.seed(RandomSeed)
ModelDF_Split = initial_split(ModelDF,
                           strata = AxisY, prop = TrainProportion)

ModelDF_Train = training(ModelDF_Split)
ModelDF_Test = testing(ModelDF_Split)

ModelLM = lm(AxisY ~ AxisX, 
             data = ModelDF_Train)

ModelDF_Test = ModelDF_Test %>%
  mutate(PredY = predict(ModelLM, ModelDF_Test),
         PassY = quantile(AxisY, 0.70),
         Accurate = if_else((PredY >= PassY & AxisY >= PassY)
                            | (PredY < PassY & AxisY < PassY), 1, 0),
         Result = if_else(Accurate == 1, "Accurate Prediction", "Bad Prediction")
         )

PrettyModelTable(ModelLM,
                 paste0("Set ID (", ModelDF$SetID[1] ,") with Overall R-Squared ", percent(ModelDF$RSQ[1], accuracy = 0.1), 
                        " - Regression Metrics on Training Subset"))

Set ID (12) with Overall R-Squared 39.0% - Regression Metrics on Training Subset

Variable Estimate StdError TValue PValue Significant
R-Squared 0.3956397 NA NA NA
(Intercept) 83.8790000 7.712 10.876 0
AxisX 2.0980000 0.135 15.563 0 Yes
ModelDF_Test %>%
  mutate(TotalCount = n()) %>%
  group_by(Result) %>%
  summarize(`Accuracy Rate` = percent(n() / min(TotalCount), accuracy = 0.1),
         .groups = "drop") %>%
PrettyTable(paste0("Model Prediction Accuracy Where Pass = 70th Percentile - Test Data n=", nrow(ModelDF_Test)))

Model Prediction Accuracy Where Pass = 70th Percentile - Test Data n=128

Result Accuracy Rate
Accurate Prediction 83.6%
Bad Prediction 16.4%

Here we can see that a linear classifier can be reasonably accurate at an R-Squared around 40%.

Conclusions :

By visualizing a grid of 25 regression models, we can see how noisy data and higher sample sizes impact our model metrics of slope/effect size/(M), R-Squared and P.

In closing, don’t be regressive.

Be Savvy