Linear Regression: Looking back to old school

Some formulas

$\text{Independent variable} = x$

$i_{th}~\text{observation} = x_i$

$\text{Number (frequency) of observation} = n$

$\text{Mean} = \bar{x}$

$\text{Mean (M)} = \frac{Total}{Frequency} = \frac{\sum x_i}{n},~i = 1~to~n$

$\text{Deviation from mean} = x_i - \bar{x}$

$\text{Squared deviation from mean} = (x_i - \bar{x})^2 \Rightarrow \text{Total Sum of Squares (TSS)} = \sum(x_i - \bar{x})^2$

$\text{Mean sum of squares} = \frac{1}{n}\sum(x_i - \bar{x})^2 = \text{Variance}$

$\text{Positive square root of variance }= \sqrt{\frac{1}{n}\sum(x_i - \bar{x})^2} = \text{Standard~deviation~(SD)}$

Standard deviation is the deviation measured with reference to the mean of the observations.

Regression equation: $y_i = \alpha + \beta X_i + \epsilon_i$

$\text{Dependent~variable} = y_i$ [observed values]

$\text{Estimated value of dependent variable} = \hat{y_i} = \alpha + \beta X_i ~~\therefore~\text{Residual or Error},~ \epsilon_i = y_i - \hat{y_i}$

$\text{Mean value of dependent variable} = \bar{y}$

$\text{Total Sum of Squares (TSS)} = \sum(y_i - \bar{y})^2$

$\text{Mean Sum of Squares (MSS)} = \frac{1}{n} \sum(y_i - \bar{y})^2$

Now, think the deviation of the observed values from the predicted (modeled or estimated) values.

$\text{Deviation of observation from estimated values} = y_i - \hat{y_i} = \text{Error = Residuals}$

Mean Absolute Error (MAE) = $\frac{1}{n} \sum{(\hat{y}_i - y_i)}$ = Mean of residuals

$\text{Residual Squares} = (y_i - \hat{y_i})^2 \Rightarrow \text{Residual Sum of squares (RSS)} = \sum(y_i - \hat{y_i})^2$

$\text{Residual Mean Sum of Squares (RMSS)} = \frac{1}{n}\sum(y_i - \hat{y_i})^2 = MSE_{error} = MSE_{residual}$

$\sqrt{RMSS} = \sqrt{\frac{1}{n}\sum(y_i - \hat{y_i})^2} = RMSE~(Root~Mean~Square~Error)$

$\text{MSE} = \frac{1}{n}\sum(\hat{y_i} - y_i)^2 = \frac{1}{n}\sum(\hat{y}_i - \bar{\hat{y}})^2 + \frac{1}{n}\sum(\bar{\hat{y}} - y_i)^2 + \text{Irreducible Error}$ $\text{MSE} = \frac{1}{n}\sum(y_i - \hat{y_i})^2 = \text{Variance + Bias}^2 + \text{Irreducible Error}$

Variance indicates the mean squared deviation (MSS) of the individual predicted observations from the mean predicted observation. Variance $= \frac{1}{n}\sum(y_{pred}−\bar{y}_{pred})^2$

$Bias^2$ indicates the squared deviation (MSS) of the mean predicted observation from the true observations. $Bias^2$ = $\frac{1}{n} \sum(\bar{y}_{pred}−y_{true})^2$. This becomes in most practical applications as: $Bias^2$ = $(\bar{y}_{pred} - \bar{y}_{true})^2$.

$High~Variance$: Predictions are very sensitive to the input (train) data. The model shows overfitting meaning the models fits well on the training data but poorly performs on the test data.

$High~Bias^2$: The model has not well captured the underlying pattern in the data. The model shows underfitting because simpler model has been used instead of more complex model.

These are the basic equations used frequently in different analysis. Now, we will see how the bivariate formulas of correlation and regression are derived. Let us start with the formula of variance.

$var_x = \frac{1}{n}\sum(x_i - \bar{x})^2, var_y = \frac{1}{n}\sum(y_i - \bar{y})^2$

$\text{Sum product of x and y},~sp_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})$

$\text{Covariance},~cov_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y}) \Rightarrow \text{How y varies with x?}$

$\text{Correlation},~r = \frac{cov_{xy}}{sd_x sd_y}\Rightarrow \text{Unitless, range: -1 to +1}$

$\text{Regression},~\beta = \frac{cov_{xy}}{var_x} = r\frac{sd_y}{sd_x} = Slope = \text{change in y per x}\Rightarrow y = f(x)$

$\text{Standardized regression coefficient},~\beta^*~or~b~or~B = \beta\frac{sd_x}{sd_y} = change~in~sd_y~per~sd_x$

For standardized (z-scores, centered and scaled) variables, $\beta^* =\beta\frac{sd_x}{sd_y}= r\frac{sd_y}{sd_x}\frac{sd_x}{sd_y} = r$

$\text{Coefficient of determination, }R^2 = 1-\frac{RSS}{TSS} = \frac{TSS-RSS}{TSS} = \frac{SS_{effect}}{SS_{total}}$

$Adjusted~R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)},~\text{for n observations, d predictors}$

$ANOVA \Rightarrow F=\frac{MSE_{effect}}{MSE_{residual}},~\eta^2 = \frac{SS_{effect}}{SS_{total}} \Rightarrow \text{ Similar to }R^2$

$ANOVA \Rightarrow \text{Model CV(%)} = \frac{RMSE}{\bar{y}}\times100, \text{Dependent variable CV(%)} = \frac{sd_y}{\bar{y}}\times 100$

$\text{Variance Inflation Factor, } VIF = \frac{1}{1-R^2}$, if greater than 4, the model may have multicollinearity.

Simple Linear Regression (Ordinary Least Square, OLS)

Load the required libraries using p_load() function of pacman package. First install the packman package if needed using install.packages('pacman') in the console because you do not need to install this each time. Set working directory properly before continuing the analysis.

options(repr.plot.width=10, repr.plot.height=7) # for notebook figure width

pacman::p_load(ISLR, glmtoolbox, DALEX, MASS, lmtest, jtools, car, vip, cowplot, glmnet, caret, tidymodels, tidyverse)

Import dataset:

data(Credit)
write.csv(Credit, 'Credit.csv', row.names = FALSE)
credit = read.csv('Credit.csv')[-1] # excluding the ID column
str(credit)

## 'data.frame':    400 obs. of  11 variables:
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : chr  " Male" "Female" " Male" "Female" ...
##  $ Student  : chr  "No" "Yes" "No" "No" ...
##  $ Married  : chr  "Yes" "Yes" "No" "No" ...
##  $ Ethnicity: chr  "Caucasian" "Asian" "Asian" "Asian" ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

The name of the dataset is credit. It contains 400 observations and 11 variables. You can see the dataset in the environment tab by clicking on the dataset. Some of the variables are read as chr (character), which are actually factor. We need to convert them as factor before continuing the analysis.

factors = c('Gender', 'Student', 'Married', 'Ethnicity')

credit = credit %>% mutate_at(factors, factor)

str(credit)

## 'data.frame':    400 obs. of  11 variables:
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

We will use the customer Balance in US$ as the dependent variable and Income [US$] as the independent variable. We want to check the effect of Income on the Balance.

In simple linear regression, there is only one independent variable. In multiple linear regression, there are more than one independent variables. The dependent variable in linear regression must be continuous or discrete and the residuals are normally distributed with equal variance (homoskedastic). We need to estimate the coefficients of the regression equation.

Regression equation: $y_i = \alpha + \beta X_i + \epsilon_i$

Fit the model using lm() function. We see that the model is significant indicated by F(1,398) = 108.99 with p = 0.00. This means that the model significantly explains the variance in the balance. Furthermore, $R^2=0.21$ indicates that 21% of the variance in the balacne can be explained by the income of the customer. Estimate (regression coefficient) of income is 6.05 which is significant at 0.00% level of probability (highly significant). This means that if we increase the income by 1 dollar, the balance will increase by 6.05. The intercept ($\alpha = 246.51$) indicates the average balance in the absence of income (independent variable).

mod = lm(Balance ~ Income, credit)
summ(mod, confint = TRUE)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,398) = 108.99, p = 0.00
## R² = 0.21
## Adj. R² = 0.21 
## 
## Standard errors:OLS
## ------------------------------------------------------------
##                       Est.     2.5%    97.5%   t val.      p
## ----------------- -------- -------- -------- -------- ------
## (Intercept)         246.51   181.25   311.78     7.43   0.00
## Income                6.05     4.91     7.19    10.44   0.00
## ------------------------------------------------------------

# Summary of the variables
Credit %>% select(Balance, Income) %>% summary()

##     Balance            Income      
##  Min.   :   0.00   Min.   : 10.35  
##  1st Qu.:  68.75   1st Qu.: 21.01  
##  Median : 459.50   Median : 33.12  
##  Mean   : 520.01   Mean   : 45.22  
##  3rd Qu.: 863.00   3rd Qu.: 57.47  
##  Max.   :1999.00   Max.   :186.63

We know the actual average balance (520.01 dollars). We can predict this ($\hat y$) by the model: $\hat y = 246.51 + 6.05 \cdot Income$

We can predict using the predict(). We also know the actual balance. We can calculate the residuals (error, $\epsilon$) [for each of the observations] by subtracting the predicted values from the actual values.

actual = credit$Balance # actual values
credit$predicted = predict(mod) # add to the dataset
credit$residuals = actual - credit$predicted # add to the dataset
data.frame(actual = actual, predicted = credit$predicted, residual = credit$residuals)[1:5,]

##   actual predicted   residual
## 1    333  336.5809   -3.58093
## 2    903  887.7925   15.20752
## 3    580  879.1312 -299.13122
## 4    964 1147.2612 -183.26122
## 5    331  584.5094 -253.50939

Let us visualize the residuals.The blue dots are the actual observations. The red line is the regression equation (predicted or estimated y values) line based on the estimated coefficients. The arrows represent the deviations (residuals) of the observations from the predicted y values (red line). The residuals of the model can also be directly accessed using mod$residuals command.

# Only 301 to 400 observations are plotted for better visualization
ggplot(credit[301:400, ], aes(x = Income, y = Balance)) +
  geom_point(color = 'blue', shape = 19, size = 5, alpha = 0.5) + 
  geom_line(aes(y = predicted), linewidth = 2, color = 'red') + # predicted values (regression line)
  geom_segment(aes(xend = Income, yend = predicted), color = 'grey50', linewidth = 1, 
               arrow = arrow(type='open', ends = 'first', length = unit(0.1, 'inches'))) +
  labs(x = 'Income (US$)', y = 'Balance (US$)') +
  theme_test()

  # +scale_x_continuous(labels = ~ .x/1000) +
  # scale_y_continuous(labels = ~ .x/1000)

To be regression valid, the residuals should be normally distributed with homoskedasticity (devoid of heteroskedasticity). The model contains several values.

names(mod)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

These values can be accessed by mod$…. commands. Such as, mod$fitted.values are the predicted values.

mod$fitted.values[1:5] %>% round(digits = 1)

##      1      2      3      4      5 
##  336.6  887.8  879.1 1147.3  584.5

Let us check the residuals to check their normality and heteroskedasticity.

boxplot(mod$residuals, col = 'red')

The boxplot indicates the residuals seem to be normally distributed with many outliers in the head and tail.

qqPlot(mod)

## [1]  97 223

The qqplot also indicates the residuals may not to be normally distributed with many outliers in the head and systematic deviations in the middle.

plot(mod, which = 1)

The red line should align with the dotted horizontal line to indicate the linear relationship between the independent and dependent variables. This is one of the important assumptions of the linear regression. This plot should also show the equal variance across the fitted values to meet the assumptions of homoskedasticity.

If the spread of the residuals around the dotted line is not homogeneous across the fitted values, a funnel-shaped spread is found. This means the variances are heteroskedastic. It is a problem of linear regression. We can estimate the heteroskedasticity robust standard error, which gives more stable p-values. The robust standard error only gives more stable p-values, the estimates are similarly unbiased.

Visualization of the funnel-shaped spread of the residuals.

ggplot(credit, aes(y = residuals, x = predicted)) +
  geom_point(shape = 1, size = 2) + theme_test() +
  geom_hline(yintercept = 0, linetype = 'dotted', linewidth = 1) +
  geom_hline(yintercept = 700, linetype = 'dashed', linewidth = 1, color = 'blue') +
  geom_hline(yintercept = -700, linetype = 'dashed', linewidth = 1, color = 'blue') +
  geom_abline(intercept = -10, slope = 1.0, linewidth = 2, linetype = 'dotted') +
  geom_abline(intercept = 10, slope = -1.0) +
  geom_abline(intercept = 1300, slope = -.50)

Heteroskedasticity can also be checked by Breusch-Pagan (BP) test ($H_0$: The residuals are homoskedastic.). Our model shows heteroskedasticity at 10% level of significance (p = 0.07).

To get the robust standard error, in case of funnel shaped (heteroskedastic) spread of the residuals, we can use summ(mod, robust = TRUE, confint = TRUE) command.

bptest(mod)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 3.2191, df = 1, p-value = 0.07279

summ(mod, robust = TRUE, confint = TRUE)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,398) = 108.99, p = 0.00
## R² = 0.21
## Adj. R² = 0.21 
## 
## Standard errors: Robust, type = HC3
## ------------------------------------------------------------
##                       Est.     2.5%    97.5%   t val.      p
## ----------------- -------- -------- -------- -------- ------
## (Intercept)         246.51   182.22   310.81     7.54   0.00
## Income                6.05     4.84     7.25     9.88   0.00
## ------------------------------------------------------------

Standard error of the regression model can be accessed by summary(mod)$sigma.

summary(mod)$sigma %>% round(2)

## [1] 407.86

Let us examine the bias and variance from this model.

Bias-variance trade-off

Bias-variance trad-off, the following image has been taken from: https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/

Bias-variance trade-off

mod = lm(Balance ~ Income, credit)
predicted = predict(mod)
average_predicted = mean(predicted)
true = credit$Balance
residuals = mod$residuals

Mean square error (MSE): The mean square error is the average of the squared differences between the actual and predicted values. It is the average of the residuals squared. The lower the MSE, the better the model. The MSE is calculated by mean(mod$residuals^2).

Root mean square error (RMSE): The root mean square error is the square root of the MSE [average of the squared residuals]. The lower the RMSE, the better the model. The RMSE is calculated by sqrt(mean(mod$residuals^2)).

# RMSE in tidy way
mod$residuals^2 %>% mean() %>% sqrt() %>% round(2)

## [1] 406.84

# Other error indicators
(MSE = mean(residuals^2))

## [1] 165521.9

(RMSE = sqrt(MSE))

## [1] 406.8438

(variance = mean((predicted - average_predicted)^2))

## [1] 45327.92

(bias2 = mean((average_predicted - true)^2))

## [1] 210849.8

(irreducible_error = MSE - variance - bias2)

## [1] -90655.84

# RMSE is not equal to standard deviation of the residuals, which takes degree of freedom into account.
# Similarly, variance is not equal to var(residuals), which takes degree of freedom into account.

(MSE = bias2 + variance + irreducible_error)

## [1] 165521.9

Multiple linear regression

$y = \alpha + \beta_1X_1 + \beta_2X_2 + ... ... + \beta_nX_n + \epsilon$

In multiple linear regression, there more than one independent variables. The assumptions are:

The dependent (criterion, response) variable is continuous (or discrete) and normally distributed.
The independent (predictor) variables can be continuous (covariate or quantitative), discrete, or categorical (factor or qualitative) [James, et al. 2013, p.83]
The independent variables are orthogonal meaning that they are uncorrelated to each other. In reality, they may have correlations. If the correlations are significant between two variables, multicollinearity problem may arise. This means that either of the concerned variable can explain the variance. So, we need to drop one variable with higher VIF (variance inflation factor). The higher VIF (>4, 6 or at most 10) indicates multicollinearity. The formula is related to the coefficient of determination $R^2$, which is $VIF = \frac{1}{1-R^2}$
The independent variables are linearly correlated with the dependent variables. This is checked by the residuals vs fitted plot. The residual plot should show a straight line along the zero line.
The variance of the residuals should be homoskedastic, i.e. constant variances of the error terms. The residual plot should show the spread of the residuals fairly equal in number of points at both sides of the zero line.
The residuals should be uncorrelated and normally distributed that is checked by the residual qqPlot. The residuals should be placed along the diagonal line.
The residuals should be random and uncorrelated with the independent variable. If the qqPlot of the residuals shows any systematic deviation from the diagonal line, this indicates the variances still present in the residuals and the independent variables have not captured all the variance of the dependent variable.
Outliers usually have lower effect on the regression line but high-leverIncome points can pull the regression line towards them.

Let us explore the dataset.

credit = read.csv('Credit.csv')[-1] # excluding the ID column
factors = c('Gender', 'Student', 'Married', 'Ethnicity')
credit[factors] = lapply(credit[factors], as.factor)

scatterplotMatrix(formula = ~ Balance + Age + Cards + Education + Income + Limit + Rating, data = credit,
                  smooth = TRUE, regLine = FALSE, ellipse = FALSE, plot.points = TRUE,
                  cex = 1, cex.axis = 2, cex.labels = 2)

summary(credit)

##      Income           Limit           Rating          Cards      
##  Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
##  1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
##  Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
##  Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
##  3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
##  Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
##       Age          Education        Gender    Student   Married  
##  Min.   :23.00   Min.   : 5.00    Male :193   No :360   No :155  
##  1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40   Yes:245  
##  Median :56.00   Median :14.00                                   
##  Mean   :55.67   Mean   :13.45                                   
##  3rd Qu.:70.00   3rd Qu.:16.00                                   
##  Max.   :98.00   Max.   :20.00                                   
##             Ethnicity      Balance       
##  African American: 99   Min.   :   0.00  
##  Asian           :102   1st Qu.:  68.75  
##  Caucasian       :199   Median : 459.50  
##                         Mean   : 520.01  
##                         3rd Qu.: 863.00  
##                         Max.   :1999.00

Meaning of regression coefficient for qualitative predictors

In this example, we will use Student as the independent factor and Balance as the criterion variable.

mod_qual = lm(Balance ~ Student, credit)
summ(mod_qual)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,398) = 28.62, p = 0.00
## R² = 0.07
## Adj. R² = 0.06 
## 
## Standard errors:OLS
## --------------------------------------------------
##                       Est.    S.E.   t val.      p
## ----------------- -------- ------- -------- ------
## (Intercept)         480.37   23.43    20.50   0.00
## StudentYes          396.46   74.10     5.35   0.00
## --------------------------------------------------

We see that the regression coefficient ($\beta_1$) is 396.46 dollars. It means that being a student, a customer has 396.46 dollars more than the non-student customer ($\beta_0$), which is 480.37 dollars. Our $H_0 = \beta_1$ is rejected as the p < 0.05. The prediction equation will be:

\[ \hat y_i = \beta_0 + \beta_1x_i + \epsilon = \begin{cases} 1 & \beta_0 + \beta_1x_i + \epsilon \Rightarrow if~i^{th}~person~is~student \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~student \end{cases} \]

Qualitative predictors with more than two levels

Now, we will fit the model of balance with ethnicity.Though the results are not significant, still we can explain the way of interpretation.

mod_eth = lm(Balance ~ Ethnicity, credit)
summ(mod_eth)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(2,397) = 0.04, p = 0.96
## R² = 0.00
## Adj. R² = -0.00 
## 
## Standard errors:OLS
## ---------------------------------------------------------
##                              Est.    S.E.   t val.      p
## ------------------------ -------- ------- -------- ------
## (Intercept)                531.00   46.32    11.46   0.00
## EthnicityAsian             -18.69   65.02    -0.29   0.77
## EthnicityCaucasian         -12.50   56.68    -0.22   0.83
## ---------------------------------------------------------

Here, ‘African American’ is the reference group with which the Asian and Caucasian have been compared. However, we can change the reference group (level) by changing the level orders.

\[ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon = \begin{cases} 2 & \beta_0 + \beta_2 + \epsilon \Rightarrow if~i^{th}~person~is~Caucasian \\ 1 & \beta_0 + \beta_1 + \epsilon \Rightarrow if~i^{th}~person~is~Asian \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~African~American~(Reference~group) \end{cases} \]

credit$Ethnicity = factor(credit$Ethnicity, levels = c('Asian', 'Caucasian', 'African American'))
mod_eth = lm(Balance ~ Ethnicity, credit)
summ(mod_eth)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(2,397) = 0.04, p = 0.96
## R² = 0.00
## Adj. R² = -0.00 
## 
## Standard errors:OLS
## ----------------------------------------------------------------
##                                     Est.    S.E.   t val.      p
## ------------------------------- -------- ------- -------- ------
## (Intercept)                       512.31   45.63    11.23   0.00
## EthnicityCaucasian                  6.18   56.12     0.11   0.91
## EthnicityAfrican American          18.69   65.02     0.29   0.77
## ----------------------------------------------------------------

\[ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon = \begin{cases} 2 & \beta_0 + \beta_2 + \epsilon \Rightarrow if~i^{th}~person~is~African~American \\ 1 & \beta_0 + \beta_1 + \epsilon \Rightarrow if~i^{th}~person~is~Caucasian \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~Asian~(Reference~group) \end{cases} \]

Multiple linear regression with both qualitative and quantitative predictors

Full model: Model with all the independent variables. Different combinations of independent variables can give you a total of $2^p$ models. Selection of the best model can be tricky. Stepwise regression can be done using step() or stepAIC() in R. Stepwise regression compares the lowest RMSE ($\equiv$ largest $R^2$) among the models. The model with the lowest AIC or BIC are also considered as the best. However, in later sessions, we will discuss how we can estimate cross-validated test errors to compare the models and select the best model.

mod_full = lm(Balance ~ ., credit)
summ(mod_full)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(11,388) = 750.34, p = 0.00
## R² = 0.96
## Adj. R² = 0.95 
## 
## Standard errors:OLS
## -----------------------------------------------------------------
##                                      Est.    S.E.   t val.      p
## ------------------------------- --------- ------- -------- ------
## (Intercept)                       -462.40   35.26   -13.11   0.00
## Income                              -7.80    0.23   -33.31   0.00
## Limit                                0.19    0.03     5.82   0.00
## Rating                               1.14    0.49     2.32   0.02
## Cards                               17.72    4.34     4.08   0.00
## Age                                 -0.61    0.29    -2.09   0.04
## Education                           -1.10    1.60    -0.69   0.49
## GenderFemale                       -10.65    9.91    -1.07   0.28
## StudentYes                         425.75   16.72    25.46   0.00
## MarriedYes                          -8.53   10.36    -0.82   0.41
## EthnicityCaucasian                  -6.70   12.12    -0.55   0.58
## EthnicityAfrican American          -16.80   14.12    -1.19   0.23
## -----------------------------------------------------------------

Best model selection using step-wise regression: `step()` or `stepAIC()`?

Summary shows only the final selected model.

mod_aic = stepAIC(mod_full, direction = 'both', trace = FALSE)
summ(mod_aic)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(6,393) = 1380.03, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors:OLS
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ----------------- --------- ------- -------- ------
## (Intercept)         -493.73   24.82   -19.89   0.00
## Income                -7.80    0.23   -33.40   0.00
## Limit                  0.19    0.03     5.98   0.00
## Rating                 1.09    0.48     2.25   0.02
## Cards                 18.21    4.32     4.22   0.00
## Age                   -0.62    0.29    -2.14   0.03
## StudentYes           425.61   16.51    25.78   0.00
## ---------------------------------------------------

Best subset selection using ridge and lasso regression (shrinkage method)

The ridge regression uses a tuning parameter called lambda ($\lambda$), which shrinks the coefficients of non-significant predictors to zero. In reality, when $\lambda=0$, ridge regression equal the least squares (OLS) and when $\lambda=\infty$, all coefficients shrunk to zero. We need find out the optimum $\lambda$ that gives the lowest RMSE. Please note that as $\lambda$ increases, variance (predictions disparsed or deviated from mean prediction) decreases, but bias (mean prediction deviated from true mean, coefficient estimation error) increases. Ridge regression includes all predictors in the model unlike the stepwise regression that removes non-significant predictors. Ridge regression shrinks regression coefficients towards zero (but not exactly zero), but lasso regression shrinks some of the coefficients exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large. Thus, the lasso truly selects subset and result is more interpretable than the ridge. Cross validation helps to select the best value of $\lambda$ (regularization parameter). Please note that in ridge regression, alpha = 0 [L2 penalty]. However, in lasso regression, alpha = 1 [L1 penalty].

# Ridge regression: alpha = 0
# All data need to be numeric.
# credit <- data.frame(lapply(credit, function(x) if(is.factor(x) || is.character(x)) as.numeric(as.factor(x)) else x))

credit <- data.frame(lapply(credit, as.numeric))

set.seed(123)
y = credit %>% select(Balance) %>% scale(center = F, scale = F) %>%  as.numeric()
x = credit %>% select(-Balance)  %>% as.matrix()

lambda = seq(38, 40, length.out = 20)

control = trainControl(method = "cv", number = 10)

tuning = expand.grid(alpha = 0, lambda = lambda)

mod_ridge = train(x=x, y=y, method = 'glmnet', trControl = control, tuneGrid = tuning)

mod_ridge

## glmnet 
## 
## 400 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 359, 360, 360, 360, 361, 360, ... 
## Resampling results across tuning parameters:
## 
##   lambda    RMSE      Rsquared   MAE      
##   38.00000  117.7548  0.9437119   99.90069
##   38.10526  117.7548  0.9437119   99.90069
##   38.21053  117.7548  0.9437119   99.90069
##   38.31579  117.7548  0.9437119   99.90069
##   38.42105  117.7548  0.9437119   99.90069
##   38.52632  117.7562  0.9437117   99.90174
##   38.63158  117.7588  0.9437114   99.90365
##   38.73684  117.7614  0.9437112   99.90556
##   38.84211  117.7640  0.9437109   99.90747
##   38.94737  117.7666  0.9437106   99.90938
##   39.05263  117.7693  0.9437103   99.91129
##   39.15789  117.7719  0.9437100   99.91320
##   39.26316  117.7746  0.9437097   99.91512
##   39.36842  117.7772  0.9437094   99.91703
##   39.47368  117.7800  0.9437090   99.91903
##   39.57895  117.7896  0.9437068   99.92596
##   39.68421  117.8123  0.9436972   99.94350
##   39.78947  117.8495  0.9436778   99.97281
##   39.89474  117.8950  0.9436527  100.00891
##   40.00000  117.9406  0.9436276  100.04501
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 38.42105.

plot(mod_ridge)

# Lasso regression: alpha = 1
# All data need to be numeric.

# lambda should be sufficiently large to get 0 betas, find by trial and error
lambda = seq(-10,10, length.out = 20)

control = trainControl(method = "cv", number = 10)
tuning = expand.grid(alpha = 1, lambda = lambda)
mod_lasso = train(x=x, y=y, method = 'glmnet', trControl = control, tuneGrid = tuning)

mod_lasso

## glmnet 
## 
## 400 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 360, 360, 360, 359, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE     
##   -10.0000000   99.30002  0.9552521  80.30791
##    -8.9473684   99.30002  0.9552521  80.30791
##    -7.8947368   99.30002  0.9552521  80.30791
##    -6.8421053   99.30002  0.9552521  80.30791
##    -5.7894737   99.30002  0.9552521  80.30791
##    -4.7368421   99.30002  0.9552521  80.30791
##    -3.6842105   99.30002  0.9552521  80.30791
##    -2.6315789   99.30002  0.9552521  80.30791
##    -1.5789474   99.30002  0.9552521  80.30791
##    -0.5263158   99.30002  0.9552521  80.30791
##     0.5263158   99.30002  0.9552521  80.30791
##     1.5789474   99.50294  0.9551710  80.93982
##     2.6315789   99.84410  0.9550171  81.73106
##     3.6842105  100.24745  0.9548479  82.50507
##     4.7368421  100.72856  0.9546355  83.30046
##     5.7894737  101.32852  0.9543257  84.17740
##     6.8421053  102.11710  0.9538498  85.19073
##     7.8947368  103.06773  0.9532326  86.29724
##     8.9473684  104.14100  0.9525183  87.42997
##    10.0000000  105.32219  0.9517162  88.59304
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5263158.

# May give warning because of missing values above certain lambdas.

# best lambda for ridge, the lambda_min
(best_lambda = mod_ridge$bestTune$lambda)

## [1] 38.42105

plot(mod_lasso)

# Best lambda can also be identified using 10-fold cross-validation
mod_ridge_cv = cv.glmnet(x, y, alpha = 0, nfolds = 10)
(best_lambda = mod_ridge_cv$lambda.min)

## [1] 39.65627

# coefficients
(ridge_coef = coef(mod_ridge$finalModel, s = best_lambda))

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -742.8986477
## Income        -5.1766259
## Limit          0.1144897
## Rating         1.6610783
## Cards         15.7939335
## Age           -0.9548837
## Education     -0.4878440
## Gender        -4.8936724
## Student      381.4666857
## Married      -12.1695805
## Ethnicity     -7.1882522

# best lambda for lasso lambda.1se in the cross-validated model
mod_lasso_cv = cv.glmnet(x, y, alpha = 1, nfolds = 10)
(lambda_1se = mod_lasso_cv$lambda.1se)

## [1] 7.967869

# coefficients
(lasso_coef = coef(mod_lasso_cv, s = lambda_1se))

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -869.9462203
## Income        -6.7535701
## Limit          0.1382901
## Rating         1.6760049
## Cards          9.9781334
## Age           -0.3161273
## Education      .        
## Gender         .        
## Student      394.2455495
## Married        .        
## Ethnicity      .

The non-zero coefficient containing variables are selected for the final regression model. If you compare the stepwise, ridge and lasso, you will get a good idea, which variables should be selected as the predictors for your model. We are selecting the predictors based on the stepwise regression, for learning purpose. In real life, you can compare the RMSE for different models to select the best model. Stepwise regression suggest the following model Balance ~ Income + Limit + Rating + Cards + Age + Student

mod_final = lm(Balance ~ Income + Limit + Rating + Cards + Age + Student, data = credit)
summ(mod_final, vif = TRUE)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(6,393) = 1380.03, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors:OLS
## ------------------------------------------------------------
##                        Est.    S.E.   t val.      p      VIF
## ----------------- --------- ------- -------- ------ --------
## (Intercept)         -919.34   30.75   -29.90   0.00         
## Income                -7.80    0.23   -33.40   0.00     2.78
## Limit                  0.19    0.03     5.98   0.00   229.24
## Rating                 1.09    0.48     2.25   0.02   230.87
## Cards                 18.21    4.32     4.22   0.00     1.44
## Age                   -0.62    0.29    -2.14   0.03     1.04
## Student              425.61   16.51    25.78   0.00     1.01
## ------------------------------------------------------------

Our $H_0 : \beta_{Income} = \beta_{Limit} = \beta_{Rating} = \beta_{Cards} = \beta_{Age} = \beta_{Student} = 0$. If the associated p-value is < 0.05, we reject the $H_0$. We see that F(6,393) = 1380.03, p = 0.00, which indicates that at lease one $\beta \neq 0$. The p-values associated with different predictors indicate the significance of that particular predictor. We see that all the p-values < 0.05. So, they have significant effects on the balance.

Multicollinearity: The VIFs of limit and rating are very high. So, they could be highly correlated with each other. Therefore, one of them will explain the dependent variable well without the help of the other one. So, we will remove the variable with the highest VIF (Rating) and run the model again. This will not substantially reduce the $R^2$.

Besides, we may need to remove the non-significant variables from the model to preserve more degrees of freedom. If the $R^2$ is higher than the $Adjusted~R^2$, such reduction (penalty) in the coefficient of determination is due to the inclusion of non-significant variables in the model.

mod_multcol = lm(Balance ~ Income + Limit + Cards + Age + Student, data = credit)
summ(mod_multcol, vif = TRUE)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors:OLS
## ----------------------------------------------------------
##                        Est.    S.E.   t val.      p    VIF
## ----------------- --------- ------- -------- ------ ------
## (Intercept)         -895.71   29.05   -30.83   0.00       
## Income                -7.76    0.23   -33.15   0.00   2.76
## Limit                  0.27    0.00    75.30   0.00   2.70
## Cards                 23.55    3.63     6.49   0.00   1.00
## Age                   -0.62    0.29    -2.12   0.03   1.04
## Student              428.38   16.55    25.89   0.00   1.00
## ----------------------------------------------------------

So, there is no substantial multicollinearity issues now. However, we need to check the residuals to confirm whether we need to calculate the robust standard errors. The residual plot shows that the relationship is not linear, the variance is not homoskedastic, and the residuals are not normally distributed.

However, for learning purpose, we continue with this model for its interpretation with robust standard error.

# Residual plots
par(mfrow = c(1,2))
plot(mod_multcol, which = 1)
plot(mod_multcol, which = 2)

par(mfrow = c(1,1))

Distribution of dependent variable is not normal

par(mfrow = c(1,2))
boxplot(credit$Balance)
qqPlot(credit$Balance)

## [1] 324  29

par(mfrow = c(1,1))

# Check heteroskedasticity. p < 0.05, so heteroskedasticity exists. Calculate robust standard error.
bptest(mod_multcol)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod_multcol
## BP = 128.37, df = 5, p-value < 2.2e-16

summ(mod_multcol, confint = FALSE, robust = TRUE, scale = FALSE)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors: Robust, type = HC3
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ----------------- --------- ------- -------- ------
## (Intercept)         -895.71   30.47   -29.40   0.00
## Income                -7.76    0.22   -35.55   0.00
## Limit                  0.27    0.00    59.00   0.00
## Cards                 23.55    3.32     7.10   0.00
## Age                   -0.62    0.30    -2.08   0.04
## Student              428.38   15.06    28.44   0.00
## ---------------------------------------------------

This result shows the t-values based on heteroskedasticity robust standard error. Notice that the coefficients are the same. However, the regressin coefficient of income is -7.76 dollars, which indicates that one unit increase in income will decrease the balance by $7.76 holding other variables constant. Similarly, a student is likely to have balance 428.38 dollars greater than the average of the non-students, holding other variables constant.

Still we do not know which variable has the higher effects on the balance. For this we need to calculate the standardized regression coefficient (unit free using scale = TRUE) to compare between the variables.

summ(mod_multcol, scale = T, robust = T)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors: Robust, type = HC3
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ----------------- --------- ------- -------- ------
## (Intercept)          477.18    5.37    88.93   0.00
## Income              -273.48    7.69   -35.55   0.00
## Limit                614.32   10.41    59.00   0.00
## Cards                 32.29    4.55     7.10   0.00
## Age                  -10.73    5.17    -2.08   0.04
## Student              428.38   15.06    28.44   0.00
## ---------------------------------------------------
## 
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.

paste('sd of age =', sd(credit$Age))

## [1] "sd of age = 17.2498067622027"

paste('sd of balance =', sd(credit$Balance))

## [1] "sd of balance = 459.758877389383"

Now we see that the effect of card limit on the balance is the highest. The lowest effect is seen in the case of age.

The interpretation of the standardized coefficient is somewhat different from non-standardized coefficient. Here we have to conclude that one standard deviation increase in age will decrease 10.73 standard deviation in balance holding other variables constant. We have calculate the standard deviations of age (17.2) and balance (459.8) to understand the true implications of the standardize coefficients.

We must acknowledge holding other variables constant, because this rate change may not be realized if other variables are allowed to change. However, for writing the equation and easier interpretation, we use non-standardized regression coefficients.

summ(mod_multcol)

## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95 
## 
## Standard errors:OLS
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ----------------- --------- ------- -------- ------
## (Intercept)         -895.71   29.05   -30.83   0.00
## Income                -7.76    0.23   -33.15   0.00
## Limit                  0.27    0.00    75.30   0.00
## Cards                 23.55    3.63     6.49   0.00
## Age                   -0.62    0.29    -2.12   0.03
## Student              428.38   16.55    25.89   0.00
## ---------------------------------------------------

This model is significant indicated by the F value. It explains 95% of the variance in the balance. We can also estimate the individual contribution of the variables through anova(model) outputs (sum of squares) similar to the ANOVA.

The model prediction equation is: $\hat y = -895.71 - 7.76 \cdot Income + 0.27 \cdot Limit + 23.55 \cdot Cards - 0.62 \cdot Age + 428.38 \cdot Student$. Using this equation, we can predict the unknown balance by plugging in the values of the Independent variables.

RMSE of this model = 98.37 on the whole dataset (known data)

sqrt(mean(mod_multcol$residuals^2))

## [1] 98.37279

The residuals are not normally distributed. The residuals are not homoskedastic. The residuals are not random. The residuals are correlated with the independent variables. The residuals are not linearly correlated with the dependent variables. So, the model is not as good as we expect. The QQ-plot of the residuals exhibit systematic deviation from the diagonal line in the tail.

A good model should have following qualities:

The dependent variable is continuous and normally distributed checked by boxplot and qqplot.
The independent variables are continuous.
The independent variables are orthogonal meaning that they are uncorrelated to each other. The VIFs are less than 2.
The independent variables are linearly correlated with the dependent variables. This has been checked by the residuals vs fitted plot. The residual plot shows nearly a straight line along the zero line.
The variance of the residuals is homoskedastic. The residual plot shows the spread of the residuals fairly equal in number of spread at both sides of the zero line.
The residuals are normally distributed that has been checked by the residual QQ plot. The residuals are located along the diagonal line.
The residuals should be random and uncorrelated with the independent variable. The QQ plot of the residuals does not show any systematic deviation from the diagonal line.

Linear Regression

Professor Dr. Md. Kamrul Hasan