\(\text{Independent variable} = x\)
\(i_{th}~\text{observation} = x_i\)
\(\text{Number (frequency) of observation} = n\)
\(\text{Mean} = \bar{x}\)
\(\text{Mean (M)} = \frac{Total}{Frequency} = \frac{\sum x_i}{n},~i = 1~to~n\)
\(\text{Deviation from mean} = x_i - \bar{x}\)
\(\text{Squared deviation from mean} = (x_i - \bar{x})^2 \Rightarrow \text{Total Sum of Squares (TSS)} = \sum(x_i - \bar{x})^2\)
\(\text{Mean sum of squares} = \frac{1}{n}\sum(x_i - \bar{x})^2 = \text{Variance}\)
\(\text{Positive square root of variance }= \sqrt{\frac{1}{n}\sum(x_i - \bar{x})^2} = \text{Standard~deviation~(SD)}\)
Standard deviation is the deviation measured with reference to the mean of the observations.
Regression equation: \(y_i = \alpha + \beta X_i + \epsilon_i\)
\(\text{Dependent~variable} = y_i\) [observed values]
\(\text{Estimated value of dependent variable} = \hat{y_i} = \alpha + \beta X_i ~~\therefore~\text{Residual or Error},~ \epsilon_i = y_i - \hat{y_i}\)
\(\text{Mean value of dependent variable} = \bar{y}\)
\(\text{Total Sum of Squares (TSS)} = \sum(y_i - \bar{y})^2\)
\(\text{Mean Sum of Squares (MSS)} = \frac{1}{n} \sum(y_i - \bar{y})^2\)
Now, think the deviation of the observed values from the predicted (modeled or estimated) values.
\(\text{Deviation of observation from estimated values} = y_i - \hat{y_i} = \text{Error = Residuals}\)
Mean Absolute Error (MAE) = \(\frac{1}{n} \sum{(\hat{y}_i - y_i)}\) = Mean of residuals
\(\text{Residual Squares} = (y_i - \hat{y_i})^2 \Rightarrow \text{Residual Sum of squares (RSS)} = \sum(y_i - \hat{y_i})^2\)
\(\text{Residual Mean Sum of Squares (RMSS)} = \frac{1}{n}\sum(y_i - \hat{y_i})^2 = MSE_{error} = MSE_{residual}\)
\(\sqrt{RMSS} = \sqrt{\frac{1}{n}\sum(y_i - \hat{y_i})^2} = RMSE~(Root~Mean~Square~Error)\)
\(\text{MSE} = \frac{1}{n}\sum(\hat{y_i} - y_i)^2 = \frac{1}{n}\sum(\hat{y}_i - \bar{\hat{y}})^2 + \frac{1}{n}\sum(\bar{\hat{y}} - y_i)^2 + \text{Irreducible Error}\) \(\text{MSE} = \frac{1}{n}\sum(y_i - \hat{y_i})^2 = \text{Variance + Bias}^2 + \text{Irreducible Error}\)
Variance indicates the mean squared deviation (MSS) of the individual predicted observations from the mean predicted observation. Variance \(= \frac{1}{n}\sum(y_{pred}−\bar{y}_{pred})^2\)
\(Bias^2\) indicates the squared deviation (MSS) of the mean predicted observation from the true observations. \(Bias^2\) = \(\frac{1}{n} \sum(\bar{y}_{pred}−y_{true})^2\). This becomes in most practical applications as: \(Bias^2\) = \((\bar{y}_{pred} - \bar{y}_{true})^2\).
\(High~Variance\): Predictions are very sensitive to the input (train) data. The model shows overfitting meaning the models fits well on the training data but poorly performs on the test data.
\(High~Bias^2\): The model has not well captured the underlying pattern in the data. The model shows underfitting because simpler model has been used instead of more complex model.
These are the basic equations used frequently in different analysis. Now, we will see how the bivariate formulas of correlation and regression are derived. Let us start with the formula of variance.
\(var_x = \frac{1}{n}\sum(x_i - \bar{x})^2, var_y = \frac{1}{n}\sum(y_i - \bar{y})^2\)
\(\text{Sum product of x and y},~sp_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})\)
\(\text{Covariance},~cov_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y}) \Rightarrow \text{How y varies with x?}\)
\(\text{Correlation},~r = \frac{cov_{xy}}{sd_x sd_y}\Rightarrow \text{Unitless, range: -1 to +1}\)
\(\text{Regression},~\beta = \frac{cov_{xy}}{var_x} = r\frac{sd_y}{sd_x} = Slope = \text{change in y per x}\Rightarrow y = f(x)\)
\(\text{Standardized regression coefficient},~\beta^*~or~b~or~B = \beta\frac{sd_x}{sd_y} = change~in~sd_y~per~sd_x\)
For standardized (z-scores, centered and scaled) variables, \(\beta^* =\beta\frac{sd_x}{sd_y}= r\frac{sd_y}{sd_x}\frac{sd_x}{sd_y} = r\)
\(\text{Coefficient of determination, }R^2 = 1-\frac{RSS}{TSS} = \frac{TSS-RSS}{TSS} = \frac{SS_{effect}}{SS_{total}}\)
\(Adjusted~R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)},~\text{for n observations, d predictors}\)
\(ANOVA \Rightarrow F=\frac{MSE_{effect}}{MSE_{residual}},~\eta^2 = \frac{SS_{effect}}{SS_{total}} \Rightarrow \text{ Similar to }R^2\)
\(ANOVA \Rightarrow \text{Model CV(%)} = \frac{RMSE}{\bar{y}}\times100, \text{Dependent variable CV(%)} = \frac{sd_y}{\bar{y}}\times 100\)
\(\text{Variance Inflation Factor, } VIF = \frac{1}{1-R^2}\), if greater than 4, the model may have multicollinearity.
Load the required libraries using p_load()
function of
pacman
package. First install the packman
package if needed using install.packages('pacman')
in the
console because you do not need to install this each time. Set working
directory properly before continuing the analysis.
options(repr.plot.width=10, repr.plot.height=7) # for notebook figure width
pacman::p_load(ISLR, glmtoolbox, DALEX, MASS, lmtest, jtools, car, vip, cowplot, glmnet, caret, tidymodels, tidyverse)
Import dataset:
data(Credit)
write.csv(Credit, 'Credit.csv', row.names = FALSE)
credit = read.csv('Credit.csv')[-1] # excluding the ID column
str(credit)
## 'data.frame': 400 obs. of 11 variables:
## $ Income : num 14.9 106 104.6 148.9 55.9 ...
## $ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
## $ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
## $ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
## $ Age : int 34 82 71 36 68 77 37 87 66 41 ...
## $ Education: int 11 15 11 11 16 10 12 9 13 19 ...
## $ Gender : chr " Male" "Female" " Male" "Female" ...
## $ Student : chr "No" "Yes" "No" "No" ...
## $ Married : chr "Yes" "Yes" "No" "No" ...
## $ Ethnicity: chr "Caucasian" "Asian" "Asian" "Asian" ...
## $ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...
The name of the dataset is credit. It contains 400 observations and 11 variables. You can see the dataset in the environment tab by clicking on the dataset. Some of the variables are read as chr (character), which are actually factor. We need to convert them as factor before continuing the analysis.
factors = c('Gender', 'Student', 'Married', 'Ethnicity')
credit = credit %>% mutate_at(factors, factor)
str(credit)
## 'data.frame': 400 obs. of 11 variables:
## $ Income : num 14.9 106 104.6 148.9 55.9 ...
## $ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
## $ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
## $ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
## $ Age : int 34 82 71 36 68 77 37 87 66 41 ...
## $ Education: int 11 15 11 11 16 10 12 9 13 19 ...
## $ Gender : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
## $ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
## $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
## $ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...
We will use the customer Balance
in US$ as the dependent
variable and Income
[US$] as the independent variable. We
want to check the effect of Income
on the
Balance
.
In simple linear regression, there is only one independent variable. In multiple linear regression, there are more than one independent variables. The dependent variable in linear regression must be continuous or discrete and the residuals are normally distributed with equal variance (homoskedastic). We need to estimate the coefficients of the regression equation.
Regression equation: \(y_i = \alpha + \beta X_i + \epsilon_i\)
Fit the model using lm()
function. We see that the model
is significant indicated by F(1,398) = 108.99 with p = 0.00. This means
that the model significantly explains the variance in the balance.
Furthermore, \(R^2=0.21\) indicates
that 21% of the variance in the balacne can be explained by the income
of the customer. Estimate (regression coefficient) of income is 6.05
which is significant at 0.00% level of probability (highly significant).
This means that if we increase the income by 1 dollar, the balance will
increase by 6.05. The intercept (\(\alpha =
246.51\)) indicates the average balance in the absence of income
(independent variable).
mod = lm(Balance ~ Income, credit)
summ(mod, confint = TRUE)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,398) = 108.99, p = 0.00
## R² = 0.21
## Adj. R² = 0.21
##
## Standard errors:OLS
## ------------------------------------------------------------
## Est. 2.5% 97.5% t val. p
## ----------------- -------- -------- -------- -------- ------
## (Intercept) 246.51 181.25 311.78 7.43 0.00
## Income 6.05 4.91 7.19 10.44 0.00
## ------------------------------------------------------------
# Summary of the variables
Credit %>% select(Balance, Income) %>% summary()
## Balance Income
## Min. : 0.00 Min. : 10.35
## 1st Qu.: 68.75 1st Qu.: 21.01
## Median : 459.50 Median : 33.12
## Mean : 520.01 Mean : 45.22
## 3rd Qu.: 863.00 3rd Qu.: 57.47
## Max. :1999.00 Max. :186.63
We know the actual average balance (520.01 dollars). We can predict this (\(\hat y\)) by the model: \(\hat y = 246.51 + 6.05 \cdot Income\)
We can predict using the predict()
. We also know the
actual balance. We can calculate the residuals (error, \(\epsilon\)) [for each of the observations]
by subtracting the predicted values from the actual values.
actual = credit$Balance # actual values
credit$predicted = predict(mod) # add to the dataset
credit$residuals = actual - credit$predicted # add to the dataset
data.frame(actual = actual, predicted = credit$predicted, residual = credit$residuals)[1:5,]
## actual predicted residual
## 1 333 336.5809 -3.58093
## 2 903 887.7925 15.20752
## 3 580 879.1312 -299.13122
## 4 964 1147.2612 -183.26122
## 5 331 584.5094 -253.50939
Let us visualize the residuals.The blue dots are the actual
observations. The red line is the regression equation (predicted or
estimated y values) line based on the estimated coefficients. The arrows
represent the deviations (residuals) of the observations from the
predicted y values (red line). The residuals of the model can also be
directly accessed using mod$residuals
command.
# Only 301 to 400 observations are plotted for better visualization
ggplot(credit[301:400, ], aes(x = Income, y = Balance)) +
geom_point(color = 'blue', shape = 19, size = 5, alpha = 0.5) +
geom_line(aes(y = predicted), linewidth = 2, color = 'red') + # predicted values (regression line)
geom_segment(aes(xend = Income, yend = predicted), color = 'grey50', linewidth = 1,
arrow = arrow(type='open', ends = 'first', length = unit(0.1, 'inches'))) +
labs(x = 'Income (US$)', y = 'Balance (US$)') +
theme_test()
# +scale_x_continuous(labels = ~ .x/1000) +
# scale_y_continuous(labels = ~ .x/1000)
To be regression valid, the residuals should be normally distributed with homoskedasticity (devoid of heteroskedasticity). The model contains several values.
names(mod)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
These values can be accessed by mod$
…. commands. Such
as, mod$fitted.values
are the predicted values.
mod$fitted.values[1:5] %>% round(digits = 1)
## 1 2 3 4 5
## 336.6 887.8 879.1 1147.3 584.5
Let us check the residuals to check their normality and heteroskedasticity.
boxplot(mod$residuals, col = 'red')
The boxplot indicates the residuals seem to be normally distributed with many outliers in the head and tail.
qqPlot(mod)
## [1] 97 223
The qqplot also indicates the residuals may not to be normally distributed with many outliers in the head and systematic deviations in the middle.
plot(mod, which = 1)
The red line should align with the dotted horizontal line to indicate the linear relationship between the independent and dependent variables. This is one of the important assumptions of the linear regression. This plot should also show the equal variance across the fitted values to meet the assumptions of homoskedasticity.
If the spread of the residuals around the dotted line is not homogeneous across the fitted values, a funnel-shaped spread is found. This means the variances are heteroskedastic. It is a problem of linear regression. We can estimate the heteroskedasticity robust standard error, which gives more stable p-values. The robust standard error only gives more stable p-values, the estimates are similarly unbiased.
ggplot(credit, aes(y = residuals, x = predicted)) +
geom_point(shape = 1, size = 2) + theme_test() +
geom_hline(yintercept = 0, linetype = 'dotted', linewidth = 1) +
geom_hline(yintercept = 700, linetype = 'dashed', linewidth = 1, color = 'blue') +
geom_hline(yintercept = -700, linetype = 'dashed', linewidth = 1, color = 'blue') +
geom_abline(intercept = -10, slope = 1.0, linewidth = 2, linetype = 'dotted') +
geom_abline(intercept = 10, slope = -1.0) +
geom_abline(intercept = 1300, slope = -.50)
Heteroskedasticity can also be checked by Breusch-Pagan (BP) test (\(H_0\): The residuals are homoskedastic.). Our model shows heteroskedasticity at 10% level of significance (p = 0.07).
To get the robust standard error, in case of funnel shaped
(heteroskedastic) spread of the residuals, we can use
summ(mod, robust = TRUE, confint = TRUE)
command.
bptest(mod)
##
## studentized Breusch-Pagan test
##
## data: mod
## BP = 3.2191, df = 1, p-value = 0.07279
summ(mod, robust = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,398) = 108.99, p = 0.00
## R² = 0.21
## Adj. R² = 0.21
##
## Standard errors: Robust, type = HC3
## ------------------------------------------------------------
## Est. 2.5% 97.5% t val. p
## ----------------- -------- -------- -------- -------- ------
## (Intercept) 246.51 182.22 310.81 7.54 0.00
## Income 6.05 4.84 7.25 9.88 0.00
## ------------------------------------------------------------
Standard error of the regression model can be accessed by
summary(mod)$sigma
.
summary(mod)$sigma %>% round(2)
## [1] 407.86
Bias-variance trad-off, the following image has been taken from: https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/
mod = lm(Balance ~ Income, credit)
predicted = predict(mod)
average_predicted = mean(predicted)
true = credit$Balance
residuals = mod$residuals
Mean square error (MSE): The mean square error is the average of the
squared differences between the actual and predicted values. It is the
average of the residuals squared. The lower the MSE, the better the
model. The MSE is calculated by mean(mod$residuals^2)
.
Root mean square error (RMSE): The root mean square error is the
square root of the MSE [average of the squared residuals]. The lower the
RMSE, the better the model. The RMSE is calculated by
sqrt(mean(mod$residuals^2))
.
# RMSE in tidy way
mod$residuals^2 %>% mean() %>% sqrt() %>% round(2)
## [1] 406.84
# Other error indicators
(MSE = mean(residuals^2))
## [1] 165521.9
(RMSE = sqrt(MSE))
## [1] 406.8438
(variance = mean((predicted - average_predicted)^2))
## [1] 45327.92
(bias2 = mean((average_predicted - true)^2))
## [1] 210849.8
(irreducible_error = MSE - variance - bias2)
## [1] -90655.84
# RMSE is not equal to standard deviation of the residuals, which takes degree of freedom into account.
# Similarly, variance is not equal to var(residuals), which takes degree of freedom into account.
(MSE = bias2 + variance + irreducible_error)
## [1] 165521.9
\(y = \alpha + \beta_1X_1 + \beta_2X_2 + ... ... + \beta_nX_n + \epsilon\)
In multiple linear regression, there more than one independent variables. The assumptions are:
The dependent (criterion, response) variable is continuous (or discrete) and normally distributed.
The independent (predictor) variables can be continuous (covariate or quantitative), discrete, or categorical (factor or qualitative) [James, et al. 2013, p.83]
The independent variables are orthogonal meaning that they are uncorrelated to each other. In reality, they may have correlations. If the correlations are significant between two variables, multicollinearity problem may arise. This means that either of the concerned variable can explain the variance. So, we need to drop one variable with higher VIF (variance inflation factor). The higher VIF (>4, 6 or at most 10) indicates multicollinearity. The formula is related to the coefficient of determination \(R^2\), which is \(VIF = \frac{1}{1-R^2}\)
The independent variables are linearly correlated with the dependent variables. This is checked by the residuals vs fitted plot. The residual plot should show a straight line along the zero line.
The variance of the residuals should be homoskedastic, i.e. constant variances of the error terms. The residual plot should show the spread of the residuals fairly equal in number of points at both sides of the zero line.
The residuals should be uncorrelated and normally distributed
that is checked by the residual qqPlot
. The residuals
should be placed along the diagonal line.
The residuals should be random and uncorrelated with the
independent variable. If the qqPlot
of the residuals shows
any systematic deviation from the diagonal line, this indicates the
variances still present in the residuals and the independent variables
have not captured all the variance of the dependent variable.
Outliers usually have lower effect on the regression line but high-leverIncome points can pull the regression line towards them.
Let us explore the dataset.
credit = read.csv('Credit.csv')[-1] # excluding the ID column
factors = c('Gender', 'Student', 'Married', 'Ethnicity')
credit[factors] = lapply(credit[factors], as.factor)
scatterplotMatrix(formula = ~ Balance + Age + Cards + Education + Income + Limit + Rating, data = credit,
smooth = TRUE, regLine = FALSE, ellipse = FALSE, plot.points = TRUE,
cex = 1, cex.axis = 2, cex.labels = 2)
summary(credit)
## Income Limit Rating Cards
## Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000
## 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000
## Median : 33.12 Median : 4622 Median :344.0 Median :3.000
## Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958
## 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000
## Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000
## Age Education Gender Student Married
## Min. :23.00 Min. : 5.00 Male :193 No :360 No :155
## 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40 Yes:245
## Median :56.00 Median :14.00
## Mean :55.67 Mean :13.45
## 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :98.00 Max. :20.00
## Ethnicity Balance
## African American: 99 Min. : 0.00
## Asian :102 1st Qu.: 68.75
## Caucasian :199 Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
In this example, we will use Student as the independent factor and Balance as the criterion variable.
mod_qual = lm(Balance ~ Student, credit)
summ(mod_qual)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,398) = 28.62, p = 0.00
## R² = 0.07
## Adj. R² = 0.06
##
## Standard errors:OLS
## --------------------------------------------------
## Est. S.E. t val. p
## ----------------- -------- ------- -------- ------
## (Intercept) 480.37 23.43 20.50 0.00
## StudentYes 396.46 74.10 5.35 0.00
## --------------------------------------------------
We see that the regression coefficient (\(\beta_1\)) is 396.46 dollars. It means that being a student, a customer has 396.46 dollars more than the non-student customer (\(\beta_0\)), which is 480.37 dollars. Our \(H_0 = \beta_1\) is rejected as the p < 0.05. The prediction equation will be:
\[ \hat y_i = \beta_0 + \beta_1x_i + \epsilon = \begin{cases} 1 & \beta_0 + \beta_1x_i + \epsilon \Rightarrow if~i^{th}~person~is~student \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~student \end{cases} \]
Now, we will fit the model of balance with ethnicity.Though the results are not significant, still we can explain the way of interpretation.
mod_eth = lm(Balance ~ Ethnicity, credit)
summ(mod_eth)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(2,397) = 0.04, p = 0.96
## R² = 0.00
## Adj. R² = -0.00
##
## Standard errors:OLS
## ---------------------------------------------------------
## Est. S.E. t val. p
## ------------------------ -------- ------- -------- ------
## (Intercept) 531.00 46.32 11.46 0.00
## EthnicityAsian -18.69 65.02 -0.29 0.77
## EthnicityCaucasian -12.50 56.68 -0.22 0.83
## ---------------------------------------------------------
Here, ‘African American’ is the reference group with which the Asian and Caucasian have been compared. However, we can change the reference group (level) by changing the level orders.
\[ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon = \begin{cases} 2 & \beta_0 + \beta_2 + \epsilon \Rightarrow if~i^{th}~person~is~Caucasian \\ 1 & \beta_0 + \beta_1 + \epsilon \Rightarrow if~i^{th}~person~is~Asian \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~African~American~(Reference~group) \end{cases} \]
credit$Ethnicity = factor(credit$Ethnicity, levels = c('Asian', 'Caucasian', 'African American'))
mod_eth = lm(Balance ~ Ethnicity, credit)
summ(mod_eth)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(2,397) = 0.04, p = 0.96
## R² = 0.00
## Adj. R² = -0.00
##
## Standard errors:OLS
## ----------------------------------------------------------------
## Est. S.E. t val. p
## ------------------------------- -------- ------- -------- ------
## (Intercept) 512.31 45.63 11.23 0.00
## EthnicityCaucasian 6.18 56.12 0.11 0.91
## EthnicityAfrican American 18.69 65.02 0.29 0.77
## ----------------------------------------------------------------
\[ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon = \begin{cases} 2 & \beta_0 + \beta_2 + \epsilon \Rightarrow if~i^{th}~person~is~African~American \\ 1 & \beta_0 + \beta_1 + \epsilon \Rightarrow if~i^{th}~person~is~Caucasian \\ 0 & \beta_0 + \epsilon \Rightarrow if~i^{th}~person~is~not~Asian~(Reference~group) \end{cases} \]
Full model: Model with all the independent variables. Different
combinations of independent variables can give you a total of \(2^p\) models. Selection of the best model
can be tricky. Stepwise regression can be done using step()
or stepAIC()
in R. Stepwise regression compares the lowest
RMSE (\(\equiv\) largest \(R^2\)) among the models. The model with the
lowest AIC or BIC are also considered as the best. However, in later
sessions, we will discuss how we can estimate cross-validated test
errors to compare the models and select the best model.
mod_full = lm(Balance ~ ., credit)
summ(mod_full)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(11,388) = 750.34, p = 0.00
## R² = 0.96
## Adj. R² = 0.95
##
## Standard errors:OLS
## -----------------------------------------------------------------
## Est. S.E. t val. p
## ------------------------------- --------- ------- -------- ------
## (Intercept) -462.40 35.26 -13.11 0.00
## Income -7.80 0.23 -33.31 0.00
## Limit 0.19 0.03 5.82 0.00
## Rating 1.14 0.49 2.32 0.02
## Cards 17.72 4.34 4.08 0.00
## Age -0.61 0.29 -2.09 0.04
## Education -1.10 1.60 -0.69 0.49
## GenderFemale -10.65 9.91 -1.07 0.28
## StudentYes 425.75 16.72 25.46 0.00
## MarriedYes -8.53 10.36 -0.82 0.41
## EthnicityCaucasian -6.70 12.12 -0.55 0.58
## EthnicityAfrican American -16.80 14.12 -1.19 0.23
## -----------------------------------------------------------------
step()
or stepAIC()
?Summary shows only the final selected model.
mod_aic = stepAIC(mod_full, direction = 'both', trace = FALSE)
summ(mod_aic)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(6,393) = 1380.03, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors:OLS
## ---------------------------------------------------
## Est. S.E. t val. p
## ----------------- --------- ------- -------- ------
## (Intercept) -493.73 24.82 -19.89 0.00
## Income -7.80 0.23 -33.40 0.00
## Limit 0.19 0.03 5.98 0.00
## Rating 1.09 0.48 2.25 0.02
## Cards 18.21 4.32 4.22 0.00
## Age -0.62 0.29 -2.14 0.03
## StudentYes 425.61 16.51 25.78 0.00
## ---------------------------------------------------
The ridge regression uses a tuning parameter called lambda (\(\lambda\)), which shrinks the coefficients of non-significant predictors to zero. In reality, when \(\lambda=0\), ridge regression equal the least squares (OLS) and when \(\lambda=\infty\), all coefficients shrunk to zero. We need find out the optimum \(\lambda\) that gives the lowest RMSE. Please note that as \(\lambda\) increases, variance (predictions disparsed or deviated from mean prediction) decreases, but bias (mean prediction deviated from true mean, coefficient estimation error) increases. Ridge regression includes all predictors in the model unlike the stepwise regression that removes non-significant predictors. Ridge regression shrinks regression coefficients towards zero (but not exactly zero), but lasso regression shrinks some of the coefficients exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large. Thus, the lasso truly selects subset and result is more interpretable than the ridge. Cross validation helps to select the best value of \(\lambda\) (regularization parameter). Please note that in ridge regression, alpha = 0 [L2 penalty]. However, in lasso regression, alpha = 1 [L1 penalty].
# Ridge regression: alpha = 0
# All data need to be numeric.
# credit <- data.frame(lapply(credit, function(x) if(is.factor(x) || is.character(x)) as.numeric(as.factor(x)) else x))
credit <- data.frame(lapply(credit, as.numeric))
set.seed(123)
y = credit %>% select(Balance) %>% scale(center = F, scale = F) %>% as.numeric()
x = credit %>% select(-Balance) %>% as.matrix()
lambda = seq(38, 40, length.out = 20)
control = trainControl(method = "cv", number = 10)
tuning = expand.grid(alpha = 0, lambda = lambda)
mod_ridge = train(x=x, y=y, method = 'glmnet', trControl = control, tuneGrid = tuning)
mod_ridge
## glmnet
##
## 400 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 359, 360, 360, 360, 361, 360, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 38.00000 117.7548 0.9437119 99.90069
## 38.10526 117.7548 0.9437119 99.90069
## 38.21053 117.7548 0.9437119 99.90069
## 38.31579 117.7548 0.9437119 99.90069
## 38.42105 117.7548 0.9437119 99.90069
## 38.52632 117.7562 0.9437117 99.90174
## 38.63158 117.7588 0.9437114 99.90365
## 38.73684 117.7614 0.9437112 99.90556
## 38.84211 117.7640 0.9437109 99.90747
## 38.94737 117.7666 0.9437106 99.90938
## 39.05263 117.7693 0.9437103 99.91129
## 39.15789 117.7719 0.9437100 99.91320
## 39.26316 117.7746 0.9437097 99.91512
## 39.36842 117.7772 0.9437094 99.91703
## 39.47368 117.7800 0.9437090 99.91903
## 39.57895 117.7896 0.9437068 99.92596
## 39.68421 117.8123 0.9436972 99.94350
## 39.78947 117.8495 0.9436778 99.97281
## 39.89474 117.8950 0.9436527 100.00891
## 40.00000 117.9406 0.9436276 100.04501
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 38.42105.
plot(mod_ridge)
# Lasso regression: alpha = 1
# All data need to be numeric.
# lambda should be sufficiently large to get 0 betas, find by trial and error
lambda = seq(-10,10, length.out = 20)
control = trainControl(method = "cv", number = 10)
tuning = expand.grid(alpha = 1, lambda = lambda)
mod_lasso = train(x=x, y=y, method = 'glmnet', trControl = control, tuneGrid = tuning)
mod_lasso
## glmnet
##
## 400 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 360, 360, 360, 359, 360, 360, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## -10.0000000 99.30002 0.9552521 80.30791
## -8.9473684 99.30002 0.9552521 80.30791
## -7.8947368 99.30002 0.9552521 80.30791
## -6.8421053 99.30002 0.9552521 80.30791
## -5.7894737 99.30002 0.9552521 80.30791
## -4.7368421 99.30002 0.9552521 80.30791
## -3.6842105 99.30002 0.9552521 80.30791
## -2.6315789 99.30002 0.9552521 80.30791
## -1.5789474 99.30002 0.9552521 80.30791
## -0.5263158 99.30002 0.9552521 80.30791
## 0.5263158 99.30002 0.9552521 80.30791
## 1.5789474 99.50294 0.9551710 80.93982
## 2.6315789 99.84410 0.9550171 81.73106
## 3.6842105 100.24745 0.9548479 82.50507
## 4.7368421 100.72856 0.9546355 83.30046
## 5.7894737 101.32852 0.9543257 84.17740
## 6.8421053 102.11710 0.9538498 85.19073
## 7.8947368 103.06773 0.9532326 86.29724
## 8.9473684 104.14100 0.9525183 87.42997
## 10.0000000 105.32219 0.9517162 88.59304
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5263158.
# May give warning because of missing values above certain lambdas.
# best lambda for ridge, the lambda_min
(best_lambda = mod_ridge$bestTune$lambda)
## [1] 38.42105
plot(mod_lasso)
# Best lambda can also be identified using 10-fold cross-validation
mod_ridge_cv = cv.glmnet(x, y, alpha = 0, nfolds = 10)
(best_lambda = mod_ridge_cv$lambda.min)
## [1] 39.65627
# coefficients
(ridge_coef = coef(mod_ridge$finalModel, s = best_lambda))
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -742.8986477
## Income -5.1766259
## Limit 0.1144897
## Rating 1.6610783
## Cards 15.7939335
## Age -0.9548837
## Education -0.4878440
## Gender -4.8936724
## Student 381.4666857
## Married -12.1695805
## Ethnicity -7.1882522
# best lambda for lasso lambda.1se in the cross-validated model
mod_lasso_cv = cv.glmnet(x, y, alpha = 1, nfolds = 10)
(lambda_1se = mod_lasso_cv$lambda.1se)
## [1] 7.967869
# coefficients
(lasso_coef = coef(mod_lasso_cv, s = lambda_1se))
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -869.9462203
## Income -6.7535701
## Limit 0.1382901
## Rating 1.6760049
## Cards 9.9781334
## Age -0.3161273
## Education .
## Gender .
## Student 394.2455495
## Married .
## Ethnicity .
The non-zero coefficient containing variables are selected for the
final regression model. If you compare the stepwise, ridge and lasso,
you will get a good idea, which variables should be selected as the
predictors for your model. We are selecting the predictors based on the
stepwise regression, for learning purpose. In real life, you can compare
the RMSE for different models to select the best model. Stepwise
regression suggest the following model
Balance ~ Income + Limit + Rating + Cards + Age + Student
mod_final = lm(Balance ~ Income + Limit + Rating + Cards + Age + Student, data = credit)
summ(mod_final, vif = TRUE)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(6,393) = 1380.03, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors:OLS
## ------------------------------------------------------------
## Est. S.E. t val. p VIF
## ----------------- --------- ------- -------- ------ --------
## (Intercept) -919.34 30.75 -29.90 0.00
## Income -7.80 0.23 -33.40 0.00 2.78
## Limit 0.19 0.03 5.98 0.00 229.24
## Rating 1.09 0.48 2.25 0.02 230.87
## Cards 18.21 4.32 4.22 0.00 1.44
## Age -0.62 0.29 -2.14 0.03 1.04
## Student 425.61 16.51 25.78 0.00 1.01
## ------------------------------------------------------------
Our \(H_0 : \beta_{Income} = \beta_{Limit} = \beta_{Rating} = \beta_{Cards} = \beta_{Age} = \beta_{Student} = 0\). If the associated p-value is < 0.05, we reject the \(H_0\). We see that F(6,393) = 1380.03, p = 0.00, which indicates that at lease one \(\beta \neq 0\). The p-values associated with different predictors indicate the significance of that particular predictor. We see that all the p-values < 0.05. So, they have significant effects on the balance.
Multicollinearity: The VIFs of limit and rating are very high. So, they could be highly correlated with each other. Therefore, one of them will explain the dependent variable well without the help of the other one. So, we will remove the variable with the highest VIF (Rating) and run the model again. This will not substantially reduce the \(R^2\).
Besides, we may need to remove the non-significant variables from the model to preserve more degrees of freedom. If the \(R^2\) is higher than the \(Adjusted~R^2\), such reduction (penalty) in the coefficient of determination is due to the inclusion of non-significant variables in the model.
mod_multcol = lm(Balance ~ Income + Limit + Cards + Age + Student, data = credit)
summ(mod_multcol, vif = TRUE)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors:OLS
## ----------------------------------------------------------
## Est. S.E. t val. p VIF
## ----------------- --------- ------- -------- ------ ------
## (Intercept) -895.71 29.05 -30.83 0.00
## Income -7.76 0.23 -33.15 0.00 2.76
## Limit 0.27 0.00 75.30 0.00 2.70
## Cards 23.55 3.63 6.49 0.00 1.00
## Age -0.62 0.29 -2.12 0.03 1.04
## Student 428.38 16.55 25.89 0.00 1.00
## ----------------------------------------------------------
So, there is no substantial multicollinearity issues now. However, we need to check the residuals to confirm whether we need to calculate the robust standard errors. The residual plot shows that the relationship is not linear, the variance is not homoskedastic, and the residuals are not normally distributed.
However, for learning purpose, we continue with this model for its interpretation with robust standard error.
# Residual plots
par(mfrow = c(1,2))
plot(mod_multcol, which = 1)
plot(mod_multcol, which = 2)
par(mfrow = c(1,1))
par(mfrow = c(1,2))
boxplot(credit$Balance)
qqPlot(credit$Balance)
## [1] 324 29
par(mfrow = c(1,1))
# Check heteroskedasticity. p < 0.05, so heteroskedasticity exists. Calculate robust standard error.
bptest(mod_multcol)
##
## studentized Breusch-Pagan test
##
## data: mod_multcol
## BP = 128.37, df = 5, p-value < 2.2e-16
summ(mod_multcol, confint = FALSE, robust = TRUE, scale = FALSE)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors: Robust, type = HC3
## ---------------------------------------------------
## Est. S.E. t val. p
## ----------------- --------- ------- -------- ------
## (Intercept) -895.71 30.47 -29.40 0.00
## Income -7.76 0.22 -35.55 0.00
## Limit 0.27 0.00 59.00 0.00
## Cards 23.55 3.32 7.10 0.00
## Age -0.62 0.30 -2.08 0.04
## Student 428.38 15.06 28.44 0.00
## ---------------------------------------------------
This result shows the t-values based on heteroskedasticity robust standard error. Notice that the coefficients are the same. However, the regressin coefficient of income is -7.76 dollars, which indicates that one unit increase in income will decrease the balance by $7.76 holding other variables constant. Similarly, a student is likely to have balance 428.38 dollars greater than the average of the non-students, holding other variables constant.
Still we do not know which variable has the higher effects on the
balance. For this we need to calculate the standardized regression
coefficient (unit free using scale = TRUE
) to compare
between the variables.
summ(mod_multcol, scale = T, robust = T)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors: Robust, type = HC3
## ---------------------------------------------------
## Est. S.E. t val. p
## ----------------- --------- ------- -------- ------
## (Intercept) 477.18 5.37 88.93 0.00
## Income -273.48 7.69 -35.55 0.00
## Limit 614.32 10.41 59.00 0.00
## Cards 32.29 4.55 7.10 0.00
## Age -10.73 5.17 -2.08 0.04
## Student 428.38 15.06 28.44 0.00
## ---------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
paste('sd of age =', sd(credit$Age))
## [1] "sd of age = 17.2498067622027"
paste('sd of balance =', sd(credit$Balance))
## [1] "sd of balance = 459.758877389383"
Now we see that the effect of card limit on the balance is the highest. The lowest effect is seen in the case of age.
The interpretation of the standardized coefficient is somewhat different from non-standardized coefficient. Here we have to conclude that one standard deviation increase in age will decrease 10.73 standard deviation in balance holding other variables constant. We have calculate the standard deviations of age (17.2) and balance (459.8) to understand the true implications of the standardize coefficients.
We must acknowledge holding other variables constant, because this rate change may not be realized if other variables are allowed to change. However, for writing the equation and easier interpretation, we use non-standardized regression coefficients.
summ(mod_multcol)
## MODEL INFO:
## Observations: 400
## Dependent Variable: Balance
## Type: OLS linear regression
##
## MODEL FIT:
## F(5,394) = 1638.12, p = 0.00
## R² = 0.95
## Adj. R² = 0.95
##
## Standard errors:OLS
## ---------------------------------------------------
## Est. S.E. t val. p
## ----------------- --------- ------- -------- ------
## (Intercept) -895.71 29.05 -30.83 0.00
## Income -7.76 0.23 -33.15 0.00
## Limit 0.27 0.00 75.30 0.00
## Cards 23.55 3.63 6.49 0.00
## Age -0.62 0.29 -2.12 0.03
## Student 428.38 16.55 25.89 0.00
## ---------------------------------------------------
This model is significant indicated by the F value. It explains 95%
of the variance in the balance. We can also estimate the individual
contribution of the variables through anova(model)
outputs
(sum of squares) similar to the ANOVA.
The model prediction equation is: \(\hat y = -895.71 - 7.76 \cdot Income + 0.27 \cdot Limit + 23.55 \cdot Cards - 0.62 \cdot Age + 428.38 \cdot Student\). Using this equation, we can predict the unknown balance by plugging in the values of the Independent variables.
RMSE of this model = 98.37 on the whole dataset (known data)
sqrt(mean(mod_multcol$residuals^2))
## [1] 98.37279
The residuals are not normally distributed. The residuals are not homoskedastic. The residuals are not random. The residuals are correlated with the independent variables. The residuals are not linearly correlated with the dependent variables. So, the model is not as good as we expect. The QQ-plot of the residuals exhibit systematic deviation from the diagonal line in the tail.