After reading Chapters 3 and 4, solve the following exercises in a step-by-step fashion:
(Section 3.7) 3.8, 3.9, and 4.10.
(Section 4.7) 4.10 and 4.11.
This question involves the use of simple linear regression on the Auto data set.
library(ISLR2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.
attach(Auto)
lm.fit <- lm(mpg ~ horsepower)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
For example:
- i. Is there a relationship between the predictor and the response
- ii. How strong is the relationship between the predictor and the response?
- iii. Is the relationship between the predictor and the response positive or negative?
- iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?
We can infer that there is a strong negative relationship between the predictor and the response because we see a small p-value. The predicted mpg associated with a horsepower of 98 and the associated 95 % confidence and prediction intervals are:
# 95 % Confidence Interval
predict(lm.fit, data.frame(horsepower = (c(98))), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
# Predicted is 24.46708
# Interval is (23.97308, 24.96108)
# 95 % Prediction Interval
predict(lm.fit, data.frame(horsepower = (c(98))), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
# Predicted is 24.46708
# Interval is (14.8094, 34.12476)
(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.
plot(horsepower, mpg)
abline(lm.fit)
(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2,2))
plot(lm.fit)
This question involves the use of multiple linear regression on the Auto data set.
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
plot(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
quantityAuto <- Auto %>% select(-name)
cor(quantityAuto)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
lm.fit <- lm(mpg ~., data = quantityAuto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ ., data = quantityAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance:
+ i. Is there a relationship between the predictors and the response?
+ ii. Which predictors appear to have a statistically significant relationship to the response?
+ iii. What does the coefficient for the year variable suggest?
There is a relationship between some of the predictors and the response. The predictors that appear to have a statistically significant relationship to the response are displacement, weight, year, and origin. The coefficient for the year variable suggests that for every change in mpg, year affects the regression by 0.75 with a significant positive relationship.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2,2))
plot(lm.fit)
Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
We can see that the residuals suggest unusually large outliers starting at x = 30 and beyond. We can also notice that there are some observations in the Residuals vs Leverage plot that falls outside the Cook’s distance (red dashed line at the bottom), this indicates that they are influential points, high leverage.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm.fit2 <- lm(mpg~displacement*weight, data = quantityAuto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ displacement * weight, data = quantityAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8664 -2.4801 -0.3355 1.8071 17.9429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.372e+01 1.940e+00 27.697 < 2e-16 ***
## displacement -7.831e-02 1.131e-02 -6.922 1.85e-11 ***
## weight -8.931e-03 8.474e-04 -10.539 < 2e-16 ***
## displacement:weight 1.744e-05 2.789e-06 6.253 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.097 on 388 degrees of freedom
## Multiple R-squared: 0.7265, Adjusted R-squared: 0.7244
## F-statistic: 343.6 on 3 and 388 DF, p-value: < 2.2e-16
lm.fit3 <- lm(mpg~displacement*year, data = quantityAuto)
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ displacement * year, data = quantityAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8530 -2.4250 -0.2234 2.0823 16.9933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.288e+01 8.368e+00 -8.709 < 2e-16 ***
## displacement 2.523e-01 4.059e-02 6.216 1.32e-09 ***
## year 1.408e+00 1.102e-01 12.779 < 2e-16 ***
## displacement:year -4.080e-03 5.453e-04 -7.482 4.96e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.729 on 388 degrees of freedom
## Multiple R-squared: 0.7735, Adjusted R-squared: 0.7718
## F-statistic: 441.7 on 3 and 388 DF, p-value: < 2.2e-16
lm.fit4 <- lm(mpg~year*origin, data = quantityAuto)
summary(lm.fit4)
##
## Call:
## lm(formula = mpg ~ year * origin, data = quantityAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3141 -3.7120 -0.6513 3.3621 15.5859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -83.3809 12.0000 -6.948 1.57e-11 ***
## year 1.3089 0.1576 8.305 1.68e-15 ***
## origin 17.3752 6.8325 2.543 0.0114 *
## year:origin -0.1663 0.0889 -1.871 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.199 on 388 degrees of freedom
## Multiple R-squared: 0.5596, Adjusted R-squared: 0.5562
## F-statistic: 164.4 on 3 and 388 DF, p-value: < 2.2e-16
lm.fit5 <- lm(mpg~year*weight, data = quantityAuto)
summary(lm.fit5)
##
## Call:
## lm(formula = mpg ~ year * weight, data = quantityAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0397 -1.9956 -0.0983 1.6525 12.9896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.105e+02 1.295e+01 -8.531 3.30e-16 ***
## year 2.040e+00 1.718e-01 11.876 < 2e-16 ***
## weight 2.755e-02 4.413e-03 6.242 1.14e-09 ***
## year:weight -4.579e-04 5.907e-05 -7.752 8.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.193 on 388 degrees of freedom
## Multiple R-squared: 0.8339, Adjusted R-squared: 0.8326
## F-statistic: 649.3 on 3 and 388 DF, p-value: < 2.2e-16
#displacement, weight, year, and origin
The interaction between year:origin seems to not be statistically significant.
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
lm.fit <- lm(mpg~year + I(year^2))
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2681 -5.0887 -0.8619 4.6922 18.2275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 565.72999 146.93208 3.850 0.000138 ***
## year -15.53584 3.87206 -4.012 7.21e-05 ***
## I(year^2) 0.11028 0.02546 4.331 1.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.223 on 389 degrees of freedom
## Multiple R-squared: 0.3675, Adjusted R-squared: 0.3643
## F-statistic: 113 on 2 and 389 DF, p-value: < 2.2e-16
lm.fit <- lm(mpg~year + log(year))
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ year + log(year))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2727 -5.1030 -0.8339 4.7437 18.1913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4177.461 975.784 4.281 2.34e-05 ***
## year 18.042 3.863 4.670 4.15e-06 ***
## log(year) -1276.157 293.169 -4.353 1.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.222 on 389 degrees of freedom
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3646
## F-statistic: 113.2 on 2 and 389 DF, p-value: < 2.2e-16
lm.fit <- lm(mpg~year + I(sqrt(year)))
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ year + I(sqrt(year)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2719 -5.0997 -0.8405 4.7305 18.2006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2480.086 586.528 4.228 2.94e-05 ***
## year 34.833 7.729 4.507 8.71e-06 ***
## I(sqrt(year)) -585.630 134.688 -4.348 1.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.222 on 389 degrees of freedom
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3645
## F-statistic: 113.1 on 2 and 389 DF, p-value: < 2.2e-16
Using different transformations for the variable year, we noticed that the near-zero p-value associated with the quadratic term suggests that it leads to an improved model, the Std. Error is also lowered for year^2. This is not the case for the other transformations. The other transformations are still significant but the Std. Error is way higher while also the p-value is a little bit higher too, for theses cases the coefficient is only showing the relationship between the variables, the coefficients make sense since we are using a log function and also √year, hence why these values are negatives.
This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.fit <- lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Each coefficient indicates how it is affecting Sales. Price has a negative high significant relationship while the UrbanYes (which is a 1 because it is a qualitative variable) has no significant relationship with Sales, finally USYes (again qualitative) has a positive significant relationship.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.043469 -0.054459Price - 0.021916UrbanYes +1.200573*USYes + error
We do not include “No” for US or Urban as the dummy variable is zero, we would be multiplying by zero.
(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
We can reject the null hypothesis for Price and USYes.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit <- lm(Sales~Price+US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data? Model (a) has a non significant variable which is UrbanYes, comparing this model to model (e) which every variable is significant, model (e) has a lower R-squared compared to model (a). Model (e) also has lower Std. Error for some predictors while Multiple R-squared for both is the same. We would need to choose model (e) since this model has every single predictor rejecting the null hypothesis while model (a) has one predictor accepting H0.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
# 95 % Confidence Interval
predict(lm.fit, data.frame(Price = 150, US = "Yes"), interval = "confidence")
## fit lwr upr
## 1 6.058791 5.602926 6.514655
# Predicted is 6.058791
# Interval is (5.602926, 6.514655)
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(lm.fit)
There is evidence of high leverage observations, outliers seem to be present although not drastically away from the data.
This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
#install.packages("ISLR")
#install.packages("MASS")
#install.packages("olsrr")
library("ISLR")
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.2.0 ✓ forcats 0.5.1
## ✓ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("MASS")
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:ISLR2':
##
## Boston
library ("e1071")
library("class")
library("rsample")
##
## Attaching package: 'rsample'
## The following object is masked from 'package:e1071':
##
## permutations
library("olsrr")
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
data("Weekly")
weekly_db<-Weekly
summary(weekly_db)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
There´s not a clear difference in the predictors in terms of the objective variable.
ggplot(data=weekly_db, aes(x=Lag1, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
ggplot(data=weekly_db, aes(x=Lag2, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
ggplot(data=weekly_db, aes(x=Lag3, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
ggplot(data=weekly_db, aes(x=Lag4, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
ggplot(data=weekly_db, aes(x=Lag5, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
ggplot(data=weekly_db, aes(x=Volume, group=Direction, fill=Direction)) +
geom_density(alpha=.4)
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
The predictor Lag 2 is statistically significant with a 95% confidence interval and the intercept also is significant with a 99% confidence interval.
glm.fits <- glm (Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
data=weekly_db , family = binomial)
summary (glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = weekly_db)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm_probs <- predict (glm.fits ,data=weekly_db)
glm_pred <- rep (" Down ", 1089)
glm_pred[glm_probs > .5] = "Up"
Confussion Matrix
cm1<-table (glm_pred , weekly_db$Direction)
cm1
##
## glm_pred Down Up
## Down 465 563
## Up 19 42
Accuracy Total
acc1<-(cm1[1]+cm1[4])/nrow(weekly_db)
acc1
## [1] 0.4655647
Accuracy Up (Sensitivity)
acc_up1<-cm1[4]/(cm1[4]+cm1[3])
acc_up1
## [1] 0.06942149
Accuracy Down (Specificity)
acc_dw1<-cm1[1]/(cm1[2]+cm1[1])
acc_dw1
## [1] 0.9607438
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
weekly_train<-weekly_db[which(weekly_db$Year<=2008),]
weekly_test<-weekly_db[which(weekly_db$Year>=2009),]
weekly_dir_test<-weekly_test$Direction
glm.fits2 <- glm (Direction ~ Lag2 , data=weekly_train , family = binomial)
glm_probs2 <- predict (glm.fits2 , newdata=weekly_test, type = "response")
glm_pred2 <- rep (" Down ", 104)
glm_pred2[glm_probs2 > .5] = "Up"
Confussion Matrix
cm2<-table (glm_pred2 , weekly_dir_test)
cm2
## weekly_dir_test
## glm_pred2 Down Up
## Down 9 5
## Up 34 56
Accuracy Total
acc2<-(cm2[1]+cm2[4])/nrow(weekly_test)
acc2
## [1] 0.625
Accuracy Up (Sensitivity)
acc_up2<-cm2[4]/(cm2[4]+cm2[3])
acc_up2
## [1] 0.9180328
Accuracy Down (Specificity)
acc_dw2<-cm2[1]/(cm2[2]+cm2[1])
acc_dw2
## [1] 0.2093023
(e) Repeat (d) using LDA.
lda.fit <- lda (Direction ~ Lag2 , data=weekly_train)
lda_pred <- predict (lda.fit , weekly_test)
Confussion Matrix
cm3<-table (lda_pred$class , weekly_dir_test)
cm3
## weekly_dir_test
## Down Up
## Down 9 5
## Up 34 56
Accuracy Total
acc3<-(cm3[1]+cm3[4])/nrow(weekly_test)
acc3
## [1] 0.625
Accuracy Up (Sensitivity)
acc_up3<-cm3[4]/(cm3[4]+cm3[3])
acc_up3
## [1] 0.9180328
Accuracy Down (Specificity)
acc_dw3<-cm3[1]/(cm3[2]+cm3[1])
acc_dw3
## [1] 0.2093023
(f) Repeat (d) using QDA.
qda.fit <- qda (Direction ~ Lag2 , data=weekly_train)
qda_pred <- predict (qda.fit , weekly_test)
Confussion Matrix
cm4<-table (qda_pred$class , weekly_dir_test)
cm4
## weekly_dir_test
## Down Up
## Down 0 0
## Up 43 61
Accuracy Total
acc4<-(cm4[1]+cm4[4])/nrow(weekly_test)
acc4
## [1] 0.5865385
Accuracy Up (Sensitivity)
acc_up4<-cm4[4]/(cm4[4]+cm4[3])
acc_up4
## [1] 1
Accuracy Down (Specificity)
acc_dw4<-cm4[1]/(cm4[2]+cm4[1])
acc_dw4
## [1] 0
(g) Repeat (d) using KNN with K = 1.
weekly_dir_train<-weekly_train$Direction
weekly_train_x<- weekly_train[,2:8]
weekly_test_x<- weekly_test[,2:8]
set.seed (123)
knn_pred <- knn(weekly_train_x, weekly_test_x, weekly_dir_train , k = 1)
Confussion Matrix
cm5<-table (knn_pred , weekly_dir_test)
cm5
## weekly_dir_test
## knn_pred Down Up
## Down 35 9
## Up 8 52
Accuracy Total
acc5<-(cm5[1]+cm5[4])/nrow(weekly_test)
acc5
## [1] 0.8365385
Accuracy Up (Sensitivity)
acc_up5<-cm5[4]/(cm5[4]+cm5[3])
acc_up5
## [1] 0.852459
Accuracy Down (Specificity)
acc_dw5<-cm5[1]/(cm5[2]+cm5[1])
acc_dw5
## [1] 0.8139535
(h) Repeat (d) using naive Bayes.
nb.fit <- naiveBayes (Direction ~ Lag2 , data=weekly_train)
nb_pred <- predict (nb.fit , weekly_test)
Confussion Matrix
cm6<-table (nb_pred , weekly_dir_test)
cm6
## weekly_dir_test
## nb_pred Down Up
## Down 0 0
## Up 43 61
Accuracy Total
acc6<-(cm6[1]+cm6[4])/nrow(weekly_test)
acc6
## [1] 0.5865385
Accuracy Up (Sensitivity)
acc_up6<-cm6[4]/(cm6[4]+cm6[3])
acc_up6
## [1] 1
Accuracy Down (Specificity)
acc_dw6<-cm6[1]/(cm6[2]+cm6[1])
acc_dw6
## [1] 0
(i) Which of these methods appears to provide the best results on this data?
The model that show the best performance is the K-Nearest Neighbors.
model <- c("GLM", "LDA","QDA","KNN","NB")
total_acc <- c(acc2,acc3,acc4,acc5,acc6)
sensitivity <- c(acc_up2, acc_up3, acc_up4, acc_up5, acc_up6)
specificity <- c(acc_dw2, acc_dw3, acc_dw4, acc_dw5, acc_dw6)
results_weekly <- data.frame(model, total_acc,sensitivity,specificity)
results_weekly
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
data(Auto)
auto_db<- Auto
summary(auto_db)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
auto_db$mpg01<-0
auto_db[which(auto_db$mpg>=median(auto_db$mpg)),10]<-1
auto_db$mpg01<-as.factor(auto_db$mpg01)
summary(auto_db)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## 0:196
## 1:196
##
##
##
##
##
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
In all the variables there is a clear difference in terms of the objective variable. The ones with the more visual difference are (cylinders, displacement, horsepower, weight and origin).
ggplot(aes(y=cylinders,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=displacement,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=horsepower,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=weight,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=acceleration,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=year,fill=mpg01), data=auto_db) + geom_boxplot()
ggplot(aes(y=origin,fill=mpg01), data=auto_db) + geom_boxplot()
(c) Split the data into a training set and a test set.
set.seed(123)
split_auto<- initial_split(
data = auto_db, # Datos
prop = 0.8, # Proporcion conjunto entrenamiento
strata = mpg01 # Variable objetivo
)
auto_train <- training(split_auto)
auto_test <- testing(split_auto)
auto_test_mpg01<-auto_test$mpg01
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lda.fit_auto <- lda (mpg01 ~ cylinders + displacement + horsepower + weight + origin, data=auto_train)
lda_pred_auto <- predict (lda.fit_auto , auto_test)
Confussion Matrix
cm1_auto<-table (lda_pred_auto$class , auto_test_mpg01)
cm1_auto
## auto_test_mpg01
## 0 1
## 0 32 2
## 1 8 38
Accuracy Total
acc1_auto<-(cm3[1]+cm3[4])/nrow(auto_test)
acc1_auto
## [1] 0.8125
Accuracy 1 - over median (Sensitivity)
acc1_1_auto<-cm1_auto[4]/(cm1_auto[4]+cm1_auto[3])
acc1_1_auto
## [1] 0.95
Accuracy 0 - under median (Specificity)
acc1_0_auto<-cm1_auto[1]/(cm1_auto[2]+cm1_auto[1])
acc1_0_auto
## [1] 0.8
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit_auto <- qda (mpg01 ~ cylinders + displacement + horsepower + weight + origin, data=auto_train)
qda_pred <- predict (qda.fit_auto , auto_test)
Confussion Matrix
cm2_auto<-table (qda_pred$class , auto_test_mpg01)
cm2_auto
## auto_test_mpg01
## 0 1
## 0 34 3
## 1 6 37
Accuracy Total
acc2_auto<-(cm2_auto[1]+cm2_auto[4])/nrow(auto_test)
acc2_auto
## [1] 0.8875
Accuracy 1 - over median (Sensitivity)
acc2_1_auto<-cm2_auto[4]/(cm2_auto[4]+cm2_auto[3])
acc2_1_auto
## [1] 0.925
Accuracy 0 - under median (Specificity)
acc2_0_auto<-cm2_auto[1]/(cm2_auto[2]+cm2_auto[1])
acc2_0_auto
## [1] 0.85
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm.fits_auto <- glm (mpg01 ~ cylinders + displacement + horsepower + weight + origin, data=auto_train, family = binomial)
glm_probs_auto <- predict (glm.fits_auto , newdata=auto_test, type = "response")
glm_pred_auto <- rep (0, nrow(auto_test))
glm_pred_auto[glm_probs_auto > .5] = 1
Confussion Matrix
cm3_auto<-table (glm_pred_auto , auto_test_mpg01)
cm3_auto
## auto_test_mpg01
## glm_pred_auto 0 1
## 0 33 3
## 1 7 37
Accuracy Total
acc3_auto<-(cm3_auto[1]+cm3_auto[4])/nrow(auto_test)
acc3_auto
## [1] 0.875
Accuracy 1 - over median (Sensitivity)
acc3_1_auto<-cm3_auto[4]/(cm3_auto[4]+cm3_auto[3])
acc3_1_auto
## [1] 0.925
Accuracy 0 - under median (Specificity)
acc3_0_auto<-cm3_auto[1]/(cm3_auto[2]+cm3_auto[1])
acc3_0_auto
## [1] 0.825
(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained
nb.fit_auto <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data=auto_train)
nb_pred_auto <- predict (nb.fit_auto , auto_test)
Confussion Matrix
cm4_auto<-table (nb_pred_auto , auto_test_mpg01)
cm4_auto
## auto_test_mpg01
## nb_pred_auto 0 1
## 0 32 2
## 1 8 38
Accuracy Total
acc4_auto<-(cm4_auto[1]+cm4_auto[4])/nrow(auto_test)
acc4_auto
## [1] 0.875
Accuracy 1 - over median (Sensitivity)
acc4_1_auto<-cm4_auto[4]/(cm4_auto[4]+cm4_auto[3])
acc4_1_auto
## [1] 0.95
Accuracy 0 - over median (Specificity)
acc4_0_auto<-cm4_auto[1]/(cm4_auto[2]+cm4_auto[1])
acc4_0_auto
## [1] 0.8
(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain?
auto_train_mpg01<-auto_train$mpg01
auto_train_x<- auto_train[,c(2,3,4,5,8)]
auto_test_x<- auto_test[,c(2,3,4,5,8)]
set.seed (123)
knn_pred_auto <- knn(auto_train_x, auto_test_x, auto_train_mpg01 , k = 1)
Confussion Matrix
cm5_auto<-table (knn_pred_auto , auto_test_mpg01)
cm5_auto
## auto_test_mpg01
## knn_pred_auto 0 1
## 0 31 4
## 1 9 36
Accuracy Total
acc5_auto<-(cm5_auto[1]+cm5_auto[4])/nrow(auto_test)
acc5_auto
## [1] 0.8375
Accuracy 1 - over median (Sensitivity)
acc5_1_auto<-cm5_auto[4]/(cm5_auto[4]+cm5_auto[3])
acc5_1_auto
## [1] 0.9
Accuracy 0 - over median (Specificity)
acc5_0_auto<-cm5_auto[1]/(cm5_auto[2]+cm5_auto[1])
acc5_0_auto
## [1] 0.775
Note that some details are missing for all the following examples, the problems lack a complete explanation, and the code may need adequate comments. In this form, you must present a proper mathematical formulation, a brief background of the problem (and its bibliographical references) and, a much better explanation.
# Install release version from CRAN
# install.packages("olsrr")
Ordinary least squares regression.
library(olsrr)
ols_regress(mpg ~ disp + hp + wt + qsec, data = mtcars)
## Model Summary
## --------------------------------------------------------------
## R 0.914 RMSE 2.622
## R-Squared 0.835 Coef. Var 13.051
## Adj. R-Squared 0.811 MSE 6.875
## Pred R-Squared 0.771 MAE 1.858
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 940.412 4 235.103 34.195 0.0000
## Residual 185.635 27 6.875
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 27.330 8.639 3.164 0.004 9.604 45.055
## disp 0.003 0.011 0.055 0.248 0.806 -0.019 0.025
## hp -0.019 0.016 -0.212 -1.196 0.242 -0.051 0.013
## wt -4.609 1.266 -0.748 -3.641 0.001 -7.206 -2.012
## qsec 0.544 0.466 0.161 1.166 0.254 -0.413 1.501
## ----------------------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by entering and removing predictors based on p-values, in a stepwise manner until there is no variable left to enter or remove any more.
model <- lm(y ~ ., data = surgical)
ols_step_both_p(model)
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 liver_test addition 0.455 0.444 62.5120 771.8753 296.2992
## 2 alc_heavy addition 0.567 0.550 41.3680 761.4394 266.6484
## 3 enzyme_test addition 0.659 0.639 24.3380 750.5089 238.9145
## 4 pindex addition 0.750 0.730 7.5370 735.7146 206.5835
## 5 bcs addition 0.781 0.758 3.1920 730.6204 195.4544
## ------------------------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more.
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
k
##
##
## Backward Elimination Summary
## ---------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ---------------------------------------------------------------------------
## Full Model 736.390 1825905.713 6543614.824 0.78184 0.74305
## alc_mod 734.407 1826477.828 6543042.709 0.78177 0.74856
## gender 732.494 1829435.617 6540084.920 0.78142 0.75351
## age 730.620 1833716.447 6535804.090 0.78091 0.75808
## ---------------------------------------------------------------------------
Breusch Pagan test is used to test for herteroskedasticity (non-constant error variance). It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. It is a χ2 test.
Test for constant variance. It assumes that the error terms are normally distributed.
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : mpg
## Variables: fitted values of mpg
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 1.429672
## Prob > Chi2 = 0.231818
Variance inflation factor, tolerance, eigenvalues and condition indices. Collinearity implies two variables are near perfect linear combinations of one another. Multicollinearity involves more than two variables. In the presence of multicollinearity, regression estimates are unstable and have high standard errors.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_coll_diag(model)
## Tolerance and Variance Inflation Factor
## ---------------------------------------
## Variables Tolerance VIF
## 1 disp 0.1252279 7.985439
## 2 hp 0.1935450 5.166758
## 3 wt 0.1445726 6.916942
## 4 qsec 0.3191708 3.133119
##
##
## Eigenvalue and Condition Index
## ------------------------------
## Eigenvalue Condition Index intercept disp hp wt
## 1 4.721487187 1.000000 0.000123237 0.001132468 0.001413094 0.0005253393
## 2 0.216562203 4.669260 0.002617424 0.036811051 0.027751289 0.0002096014
## 3 0.050416837 9.677242 0.001656551 0.120881424 0.392366164 0.0377028008
## 4 0.010104757 21.616057 0.025805998 0.777260487 0.059594623 0.7017528428
## 5 0.001429017 57.480524 0.969796790 0.063914571 0.518874831 0.2598094157
## qsec
## 1 0.0001277169
## 2 0.0046789491
## 3 0.0001952599
## 4 0.0024577686
## 5 0.9925403056
Residual Diagnostics: Includes plots to examine residuals to validate OLS assumptions
Variable selection: Different variable selection procedures such as all possible regression, best subset regression, stepwise regression, stepwise forward regression and stepwise backward regression
Heteroskedasticity: Tests for heteroskedasticity include bartlett test, breusch pagan test, score test and f test
Measures of influence: Includes 10 different plots to detect and identify influential observations
Collinearity diagnostics: VIF, Tolerance and condition indices to detect collinearity and plots for assessing mode fit and contributions of variables
Ordinary least squares regression. In the presence of interaction terms in the model, the predictors are scaled and centered before computing the standardized betas, you can indicate the presence of interaction terms by setting iterm to TRUE.
ols_regress(mpg ~ disp + hp + wt + qsec, data = mtcars)
## Model Summary
## --------------------------------------------------------------
## R 0.914 RMSE 2.622
## R-Squared 0.835 Coef. Var 13.051
## Adj. R-Squared 0.811 MSE 6.875
## Pred R-Squared 0.771 MAE 1.858
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 940.412 4 235.103 34.195 0.0000
## Residual 185.635 27 6.875
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 27.330 8.639 3.164 0.004 9.604 45.055
## disp 0.003 0.011 0.055 0.248 0.806 -0.019 0.025
## hp -0.019 0.016 -0.212 -1.196 0.242 -0.051 0.013
## wt -4.609 1.266 -0.748 -3.641 0.001 -7.206 -2.012
## qsec 0.544 0.466 0.161 1.166 0.254 -0.413 1.501
## ----------------------------------------------------------------------------------------
Plot to detect non-linearity, unequal error variances, and outliers. Scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.
Characteristics of a well behaved residual vs fitted plot:
The residuals spread randomly around the 0 line indicating that the relationship is linear.
The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_fit(model)
DFBETAs measure the difference in each parameter estimate with and without the influential observation. dfbetas_panel creates plots to detect influential observations using DFBETAs. There is a DFBETA for each data point i.e if there are n observations and k variables, there will be 𝑛∗𝑘DFBETAs. In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and 2/(√𝑛) as a size-adjusted cutoff.
model <- lm(mpg ~ disp + hp + wt, data = mtcars)
ols_plot_dfbetas(model)
Plot to detect non-linearity, influential observations and outliers. Consists of side-by-side quantile plots of the centered fit and the residuals. It shows how much variation in the data is explained by the fit and how much remains in the residuals. For inappropriate models, the spread of the residuals in such a plot is often greater than the spread of the centered fit.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_fit_spread(model)
Breusch Pagan test is used to test for herteroskedasticity (non-constant error variance). It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. It is a 𝜒2 test.
Null Hypothesis: Equal/constant variances
Alternative Hypothesis: Unequal/non-constant variances
Computation
Fit a regression model
Regress the squared residuals from the above model on the independent variables
Compute 𝑛𝑅2. It follows a chi square distribution with p -1 degrees of freedom, where p is the number of independent variables, n is the sample size and 𝑅2 is the coefficient of determination from the regression in step 2.
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : mpg
## Variables: fitted values of mpg
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 1.429672
## Prob > Chi2 = 0.231818
Collinearity implies two variables are near perfect linear combinations of one another. Multicollinearity involves more than two variables. In the presence of multicollinearity, regression estimates are unstable and have high standard errors.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_coll_diag(model)
## Tolerance and Variance Inflation Factor
## ---------------------------------------
## Variables Tolerance VIF
## 1 disp 0.1252279 7.985439
## 2 hp 0.1935450 5.166758
## 3 wt 0.1445726 6.916942
## 4 qsec 0.3191708 3.133119
##
##
## Eigenvalue and Condition Index
## ------------------------------
## Eigenvalue Condition Index intercept disp hp wt
## 1 4.721487187 1.000000 0.000123237 0.001132468 0.001413094 0.0005253393
## 2 0.216562203 4.669260 0.002617424 0.036811051 0.027751289 0.0002096014
## 3 0.050416837 9.677242 0.001656551 0.120881424 0.392366164 0.0377028008
## 4 0.010104757 21.616057 0.025805998 0.777260487 0.059594623 0.7017528428
## 5 0.001429017 57.480524 0.969796790 0.063914571 0.518874831 0.2598094157
## qsec
## 1 0.0001277169
## 2 0.0046789491
## 3 0.0001952599
## 4 0.0024577686
## 5 0.9925403056
Build regression model from a set of candidate predictor variables by entering and removing predictors based on p values, in a stepwise manner until there is no variable left to enter or remove any more.
model <- lm(y ~ ., data = surgical)
ols_step_both_p(model)
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 liver_test addition 0.455 0.444 62.5120 771.8753 296.2992
## 2 alc_heavy addition 0.567 0.550 41.3680 761.4394 266.6484
## 3 enzyme_test addition 0.659 0.639 24.3380 750.5089 238.9145
## 4 pindex addition 0.750 0.730 7.5370 735.7146 206.5835
## 5 bcs addition 0.781 0.758 3.1920 730.6204 195.4544
## ------------------------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_both_p(model)
plot(k)
Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more.
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
k
##
##
## Backward Elimination Summary
## ---------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ---------------------------------------------------------------------------
## Full Model 736.390 1825905.713 6543614.824 0.78184 0.74305
## alc_mod 734.407 1826477.828 6543042.709 0.78177 0.74856
## gender 732.494 1829435.617 6540084.920 0.78142 0.75351
## age 730.620 1833716.447 6535804.090 0.78091 0.75808
## ---------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
plot(k)
All subset regression tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are 2𝑘distinct subsets of them to be tested. For example, if you have 10 candidate independent variables, the number of subsets to be tested is 2^10, which is 1024, and if you have 20 candidate variables, the number is 2^20, which is more than one million.
ols_step_all_possible() Fits all regressions involving one regressor, two regressors, three regressors, and so on. It tests all possible subsets of the set of potential independent variables.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_all_possible(model)
The plot method shows the panel of fit criteria for all possible regression methods.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
k <- ols_step_all_possible(model)
plot(k)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow’s Cp or AIC.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_best_subset(model)
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
k <- ols_step_best_subset(model)
plot(k)
Build regression model from a set of candidate predictor variables by entering predictors based on p values, in a stepwise manner until there is no variable left to enter any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
Build regression model from a set of candidate predictor variables by entering predictors based on p values, in a stepwise manner until there is no variable left to enter any more.
model <- lm(y ~ ., data = surgical)
ols_step_forward_p(model)
##
## Selection Summary
## ------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------
## 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
## 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
## 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
## 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
## 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
## ------------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_forward_p(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_forward_p(model, details = TRUE)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. bcs
## 2. pindex
## 3. enzyme_test
## 4. liver_test
## 5. age
## 6. gender
## 7. alc_mod
## 8. alc_heavy
##
## We are selecting variables based on p value...
##
##
## Forward Selection: Step 1
##
## - liver_test
##
## Model Summary
## -----------------------------------------------------------------
## R 0.674 RMSE 296.299
## R-Squared 0.455 Coef. Var 42.202
## Adj. R-Squared 0.444 MSE 87793.232
## Pred R-Squared 0.386 MAE 212.857
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 3804272.477 1 3804272.477 43.332 0.0000
## Residual 4565248.060 52 87793.232
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------
## (Intercept) 15.191 111.869 0.136 0.893 -209.290 239.671
## liver_test 250.305 38.025 0.674 6.583 0.000 174.003 326.607
## -------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 2
##
## - alc_heavy
##
## Model Summary
## -----------------------------------------------------------------
## R 0.753 RMSE 266.648
## R-Squared 0.567 Coef. Var 37.979
## Adj. R-Squared 0.550 MSE 71101.387
## Pred R-Squared 0.487 MAE 187.393
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 4743349.776 2 2371674.888 33.356 0.0000
## Residual 3626170.761 51 71101.387
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -5.069 100.828 -0.050 0.960 -207.490 197.352
## liver_test 234.597 34.491 0.632 6.802 0.000 165.353 303.841
## alc_heavy 342.183 94.156 0.338 3.634 0.001 153.157 531.208
## --------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 3
##
## - enzyme_test
##
## Model Summary
## -----------------------------------------------------------------
## R 0.812 RMSE 238.914
## R-Squared 0.659 Coef. Var 34.029
## Adj. R-Squared 0.639 MSE 57080.128
## Pred R-Squared 0.567 MAE 170.603
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 5515514.136 3 1838504.712 32.209 0.0000
## Residual 2854006.401 50 57080.128
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) -344.559 129.156 -2.668 0.010 -603.976 -85.141
## liver_test 183.844 33.845 0.495 5.432 0.000 115.865 251.823
## alc_heavy 319.662 84.585 0.315 3.779 0.000 149.769 489.555
## enzyme_test 6.263 1.703 0.335 3.678 0.001 2.843 9.683
## ---------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 4
##
## - pindex
##
## Model Summary
## -----------------------------------------------------------------
## R 0.866 RMSE 206.584
## R-Squared 0.750 Coef. Var 29.424
## Adj. R-Squared 0.730 MSE 42676.744
## Pred R-Squared 0.669 MAE 146.473
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6278360.060 4 1569590.015 36.779 0.0000
## Residual 2091160.477 49 42676.744
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -----------------------------------------------------------------------------------------------
## (Intercept) -789.012 153.372 -5.144 0.000 -1097.226 -480.799
## liver_test 125.474 32.358 0.338 3.878 0.000 60.448 190.499
## alc_heavy 359.875 73.754 0.355 4.879 0.000 211.660 508.089
## enzyme_test 7.548 1.503 0.404 5.020 0.000 4.527 10.569
## pindex 7.876 1.863 0.335 4.228 0.000 4.133 11.620
## -----------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 5
##
## - bcs
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
##
##
## No more variables to be added.
##
## Variables Entered:
##
## + liver_test
## + alc_heavy
## + enzyme_test
## + pindex
## + bcs
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
## Selection Summary
## ------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------
## 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
## 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
## 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
## 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
## 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
## ------------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by removing predictors based on p values, in a stepwise manner until there is no variable left to remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
model <- lm(y ~ ., data = surgical)
ols_step_backward_p(model)
##
##
## Elimination Summary
## --------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------
## 1 alc_mod 0.7818 0.7486 7.0141 734.4068 199.2637
## 2 gender 0.7814 0.7535 5.0870 732.4942 197.2921
## 3 age 0.7809 0.7581 3.1925 730.6204 195.4544
## --------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_p(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_backward_p(model, details = TRUE)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . bcs
## 2 . pindex
## 3 . enzyme_test
## 4 . liver_test
## 5 . age
## 6 . gender
## 7 . alc_mod
## 8 . alc_heavy
##
## We are eliminating variables based on p value...
##
## - alc_mod
##
## Backward Elimination: Step 1
##
## Variable alc_mod Removed
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 199.264
## R-Squared 0.782 Coef. Var 28.381
## Adj. R-Squared 0.749 MSE 39706.040
## Pred R-Squared 0.678 MAE 137.053
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6543042.709 7 934720.387 23.541 0.0000
## Residual 1826477.828 46 39706.040
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1145.971 238.536 -4.804 0.000 -1626.119 -665.822
## bcs 62.274 24.187 0.251 2.575 0.013 13.589 110.959
## pindex 8.987 1.850 0.382 4.857 0.000 5.262 12.711
## enzyme_test 9.875 1.720 0.528 5.743 0.000 6.414 13.337
## liver_test 50.763 44.379 0.137 1.144 0.259 -38.567 140.093
## age -0.911 2.599 -0.025 -0.351 0.728 -6.142 4.320
## gender 15.786 57.840 0.020 0.273 0.786 -100.639 132.212
## alc_heavy 315.854 73.849 0.312 4.277 0.000 167.202 464.505
## ------------------------------------------------------------------------------------------------
##
##
## - gender
##
## Backward Elimination: Step 2
##
## Variable gender Removed
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 197.292
## R-Squared 0.781 Coef. Var 28.101
## Adj. R-Squared 0.754 MSE 38924.162
## Pred R-Squared 0.692 MAE 138.160
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6540084.920 6 1090014.153 28.004 0.0000
## Residual 1829435.617 47 38924.162
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1143.080 235.943 -4.845 0.000 -1617.737 -668.424
## bcs 61.424 23.748 0.248 2.586 0.013 13.649 109.199
## pindex 8.974 1.832 0.382 4.900 0.000 5.290 12.659
## enzyme_test 9.852 1.700 0.527 5.794 0.000 6.431 13.273
## liver_test 54.053 42.288 0.146 1.278 0.207 -31.019 139.125
## age -0.850 2.563 -0.024 -0.332 0.742 -6.007 4.307
## alc_heavy 314.585 72.974 0.310 4.311 0.000 167.781 461.390
## ------------------------------------------------------------------------------------------------
##
##
## - age
##
## Backward Elimination: Step 3
##
## Variable age Removed
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## ------------------------------------------------------------------------------------------------
##
##
##
## No more variables satisfy the condition of p value = 0.3
##
##
## Variables Removed:
##
## - alc_mod
## - gender
## - age
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## ------------------------------------------------------------------------------------------------
##
##
## Elimination Summary
## --------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------
## 1 alc_mod 0.7818 0.7486 7.0141 734.4068 199.2637
## 2 gender 0.7814 0.7535 5.0870 732.4942 197.2921
## 3 age 0.7809 0.7581 3.1925 730.6204 195.4544
## --------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by entering and removing predictors based on p values, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
model <- lm(y ~ ., data = surgical)
ols_step_both_p(model)
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 liver_test addition 0.455 0.444 62.5120 771.8753 296.2992
## 2 alc_heavy addition 0.567 0.550 41.3680 761.4394 266.6484
## 3 enzyme_test addition 0.659 0.639 24.3380 750.5089 238.9145
## 4 pindex addition 0.750 0.730 7.5370 735.7146 206.5835
## 5 bcs addition 0.781 0.758 3.1920 730.6204 195.4544
## ------------------------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_both_p(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_both_p(model, details = TRUE)
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. bcs
## 2. pindex
## 3. enzyme_test
## 4. liver_test
## 5. age
## 6. gender
## 7. alc_mod
## 8. alc_heavy
##
## We are selecting variables based on p value...
##
##
## Stepwise Selection: Step 1
##
## - liver_test added
##
## Model Summary
## -----------------------------------------------------------------
## R 0.674 RMSE 296.299
## R-Squared 0.455 Coef. Var 42.202
## Adj. R-Squared 0.444 MSE 87793.232
## Pred R-Squared 0.386 MAE 212.857
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 3804272.477 1 3804272.477 43.332 0.0000
## Residual 4565248.060 52 87793.232
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------
## (Intercept) 15.191 111.869 0.136 0.893 -209.290 239.671
## liver_test 250.305 38.025 0.674 6.583 0.000 174.003 326.607
## -------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 2
##
## - alc_heavy added
##
## Model Summary
## -----------------------------------------------------------------
## R 0.753 RMSE 266.648
## R-Squared 0.567 Coef. Var 37.979
## Adj. R-Squared 0.550 MSE 71101.387
## Pred R-Squared 0.487 MAE 187.393
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 4743349.776 2 2371674.888 33.356 0.0000
## Residual 3626170.761 51 71101.387
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -5.069 100.828 -0.050 0.960 -207.490 197.352
## liver_test 234.597 34.491 0.632 6.802 0.000 165.353 303.841
## alc_heavy 342.183 94.156 0.338 3.634 0.001 153.157 531.208
## --------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -----------------------------------------------------------------
## R 0.753 RMSE 266.648
## R-Squared 0.567 Coef. Var 37.979
## Adj. R-Squared 0.550 MSE 71101.387
## Pred R-Squared 0.487 MAE 187.393
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 4743349.776 2 2371674.888 33.356 0.0000
## Residual 3626170.761 51 71101.387
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -5.069 100.828 -0.050 0.960 -207.490 197.352
## liver_test 234.597 34.491 0.632 6.802 0.000 165.353 303.841
## alc_heavy 342.183 94.156 0.338 3.634 0.001 153.157 531.208
## --------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 3
##
## - enzyme_test added
##
## Model Summary
## -----------------------------------------------------------------
## R 0.812 RMSE 238.914
## R-Squared 0.659 Coef. Var 34.029
## Adj. R-Squared 0.639 MSE 57080.128
## Pred R-Squared 0.567 MAE 170.603
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 5515514.136 3 1838504.712 32.209 0.0000
## Residual 2854006.401 50 57080.128
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) -344.559 129.156 -2.668 0.010 -603.976 -85.141
## liver_test 183.844 33.845 0.495 5.432 0.000 115.865 251.823
## alc_heavy 319.662 84.585 0.315 3.779 0.000 149.769 489.555
## enzyme_test 6.263 1.703 0.335 3.678 0.001 2.843 9.683
## ---------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -----------------------------------------------------------------
## R 0.812 RMSE 238.914
## R-Squared 0.659 Coef. Var 34.029
## Adj. R-Squared 0.639 MSE 57080.128
## Pred R-Squared 0.567 MAE 170.603
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 5515514.136 3 1838504.712 32.209 0.0000
## Residual 2854006.401 50 57080.128
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) -344.559 129.156 -2.668 0.010 -603.976 -85.141
## liver_test 183.844 33.845 0.495 5.432 0.000 115.865 251.823
## alc_heavy 319.662 84.585 0.315 3.779 0.000 149.769 489.555
## enzyme_test 6.263 1.703 0.335 3.678 0.001 2.843 9.683
## ---------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 4
##
## - pindex added
##
## Model Summary
## -----------------------------------------------------------------
## R 0.866 RMSE 206.584
## R-Squared 0.750 Coef. Var 29.424
## Adj. R-Squared 0.730 MSE 42676.744
## Pred R-Squared 0.669 MAE 146.473
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6278360.060 4 1569590.015 36.779 0.0000
## Residual 2091160.477 49 42676.744
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -----------------------------------------------------------------------------------------------
## (Intercept) -789.012 153.372 -5.144 0.000 -1097.226 -480.799
## liver_test 125.474 32.358 0.338 3.878 0.000 60.448 190.499
## alc_heavy 359.875 73.754 0.355 4.879 0.000 211.660 508.089
## enzyme_test 7.548 1.503 0.404 5.020 0.000 4.527 10.569
## pindex 7.876 1.863 0.335 4.228 0.000 4.133 11.620
## -----------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -----------------------------------------------------------------
## R 0.866 RMSE 206.584
## R-Squared 0.750 Coef. Var 29.424
## Adj. R-Squared 0.730 MSE 42676.744
## Pred R-Squared 0.669 MAE 146.473
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6278360.060 4 1569590.015 36.779 0.0000
## Residual 2091160.477 49 42676.744
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -----------------------------------------------------------------------------------------------
## (Intercept) -789.012 153.372 -5.144 0.000 -1097.226 -480.799
## liver_test 125.474 32.358 0.338 3.878 0.000 60.448 190.499
## alc_heavy 359.875 73.754 0.355 4.879 0.000 211.660 508.089
## enzyme_test 7.548 1.503 0.404 5.020 0.000 4.527 10.569
## pindex 7.876 1.863 0.335 4.228 0.000 4.133 11.620
## -----------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 5
##
## - bcs added
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
##
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 liver_test addition 0.455 0.444 62.5120 771.8753 296.2992
## 2 alc_heavy addition 0.567 0.550 41.3680 761.4394 266.6484
## 3 enzyme_test addition 0.659 0.639 24.3380 750.5089 238.9145
## 4 pindex addition 0.750 0.730 7.5370 735.7146 206.5835
## 5 bcs addition 0.781 0.758 3.1920 730.6204 195.4544
## ------------------------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by entering predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to enter any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
model <- lm(y ~ ., data = surgical)
ols_step_forward_aic(model)
##
## Selection Summary
## ----------------------------------------------------------------------------
## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq
## ----------------------------------------------------------------------------
## liver_test 771.875 3804272.477 4565248.060 0.45454 0.44405
## alc_heavy 761.439 4743349.776 3626170.761 0.56674 0.54975
## enzyme_test 750.509 5515514.136 2854006.401 0.65900 0.63854
## pindex 735.715 6278360.060 2091160.477 0.75015 0.72975
## bcs 730.620 6535804.090 1833716.447 0.78091 0.75808
## ----------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_forward_aic(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_forward_aic(model, details = TRUE)
## Forward Selection Method
## ------------------------
##
## Candidate Terms:
##
## 1 . bcs
## 2 . pindex
## 3 . enzyme_test
## 4 . liver_test
## 5 . age
## 6 . gender
## 7 . alc_mod
## 8 . alc_heavy
##
## Step 0: AIC = 802.606
## y ~ 1
##
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## liver_test 1 771.875 3804272.477 4565248.060 0.455 0.444
## enzyme_test 1 782.629 2798309.881 5571210.656 0.334 0.322
## pindex 1 794.100 1479766.754 6889753.784 0.177 0.161
## alc_heavy 1 794.301 1454057.255 6915463.282 0.174 0.158
## bcs 1 797.697 1005151.658 7364368.879 0.120 0.103
## alc_mod 1 802.828 271062.330 8098458.207 0.032 0.014
## gender 1 802.956 251808.570 8117711.967 0.030 0.011
## age 1 803.834 118862.559 8250657.978 0.014 -0.005
## --------------------------------------------------------------------------------
##
##
## - liver_test
##
##
## Step 1 : AIC = 771.8753
## y ~ liver_test
##
## -------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## -------------------------------------------------------------------------------
## alc_heavy 1 761.439 939077.300 3626170.761 0.567 0.550
## enzyme_test 1 762.077 896004.331 3669243.729 0.562 0.544
## pindex 1 770.387 285591.786 4279656.274 0.489 0.469
## alc_mod 1 771.141 225396.238 4339851.822 0.481 0.461
## gender 1 773.802 6162.222 4559085.838 0.455 0.434
## age 1 773.831 3726.297 4561521.763 0.455 0.434
## bcs 1 773.867 685.256 4564562.805 0.455 0.433
## -------------------------------------------------------------------------------
##
## - alc_heavy
##
##
## Step 2 : AIC = 761.4394
## y ~ liver_test + alc_heavy
##
## -------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## -------------------------------------------------------------------------------
## enzyme_test 1 750.509 772164.360 2854006.401 0.659 0.639
## pindex 1 756.125 459358.635 3166812.126 0.622 0.599
## bcs 1 763.063 25195.587 3600975.173 0.570 0.544
## age 1 763.110 22048.109 3604122.652 0.569 0.544
## alc_mod 1 763.428 784.551 3625386.210 0.567 0.541
## gender 1 763.433 443.343 3625727.417 0.567 0.541
## -------------------------------------------------------------------------------
##
## - enzyme_test
##
##
## Step 3 : AIC = 750.5089
## y ~ liver_test + alc_heavy + enzyme_test
##
## -----------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## -----------------------------------------------------------------------------
## pindex 1 735.715 762845.924 2091160.477 0.750 0.730
## bcs 1 750.782 89836.308 2764170.093 0.670 0.643
## alc_mod 1 752.403 5607.570 2848398.831 0.660 0.632
## age 1 752.416 4896.081 2849110.320 0.660 0.632
## gender 1 752.509 5.958 2854000.443 0.659 0.631
## -----------------------------------------------------------------------------
##
## - pindex
##
##
## Step 4 : AIC = 735.7146
## y ~ liver_test + alc_heavy + enzyme_test + pindex
##
## -----------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## -----------------------------------------------------------------------------
## bcs 1 730.620 257444.030 1833716.447 0.781 0.758
## age 1 737.680 1325.880 2089834.596 0.750 0.724
## gender 1 737.712 90.186 2091070.290 0.750 0.724
## alc_mod 1 737.713 60.620 2091099.857 0.750 0.724
## -----------------------------------------------------------------------------
##
## - bcs
##
##
## Step 5 : AIC = 730.6204
## y ~ liver_test + alc_heavy + enzyme_test + pindex + bcs
##
## ---------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## ---------------------------------------------------------------------------
## age 1 732.494 4280.830 1829435.617 0.781 0.754
## gender 1 732.551 2360.288 1831356.159 0.781 0.753
## alc_mod 1 732.614 216.992 1833499.455 0.781 0.753
## ---------------------------------------------------------------------------
##
##
## No more variables to be added.
##
## Variables Entered:
##
## - liver_test
## - alc_heavy
## - enzyme_test
## - pindex
## - bcs
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
## Selection Summary
## ----------------------------------------------------------------------------
## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq
## ----------------------------------------------------------------------------
## liver_test 771.875 3804272.477 4565248.060 0.45454 0.44405
## alc_heavy 761.439 4743349.776 3626170.761 0.56674 0.54975
## enzyme_test 750.509 5515514.136 2854006.401 0.65900 0.63854
## pindex 735.715 6278360.060 2091160.477 0.75015 0.72975
## bcs 730.620 6535804.090 1833716.447 0.78091 0.75808
## ----------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
k
##
##
## Backward Elimination Summary
## ---------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ---------------------------------------------------------------------------
## Full Model 736.390 1825905.713 6543614.824 0.78184 0.74305
## alc_mod 734.407 1826477.828 6543042.709 0.78177 0.74856
## gender 732.494 1829435.617 6540084.920 0.78142 0.75351
## age 730.620 1833716.447 6535804.090 0.78091 0.75808
## ---------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_backward_aic(model, details = TRUE)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . bcs
## 2 . pindex
## 3 . enzyme_test
## 4 . liver_test
## 5 . age
## 6 . gender
## 7 . alc_mod
## 8 . alc_heavy
##
## Step 0: AIC = 736.3899
## y ~ bcs + pindex + enzyme_test + liver_test + age + gender + alc_mod + alc_heavy
##
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## alc_mod 1 734.407 572.115 1826477.828 0.782 0.749
## gender 1 734.478 2990.338 1828896.051 0.781 0.748
## age 1 734.544 5231.108 1831136.821 0.781 0.748
## liver_test 1 735.878 51016.156 1876921.869 0.776 0.742
## bcs 1 741.677 263780.393 2089686.106 0.750 0.712
## alc_heavy 1 749.210 576636.222 2402541.935 0.713 0.669
## pindex 1 756.624 930187.311 2756093.024 0.671 0.621
## enzyme_test 1 763.557 1307756.930 3133662.644 0.626 0.569
## --------------------------------------------------------------------------------
##
##
## Variables Removed:
##
## - alc_mod
##
##
## Step 1 : AIC = 734.4068
## y ~ bcs + pindex + enzyme_test + liver_test + age + gender + alc_heavy
##
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## gender 1 732.494 2957.789 1829435.617 0.781 0.754
## age 1 732.551 4878.331 1831356.159 0.781 0.753
## liver_test 1 733.921 51951.343 1878429.171 0.776 0.747
## bcs 1 739.677 263219.094 2089696.922 0.750 0.718
## alc_heavy 1 750.486 726328.685 2552806.513 0.695 0.656
## pindex 1 754.759 936543.762 2763021.590 0.670 0.628
## enzyme_test 1 761.595 1309433.007 3135910.834 0.625 0.577
## --------------------------------------------------------------------------------
##
## - gender
##
##
## Step 2 : AIC = 732.4942
## y ~ bcs + pindex + enzyme_test + liver_test + age + alc_heavy
##
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## age 1 730.620 4280.830 1833716.447 0.781 0.758
## liver_test 1 732.339 63596.190 1893031.807 0.774 0.750
## bcs 1 737.680 260398.979 2089834.596 0.750 0.724
## alc_heavy 1 748.486 723371.473 2552807.090 0.695 0.663
## pindex 1 752.777 934511.071 2763946.688 0.670 0.635
## enzyme_test 1 759.596 1306482.666 3135918.283 0.625 0.586
## --------------------------------------------------------------------------------
##
## - age
##
##
## Step 3 : AIC = 730.6204
## y ~ bcs + pindex + enzyme_test + liver_test + alc_heavy
##
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## liver_test 1 730.924 79919.825 1913636.272 0.771 0.753
## bcs 1 735.715 257444.030 2091160.477 0.750 0.730
## alc_heavy 1 747.181 752122.827 2585839.274 0.691 0.666
## pindex 1 750.782 930453.646 2764170.093 0.670 0.643
## enzyme_test 1 757.971 1324076.125 3157792.572 0.623 0.592
## --------------------------------------------------------------------------------
##
##
## No more variables to be removed.
##
## Variables Removed:
##
## - alc_mod
## - gender
## - age
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## ------------------------------------------------------------------------------------------------
##
##
## Backward Elimination Summary
## ---------------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## ---------------------------------------------------------------------------
## Full Model 736.390 1825905.713 6543614.824 0.78184 0.74305
## alc_mod 734.407 1826477.828 6543042.709 0.78177 0.74856
## gender 732.494 1829435.617 6540084.920 0.78142 0.75351
## age 730.620 1833716.447 6535804.090 0.78091 0.75808
## ---------------------------------------------------------------------------
Build regression model from a set of candidate predictor variables by entering and removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.
model <- lm(y ~ ., data = surgical)
ols_step_both_aic(model)
##
##
## Stepwise Summary
## ----------------------------------------------------------------------------------------
## Variable Method AIC RSS Sum Sq R-Sq Adj. R-Sq
## ----------------------------------------------------------------------------------------
## liver_test addition 771.875 4565248.060 3804272.477 0.45454 0.44405
## alc_heavy addition 761.439 3626170.761 4743349.776 0.56674 0.54975
## enzyme_test addition 750.509 2854006.401 5515514.136 0.65900 0.63854
## pindex addition 735.715 2091160.477 6278360.060 0.75015 0.72975
## bcs addition 730.620 1833716.447 6535804.090 0.78091 0.75808
## ----------------------------------------------------------------------------------------
model <- lm(y ~ ., data = surgical)
k <- ols_step_both_aic(model)
plot(k)
model <- lm(y ~ ., data = surgical)
ols_step_both_aic(model, details = TRUE)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1 . bcs
## 2 . pindex
## 3 . enzyme_test
## 4 . liver_test
## 5 . age
## 6 . gender
## 7 . alc_mod
## 8 . alc_heavy
##
## Step 0: AIC = 802.606
## y ~ 1
##
##
## Variables Entered/Removed:
##
## Enter New Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## liver_test 1 771.875 3804272.477 4565248.060 0.455 0.444
## enzyme_test 1 782.629 2798309.881 5571210.656 0.334 0.322
## pindex 1 794.100 1479766.754 6889753.784 0.177 0.161
## alc_heavy 1 794.301 1454057.255 6915463.282 0.174 0.158
## bcs 1 797.697 1005151.658 7364368.879 0.120 0.103
## alc_mod 1 802.828 271062.330 8098458.207 0.032 0.014
## gender 1 802.956 251808.570 8117711.967 0.030 0.011
## age 1 803.834 118862.559 8250657.978 0.014 -0.005
## --------------------------------------------------------------------------------
##
## - liver_test added
##
##
## Step 1 : AIC = 771.8753
## y ~ liver_test
##
## Enter New Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## alc_heavy 1 761.439 4743349.776 3626170.761 0.567 0.550
## enzyme_test 1 762.077 4700276.808 3669243.729 0.562 0.544
## pindex 1 770.387 4089864.263 4279656.274 0.489 0.469
## alc_mod 1 771.141 4029668.715 4339851.822 0.481 0.461
## gender 1 773.802 3810434.699 4559085.838 0.455 0.434
## age 1 773.831 3807998.774 4561521.763 0.455 0.434
## bcs 1 773.867 3804957.732 4564562.805 0.455 0.433
## --------------------------------------------------------------------------------
##
## - alc_heavy added
##
##
## Step 2 : AIC = 761.4394
## y ~ liver_test + alc_heavy
##
## Remove Existing Variables
## -------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## -------------------------------------------------------------------------------
## alc_heavy 1 771.875 3804272.477 4565248.060 0.455 0.444
## liver_test 1 794.301 1454057.255 6915463.282 0.174 0.158
## -------------------------------------------------------------------------------
##
## Enter New Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## enzyme_test 1 750.509 5515514.136 2854006.401 0.659 0.639
## pindex 1 756.125 5202708.411 3166812.126 0.622 0.599
## bcs 1 763.063 4768545.364 3600975.173 0.570 0.544
## age 1 763.110 4765397.885 3604122.652 0.569 0.544
## alc_mod 1 763.428 4744134.327 3625386.210 0.567 0.541
## gender 1 763.433 4743793.120 3625727.417 0.567 0.541
## --------------------------------------------------------------------------------
##
## - enzyme_test added
##
##
## Step 3 : AIC = 750.5089
## y ~ liver_test + alc_heavy + enzyme_test
##
## Remove Existing Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## enzyme_test 1 761.439 4743349.776 3626170.761 0.567 0.550
## alc_heavy 1 762.077 4700276.808 3669243.729 0.562 0.544
## liver_test 1 773.555 3831289.024 4538231.513 0.458 0.437
## --------------------------------------------------------------------------------
##
## Enter New Variables
## ------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------
## pindex 1 735.715 6278360.060 2091160.477 0.750 0.730
## bcs 1 750.782 5605350.444 2764170.093 0.670 0.643
## alc_mod 1 752.403 5521121.706 2848398.831 0.660 0.632
## age 1 752.416 5520410.217 2849110.320 0.660 0.632
## gender 1 752.509 5515520.094 2854000.443 0.659 0.631
## ------------------------------------------------------------------------------
##
## - pindex added
##
##
## Step 4 : AIC = 735.7146
## y ~ liver_test + alc_heavy + enzyme_test + pindex
##
## Remove Existing Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## liver_test 1 748.167 5636649.760 2732870.777 0.673 0.654
## pindex 1 750.509 5515514.136 2854006.401 0.659 0.639
## alc_heavy 1 755.099 5262294.325 3107226.212 0.629 0.606
## enzyme_test 1 756.125 5202708.411 3166812.126 0.622 0.599
## --------------------------------------------------------------------------------
##
## Enter New Variables
## ------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------
## bcs 1 730.620 6535804.090 1833716.447 0.781 0.758
## age 1 737.680 6279685.941 2089834.596 0.750 0.724
## gender 1 737.712 6278450.247 2091070.290 0.750 0.724
## alc_mod 1 737.713 6278420.680 2091099.857 0.750 0.724
## ------------------------------------------------------------------------------
##
## - bcs added
##
##
## Step 5 : AIC = 730.6204
## y ~ liver_test + alc_heavy + enzyme_test + pindex + bcs
##
## Remove Existing Variables
## --------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------
## liver_test 1 730.924 6455884.265 1913636.272 0.771 0.753
## bcs 1 735.715 6278360.060 2091160.477 0.750 0.730
## alc_heavy 1 747.181 5783681.263 2585839.274 0.691 0.666
## pindex 1 750.782 5605350.444 2764170.093 0.670 0.643
## enzyme_test 1 757.971 5211727.965 3157792.572 0.623 0.592
## --------------------------------------------------------------------------------
##
## Enter New Variables
## ------------------------------------------------------------------------------
## Variable DF AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------------------
## age 1 732.494 6540084.920 1829435.617 0.781 0.754
## gender 1 732.551 6538164.378 1831356.159 0.781 0.753
## alc_mod 1 732.614 6536021.082 1833499.455 0.781 0.753
## ------------------------------------------------------------------------------
##
##
## No more variables to be added or removed.
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
##
## Stepwise Summary
## ----------------------------------------------------------------------------------------
## Variable Method AIC RSS Sum Sq R-Sq Adj. R-Sq
## ----------------------------------------------------------------------------------------
## liver_test addition 771.875 4565248.060 3804272.477 0.45454 0.44405
## alc_heavy addition 761.439 3626170.761 4743349.776 0.56674 0.54975
## enzyme_test addition 750.509 2854006.401 5515514.136 0.65900 0.63854
## pindex addition 735.715 2091160.477 6278360.060 0.75015 0.72975
## bcs addition 730.620 1833716.447 6535804.090 0.78091 0.75808
## ----------------------------------------------------------------------------------------
olsrr offers tools for detecting violation of standard regression assumptions. Here we take a look at residual diagnostics. The standard regression assumptions include the following about residuals/errors:
The error has a normal distribution (normality assumption).
The errors have mean zero.
The errors have same but unknown variance (homoscedasticity assumption).
The error are independent of each other (independent errors assumption).
This graph is used to detect a violation on normality assumption, the most closest the point (residuals) to the line, represent normality.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_qq(model)
This test is used to detect a violation on normality assumption. It calculates 4 test (Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mise, Anderson-Darling).
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_normality(model)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9366 0.0600
## Kolmogorov-Smirnov 0.1152 0.7464
## Cramer-von Mises 2.8122 0.0000
## Anderson-Darling 0.5859 0.1188
## -----------------------------------------------
Correlation between observed residuals and expected residuals under normality.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_correlation(model)
## [1] 0.970066
It is a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.
Characteristics of a well behaved residual vs fitted plot:
The residuals spread randomly around the 0 line indicating that the relationship is linear.
The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_fit(model)
Histogram of residuals for detecting violation of normality assumption.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_hist(model)
Librerias
library(descriptr)
##
## Attaching package: 'descriptr'
## The following object is masked from 'package:olsrr':
##
## hsb
library(olsrr)
Use grouping variable
ols_test_bartlett(hsb, 'read', group_var = 'female')
##
## Bartlett's Test of Homogenity of Variances
## ------------------------------------------------
## Ho: Variances are equal across groups
## Ha: Variances are unequal for atleast two groups
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.1866579
## Prob > Chi2 = 0.6657129
Using variables
ols_test_bartlett(hsb, 'read', 'write')
##
## Bartlett's Test of Homogenity of Variances
## ------------------------------------------------
## Ho: Variances are equal across groups
## Ha: Variances are unequal for atleast two groups
##
## Data
## ---------------------
## Variables: read write
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.222871
## Prob > Chi2 = 0.2687979
Use fitted values of the model
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : mpg
## Variables: fitted values of mpg
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 1.429672
## Prob > Chi2 = 0.231818
Use independent variables of the model
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model, rhs = TRUE)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------
## Response : mpg
## Variables: disp hp wt drat
##
## Test Summary
## ----------------------------
## DF = 4
## Chi2 = 1.513808
## Prob > Chi2 = 0.8241927
Use independent variables of the model and perform multiple tests
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------
## Response : mpg
## Variables: disp hp wt drat
##
## Test Summary (Unadjusted p values)
## ----------------------------------------------
## Variable chi2 df p
## ----------------------------------------------
## disp 1.2355345 1 0.2663334
## hp 0.9209878 1 0.3372157
## wt 1.2529988 1 0.2629805
## drat 1.1668486 1 0.2800497
## ----------------------------------------------
## simultaneous 1.5138083 4 0.8241927
## ----------------------------------------------
Bonferroni p value Adjustment
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = 'bonferroni')
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------
## Response : mpg
## Variables: disp hp wt drat
##
## Test Summary (Bonferroni p values)
## ----------------------------------------------
## Variable chi2 df p
## ----------------------------------------------
## disp 1.2355345 1 1.0000000
## hp 0.9209878 1 1.0000000
## wt 1.2529988 1 1.0000000
## drat 1.1668486 1 1.0000000
## ----------------------------------------------
## simultaneous 1.5138083 4 0.8241927
## ----------------------------------------------
Sidak p value Adjustment
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = 'sidak')
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------
## Response : mpg
## Variables: disp hp wt drat
##
## Test Summary (Sidak p values)
## ----------------------------------------------
## Variable chi2 df p
## ----------------------------------------------
## disp 1.2355345 1 0.7102690
## hp 0.9209878 1 0.8070305
## wt 1.2529988 1 0.7049362
## drat 1.1668486 1 0.7313356
## ----------------------------------------------
## simultaneous 1.5138083 4 0.8241927
## ----------------------------------------------
Holm’s p value Adjustment
model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = 'holm')
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------
## Response : mpg
## Variables: disp hp wt drat
##
## Test Summary (Holm's p values)
## ----------------------------------------------
## Variable chi2 df p
## ----------------------------------------------
## disp 1.2355345 1 0.7990002
## hp 0.9209878 1 0.3372157
## wt 1.2529988 1 1.0000000
## drat 1.1668486 1 0.5600994
## ----------------------------------------------
## simultaneous 1.5138083 4 0.8241927
## ----------------------------------------------
Use fitted values of the model
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_score(model)
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of mpg
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.5163959
## Prob > Chi2 = 0.4723832
Use independent variables of the model
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_score(model, rhs = TRUE)
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: disp hp wt qsec
##
## Test Summary
## ----------------------------
## DF = 4
## Chi2 = 2.039404
## Prob > Chi2 = 0.7285114
Specify variables
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_score(model, vars = c('disp', 'hp'))
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: disp hp
##
## Test Summary
## ----------------------------
## DF = 2
## Chi2 = 0.9983196
## Prob > Chi2 = 0.6070405
Use fitted values of the model
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_f(model)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of mpg
##
## Test Summary
## -------------------------
## Num DF = 1
## Den DF = 30
## F = 0.4920617
## Prob > F = 0.4884154
Use independent variables of the model
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_f(model, rhs = TRUE)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: disp hp wt qsec
##
## Test Summary
## -------------------------
## Num DF = 4
## Den DF = 27
## F = 0.4594694
## Prob > F = 0.7647271
Specify variables
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_test_f(model, vars = c('disp', 'hp'))
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: disp hp
##
## Test Summary
## -------------------------
## Num DF = 2
## Den DF = 29
## F = 0.4669306
## Prob > F = 0.631555
It is used to identify influential data points. It depends on both the residual and leverage i.e it takes it account both the x value and y value of the observation.
Steps to compute Cook’s distance:
A data point having a large cook’s d indicates that the data point strongly influences the fitted values.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_cooksd_bar(model)
Chart of Cook’s distance to detect observations that strongly influence fitted values of the model.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_cooksd_chart(model)
DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each data point i.e if there are n observations and k variables, there will be n∗k DFBETAs. In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and 2/√n as a size-adjusted cutoff.
model <- lm(mpg ~ disp + hp + wt, data = mtcars)
ols_plot_dfbetas(model)
Proposed by Welsch and Kuh (1977). It is the scaled difference between the i^th fitted value obtained from the full data and the i^th fitted value obtained by deleting the ith observation. DFFIT - difference in fits, is used to identify influential data points. It quantifies the number of standard deviations that the fitted value changes when the i^th data point is omitted.
Steps to compute DFFITs:
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_dffits(model)
Plot for detecting outliers. Studentized deleted residuals (or externally studentized residuals) is the deleted residual divided by its estimated standard deviation. Studentized residuals are going to be more effective for detecting outlying Y observations than standardized residuals. If an observation has an externally studentized residual that is larger than 3 (in absolute value) we can call it an outlier.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_stud(model)
Chart for detecting outliers. Standardized residual (internally studentized) is the residual divided by estimated standard deviation.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_stand(model)
Graph for detecting influential observations.
model <- lm(read ~ write + math + science, data = hsb)
ols_plot_resid_lev(model)
Graph for detecting outliers.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_stud_fit(model)
Hadi’s measure of influence based on the fact that influential observations can be present in either the response variable or in the predictors or both. The plot is used to detect influential observations based on Hadi’s measure.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_hadi(model)
Plot to aid in classifying unusual observations as high-leverage points, outliers, or a combination of both.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_pot(model)
Collinearity implies two variables are near perfect linear combinations of one another. Multicollinearity involves more than two variables. In the presence of multicollinearity, regression estimates are unstable and have high standard errors.
Variance inflation factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is “inflated” by the existence of correlation among the predictor variables in the model. A VIF of 1 means that there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of βk is not inflated at all. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
Steps to calculate VIF:
Regress the k^th predictor on rest of the predictors in the model. Compute the R^2 k
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_vif_tol(model)
Condition Index
Most multivariate statistical approaches involve decomposing a correlation matrix into linear combinations of variables. The linear combinations are chosen so that the first combination has the largest possible variance (subject to some restrictions we won’t discuss), the second combination has the next largest variance, subject to being uncorrelated with the first, the third has the largest possible variance, subject to being uncorrelated with the first and second, and so forth. The variance of each of these linear combinations is called an eigenvalue. Collinearity is spotted by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A rule of thumb is to label as large those condition indices in the range of 30 or larger.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_eigen_cindex(model)
Collinearity Diagnostics
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_coll_diag(model)
## Tolerance and Variance Inflation Factor
## ---------------------------------------
## Variables Tolerance VIF
## 1 disp 0.1252279 7.985439
## 2 hp 0.1935450 5.166758
## 3 wt 0.1445726 6.916942
## 4 qsec 0.3191708 3.133119
##
##
## Eigenvalue and Condition Index
## ------------------------------
## Eigenvalue Condition Index intercept disp hp wt
## 1 4.721487187 1.000000 0.000123237 0.001132468 0.001413094 0.0005253393
## 2 0.216562203 4.669260 0.002617424 0.036811051 0.027751289 0.0002096014
## 3 0.050416837 9.677242 0.001656551 0.120881424 0.392366164 0.0377028008
## 4 0.010104757 21.616057 0.025805998 0.777260487 0.059594623 0.7017528428
## 5 0.001429017 57.480524 0.969796790 0.063914571 0.518874831 0.2598094157
## qsec
## 1 0.0001277169
## 2 0.0046789491
## 3 0.0001952599
## 4 0.0024577686
## 5 0.9925403056
Residual Fit Spread Plot
Plot to detect non-linearity, influential observations and outliers. Consists of side-by-side quantile plots of the centered fit and the residuals. It shows how much variation in the data is explained by the fit and how much remains in the residuals. For inappropriate models, the spread of the residuals in such a plot is often greater than the spread of the centered fit.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_resid_fit_spread(model)
Part & Partial Correlations
Correlations Relative importance of independent variables in determining Y. How much each variable uniquely contributes to R2 over and above that which can be accounted for by the other predictors.
Zero Order Pearson correlation coefficient between the dependent variable and the independent variables.
Part Unique contribution of independent variables. How much R2 will decrease if that variable is removed from the model?
Partial How much of the variance in Y, which is not estimated by the other independent variables in the model, is estimated by the specific variable?
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_correlations(model)
Observed vs Predicted Plot
Plot of observed vs fitted values to assess the fit of the model. Ideally, all your points should be close to a regressed diagonal line. Draw such a diagonal line within your graph and check out where the points lie. If your model had a high R Square, all the points would be close to this diagonal line. The lower the R Square, the weaker the Goodness of fit of your model, the more foggy or dispersed your points are from this diagonal line.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_obs_fit(model)
Lack of Fit F Test
Assess how much of the error in prediction is due to lack of model fit. The residual sum of squares resulting from a regression can be decomposed into 2 components:
If most of the error is due to lack of fit and not just random error, the model should be discarded and a new model must be built. The lack of fit F test works only with simple linear regression. Moreover, it is important that the data contains repeat observations i.e. replicates for at least one of the values of the predictor x. This test generally only applies to datasets with plenty of replicates.
model <- lm(mpg ~ disp, data = mtcars)
ols_pure_error_anova(model)
## Lack of Fit F Test
## -----------------
## Response : mpg
## Predictor: disp
##
## Analysis of Variance Table
## ----------------------------------------------------------------------
## DF Sum Sq Mean Sq F Value Pr(>F)
## ----------------------------------------------------------------------
## disp 1 808.8885 808.8885 314.0095 1.934413e-17
## Residual 30 317.1587 10.57196
## Lack of fit 25 304.2787 12.17115 4.724824 0.04563623
## Pure Error 5 12.88 2.576
## ----------------------------------------------------------------------
Diagnostics Panel
Panel of plots for regression diagnostics
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_diagnostics(model)
Residual vs Regressor Plots
Graph to determine whether we should add a new predictor to the model already containing other predictors. The residuals from the model is regressed on the new predictor and if the plot shows non random pattern, you should consider adding the new predictor to the model.
model <- lm(mpg ~ disp + hp + wt, data = mtcars)
ols_plot_resid_regressor(model, 'drat')
Added Variable Plot
Added variable plot provides information about the marginal importance of a predictor variable Xk, given the other predictor variables already in the model. It shows the marginal importance of the variable in reducing the residual variability.
It enables us to visualize the regression coefficient of a new variable being considered to be included in a model. The plot can be constructed for each predictor variable.
Steps to construct an added variable plot:
A strong linear relationship in the added variable plot indicates the increased importance of the contribution of X to the model already containing the other predictors.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_added_variable(model)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Residual Plus Component Plot
The residual plus component plot was introduced by Ezekeil (1924). It was called as Partial Residual Plot by Larsen and McCleary (1972). Hadi and Chatterjee (2012) called it the residual plus component plot.
Steps to construct the plot:
The residual plus component plot indicates whether any non-linearity is present in the relationship between Y and X and can suggest possible transformations for linearizing the data.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_comp_plus_resid(model)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
a) A Short Introduction to the blorr Package
Data Set Information:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Attribute Information:
Input variables: Bank client data: 1 - age (numeric) 2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’) 3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed) 4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’) 5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’) 6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’) 7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
Related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: ‘cellular’,‘telephone’) 9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’) 10 - day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
Social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
To demonstrate the features of blorr, we will use the bank marketing data set. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. It contains a random sample (~4k) of the original data set which can be found at https://archive.ics.uci.edu/ml/datasets/bank+marketing
In order to execute, all character variables were transformed to Factor. Also the response variable was transformed to 1 for yes and 0 for now.
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
library(blorr)
library(magrittr)
library(readr)
bank_marketing <- read_delim("bank.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE)
bank_marketing[sapply(bank_marketing, is.character)] <- lapply(bank_marketing[sapply(bank_marketing, is.character)],
as.factor)
bank_marketing<-as.data.frame(bank_marketing)
bank_marketing$y<-as.factor(ifelse(bank_marketing$y=="yes",1,0))
Let us begin with careful bivariate analysis of each possible variable and the outcome variable. We will use information value and likelihood ratio chi square test for selecting the initial set of predictors for our model. The bivariate analysis is currently available for categorical predictors only.
blr_bivariate_analysis(bank_marketing, y, job, marital, education, default,
housing, loan, contact, poutcome)
## Bivariate Analysis
## ----------------------------------------------------------------------
## Variable Information Value LR Chi Square LR DF LR p-value
## ----------------------------------------------------------------------
## job 0.13 62.6171 11 0.0000
## marital 0.04 18.6302 2 1e-04
## education 0.03 14.8259 3 0.0020
## default 0.00 0.0076 1 0.9305
## housing 0.11 49.0572 1 0.0000
## loan 0.06 25.7773 1 0.0000
## contact 0.25 102.3786 2 0.0000
## poutcome 0.46 235.5376 3 0.0000
## ----------------------------------------------------------------------
Weight of evidence (WoE) is used to assess the relative risk of different attributes for a characteristic and as a means to transform characteristics into variables. It is also a very useful tool for binning. The WoE for any group with average odds is zero. A negative WoE indicates that the proportion of defaults is higher for that attribute than the overall proportion and indicates higher risk.
The information value is used to rank order variables in terms of their predictive power. A high information value indicates a high ability to discriminate. Values for the information value will always be positive and may be above 3 when assessing highly predictive characteristics. Characteristics with information values less than 0:10 are typically viewed as weak, while values over 0.30 are sought after.
blr_woe_iv(bank_marketing, job, y)
## Weight of Evidence
## --------------------------------------------------------------------------------
## levels count_0s count_1s dist_0s dist_1s woe iv
## --------------------------------------------------------------------------------
## admin. 420 58 0.10 0.11 -0.06 0.00
## blue-collar 877 69 0.22 0.13 0.50 0.04
## entrepreneur 153 15 0.04 0.03 0.28 0.00
## housemaid 98 14 0.02 0.03 -0.09 0.00
## management 838 131 0.21 0.25 -0.18 0.01
## retired 176 54 0.04 0.10 -0.86 0.05
## self-employed 163 20 0.04 0.04 0.06 0.00
## services 379 38 0.09 0.07 0.26 0.01
## student 65 19 0.02 0.04 -0.81 0.02
## technician 685 83 0.17 0.16 0.07 0.00
## unemployed 115 13 0.03 0.03 0.14 0.00
## unknown 31 7 0.01 0.01 -0.54 0.00
## --------------------------------------------------------------------------------
##
## Information Value
## -----------------------------
## Variable Information Value
## -----------------------------
## job 0.1323
## -----------------------------
k <- blr_woe_iv(bank_marketing, job, y)
plot(k)
We can generate the weight of evidence and information value for multiple variables using blr_woe_iv_stats().
blr_woe_iv_stats(bank_marketing, y, job, marital, education)
## Variable: job
##
## Weight of Evidence
## --------------------------------------------------------------------------------
## levels count_0s count_1s dist_0s dist_1s woe iv
## --------------------------------------------------------------------------------
## admin. 420 58 0.10 0.11 -0.06 0.00
## blue-collar 877 69 0.22 0.13 0.50 0.04
## entrepreneur 153 15 0.04 0.03 0.28 0.00
## housemaid 98 14 0.02 0.03 -0.09 0.00
## management 838 131 0.21 0.25 -0.18 0.01
## retired 176 54 0.04 0.10 -0.86 0.05
## self-employed 163 20 0.04 0.04 0.06 0.00
## services 379 38 0.09 0.07 0.26 0.01
## student 65 19 0.02 0.04 -0.81 0.02
## technician 685 83 0.17 0.16 0.07 0.00
## unemployed 115 13 0.03 0.03 0.14 0.00
## unknown 31 7 0.01 0.01 -0.54 0.00
## --------------------------------------------------------------------------------
##
## Information Value
## -----------------------------
## Variable Information Value
## -----------------------------
## job 0.1323
## -----------------------------
##
##
## Variable: marital
##
## Weight of Evidence
## ---------------------------------------------------------------------------
## levels count_0s count_1s dist_0s dist_1s woe iv
## ---------------------------------------------------------------------------
## divorced 451 77 0.11 0.15 -0.27 0.01
## married 2520 277 0.63 0.53 0.17 0.02
## single 1029 167 0.26 0.32 -0.22 0.01
## ---------------------------------------------------------------------------
##
## Information Value
## -----------------------------
## Variable Information Value
## -----------------------------
## marital 0.0401
## -----------------------------
##
##
## Variable: education
##
## Weight of Evidence
## ----------------------------------------------------------------------------
## levels count_0s count_1s dist_0s dist_1s woe iv
## ----------------------------------------------------------------------------
## primary 614 64 0.15 0.12 0.22 0.01
## secondary 2061 245 0.52 0.47 0.09 0.00
## tertiary 1157 193 0.29 0.37 -0.25 0.02
## unknown 168 19 0.04 0.04 0.14 0.00
## ----------------------------------------------------------------------------
##
## Information Value
## ------------------------------
## Variable Information Value
## ------------------------------
## education 0.0318
## ------------------------------
For the initial/ first cut model, all the independent variables are put into the model. Our goal is to include a limited number of independent variables (5-15) which are all significant, without sacrificing too much on the model performance. The rationale behind not-including too many variables is that the model would be over fitted and would become unstable when tested on the validation sample. The variable reduction is done using forward or backward or stepwise variable selection procedures. We will use blr_step_aic_both() to shortlist predictors for our model.
model <- glm(y ~ ., data = bank_marketing, family = binomial(link = 'logit'))
blr_step_aic_both(model)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1 . age
## 2 . job
## 3 . marital
## 4 . education
## 5 . default
## 6 . balance
## 7 . housing
## 8 . loan
## 9 . contact
## 10 . day
## 11 . month
## 12 . duration
## 13 . campaign
## 14 . pdays
## 15 . previous
## 16 . poutcome
##
##
## Variables Entered/Removed:
##
## - duration added
## - poutcome added
## - month added
## - contact added
## - loan added
## - marital added
## - housing added
## - campaign added
## - education added
## - day added
##
## No more variables to be added or removed.
##
##
## Stepwise Summary
## ---------------------------------------------------------
## Variable Method AIC BIC Deviance
## ---------------------------------------------------------
## duration addition 2705.753 2718.586 2701.753
## poutcome addition 2472.218 2504.300 2462.218
## month addition 2332.784 2435.448 2300.784
## contact addition 2282.539 2398.036 2246.539
## loan addition 2271.730 2393.643 2233.730
## marital addition 2264.149 2398.895 2222.149
## housing addition 2257.919 2399.082 2213.919
## campaign addition 2254.283 2401.862 2208.283
## education addition 2251.533 2418.362 2199.533
## day addition 2249.500 2422.745 2195.500
## ---------------------------------------------------------
As we can notice in the Stepwise Summary, as we start to add more variables, the AIC ad the deviance starts to improve. As we also can see on the following graphs.
model %>%
blr_step_aic_both() %>%
plot()
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1 . age
## 2 . job
## 3 . marital
## 4 . education
## 5 . default
## 6 . balance
## 7 . housing
## 8 . loan
## 9 . contact
## 10 . day
## 11 . month
## 12 . duration
## 13 . campaign
## 14 . pdays
## 15 . previous
## 16 . poutcome
##
##
## Variables Entered/Removed:
##
## - duration added
## - poutcome added
## - month added
## - contact added
## - loan added
## - marital added
## - housing added
## - campaign added
## - education added
## - day added
##
## No more variables to be added or removed.
We can use bivariate analysis and stepwise selection procedure to shortlist predictors and build the model using the glm(). The predictors used in the below model are for illustration purposes and not necessarily shortlisted from the bivariate analysis and variable selection procedures.
model <- glm(y ~ age + duration + previous + housing + default +
loan + poutcome + job + marital, data = bank_marketing,
family = binomial(link = 'logit'))
Using model
Let us look at the output generated from blr_regress():
blr_regress(model)
## Model Overview
## ------------------------------------------------------------------------
## Data Set Resp Var Obs. Df. Model Df. Residual Convergence
## ------------------------------------------------------------------------
## data y 4521 4520 4498 TRUE
## ------------------------------------------------------------------------
##
## Response Summary
## --------------------------------------------------------
## Outcome Frequency Outcome Frequency
## --------------------------------------------------------
## 0 4000 1 521
## --------------------------------------------------------
##
## Maximum Likelihood Estimates
## ----------------------------------------------------------------------
## Parameter DF Estimate Std. Error z value Pr(>|z|)
## ----------------------------------------------------------------------
## (Intercept) 1 -2.2351 0.4281 -5.2212 0.0000
## age 1 -0.0021 0.0067 -0.3205 0.7486
## duration 1 0.0039 2e-04 20.6861 0.0000
## previous 1 -0.0027 0.0377 -0.0728 0.9420
## housingyes 1 -0.6469 0.1203 -5.3771 0.0000
## defaultyes 1 0.5029 0.4115 1.2221 0.2217
## loanyes 1 -0.7334 0.1913 -3.8345 1e-04
## poutcomeother 1 0.4451 0.2570 1.7318 0.0833
## poutcomesuccess 1 2.3617 0.2544 9.2835 0.0000
## poutcomeunknown 1 -0.5772 0.2003 -2.8819 0.0040
## jobblue-collar 1 -0.5739 0.2262 -2.5368 0.0112
## jobentrepreneur 1 -0.3695 0.3644 -1.0141 0.3106
## jobhousemaid 1 -0.3708 0.3879 -0.9558 0.3392
## jobmanagement 1 0.1261 0.2032 0.6205 0.5349
## jobretired 1 0.6513 0.2912 2.2364 0.0253
## jobself-employed 1 -0.1754 0.3273 -0.5358 0.5921
## jobservices 1 -0.2990 0.2627 -1.1379 0.2552
## jobstudent 1 0.5049 0.3519 1.4345 0.1514
## jobtechnician 1 -0.1970 0.2196 -0.8973 0.3696
## jobunemployed 1 -0.6016 0.4056 -1.4831 0.1380
## jobunknown 1 0.3563 0.5289 0.6737 0.5005
## maritalmarried 1 -0.4384 0.1660 -2.6411 0.0083
## maritalsingle 1 -0.1642 0.1936 -0.8480 0.3964
## ----------------------------------------------------------------------
##
## Association of Predicted Probabilities and Observed Responses
## ---------------------------------------------------------------
## % Concordant 0.8735 Somers' D 0.7471
## % Discordant 0.1265 Gamma 0.7471
## % Tied 0.0000 Tau-a 0.1524
## Pairs 2084000 c 0.8735
## ---------------------------------------------------------------
If you want to examine the odds ratio estimates, set odd_conf_limit to TRUE. The odds ratio estimates are not explicitly computed as we observed considerable increase in computation time when dealing with large data sets.
Let us use the model formula and the data set to generate the above results.
blr_regress(y ~ age + duration + previous + housing + default +
loan + poutcome + job + marital, data = bank_marketing)
## Model Overview
## ------------------------------------------------------------------------
## Data Set Resp Var Obs. Df. Model Df. Residual Convergence
## ------------------------------------------------------------------------
## data y 4521 4520 4498 TRUE
## ------------------------------------------------------------------------
##
## Response Summary
## --------------------------------------------------------
## Outcome Frequency Outcome Frequency
## --------------------------------------------------------
## 0 4000 1 521
## --------------------------------------------------------
##
## Maximum Likelihood Estimates
## ----------------------------------------------------------------------
## Parameter DF Estimate Std. Error z value Pr(>|z|)
## ----------------------------------------------------------------------
## (Intercept) 1 -2.2351 0.4281 -5.2212 0.0000
## age 1 -0.0021 0.0067 -0.3205 0.7486
## duration 1 0.0039 2e-04 20.6861 0.0000
## previous 1 -0.0027 0.0377 -0.0728 0.9420
## housingyes 1 -0.6469 0.1203 -5.3771 0.0000
## defaultyes 1 0.5029 0.4115 1.2221 0.2217
## loanyes 1 -0.7334 0.1913 -3.8345 1e-04
## poutcomeother 1 0.4451 0.2570 1.7318 0.0833
## poutcomesuccess 1 2.3617 0.2544 9.2835 0.0000
## poutcomeunknown 1 -0.5772 0.2003 -2.8819 0.0040
## jobblue-collar 1 -0.5739 0.2262 -2.5368 0.0112
## jobentrepreneur 1 -0.3695 0.3644 -1.0141 0.3106
## jobhousemaid 1 -0.3708 0.3879 -0.9558 0.3392
## jobmanagement 1 0.1261 0.2032 0.6205 0.5349
## jobretired 1 0.6513 0.2912 2.2364 0.0253
## jobself-employed 1 -0.1754 0.3273 -0.5358 0.5921
## jobservices 1 -0.2990 0.2627 -1.1379 0.2552
## jobstudent 1 0.5049 0.3519 1.4345 0.1514
## jobtechnician 1 -0.1970 0.2196 -0.8973 0.3696
## jobunemployed 1 -0.6016 0.4056 -1.4831 0.1380
## jobunknown 1 0.3563 0.5289 0.6737 0.5005
## maritalmarried 1 -0.4384 0.1660 -2.6411 0.0083
## maritalsingle 1 -0.1642 0.1936 -0.8480 0.3964
## ----------------------------------------------------------------------
##
## Association of Predicted Probabilities and Observed Responses
## ---------------------------------------------------------------
## % Concordant 0.8735 Somers' D 0.7471
## % Discordant 0.1265 Gamma 0.7471
## % Tied 0.0000 Tau-a 0.1524
## Pairs 2084000 c 0.8735
## ---------------------------------------------------------------
Model fit statistics are available to assess how well the model fits the data and to compare two different models.The output includes likelihood ratio test, AIC, BIC and a host of pseudo r-squared measures.
blr_model_fit_stats(model)
## Model Fit Statistics
## ----------------------------------------------------------------------------------
## Log-Lik Intercept Only: -1615.500 Log-Lik Full Model: -1168.896
## Deviance(4498): 2337.792 LR(22): 893.208
## Prob > LR: 0.000
## MCFadden's R2 0.276 McFadden's Adj R2: 0.262
## ML (Cox-Snell) R2: 0.179 Cragg-Uhler(Nagelkerke) R2: 0.351
## McKelvey & Zavoina's R2: 0.352 Efron's R2: 0.259
## Count R2: 0.902 Adj Count R2: 0.148
## BIC: 2531.371 AIC: 2383.792
## ----------------------------------------------------------------------------------
In the event of deciding a cut-off point on the probability scores of a logistic regression model, a confusion matrix is created corresponding to a particular cut-off. The observations with probability scores above the cut-off score are predicted to be events and those below the cut-off score, as non-events. The confusion matrix, a 2X2 table, then calculates the number of correctly classified and miss-classified observations.
blr_confusion_matrix(model, cutoff = 0.5)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3919 363
## 1 81 158
##
##
## Accuracy : 0.9018
## No Information Rate : 0.8848
##
## Kappa : 0.3701
##
## McNemars's Test P-Value : 0.0000
##
## Sensitivity : 0.3033
## Specificity : 0.9798
## Pos Pred Value : 0.6611
## Neg Pred Value : 0.9152
## Prevalence : 0.1152
## Detection Rate : 0.0349
## Detection Prevalence : 0.0529
## Balanced Accuracy : 0.6415
## Precision : 0.6611
## Recall : 0.3033
##
## 'Positive' Class : 1
The validity of a cut-off is measured using sensitivity, specificity and accuracy.
Sensitivity: The % of correctly classified events out of all events = TP / (TP + FN)
Specificity: The % of correctly classified non-events out of all non-events = TN / (TN + FP)
Accuracy: The % of correctly classified observation over all observations = (TP + TN) / (TP + FP + TN + FN)
True Positive (TP) : Events correctly classified as events.
True Negative (TN) : Non-Events correctly classified as non-events.
False Positive (FP): Non-events miss-classified as events.
False Negative (FN): Events miss-classified as non-events.
For a standard logistic model, the higher is the cut-off, the lower will be the sensitivity and the higher would be the specificity. As the cut-off is decreased, sensitivity will go up, as then more events would be captured. Also, specificity will go down, as more non-events would miss-classified as events. Hence a trade-off is done based on the requirements. For example, if we are looking to capture as many events as possible, and we can afford to have miss-classified non-events, then a low cut-off is taken.
Hosmer and Lemeshow developed a goodness-of-fit test for logistic regression models with binary responses. The test involves dividing the data into approximately ten groups of roughly equal size based on the percentiles of the estimated probabilities. The observations are sorted in increasing order of their estimated probability of having an even outcome. The discrepancies between the observed and expected number of observations in these groups are summarized by the Pearson chi-square statistic, which is then compared to chi-square distribution with t degrees of freedom, where t is the number of groups minus 2. Lower values of Goodness-of-fit are preferred.
blr_test_hosmer_lemeshow(model)
## Partition for the Hosmer & Lemeshow Test
## --------------------------------------------------------------
## def = 1 def = 0
## Group Total Observed Expected Observed Expected
## --------------------------------------------------------------
## 1 453 3 6.41 450 446.59
## 2 452 2 10.34 450 441.66
## 3 452 8 13.47 444 438.53
## 4 452 6 17.06 446 434.94
## 5 452 16 21.27 436 430.73
## 6 452 25 27.10 427 424.90
## 7 452 31 34.94 421 417.06
## 8 452 58 49.75 394 402.25
## 9 452 126 87.30 326 364.70
## 10 452 246 253.36 206 198.64
## --------------------------------------------------------------
##
## Goodness of Fit Test
## ------------------------------
## Chi-Square DF Pr > ChiSq
## ------------------------------
## 43.7691 8 0.0000
## ------------------------------
A lift curve is a graphical representation of the % of cumulative events captured at a specific cut-off. The cut-off can be a particular decile or a percentile. Similar, to rank ordering procedure, the data is in descending order of the scores and is then grouped into deciles/percentiles. The cumulative number of observations and events are then computed for each decile/percentile. The lift curve is the created using the cumulative % population as the x-axis and the cumulative percentage of events as the y-axis.
blr_gains_table(model)
## decile total 1 0 ks tp tn fp fn sensitivity specificity
## 1 1 452 246 206 42.06689 246 3794 206 275 47.21689 94.850
## 2 2 452 126 326 58.10115 372 3468 532 149 71.40115 86.700
## 3 3 452 58 394 59.38359 430 3074 926 91 82.53359 76.850
## 4 4 452 31 421 54.80869 461 2653 1347 60 88.48369 66.325
## 5 5 452 25 427 48.93215 486 2226 1774 35 93.28215 55.650
## 6 6 452 16 436 41.10317 502 1790 2210 19 96.35317 44.750
## 7 7 452 6 446 31.10480 508 1344 2656 13 97.50480 33.600
## 8 8 452 8 444 21.54031 516 900 3100 5 99.04031 22.500
## 9 9 452 2 450 10.67418 518 450 3550 3 99.42418 11.250
## 10 10 453 3 450 0.00000 521 0 4000 0 100.00000 0.000
## accuracy
## 1 89.36076
## 2 84.93696
## 3 77.50498
## 4 68.87857
## 5 59.98673
## 6 50.69675
## 7 40.96439
## 8 31.32050
## 9 21.41119
## 10 11.52400
model %>%
blr_gains_table() %>%
plot()
ROC curve is a graphical representation of the validity of cut-offs for a logistic regression model. The ROC curve is plotted using the sensitivity and specificity for all possible cut-offs, i.e., all the probability scores. The graph is plotted using sensitivity on the y-axis and 1-specificity on the x-axis. Any point on the ROC curve represents a sensitivity X (1-specificity) measure corresponding to a cut-off. The area under the ROC curve is used as a validation measure for the model – the bigger the area the better is the model.
model %>%
blr_gains_table() %>%
blr_roc_curve()
The KS Statistic is again a measure of model efficiency, and it is created using the lift curve. The lift curve is created to plot % events. If we also plot % non-events on the same scale, with % population at x-axis, we would get another curve. The maximum distance between the lift curve for events and that for non-events is termed as KS. For a good model, KS should be big (>=0.3) and should occur as close to the event rate as possible.
model %>%
blr_gains_table() %>%
blr_ks_chart()
The decile lift chart displays the lift over the global mean event rate for each decile. For a model with good discriminatory power, the top deciles should have an event/conversion rate greater than the global mean.
model %>%
blr_gains_table() %>%
blr_decile_lift_chart()
If the model has good discriminatory power, the top deciles should have a higher event/conversion rate compared to the bottom deciles.
model %>%
blr_gains_table() %>%
blr_decile_capture_rate()
The Lorenz curve is a simple graphic device which illustrates the degree of inequality in the distribution of thevariable concerned. It is a visual representation of inequality used to measure the discriminatory power of the predictive model.
blr_lorenz_curve(model)
blorr can generate 22 plots for residual, influence and leverage diagnostics.
Influence Diagnostics
blr_plot_diag_influence(model)
From al the graphs above , we can focus and explain the Pearson residuals as the standardized distances between the observed and expected responses, and deviance residuals are defined as the signed square root of the individual contributions to the model deviance (i.e., the difference between the log-likelihoods of the saturated and fitted models)
On both residuals plots we can confirm that the residuals are stationary on all values.
blr_plot_diag_leverage(model)
blr_plot_diag_fit(model)
The delta deviance measures the change in the deviance goodness-of-fit statistic because of deleting a specific factor/covariate pattern. Delta deviance can be large because of a large residual (deviance or Pearson) and/or a large leverage. Minitab calculates a value for each distinct factor/covariate pattern.
The delta chi-square is the change in Pearson chi-square because of deleting all the observations with the jth factor/covariate pattern. Minitab calculates a delta chi-square value for each distinct factor/covariate pattern. Observations that are not fit well by the model. have high delta chi-square values.