For the second case study, we use the Credit dataset in ISLR R library. The Credit dataset contains n = 400 observations and 11 variables:
| Variable | Description | Variable Values |
|---|---|---|
| Income | Income | A positive number with two decimal places |
| Limit | Credit limit | A positive integer from 855 to 13,913 |
| Rating | Credit rating | A positive integer from 93 to 982 |
| Cards | Number of credit cards | A positive integer from 1 to 9 |
| Age | Age in years | A positive integer from 23 to 98 |
| Education | Number of years of education | A positive integer from 5 to 20 |
| Gender | An indicator of the individual’s gender | A factor with two levles, Male and Female |
| Student | An indicator of whether the individual was a student | A factor with two levles, No and Yes |
| Married | An indicator of whether the individual was married | A factor with two levles, No and Yes |
| Ethnicity | An indicator of the individual’s ethnicity | A factor with three levles, African American,Asian, and Caucasian |
| Balance | Average credit card balance | A non-negative integer from 0 to 1999 |
The target variable is Balance, an integer-valued variable that ranges from 0 to 1,999 and represents the credit card balance (in $) for an individual.
The goal is to identify and interpret key factors that relate to a higher or lower “Balance” with the aid of appropriate linear models.
# CHUNK 1
library(ISLR)
data(Credit)
# Delete the first column containing row indices
Credit <- Credit[,-1]
head(Credit)
## Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1 14.891 3606 283 2 34 11 Male No Yes Caucasian
## 2 106.025 6645 483 3 82 15 Female Yes Yes Asian
## 3 104.593 7075 514 4 71 11 Male No No Asian
## 4 148.924 9504 681 3 36 11 Female No No Asian
## 5 55.882 4897 357 2 68 16 Male No Yes Caucasian
## 6 80.180 8047 569 4 77 10 Male No No Caucasian
## Balance
## 1 333
## 2 903
## 3 580
## 4 964
## 5 331
## 6 1151
# CHUNK 2
library(ggplot2)
numeric.variables <- c("Income","Limit","Rating","Cards","Age","Education","Balance")
for (var in numeric.variables)
{
print(ggplot(Credit, aes(x = Credit[,var])) + geom_histogram() + xlab(var) + ylab("Frequency"))
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Credit[,numeric.variables])
## Income Limit Rating Cards
## Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000
## 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000
## Median : 33.12 Median : 4622 Median :344.0 Median :3.000
## Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958
## 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000
## Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000
## Age Education Balance
## Min. :23.00 Min. : 5.00 Min. : 0.00
## 1st Qu.:41.75 1st Qu.:11.00 1st Qu.: 68.75
## Median :56.00 Median :14.00 Median : 459.50
## Mean :55.67 Mean :13.45 Mean : 520.01
## 3rd Qu.:70.00 3rd Qu.:16.00 3rd Qu.: 863.00
## Max. :98.00 Max. :20.00 Max. :1999.00
# CHUNK 3
library(ggplot2)
ggplot(Credit, aes(x = Balance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
*The distribution of the target variable Balance is skewed to the right. There is a high count of observations with a Balance of 0. This goes against the assumptions needed for linear modelling because the target variable is not normally distributed. However, the main goal of this analysis is interpretability, so transforming the target variable will not be done.
# CHUNK 4
# Calculate the correlation matrix for numeric variables
cor.matrix <- cor(Credit[,numeric.variables])
round(cor.matrix, digits = 3)
## Income Limit Rating Cards Age Education Balance
## Income 1.000 0.792 0.791 -0.018 0.175 -0.028 0.464
## Limit 0.792 1.000 0.997 0.010 0.101 -0.024 0.862
## Rating 0.791 0.997 1.000 0.053 0.103 -0.030 0.864
## Cards -0.018 0.010 0.053 1.000 0.043 -0.051 0.086
## Age 0.175 0.101 0.103 0.043 1.000 0.004 0.002
## Education -0.028 -0.024 -0.030 -0.051 0.004 1.000 -0.008
## Balance 0.464 0.862 0.864 0.086 0.002 -0.008 1.000
# CHUNK 5
ggplot(Credit, aes(x = Limit, y = Rating)) + geom_point()
# CHUNK 6
Credit$Limit <- NULL
head(Credit)
## Income Rating Cards Age Education Gender Student Married Ethnicity Balance
## 1 14.891 283 2 34 11 Male No Yes Caucasian 333
## 2 106.025 483 3 82 15 Female Yes Yes Asian 903
## 3 104.593 514 4 71 11 Male No No Asian 580
## 4 148.924 681 3 36 11 Female No No Asian 964
## 5 55.882 357 2 68 16 Male No Yes Caucasian 331
## 6 80.180 569 4 77 10 Male No No Caucasian 1151
# CHUNK 7
numeric.variables <- c("Income","Rating","Cards","Age","Education","Balance")
pairs(Credit[,numeric.variables])
# CHUNK 8
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Credit %>% group_by_("Gender") %>%summarise(mean = mean(Balance), median = median(Balance), n = n())
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 2 × 4
## Gender mean median n
## <fct> <dbl> <int> <int>
## 1 " Male" 510. 463 193
## 2 "Female" 530. 456 207
Credit %>% group_by_("Student") %>%summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 2 × 4
## Student mean median n
## <fct> <dbl> <dbl> <int>
## 1 No 480. 424 360
## 2 Yes 877. 953 40
Credit %>% group_by_("Married") %>% summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 2 × 4
## Married mean median n
## <fct> <dbl> <int> <int>
## 1 No 523. 467 155
## 2 Yes 518. 454 245
Credit %>% group_by_("Ethnicity") %>% summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 3 × 4
## Ethnicity mean median n
## <fct> <dbl> <dbl> <int>
## 1 African American 531 480 99
## 2 Asian 512. 414 102
## 3 Caucasian 518. 465 199
We produce the mean (and median) of Balance split by different levels of each of the four factor variables.
Since Balance is a numeric variable but we are now dealing with categorical predictors Thus, for visual representation, we will make use of split boxplots.
# CHUNK 9
ggplot(Credit, aes(x = Gender, y = Balance)) + geom_boxplot()
ggplot(Credit, aes(x = Married, y = Balance)) + geom_boxplot()
ggplot(Credit, aes(x = Ethnicity, y = Balance)) + geom_boxplot()
ggplot(Credit, aes(x = Student, y = Balance)) + geom_boxplot()
# CHUNK 10
ggplot(data = Credit, aes(x = Income, y = Balance, color = Student)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# CHUNK 11
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(lattice)
set.seed(8964)
partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE)
data.train <- Credit[partition,]
data.test <- Credit[-partition,]
# CHUNK 12
model.full <- lm(Balance ~. + Income:Student, data = data.train)
summary(model.full)
##
## Call:
## lm(formula = Balance ~ . + Income:Student, data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.17 -76.46 -15.65 66.69 296.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -552.34353 40.34543 -13.690 <2e-16 ***
## Income -7.93864 0.29465 -26.943 <2e-16 ***
## Rating 4.01246 0.06416 62.540 <2e-16 ***
## Cards 2.89772 4.69168 0.618 0.5373
## Age -0.78778 0.35994 -2.189 0.0294 *
## Education 0.88922 1.92035 0.463 0.6437
## GenderFemale -18.37030 12.00867 -1.530 0.1272
## StudentYes 395.67744 30.72302 12.879 <2e-16 ***
## MarriedYes -20.46469 12.39735 -1.651 0.0999 .
## EthnicityAsian 6.53269 17.13794 0.381 0.7033
## EthnicityCaucasian 11.95386 14.82444 0.806 0.4207
## Income:StudentYes 0.50102 0.50028 1.001 0.3174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.9502
## F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
The R^2 value is very high, suggesting that the model is a good fit.
However, there are still several variables with a high P-value: Cards, Education, Gender, Married, Ethnicity, and the interaction term between Income and StudentYes. These high p-values suggest that these variables should be dropped from the model and the model can be improved.
In general, for linear models and GLMs, the usual practice is to set the baseline level to the most populous level to improve the accuracy of our measures of significance of predictor.
# CHUNK 13
# Use the table() function to calculate the frequencies for each factor
#table <- as.data.frame(table(Credit$Gender))
# Determine the level with the highest frequency
#max <- which.max(table[,2])
# Save the name of the level with the highest frequency
#level.name <- as.character(table[max,1])
# Set the baseline level to the most populous level
#Credit$Gender <- relevel(Credit$Gender, ref = level.name)
# Check releveled result using summary()
#summary(Credit$Gender)
# Use the same work with other predictors using "For loop"
vars.categorical <- c("Gender","Student","Married","Ethnicity")
for (v in vars.categorical)
{
table <- as.data.frame(table(Credit[,v]))
max <- which.max(table[,2])
level.name <- as.character(table[max,1])
Credit[,v] <- relevel(Credit[,v], ref = level.name)
}
summary(Credit)
## Income Rating Cards Age
## Min. : 10.35 Min. : 93.0 Min. :1.000 Min. :23.00
## 1st Qu.: 21.01 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75
## Median : 33.12 Median :344.0 Median :3.000 Median :56.00
## Mean : 45.22 Mean :354.9 Mean :2.958 Mean :55.67
## 3rd Qu.: 57.47 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00
## Max. :186.63 Max. :982.0 Max. :9.000 Max. :98.00
## Education Gender Student Married Ethnicity
## Min. : 5.00 Female:207 No :360 Yes:245 Caucasian :199
## 1st Qu.:11.00 Male :193 Yes: 40 No :155 African American: 99
## Median :14.00 Asian :102
## Mean :13.45
## 3rd Qu.:16.00
## Max. :20.00
## Balance
## Min. : 0.00
## 1st Qu.: 68.75
## Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
# CHUNK 14
# To make sure factors in the training set are releveled
data.train <- Credit[partition,]
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)
##
## Call:
## lm(formula = Balance ~ . + Income:Student, data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.17 -76.46 -15.65 66.69 296.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -579.22466 40.08600 -14.450 <2e-16 ***
## Income -7.93864 0.29465 -26.943 <2e-16 ***
## Rating 4.01246 0.06416 62.540 <2e-16 ***
## Cards 2.89772 4.69168 0.618 0.5373
## Age -0.78778 0.35994 -2.189 0.0294 *
## Education 0.88922 1.92035 0.463 0.6437
## Gender Male 18.37030 12.00867 1.530 0.1272
## StudentYes 395.67744 30.72302 12.879 <2e-16 ***
## MarriedNo 20.46469 12.39735 1.651 0.0999 .
## EthnicityAfrican American -11.95386 14.82444 -0.806 0.4207
## EthnicityAsian -5.42117 14.68523 -0.369 0.7123
## Income:StudentYes 0.50102 0.50028 1.001 0.3174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.9502
## F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
# CHUNK 15
#library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
data = Credit, fullRank = TRUE)
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
Credit.bin <- cbind(Credit, binarized_vars)
Credit.bin$Gender <- NULL
Credit.bin$Student <- NULL
Credit.bin$Married <- NULL
Credit.bin$Ethnicity <- NULL
data.train.bin <- Credit.bin[partition, ]
data.test.bin <- Credit.bin[-partition, ]
head(data.train.bin)
## Income Rating Cards Age Education Balance Gender..Male Student.Yes
## 1 14.891 283 2 34 11 333 1 0
## 2 106.025 483 3 82 15 903 0 1
## 3 104.593 514 4 71 11 580 1 0
## 6 80.180 569 4 77 10 1151 1 0
## 8 71.408 512 2 87 9 872 1 0
## 9 15.125 266 5 66 13 279 0 0
## Married.No Ethnicity.African.American Ethnicity.Asian
## 1 0 0 0
## 2 0 0 1
## 3 1 0 1
## 6 1 0 0
## 8 1 0 1
## 9 1 0 0
Although the lm() function is capable of automatically converting categorical predictors into dummy variables, many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. With binarization, we are able to add or drop individual factor levels when doing feature selection
# CHUNK 16
# The interaction term is now Income:Student.Yes, not Income:Student
model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data = data.train.bin)
summary(model.full.bin)
##
## Call:
## lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.17 -76.46 -15.65 66.69 296.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -579.22466 40.08600 -14.450 <2e-16 ***
## Income -7.93864 0.29465 -26.943 <2e-16 ***
## Rating 4.01246 0.06416 62.540 <2e-16 ***
## Cards 2.89772 4.69168 0.618 0.5373
## Age -0.78778 0.35994 -2.189 0.0294 *
## Education 0.88922 1.92035 0.463 0.6437
## Gender..Male 18.37030 12.00867 1.530 0.1272
## Student.Yes 395.67744 30.72302 12.879 <2e-16 ***
## Married.No 20.46469 12.39735 1.651 0.0999 .
## Ethnicity.African.American -11.95386 14.82444 -0.806 0.4207
## Ethnicity.Asian -5.42117 14.68523 -0.369 0.7123
## Income:Student.Yes 0.50102 0.50028 1.001 0.3174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.9502
## F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
AIC vs. BIC: Recall that the AIC and BIC differ only in terms of the size of the penalty that applies to the number of parameters. For the AIC, the penalty per parameter is 2; for the BIC, the penalty is the natural logarithm of the size of the training set(ln(ntr) = In 301 = 5.7071), which is greater. Thus the BIC tends to favor a model with fewer features and represents a more conservative approach to feature selection. Since our goal in this case study is to identify key factors that relate to the target variable Balance, it makes sense to take a conservative approach and work with as few features as necessary in the final model using the BIC as the selection criterion.
Backward vs. forward selections: The decision of using forward or backward selections can be justified with similar reasons. With the selection criterion held fixed, forward selection is more likely to produce a final model with fewer features (because the model you start with is the null model, which has no features) and better aligns with our goal of identifying key factors affecting Balance.
# CHUNK 17
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
model.backward.AIC <- stepAIC(model.full.bin)
## Start: AIC=2802.25
## Balance ~ Income + Rating + Cards + Age + Education + Gender..Male +
## Student.Yes + Married.No + Ethnicity.African.American + Ethnicity.Asian +
## Income:Student.Yes
##
## Df Sum of Sq RSS AIC
## - Ethnicity.Asian 1 1448 3071419 2800.4
## - Education 1 2278 3072249 2800.5
## - Cards 1 4052 3074024 2800.6
## - Ethnicity.African.American 1 6907 3076879 2800.9
## - Income:Student.Yes 1 10654 3080626 2801.3
## <none> 3069972 2802.3
## - Gender..Male 1 24859 3094830 2802.7
## - Married.No 1 28946 3098918 2803.1
## - Age 1 50885 3120857 2805.2
## - Rating 1 41548705 44618677 3605.9
##
## Step: AIC=2800.39
## Balance ~ Income + Rating + Cards + Age + Education + Gender..Male +
## Student.Yes + Married.No + Ethnicity.African.American + Income:Student.Yes
##
## Df Sum of Sq RSS AIC
## - Education 1 2255 3073675 2798.6
## - Cards 1 3911 3075331 2798.8
## - Ethnicity.African.American 1 5595 3077014 2798.9
## - Income:Student.Yes 1 10176 3081595 2799.4
## <none> 3071419 2800.4
## - Gender..Male 1 25165 3096584 2800.8
## - Married.No 1 29351 3100771 2801.3
## - Age 1 51462 3122881 2803.4
## - Rating 1 41550276 44621695 3603.9
##
## Step: AIC=2798.61
## Balance ~ Income + Rating + Cards + Age + Gender..Male + Student.Yes +
## Married.No + Ethnicity.African.American + Income:Student.Yes
##
## Df Sum of Sq RSS AIC
## - Cards 1 3865 3077540 2797.0
## - Ethnicity.African.American 1 5586 3079261 2797.2
## - Income:Student.Yes 1 10214 3083888 2797.6
## <none> 3073675 2798.6
## - Gender..Male 1 25155 3098830 2799.1
## - Married.No 1 28499 3102174 2799.4
## - Age 1 51652 3125327 2801.6
## - Rating 1 41550265 44623939 3601.9
##
## Step: AIC=2796.99
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes +
## Married.No + Ethnicity.African.American + Income:Student.Yes
##
## Df Sum of Sq RSS AIC
## - Ethnicity.African.American 1 5579 3083119 2795.5
## - Income:Student.Yes 1 10767 3088307 2796.0
## <none> 3077540 2797.0
## - Gender..Male 1 23659 3101198 2797.3
## - Married.No 1 27917 3105457 2797.7
## - Age 1 49660 3127200 2799.8
## - Rating 1 42161655 45239195 3604.0
##
## Step: AIC=2795.54
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes +
## Married.No + Income:Student.Yes
##
## Df Sum of Sq RSS AIC
## - Income:Student.Yes 1 9873 3092992 2794.5
## <none> 3083119 2795.5
## - Gender..Male 1 23435 3106554 2795.8
## - Married.No 1 24716 3107835 2795.9
## - Age 1 50672 3133791 2798.4
## - Rating 1 42165674 45248793 3602.1
##
## Step: AIC=2794.5
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes +
## Married.No
##
## Df Sum of Sq RSS AIC
## <none> 3092992 2794.5
## - Gender..Male 1 22314 3115306 2794.7
## - Married.No 1 25027 3118019 2794.9
## - Age 1 49133 3142125 2797.2
## - Student.Yes 1 4931600 8024592 3079.5
## - Income 1 8357794 11450786 3186.5
## - Rating 1 42247226 45340218 3600.7
summary(model.backward.AIC)
##
## Call:
## lm(formula = Balance ~ Income + Rating + Age + Gender..Male +
## Student.Yes + Married.No, data = data.train.bin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195.16 -76.45 -14.03 65.33 285.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -565.22267 26.73779 -21.139 <2e-16 ***
## Income -7.87666 0.27946 -28.186 <2e-16 ***
## Rating 4.01180 0.06331 63.370 <2e-16 ***
## Age -0.77016 0.35638 -2.161 0.0315 *
## Gender..Male 17.32103 11.89326 1.456 0.1464
## Student.Yes 418.87031 19.34645 21.651 <2e-16 ***
## Married.No 18.73773 12.14860 1.542 0.1241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.6 on 294 degrees of freedom
## Multiple R-squared: 0.9517, Adjusted R-squared: 0.9507
## F-statistic: 964.8 on 6 and 294 DF, p-value: < 2.2e-16
# CHUNK 18
# first fit the null model (i.e., model with no predictors)
model.null <- lm(Balance ~ 1, data = data.train.bin)
model.forward.BIC <- stepAIC(model.null, direction = "forward", scope = list(upper = model.full.bin, lower = model.null), k = log(nrow(data.train.bin)))
## Start: AIC=3698.12
## Balance ~ 1
##
## Df Sum of Sq RSS AIC
## + Rating 1 47082983 16908423 3303.2
## + Income 1 12910007 51081399 3636.0
## + Student.Yes 1 5425335 58566071 3677.2
## <none> 63991406 3698.1
## + Cards 1 163291 63828115 3703.1
## + Gender..Male 1 127201 63864205 3703.2
## + Age 1 79880 63911526 3703.5
## + Education 1 13121 63978284 3703.8
## + Married.No 1 10920 63980486 3703.8
## + Ethnicity.Asian 1 3049 63988357 3703.8
## + Ethnicity.African.American 1 0 63991406 3703.8
##
## Step: AIC=3303.21
## Balance ~ Rating
##
## Df Sum of Sq RSS AIC
## + Income 1 8650956 8257467 3093.2
## + Student.Yes 1 4817765 12090659 3208.0
## + Age 1 707560 16200863 3296.1
## <none> 16908423 3303.2
## + Married.No 1 174547 16733876 3305.8
## + Education 1 71402 16837022 3307.6
## + Ethnicity.Asian 1 49620 16858804 3308.0
## + Gender..Male 1 31887 16876536 3308.4
## + Cards 1 30049 16878374 3308.4
## + Ethnicity.African.American 1 4304 16904120 3308.8
##
## Step: AIC=3093.2
## Balance ~ Rating + Income
##
## Df Sum of Sq RSS AIC
## + Student.Yes 1 5069114 3188353 2812.5
## <none> 8257467 3093.2
## + Married.No 1 127904 8129563 3094.2
## + Age 1 95208 8162259 3095.4
## + Education 1 44496 8212971 3097.3
## + Ethnicity.Asian 1 40115 8217352 3097.4
## + Cards 1 22019 8235448 3098.1
## + Ethnicity.African.American 1 9464 8248003 3098.6
## + Gender..Male 1 244 8257223 3098.9
##
## Step: AIC=2812.47
## Balance ~ Rating + Income + Student.Yes
##
## Df Sum of Sq RSS AIC
## <none> 3188353 2812.5
## + Age 1 46257 3142096 2813.8
## + Gender..Male 1 24128 3164225 2815.9
## + Married.No 1 23759 3164595 2815.9
## + Income:Student.Yes 1 7569 3180784 2817.5
## + Ethnicity.African.American 1 2346 3186007 2817.9
## + Education 1 1600 3186753 2818.0
## + Cards 1 865 3187488 2818.1
## + Ethnicity.Asian 1 747 3187606 2818.1
summary(model.forward.BIC)
##
## Call:
## lm(formula = Balance ~ Rating + Income + Student.Yes, data = data.train.bin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.292 -78.120 -7.473 63.273 299.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -589.16636 16.11012 -36.57 <2e-16 ***
## Rating 4.01634 0.06353 63.22 <2e-16 ***
## Income -7.97413 0.27691 -28.80 <2e-16 ***
## Student.Yes 421.17339 19.38206 21.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.6 on 297 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9497
## F-statistic: 1888 on 3 and 297 DF, p-value: < 2.2e-16
# CHUNK 19
# Calculate testRMSE of model.null, model.full.bin, model.backward.AIC, forward.BIC
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
rmse(data.test.bin$Balance, predict(model.null, newdata = data.test.bin))
## [1] 453.3665
rmse(data.test.bin$Balance, predict(model.full.bin, newdata = data.test.bin))
## [1] 105.0656
rmse(data.test.bin$Balance, predict(model.backward.AIC, newdata = data.test.bin))
## [1] 104.2375
rmse(data.test.bin$Balance, predict(model.forward.BIC, newdata = data.test.bin))
## [1] 102.6785
The model based on forward selection and BIC is recommended because of the model’s prediction accuracy and interpretability.
# CHUNK 20
# Draw “Residuals vs Fitted” plot and “Normal Q-Q” plot
plot(model.forward.BIC)
# CHUNK 21
#install.packages("glmnet") # uncomment this line the first time you use glmnet
#library(glmnet)
#library(Matrix)
# Change data format for glmnet().
# Train the model using glmnet()
# Extract coefficient estimates
The values of \(\lambda\) are arranged in descending order across the five columns from left to right, with sO, s1, s2, s3, and s4 representing \(\lambda=1000,500,100,10,0\), respectively.
# CHUNK 22
# CHUNK 24
# Fit the regularized model using lambda.min
# List the coefficient estimates
# Change test data format for glmnet()
# Calculate test RMSE
[Comment 1: Explain results]
The mean and median are about the same for all the variables except Income and Balance. A distribution without a similar mean and median may indicate a non-symmetric distribution.
The third quartile and max of all the distributions are not close for any of the distributions. This means that none of the distributions have a heavy right tail.
The first quartile and min for Income, Cards, and Balance are close. This suggests that these distributions have a heavy left tail.