Problem

For the second case study, we use the Credit dataset in ISLR R library. The Credit dataset contains n = 400 observations and 11 variables:

Variable Description Variable Values
Income Income A positive number with two decimal places
Limit Credit limit A positive integer from 855 to 13,913
Rating Credit rating A positive integer from 93 to 982
Cards Number of credit cards A positive integer from 1 to 9
Age Age in years A positive integer from 23 to 98
Education Number of years of education A positive integer from 5 to 20
Gender An indicator of the individual’s gender A factor with two levles, Male and Female
Student An indicator of whether the individual was a student A factor with two levles, No and Yes
Married An indicator of whether the individual was married A factor with two levles, No and Yes
Ethnicity An indicator of the individual’s ethnicity A factor with three levles, African American,Asian, and Caucasian
Balance Average credit card balance A non-negative integer from 0 to 1999

The target variable is Balance, an integer-valued variable that ranges from 0 to 1,999 and represents the credit card balance (in $) for an individual.

  1. The first six variables, Income, Limit, Rating, Cards, Age, and Education, are numeric predictors for Balance.
  2. The other four variables, Gender, Student, Married, and Ethnicity, are categorical predictors.The first three variables are binary and Ethnicity has three levels.

The goal is to identify and interpret key factors that relate to a higher or lower “Balance” with the aid of appropriate linear models.

load “Credit” from ISLR library

# CHUNK 1
library(ISLR)
data(Credit)


# Delete the first column containing row indices
Credit <- Credit[,-1]
head(Credit)
##    Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1  14.891  3606    283     2  34        11   Male      No     Yes Caucasian
## 2 106.025  6645    483     3  82        15 Female     Yes     Yes     Asian
## 3 104.593  7075    514     4  71        11   Male      No      No     Asian
## 4 148.924  9504    681     3  36        11 Female      No      No     Asian
## 5  55.882  4897    357     2  68        16   Male      No     Yes Caucasian
## 6  80.180  8047    569     4  77        10   Male      No      No Caucasian
##   Balance
## 1     333
## 2     903
## 3     580
## 4     964
## 5     331
## 6    1151

TASK 1: Explore the numeric variables

Univariate data exploration

Descriptive statistics
# CHUNK 2
library(ggplot2)

numeric.variables <- c("Income","Limit","Rating","Cards","Age","Education","Balance")

for (var in numeric.variables)
{
  print(ggplot(Credit, aes(x = Credit[,var])) + geom_histogram() + xlab(var) + ylab("Frequency"))
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Credit[,numeric.variables])
##      Income           Limit           Rating          Cards      
##  Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
##  1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
##  Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
##  Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
##  3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
##  Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
##       Age          Education        Balance       
##  Min.   :23.00   Min.   : 5.00   Min.   :   0.00  
##  1st Qu.:41.75   1st Qu.:11.00   1st Qu.:  68.75  
##  Median :56.00   Median :14.00   Median : 459.50  
##  Mean   :55.67   Mean   :13.45   Mean   : 520.01  
##  3rd Qu.:70.00   3rd Qu.:16.00   3rd Qu.: 863.00  
##  Max.   :98.00   Max.   :20.00   Max.   :1999.00
[Comment 1: Explain results]
  • The mean and median are about the same for all the variables except Income and Balance. A distribution without a similar mean and median may indicate a non-symmetric distribution.

  • The third quartile and max of all the distributions are not close for any of the distributions. This means that none of the distributions have a heavy right tail.

  • The first quartile and min for Income, Cards, and Balance are close. This suggests that these distributions have a heavy left tail.

Visual representations
Investigate the distribution of the target variable
# CHUNK 3
library(ggplot2)

ggplot(Credit, aes(x = Balance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

[Comment 2: Explain results]

*The distribution of the target variable Balance is skewed to the right. There is a high count of observations with a Balance of 0. This goes against the assumptions needed for linear modelling because the target variable is not normally distributed. However, the main goal of this analysis is interpretability, so transforming the target variable will not be done.

Bivariate data exploration

Descriptive statistics
Create a correlation matrix for all the numeric variables (including the target variable) in the dataset
# CHUNK 4
# Calculate the correlation matrix for numeric variables
cor.matrix <- cor(Credit[,numeric.variables])
round(cor.matrix, digits = 3)
##           Income  Limit Rating  Cards   Age Education Balance
## Income     1.000  0.792  0.791 -0.018 0.175    -0.028   0.464
## Limit      0.792  1.000  0.997  0.010 0.101    -0.024   0.862
## Rating     0.791  0.997  1.000  0.053 0.103    -0.030   0.864
## Cards     -0.018  0.010  0.053  1.000 0.043    -0.051   0.086
## Age        0.175  0.101  0.103  0.043 1.000     0.004   0.002
## Education -0.028 -0.024 -0.030 -0.051 0.004     1.000  -0.008
## Balance    0.464  0.862  0.864  0.086 0.002    -0.008   1.000
[Comment 3: Explain results]
  • There are 3 pairs of variables that exhibit a strong correlation:
    • (Income, Limit) - 0.792
    • (Income, Rating) - 0.791
    • (Rating, Limit) - 0.997
  • Large correlations suggest the duplication of information of information contained in the related variables.
  • The most alarming correlation value in this case is the 0.997, suggesting an almost perfect linear relationship between Limit and Rating.
Draw scatterplot bwteen Limit and Rating
# CHUNK 5
ggplot(Credit, aes(x = Limit, y = Rating)) + geom_point()

[Comment 4: Explain the plot and decide whether we can drop one of them]
  • The plot of Rating vs. Limit basically forms a straight line. This implies an almost perfect correlation between the two variables and that one of these variables can be dropped because it will contain information already provided by the other variable.
  • The Rating variable will be kept since it has a higher correlation with our target variable Balance (0.864 > 0.862).
Delete the variable if you need
# CHUNK 6
Credit$Limit <- NULL
head(Credit)
##    Income Rating Cards Age Education Gender Student Married Ethnicity Balance
## 1  14.891    283     2  34        11   Male      No     Yes Caucasian     333
## 2 106.025    483     3  82        15 Female     Yes     Yes     Asian     903
## 3 104.593    514     4  71        11   Male      No      No     Asian     580
## 4 148.924    681     3  36        11 Female      No      No     Asian     964
## 5  55.882    357     2  68        16   Male      No     Yes Caucasian     331
## 6  80.180    569     4  77        10   Male      No      No Caucasian    1151
Create visual representations between the numeric variables and the target variable
# CHUNK 7
numeric.variables <- c("Income","Rating","Cards","Age","Education","Balance")

pairs(Credit[,numeric.variables])

[Comment 5: Explain results]
  • Regarding the relationship between each of the variables and Balance, there looks to be only a noticeable relationship for Income and Rating with Balance. Income appears to have a somewhat positive, linear relationship with Balance. Rating has a positive, linear (maybe quadratic) relationship with Balance. The plots of Cards, Age, and Education versus Balance seem to exhibit no noticeable pattern. This observation follows what is shown in the correlation matrix.
  • The only other noteworthy relationship to take note of is the Income and Rating plot. Following our observation from earlier, there is a strong, positive, linear relationship between these two variables.
  • All other plots seem to have no pattern in linearity or varying spread.

TASK 3: Explore the categorical variables

Descriptive statistics: calculate split mean, median, and number of observations between Balance and each categorical variable.
# CHUNK 8
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Credit %>% group_by_("Gender") %>%summarise(mean = mean(Balance), median = median(Balance), n = n())
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 2 × 4
##   Gender    mean median     n
##   <fct>    <dbl>  <int> <int>
## 1 " Male"   510.    463   193
## 2 "Female"  530.    456   207
Credit %>% group_by_("Student") %>%summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 2 × 4
##   Student  mean median     n
##   <fct>   <dbl>  <dbl> <int>
## 1 No       480.    424   360
## 2 Yes      877.    953    40
Credit %>% group_by_("Married") %>%  summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 2 × 4
##   Married  mean median     n
##   <fct>   <dbl>  <int> <int>
## 1 No       523.    467   155
## 2 Yes      518.    454   245
Credit %>% group_by_("Ethnicity") %>%  summarise(mean = mean(Balance), median = median(Balance), n = n())
## # A tibble: 3 × 4
##   Ethnicity         mean median     n
##   <fct>            <dbl>  <dbl> <int>
## 1 African American  531     480    99
## 2 Asian             512.    414   102
## 3 Caucasian         518.    465   199

We produce the mean (and median) of Balance split by different levels of each of the four factor variables.

[Comment 6: Explain results]
  • After looking at the levels of each of the categorical variables, the means and medians of Balance significantly differ for only the Students variable. The mean of balance for students (876.8250) is significantly higher than the balance for non-students (480.3694).

Since Balance is a numeric variable but we are now dealing with categorical predictors Thus, for visual representation, we will make use of split boxplots.

Visual representations: Draw split boxplots between balance and each categorical variables.
# CHUNK 9 
ggplot(Credit, aes(x = Gender, y = Balance)) + geom_boxplot()

ggplot(Credit, aes(x = Married, y = Balance)) + geom_boxplot()

ggplot(Credit, aes(x = Ethnicity, y = Balance)) + geom_boxplot()

ggplot(Credit, aes(x = Student, y = Balance)) + geom_boxplot()

[Comment 7: Explain results]
  • Consistent with our analysis of the mean and median of each level of the categorical variables, the boxplots for all the categorical variables are similar except for Student.

TASK 4: Investigate interaction effect.

Interation between Income and Student?? A scatterplot of Balance against Income colored by Student

# CHUNK 10
ggplot(data = Credit, aes(x = Income, y = Balance, color = Student)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

[Comment 8: Explain results]
  • The slopes of Balance vs. Income differ depending on whether a person is a student or not. The balance for a non-student is more sensitive (steeper slope) to income than a student.
  • The differing slopes imply a possible interaction between Income and Student affecting Balance.

TASK 5: Explore the effect of releveling factor variables

Split the data into training(75%) and test(25%) sets

# CHUNK 11
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(lattice)

set.seed(8964)
partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE)
data.train <- Credit[partition,]
data.test <- Credit[-partition,]

Fit a multiple linear regression model for Balance on all variables on the training set and get model summary.

# CHUNK 12
model.full <- lm(Balance ~. + Income:Student, data = data.train)
summary(model.full)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student, data = data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.17  -76.46  -15.65   66.69  296.17 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -552.34353   40.34543 -13.690   <2e-16 ***
## Income               -7.93864    0.29465 -26.943   <2e-16 ***
## Rating                4.01246    0.06416  62.540   <2e-16 ***
## Cards                 2.89772    4.69168   0.618   0.5373    
## Age                  -0.78778    0.35994  -2.189   0.0294 *  
## Education             0.88922    1.92035   0.463   0.6437    
## GenderFemale        -18.37030   12.00867  -1.530   0.1272    
## StudentYes          395.67744   30.72302  12.879   <2e-16 ***
## MarriedYes          -20.46469   12.39735  -1.651   0.0999 .  
## EthnicityAsian        6.53269   17.13794   0.381   0.7033    
## EthnicityCaucasian   11.95386   14.82444   0.806   0.4207    
## Income:StudentYes     0.50102    0.50028   1.001   0.3174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.9502 
## F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16
[Comment 9: Comment on the model’s summary quantities]
  • The R^2 value is very high, suggesting that the model is a good fit.

  • However, there are still several variables with a high P-value: Cards, Education, Gender, Married, Ethnicity, and the interaction term between Income and StudentYes. These high p-values suggest that these variables should be dropped from the model and the model can be improved.

[Comment 10:Interpret the coefficient estimates for Married and for the Asian level of Ethnicity]
  • Holding all other predictors fixed, the balance of married individuals is lower than the balance of unmarried individuals by 20.46469 on average.
  • The balance of Asians is higher than the balance of African Americas by 6.53269 on average, holding all other predictors fixed.

Relevel the factor variables so that the most frequent level becomes the baseline level

In general, for linear models and GLMs, the usual practice is to set the baseline level to the most populous level to improve the accuracy of our measures of significance of predictor.

# CHUNK 13
# Use the table() function to calculate the frequencies for each factor
#table <- as.data.frame(table(Credit$Gender))

# Determine the level with the highest frequency
#max <- which.max(table[,2])

# Save the name of the level with the highest frequency
#level.name <- as.character(table[max,1])

# Set the baseline level to the most populous level
#Credit$Gender <- relevel(Credit$Gender, ref = level.name)

# Check releveled result using summary()
#summary(Credit$Gender)

# Use the same work with other predictors using "For loop"
vars.categorical <- c("Gender","Student","Married","Ethnicity")
for (v in vars.categorical)
{
  table <- as.data.frame(table(Credit[,v]))
  max <- which.max(table[,2])
  level.name <- as.character(table[max,1])
  Credit[,v] <- relevel(Credit[,v], ref = level.name)
}

summary(Credit)
##      Income           Rating          Cards            Age       
##  Min.   : 10.35   Min.   : 93.0   Min.   :1.000   Min.   :23.00  
##  1st Qu.: 21.01   1st Qu.:247.2   1st Qu.:2.000   1st Qu.:41.75  
##  Median : 33.12   Median :344.0   Median :3.000   Median :56.00  
##  Mean   : 45.22   Mean   :354.9   Mean   :2.958   Mean   :55.67  
##  3rd Qu.: 57.47   3rd Qu.:437.2   3rd Qu.:4.000   3rd Qu.:70.00  
##  Max.   :186.63   Max.   :982.0   Max.   :9.000   Max.   :98.00  
##    Education        Gender    Student   Married              Ethnicity  
##  Min.   : 5.00   Female:207   No :360   Yes:245   Caucasian       :199  
##  1st Qu.:11.00    Male :193   Yes: 40   No :155   African American: 99  
##  Median :14.00                                    Asian           :102  
##  Mean   :13.45                                                          
##  3rd Qu.:16.00                                                          
##  Max.   :20.00                                                          
##     Balance       
##  Min.   :   0.00  
##  1st Qu.:  68.75  
##  Median : 459.50  
##  Mean   : 520.01  
##  3rd Qu.: 863.00  
##  Max.   :1999.00

Refit a multiple linear regression model with the releveling.

# CHUNK 14
# To make sure factors in the training set are releveled
data.train <- Credit[partition,]
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student, data = data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.17  -76.46  -15.65   66.69  296.17 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -579.22466   40.08600 -14.450   <2e-16 ***
## Income                      -7.93864    0.29465 -26.943   <2e-16 ***
## Rating                       4.01246    0.06416  62.540   <2e-16 ***
## Cards                        2.89772    4.69168   0.618   0.5373    
## Age                         -0.78778    0.35994  -2.189   0.0294 *  
## Education                    0.88922    1.92035   0.463   0.6437    
## Gender Male                 18.37030   12.00867   1.530   0.1272    
## StudentYes                 395.67744   30.72302  12.879   <2e-16 ***
## MarriedNo                   20.46469   12.39735   1.651   0.0999 .  
## EthnicityAfrican American  -11.95386   14.82444  -0.806   0.4207    
## EthnicityAsian              -5.42117   14.68523  -0.369   0.7123    
## Income:StudentYes            0.50102    0.50028   1.001   0.3174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.9502 
## F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16
[Comment 11:Interpret the coefficient estimates for Married and for the Asian level of Ethnicity again]
  • The coefficient estimates of the dummy variables for Married and Ethnicity have changed.
  • The balance of unmarried is higher than that of married by 20.46469 on average, holding all other predictors fixed. Only the sign has changed, the p-value has not changed.
  • The balance of Asians is lower than that of Cacuasians by 5.42117 on average, holding all other predictors fixed. The p-value for the Asian level of Ethnicity is different from that in the previous model.
  • In general, the p-value for categorical predictors with three or more levels may depend on the choice of baseline level.
[Comment 12: Releveling is essential??]
  • When performing feature selection, the model drops the most statistically insignificant feature one-by-one. Based on the observation in the previous comment, the choice of baseline level for categorical predictors potentially affects the p-values. Affecting the p-values affects which features we drop in feature selection. So releveling is essential for feature selection.

TASK 6: Select features using stepwise selection

Binarize factor variables manually before doing stepwise selection.

# CHUNK 15
#library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
                       data = Credit, fullRank = TRUE)

binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
Credit.bin <- cbind(Credit, binarized_vars)

Credit.bin$Gender <- NULL
Credit.bin$Student <- NULL
Credit.bin$Married <- NULL
Credit.bin$Ethnicity <- NULL

data.train.bin <- Credit.bin[partition, ]
data.test.bin <- Credit.bin[-partition, ]

head(data.train.bin)
##    Income Rating Cards Age Education Balance Gender..Male Student.Yes
## 1  14.891    283     2  34        11     333            1           0
## 2 106.025    483     3  82        15     903            0           1
## 3 104.593    514     4  71        11     580            1           0
## 6  80.180    569     4  77        10    1151            1           0
## 8  71.408    512     2  87         9     872            1           0
## 9  15.125    266     5  66        13     279            0           0
##   Married.No Ethnicity.African.American Ethnicity.Asian
## 1          0                          0               0
## 2          0                          0               1
## 3          1                          0               1
## 6          1                          0               0
## 8          1                          0               1
## 9          1                          0               0

Why do we need the binarization

Although the lm() function is capable of automatically converting categorical predictors into dummy variables, many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. With binarization, we are able to add or drop individual factor levels when doing feature selection

Fit a multiple linear regression model with the binarization.

# CHUNK 16
# The interaction term is now Income:Student.Yes, not Income:Student
model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data = data.train.bin)
summary(model.full.bin)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.17  -76.46  -15.65   66.69  296.17 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -579.22466   40.08600 -14.450   <2e-16 ***
## Income                       -7.93864    0.29465 -26.943   <2e-16 ***
## Rating                        4.01246    0.06416  62.540   <2e-16 ***
## Cards                         2.89772    4.69168   0.618   0.5373    
## Age                          -0.78778    0.35994  -2.189   0.0294 *  
## Education                     0.88922    1.92035   0.463   0.6437    
## Gender..Male                 18.37030   12.00867   1.530   0.1272    
## Student.Yes                 395.67744   30.72302  12.879   <2e-16 ***
## Married.No                   20.46469   12.39735   1.651   0.0999 .  
## Ethnicity.African.American  -11.95386   14.82444  -0.806   0.4207    
## Ethnicity.Asian              -5.42117   14.68523  -0.369   0.7123    
## Income:Student.Yes            0.50102    0.50028   1.001   0.3174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.1 on 289 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.9502 
## F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16

AIC vs. , backward vs. forward: Which ones to use?

AIC vs. BIC: Recall that the AIC and BIC differ only in terms of the size of the penalty that applies to the number of parameters. For the AIC, the penalty per parameter is 2; for the BIC, the penalty is the natural logarithm of the size of the training set(ln(ntr) = In 301 = 5.7071), which is greater. Thus the BIC tends to favor a model with fewer features and represents a more conservative approach to feature selection. Since our goal in this case study is to identify key factors that relate to the target variable Balance, it makes sense to take a conservative approach and work with as few features as necessary in the final model using the BIC as the selection criterion.

Backward vs. forward selections: The decision of using forward or backward selections can be justified with similar reasons. With the selection criterion held fixed, forward selection is more likely to produce a final model with fewer features (because the model you start with is the null model, which has no features) and better aligns with our goal of identifying key factors affecting Balance.

Do feature selection using backward selection with AIC and get the model summary.

# CHUNK 17
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
model.backward.AIC <- stepAIC(model.full.bin)
## Start:  AIC=2802.25
## Balance ~ Income + Rating + Cards + Age + Education + Gender..Male + 
##     Student.Yes + Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes
## 
##                              Df Sum of Sq      RSS    AIC
## - Ethnicity.Asian             1      1448  3071419 2800.4
## - Education                   1      2278  3072249 2800.5
## - Cards                       1      4052  3074024 2800.6
## - Ethnicity.African.American  1      6907  3076879 2800.9
## - Income:Student.Yes          1     10654  3080626 2801.3
## <none>                                     3069972 2802.3
## - Gender..Male                1     24859  3094830 2802.7
## - Married.No                  1     28946  3098918 2803.1
## - Age                         1     50885  3120857 2805.2
## - Rating                      1  41548705 44618677 3605.9
## 
## Step:  AIC=2800.39
## Balance ~ Income + Rating + Cards + Age + Education + Gender..Male + 
##     Student.Yes + Married.No + Ethnicity.African.American + Income:Student.Yes
## 
##                              Df Sum of Sq      RSS    AIC
## - Education                   1      2255  3073675 2798.6
## - Cards                       1      3911  3075331 2798.8
## - Ethnicity.African.American  1      5595  3077014 2798.9
## - Income:Student.Yes          1     10176  3081595 2799.4
## <none>                                     3071419 2800.4
## - Gender..Male                1     25165  3096584 2800.8
## - Married.No                  1     29351  3100771 2801.3
## - Age                         1     51462  3122881 2803.4
## - Rating                      1  41550276 44621695 3603.9
## 
## Step:  AIC=2798.61
## Balance ~ Income + Rating + Cards + Age + Gender..Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Income:Student.Yes
## 
##                              Df Sum of Sq      RSS    AIC
## - Cards                       1      3865  3077540 2797.0
## - Ethnicity.African.American  1      5586  3079261 2797.2
## - Income:Student.Yes          1     10214  3083888 2797.6
## <none>                                     3073675 2798.6
## - Gender..Male                1     25155  3098830 2799.1
## - Married.No                  1     28499  3102174 2799.4
## - Age                         1     51652  3125327 2801.6
## - Rating                      1  41550265 44623939 3601.9
## 
## Step:  AIC=2796.99
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Income:Student.Yes
## 
##                              Df Sum of Sq      RSS    AIC
## - Ethnicity.African.American  1      5579  3083119 2795.5
## - Income:Student.Yes          1     10767  3088307 2796.0
## <none>                                     3077540 2797.0
## - Gender..Male                1     23659  3101198 2797.3
## - Married.No                  1     27917  3105457 2797.7
## - Age                         1     49660  3127200 2799.8
## - Rating                      1  42161655 45239195 3604.0
## 
## Step:  AIC=2795.54
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes + 
##     Married.No + Income:Student.Yes
## 
##                      Df Sum of Sq      RSS    AIC
## - Income:Student.Yes  1      9873  3092992 2794.5
## <none>                             3083119 2795.5
## - Gender..Male        1     23435  3106554 2795.8
## - Married.No          1     24716  3107835 2795.9
## - Age                 1     50672  3133791 2798.4
## - Rating              1  42165674 45248793 3602.1
## 
## Step:  AIC=2794.5
## Balance ~ Income + Rating + Age + Gender..Male + Student.Yes + 
##     Married.No
## 
##                Df Sum of Sq      RSS    AIC
## <none>                       3092992 2794.5
## - Gender..Male  1     22314  3115306 2794.7
## - Married.No    1     25027  3118019 2794.9
## - Age           1     49133  3142125 2797.2
## - Student.Yes   1   4931600  8024592 3079.5
## - Income        1   8357794 11450786 3186.5
## - Rating        1  42247226 45340218 3600.7
summary(model.backward.AIC)
## 
## Call:
## lm(formula = Balance ~ Income + Rating + Age + Gender..Male + 
##     Student.Yes + Married.No, data = data.train.bin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195.16  -76.45  -14.03   65.33  285.30 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -565.22267   26.73779 -21.139   <2e-16 ***
## Income         -7.87666    0.27946 -28.186   <2e-16 ***
## Rating          4.01180    0.06331  63.370   <2e-16 ***
## Age            -0.77016    0.35638  -2.161   0.0315 *  
## Gender..Male   17.32103   11.89326   1.456   0.1464    
## Student.Yes   418.87031   19.34645  21.651   <2e-16 ***
## Married.No     18.73773   12.14860   1.542   0.1241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.6 on 294 degrees of freedom
## Multiple R-squared:  0.9517, Adjusted R-squared:  0.9507 
## F-statistic: 964.8 on 6 and 294 DF,  p-value: < 2.2e-16
[Comment 13: What is the backward selection result]
  • In backward selection, the features are dropped in the following order: Ethnicity:Asian, Education, Cards, Ethnicity, African.America, Income:Student. Then the process stops.
[Comment 14: Interpret summary quantities]
  • The R^2 value has decreased a little bit from the full model from 0.952 to 0.9517.
  • The features Gender..Male and Married.No have a high p-value, but were not dropped in backward selection.
  • So it looks like a better model can be made.

Do feature selection using forward selection with BIC and get the model summary.

# CHUNK 18
# first fit the null model (i.e., model with no predictors)
model.null <- lm(Balance ~ 1, data = data.train.bin)
model.forward.BIC <- stepAIC(model.null, direction = "forward", scope = list(upper = model.full.bin, lower = model.null), k = log(nrow(data.train.bin)))
## Start:  AIC=3698.12
## Balance ~ 1
## 
##                              Df Sum of Sq      RSS    AIC
## + Rating                      1  47082983 16908423 3303.2
## + Income                      1  12910007 51081399 3636.0
## + Student.Yes                 1   5425335 58566071 3677.2
## <none>                                    63991406 3698.1
## + Cards                       1    163291 63828115 3703.1
## + Gender..Male                1    127201 63864205 3703.2
## + Age                         1     79880 63911526 3703.5
## + Education                   1     13121 63978284 3703.8
## + Married.No                  1     10920 63980486 3703.8
## + Ethnicity.Asian             1      3049 63988357 3703.8
## + Ethnicity.African.American  1         0 63991406 3703.8
## 
## Step:  AIC=3303.21
## Balance ~ Rating
## 
##                              Df Sum of Sq      RSS    AIC
## + Income                      1   8650956  8257467 3093.2
## + Student.Yes                 1   4817765 12090659 3208.0
## + Age                         1    707560 16200863 3296.1
## <none>                                    16908423 3303.2
## + Married.No                  1    174547 16733876 3305.8
## + Education                   1     71402 16837022 3307.6
## + Ethnicity.Asian             1     49620 16858804 3308.0
## + Gender..Male                1     31887 16876536 3308.4
## + Cards                       1     30049 16878374 3308.4
## + Ethnicity.African.American  1      4304 16904120 3308.8
## 
## Step:  AIC=3093.2
## Balance ~ Rating + Income
## 
##                              Df Sum of Sq     RSS    AIC
## + Student.Yes                 1   5069114 3188353 2812.5
## <none>                                    8257467 3093.2
## + Married.No                  1    127904 8129563 3094.2
## + Age                         1     95208 8162259 3095.4
## + Education                   1     44496 8212971 3097.3
## + Ethnicity.Asian             1     40115 8217352 3097.4
## + Cards                       1     22019 8235448 3098.1
## + Ethnicity.African.American  1      9464 8248003 3098.6
## + Gender..Male                1       244 8257223 3098.9
## 
## Step:  AIC=2812.47
## Balance ~ Rating + Income + Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## <none>                                    3188353 2812.5
## + Age                         1     46257 3142096 2813.8
## + Gender..Male                1     24128 3164225 2815.9
## + Married.No                  1     23759 3164595 2815.9
## + Income:Student.Yes          1      7569 3180784 2817.5
## + Ethnicity.African.American  1      2346 3186007 2817.9
## + Education                   1      1600 3186753 2818.0
## + Cards                       1       865 3187488 2818.1
## + Ethnicity.Asian             1       747 3187606 2818.1
summary(model.forward.BIC)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student.Yes, data = data.train.bin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -220.292  -78.120   -7.473   63.273  299.895 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -589.16636   16.11012  -36.57   <2e-16 ***
## Rating         4.01634    0.06353   63.22   <2e-16 ***
## Income        -7.97413    0.27691  -28.80   <2e-16 ***
## Student.Yes  421.17339   19.38206   21.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.6 on 297 degrees of freedom
## Multiple R-squared:  0.9502, Adjusted R-squared:  0.9497 
## F-statistic:  1888 on 3 and 297 DF,  p-value: < 2.2e-16
[Comment 15: What is the forward selection result]
  • Features are added in the following order: Rating, Income, Student.Yes. Then the process terminates.
  • The resulting model from forward selection only has three features, whereas the backward selection model has six features.
[Comment 16: Interpret summary result]
  • This model has an R^2 value of 0.9502. This is a decrease from the other models, but not by that much. All the models have had an R^2 value of around 0.95.
  • The p-values of all the features are low. This indicates that all the variables in this model are statistically significant.

TASK 8: Regularization using Elastic net

Use elastic net to fit a regularized regression model with alpha equal to 0.5, and lambda equal to 0, 10, 100, 500, 1000

Be sure to use the same variables (after releveling) as the full model in Task 5.

# CHUNK 21
#install.packages("glmnet")  # uncomment this line the first time you use glmnet
#library(glmnet)
#library(Matrix)

# Change data format  for glmnet(). 


# Train the model using glmnet()




# Extract coefficient estimates

The values of \(\lambda\) are arranged in descending order across the five columns from left to right, with sO, s1, s2, s3, and s4 representing \(\lambda=1000,500,100,10,0\), respectively.

[Comment 21: Interpret the feature selection results for each lambda]

  • \(\lambda=0\):
  • \(\lambda=10\):
  • \(\lambda=100\):
  • \(\lambda=500\):
  • \(\lambda=1,000\):

Do cross-validation to set the value of lambda for fitting the regularized model (alpha = 0.5)

# CHUNK 22

Fit a regularized regression model with alpha equal to 0.5, lamgda equal to one we get from crossvalidation.

# CHUNK 24
# Fit the regularized model using lambda.min

# List the coefficient estimates

# Change test data format for glmnet()

# Calculate test RMSE

[Comment 22: Interpret the regularization result and recommend which model should be used]