R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Purpose

Consumer spending is a major component of the economy in the United States. Consumer spending generates significant revenue for credit granting entities through interest in carrying balances month to month, and from monthly and annual fees. In this assignment the goal is to build an effective model that predicts customer credit card spending, but, more specifically, to predict end of month balance (Balance) on credit accounts.

#q1.    Read the data set into R a data frame.  Prepare the data for analysis.
setwd("~/Desktop/BANA 288 Predictive Analytics/HW2 Linear Regression Modeling")
dat.cb <- read.csv("hw2_credit_balance.csv")
names(dat.cb)
##  [1] "Obs_Number" "Income"     "Limit"      "Rating"     "Num_Cards" 
##  [6] "Age"        "Yrs_Ed"     "Gender"     "Student"    "Married"   
## [11] "Ethnicity"  "Balance"
dat.cb1 <- dat.cb[,2:12]
names(dat.cb1)
##  [1] "Income"    "Limit"     "Rating"    "Num_Cards" "Age"       "Yrs_Ed"   
##  [7] "Gender"    "Student"   "Married"   "Ethnicity" "Balance"
str(dat.cb1)
## 'data.frame':    450 obs. of  11 variables:
##  $ Income   : num  20.8 17.3 31.4 44.4 35 ...
##  $ Limit    : int  2672 1335 1705 5278 3327 3211 4865 8117 3388 5107 ...
##  $ Rating   : int  204 138 160 381 253 265 381 589 266 380 ...
##  $ Num_Cards: int  1 2 3 2 3 4 5 4 4 1 ...
##  $ Age      : int  70 65 81 50 54 59 67 30 74 55 ...
##  $ Yrs_Ed   : int  18 13 14 12 14 14 11 14 17 10 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 2 1 1 1 2 2 2 1 2 1 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 2 2 2 ...
##  $ Ethnicity: Factor w/ 5 levels "2 or more","African American",..: 2 2 4 5 2 3 4 4 2 4 ...
##  $ Balance  : int  0 0 0 531 50 199 836 1407 155 651 ...
## Encoding all the categorical variables and create dummy variables for "Ethnicity"
dat.cb1$Gender <- -as.numeric(dat.cb1$Gender)+2
names(dat.cb1)[7] <- "Male"
dat.cb1$Student <- -as.numeric(dat.cb1$Student)+1
dat.cb1$Married <- as.numeric(dat.cb1$Married)-1

# Ethnicity <- unique(dat.cb1$Ethnicity)
# dat.cb1$Ethnicity = factor(dat.cb1$Ethnicity,
#                         levels = c('2 or more', 'African American', 'Asian', 'Caucasian', 'Unknown'),
#                         labels = c(1, 2, 3, 4, 5))

Ethnicity <- as.data.frame(dummy(dat.cb1$Ethnicity))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
names(Ethnicity)[1] <- "Ethnicity.2.or.more"
names(Ethnicity)[2] <- "Ethnicity.African.American"
names(Ethnicity)[3] <- "Ethnicity.Asian"
names(Ethnicity)[4] <- "Ethnicity.Caucasian"
names(Ethnicity)[5] <- "Ethnicity.Unknown"

dat.cb2 <- as.data.frame(append(dat.cb1[,-10], Ethnicity))
#q2.    Compute the correlation between all independent variables and the response.  List the top 3 (or more) correlated variables.  Explain why these make sense or why they are surprising.
correlation_all <- cor(dat.cb2)
correlation <- cor(dat.cb2)[10,]
sort(correlation, decreasing = TRUE)
##                    Balance                     Rating 
##                1.000000000                0.857339800 
##                      Limit                     Income 
##                0.855190700                0.446919280 
##                  Num_Cards Ethnicity.African.American 
##                0.087810661                0.030224274 
##            Ethnicity.Asian        Ethnicity.Caucasian 
##                0.012950935                0.008986931 
##                        Age                    Married 
##                0.006591191               -0.001521574 
##                       Male          Ethnicity.Unknown 
##               -0.012684800               -0.036727061 
##                     Yrs_Ed        Ethnicity.2.or.more 
##               -0.050196896               -0.055304772 
##                    Student 
##               -0.234686398
head(sort(correlation_all, decreasing = TRUE), n=25) ## The top 15 will be the correlation between the variable themselves which equal to 1. And in the matrix, there will be 2 duplicated values. So we chose 25 to see the top 5 correlation results
##  [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [15] 1.0000000 0.9965583 0.9965583 0.8573398 0.8573398 0.8551907 0.8551907
## [22] 0.7833861 0.7833861 0.7825671 0.7825671
corrplot(correlation_all)

The top 3 variables that correlated with “Balance” is “Rating”, “Limit” and “Income”, with correlation coefficients equal to 0.857339800, 0.855190700 and 0.446919280 respectively.

Top 3 Correlation of all variables: 1. Limit & Rating: cor = 0.9965583 2. Balance & Rating: cor = 0.8573398 3. Balance & Limit: cor = 0.8551907

Balance is the total amount spent at the end of the month. If the customer has higher rating, which means they have better credit history and higher credit score, they will be allowed to have higher account limit and tend to spend more with that limit(Balance). And it will be related with the customer income. Higer income means the customer can affort more expensive items or spend more monthly.

#q3.    Using the entire set of independent variables created in question 1 above, construct a linear regression model predicting Balance that includes all variables (reg.all).  Interpret the coefficient of determination.  Which variables are significant at the 5% level?
reg.all <- lm(Balance ~., data = dat.cb2)
summary(reg.all)
## 
## Call:
## lm(formula = Balance ~ ., data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -322.51  -83.26  -13.54   56.97  645.92 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -578.65708   39.40795 -14.684  < 2e-16 ***
## Income                       -7.82159    0.24088 -32.471  < 2e-16 ***
## Limit                         0.17010    0.03215   5.291 1.93e-07 ***
## Rating                        1.43568    0.47969   2.993 0.002920 ** 
## Num_Cards                    14.86092    4.36160   3.407 0.000717 ***
## Age                          -0.41998    0.31126  -1.349 0.177939    
## Yrs_Ed                       -1.01533    1.68618  -0.602 0.547388    
## Male                         21.51224   10.39382   2.070 0.039067 *  
## Student                    -419.71899   17.39948 -24.122  < 2e-16 ***
## Married                      -7.87975   10.88566  -0.724 0.469537    
## Ethnicity.2.or.more          59.34240   29.50770   2.011 0.044932 *  
## Ethnicity.African.American   74.42929   23.17498   3.212 0.001418 ** 
## Ethnicity.Asian              90.20539   23.03071   3.917 0.000104 ***
## Ethnicity.Caucasian          83.35194   21.78507   3.826 0.000149 ***
## Ethnicity.Unknown                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.7 on 436 degrees of freedom
## Multiple R-squared:  0.9443, Adjusted R-squared:  0.9426 
## F-statistic: 568.4 on 13 and 436 DF,  p-value: < 2.2e-16

The linear regression function will be: Balance = - 578.65708 - 7.82159 * Income + 0.17010 * Limit + 1.43568 * Rating + 14.86092 * Num_Cards - 0.41998 * Age - 1.01533 * Yrs_Ed - 21.51224 * Male - 419.71899 * Student - 7.87975 * Married + 59.34240 * Ethnicity.2 or more + 74.42929 * Ethnicity.African American + 90.20539 * Ethnicity.Asian + 83.35194 * Ethnicity.Caucasian

All the variables except “Age”, “Yrs_Ed”, “Married” and “Ethnicity_Unknown” are significant at the 5% level. This linear model has p-value <- 0.05 with 0.9443 r-square, which is highly significant and have a very good performance. The variable “Student” tends to have the largest negative impact on the dependent variable balance. If the customer is a student, the balance will be -419.72 lower than non-student customers, holding other variables constant. It makes sense because students will have lower consuming capacity than the professionals. Moreover, the ethnicities also have greater impact on balance. Asian and Caucasian have higher balance than African American and people with 2 or more ethnicities.

#q4.    Run a linear regression model predicting Balance using only the Income, Limit and Gender variables.  Interpret all three slope coefficients in the context of the problem.  Which of these variables are significant at the 1% level?
reg.1 <- lm(Balance ~ Income + Limit + Male, data = dat.cb2)
summary(reg.1)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Male, data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -389.47 -118.14  -43.23   63.99  610.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.879e+02  2.068e+01 -18.757   <2e-16 ***
## Income      -7.500e+00  3.700e-01 -20.270   <2e-16 ***
## Limit        2.613e-01  5.695e-03  45.885   <2e-16 ***
## Male         7.675e+00  1.623e+01   0.473    0.636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 171.8 on 446 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.8592 
## F-statistic: 914.6 on 3 and 446 DF,  p-value: < 2.2e-16

The linear regression function will be: Balance = - 0.03879 - 7.5 * Income + 0.2613 * Limit + 7.675 * Male

Income and Limit are are significant at the 1% level. Income has a negative impact on Balance, with every \(1000\) customer annual income increase, the balance will be decreasing \(7.5\), holding other variables constant. Limit has a positive impact on Balance, with every \(1\) customer credit account limit increase, the balance will be decreasing \(0.26\), holding other variables constant. Male tend to spend \(7.675\) more than female though it is not significant, holding other variables constant.

#q5.    Run best subsets on the data (exhaustive search).  Use the results from best subsets to pick a model.  Provide an argument in support of the model chosen.
regfit.full <- regsubsets(Balance~., dat.cb2, nvmax = 15, method = "exhaustive")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : nvmax reduced to 13
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Balance ~ ., dat.cb2, nvmax = 15, method = "exhaustive")
## 14 Variables  (and intercept)
##                            Forced in Forced out
## Income                         FALSE      FALSE
## Limit                          FALSE      FALSE
## Rating                         FALSE      FALSE
## Num_Cards                      FALSE      FALSE
## Age                            FALSE      FALSE
## Yrs_Ed                         FALSE      FALSE
## Male                           FALSE      FALSE
## Student                        FALSE      FALSE
## Married                        FALSE      FALSE
## Ethnicity.2.or.more            FALSE      FALSE
## Ethnicity.African.American     FALSE      FALSE
## Ethnicity.Asian                FALSE      FALSE
## Ethnicity.Caucasian            FALSE      FALSE
## Ethnicity.Unknown              FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           Income Limit Rating Num_Cards Age Yrs_Ed Male Student Married
## 1  ( 1 )  " "    " "   "*"    " "       " " " "    " "  " "     " "    
## 2  ( 1 )  "*"    " "   "*"    " "       " " " "    " "  " "     " "    
## 3  ( 1 )  "*"    " "   "*"    " "       " " " "    " "  "*"     " "    
## 4  ( 1 )  "*"    "*"   "*"    " "       " " " "    " "  "*"     " "    
## 5  ( 1 )  "*"    "*"   " "    "*"       " " " "    " "  "*"     " "    
## 6  ( 1 )  "*"    "*"   "*"    "*"       " " " "    " "  "*"     " "    
## 7  ( 1 )  "*"    "*"   "*"    "*"       " " " "    "*"  "*"     " "    
## 8  ( 1 )  "*"    "*"   "*"    "*"       "*" " "    "*"  "*"     " "    
## 9  ( 1 )  "*"    "*"   "*"    "*"       "*" " "    "*"  "*"     " "    
## 10  ( 1 ) "*"    "*"   "*"    "*"       "*" " "    "*"  "*"     " "    
## 11  ( 1 ) "*"    "*"   "*"    "*"       "*" " "    "*"  "*"     "*"    
## 12  ( 1 ) "*"    "*"   "*"    "*"       "*" "*"    "*"  "*"     "*"    
## 13  ( 1 ) "*"    "*"   "*"    "*"       "*" "*"    "*"  "*"     "*"    
##           Ethnicity.2.or.more Ethnicity.African.American Ethnicity.Asian
## 1  ( 1 )  " "                 " "                        " "            
## 2  ( 1 )  " "                 " "                        " "            
## 3  ( 1 )  " "                 " "                        " "            
## 4  ( 1 )  " "                 " "                        " "            
## 5  ( 1 )  " "                 " "                        " "            
## 6  ( 1 )  " "                 " "                        " "            
## 7  ( 1 )  " "                 " "                        " "            
## 8  ( 1 )  " "                 " "                        " "            
## 9  ( 1 )  "*"                 " "                        " "            
## 10  ( 1 ) "*"                 "*"                        " "            
## 11  ( 1 ) "*"                 "*"                        " "            
## 12  ( 1 ) "*"                 "*"                        " "            
## 13  ( 1 ) "*"                 "*"                        "*"            
##           Ethnicity.Caucasian Ethnicity.Unknown
## 1  ( 1 )  " "                 " "              
## 2  ( 1 )  " "                 " "              
## 3  ( 1 )  " "                 " "              
## 4  ( 1 )  " "                 " "              
## 5  ( 1 )  " "                 "*"              
## 6  ( 1 )  " "                 "*"              
## 7  ( 1 )  " "                 "*"              
## 8  ( 1 )  " "                 "*"              
## 9  ( 1 )  " "                 "*"              
## 10  ( 1 ) " "                 "*"              
## 11  ( 1 ) " "                 "*"              
## 12  ( 1 ) " "                 "*"              
## 13  ( 1 ) "*"                 " "
summary(regfit.full)$which
##    (Intercept) Income Limit Rating Num_Cards   Age Yrs_Ed  Male Student Married
## 1         TRUE  FALSE FALSE   TRUE     FALSE FALSE  FALSE FALSE   FALSE   FALSE
## 2         TRUE   TRUE FALSE   TRUE     FALSE FALSE  FALSE FALSE   FALSE   FALSE
## 3         TRUE   TRUE FALSE   TRUE     FALSE FALSE  FALSE FALSE    TRUE   FALSE
## 4         TRUE   TRUE  TRUE   TRUE     FALSE FALSE  FALSE FALSE    TRUE   FALSE
## 5         TRUE   TRUE  TRUE  FALSE      TRUE FALSE  FALSE FALSE    TRUE   FALSE
## 6         TRUE   TRUE  TRUE   TRUE      TRUE FALSE  FALSE FALSE    TRUE   FALSE
## 7         TRUE   TRUE  TRUE   TRUE      TRUE FALSE  FALSE  TRUE    TRUE   FALSE
## 8         TRUE   TRUE  TRUE   TRUE      TRUE  TRUE  FALSE  TRUE    TRUE   FALSE
## 9         TRUE   TRUE  TRUE   TRUE      TRUE  TRUE  FALSE  TRUE    TRUE   FALSE
## 10        TRUE   TRUE  TRUE   TRUE      TRUE  TRUE  FALSE  TRUE    TRUE   FALSE
## 11        TRUE   TRUE  TRUE   TRUE      TRUE  TRUE  FALSE  TRUE    TRUE    TRUE
## 12        TRUE   TRUE  TRUE   TRUE      TRUE  TRUE   TRUE  TRUE    TRUE    TRUE
## 13        TRUE   TRUE  TRUE   TRUE      TRUE  TRUE   TRUE  TRUE    TRUE    TRUE
##    Ethnicity.2.or.more Ethnicity.African.American Ethnicity.Asian
## 1                FALSE                      FALSE           FALSE
## 2                FALSE                      FALSE           FALSE
## 3                FALSE                      FALSE           FALSE
## 4                FALSE                      FALSE           FALSE
## 5                FALSE                      FALSE           FALSE
## 6                FALSE                      FALSE           FALSE
## 7                FALSE                      FALSE           FALSE
## 8                FALSE                      FALSE           FALSE
## 9                 TRUE                      FALSE           FALSE
## 10                TRUE                       TRUE           FALSE
## 11                TRUE                       TRUE           FALSE
## 12                TRUE                       TRUE           FALSE
## 13                TRUE                       TRUE            TRUE
##    Ethnicity.Caucasian Ethnicity.Unknown
## 1                FALSE             FALSE
## 2                FALSE             FALSE
## 3                FALSE             FALSE
## 4                FALSE             FALSE
## 5                FALSE              TRUE
## 6                FALSE              TRUE
## 7                FALSE              TRUE
## 8                FALSE              TRUE
## 9                FALSE              TRUE
## 10               FALSE              TRUE
## 11               FALSE              TRUE
## 12               FALSE              TRUE
## 13                TRUE             FALSE
par(mfrow=c(2,2))
plot(summary(regfit.full)$rsq, xlab = "Number of Variables", 
     ylab ="R-squared")
plot(summary(regfit.full)$adjr2, xlab = "Number of Variables", 
     ylab ="Adj R-squared")
max.adjr2 <- which.max(summary(regfit.full)$adjr2)
points(9,summary(regfit.full)$adjr2[8], col = "red", cex = 2, pch = 20)

plot(summary(regfit.full)$cp, xlab = "Number of Variables", 
     ylab ="Mallows Cp")
min.cp <- which.min(summary(regfit.full)$cp)
points(7,summary(regfit.full)$cp[10], col = "blue", cex = 2, pch = 20)

plot(summary(regfit.full)$bic, xlab = "Number of Variables", 
     ylab ="Bayesian Info Crit")
min.bic <- which.min(summary(regfit.full)$bic)
points(6,summary(regfit.full)$bic[6], col = "green", cex = 2, pch = 20)

Adjusted R-squared, Mallows Cp and BIC all point to models with different numbers of variables (9, 7 and 6 respectively)

reg.max.adjr2 <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Age + Male + Student + Ethnicity.2.or.more + Ethnicity.Unknown, data = dat.cb2)
summary(reg.max.adjr2)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards + 
##     Age + Male + Student + Ethnicity.2.or.more + Ethnicity.Unknown, 
##     data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.03  -79.79  -12.17   60.16  643.06 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -513.87420   26.01470 -19.753  < 2e-16 ***
## Income                -7.82753    0.23996 -32.621  < 2e-16 ***
## Limit                  0.17179    0.03168   5.422 9.75e-08 ***
## Rating                 1.41076    0.47249   2.986 0.002986 ** 
## Num_Cards             15.10378    4.33832   3.481 0.000548 ***
## Age                   -0.42029    0.30878  -1.361 0.174167    
## Male                  21.29683   10.36289   2.055 0.040459 *  
## Student             -420.25491   17.19236 -24.444  < 2e-16 ***
## Ethnicity.2.or.more  -23.87047   21.91132  -1.089 0.276568    
## Ethnicity.Unknown    -81.99667   20.90062  -3.923 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.4 on 440 degrees of freedom
## Multiple R-squared:  0.944,  Adjusted R-squared:  0.9429 
## F-statistic: 824.8 on 9 and 440 DF,  p-value: < 2.2e-16
#coef(regfit.full, min.cp)
reg.min.cp <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Male + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.cp)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards + 
##     Male + Student + Ethnicity.Unknown, data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321.08  -79.52  -10.44   59.25  626.80 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -536.10498   20.74028 -25.848  < 2e-16 ***
## Income              -7.88078    0.23705 -33.246  < 2e-16 ***
## Limit                0.17498    0.03163   5.531 5.44e-08 ***
## Rating               1.37037    0.47193   2.904 0.003872 ** 
## Num_Cards           14.71800    4.33284   3.397 0.000743 ***
## Male                21.45497   10.36893   2.069 0.039112 *  
## Student           -421.46030   17.18070 -24.531  < 2e-16 ***
## Ethnicity.Unknown  -80.23981   20.88065  -3.843 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.6 on 442 degrees of freedom
## Multiple R-squared:  0.9437, Adjusted R-squared:  0.9428 
## F-statistic:  1058 on 7 and 442 DF,  p-value: < 2.2e-16
#coef(regfit.full, min.bic)
reg.min.bic <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards + 
##     Student + Ethnicity.Unknown, data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.45  -78.85   -9.15   57.60  638.08 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -525.63603   20.18805 -26.037  < 2e-16 ***
## Income              -7.86064    0.23772 -33.066  < 2e-16 ***
## Limit                0.17394    0.03175   5.479 7.19e-08 ***
## Rating               1.38156    0.47364   2.917 0.003715 ** 
## Num_Cards           14.77136    4.34878   3.397 0.000744 ***
## Student           -419.26738   17.21137 -24.360  < 2e-16 ***
## Ethnicity.Unknown  -81.14807   20.95321  -3.873 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 110 on 443 degrees of freedom
## Multiple R-squared:  0.9431, Adjusted R-squared:  0.9424 
## F-statistic:  1225 on 6 and 443 DF,  p-value: < 2.2e-16

I will go with the simple (less flexible) model with 6 variables, which has the lowest BIC. All the variables in this model are highly significant and it has the very decent adjusted r-square: 0.9424 with lowest computation cost.

#q6.    Compare the model of questions 5 to the model of question 3 using a nested F-test.  Interpret the results of the hypothesis test.  What does this tell us regarding the choice of model?
#  Ho:  Models are same
#  Ha:  Model with more variables is better
anova(reg.min.bic, reg.all)
## Analysis of Variance Table
## 
## Model 1: Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown
## Model 2: Balance ~ Income + Limit + Rating + Num_Cards + Age + Yrs_Ed + 
##     Male + Student + Married + Ethnicity.2.or.more + Ethnicity.African.American + 
##     Ethnicity.Asian + Ethnicity.Caucasian + Ethnicity.Unknown
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    443 5355995                           
## 2    436 5248014  7    107982 1.2816 0.2577

P.value = 0.2577 is NOT less than alpha = 0.05, we fail to Reject Ho (same conclusion). We do not have evidence that the Big model is better than the Small model. Thus, the Small model (model with lowest BIC) is just as good as the Big model (model all).

#q7.    Are there some extended regression models that should be considered here?  That is, are there transformations of the variables or extensions of linear regression that could be useful models for predicting Balance?  Provide two separate arguments for possible model extensions.  Note, you do not need to actually fit those models.  Hint:  See James text.  
# Interactive terms
# 1. Limit & Rating: cor = 0.9965583
# 2. Limit & Income: cor = 0.7833861
reg.min.bic.1 <- lm(Balance ~ Income + Limit + Rating + Limit*Rating + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.1)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Rating + 
##     Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -337.90  -63.54    5.38   56.02  617.26 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.315e+02  2.626e+01 -12.623  < 2e-16 ***
## Income            -9.005e+00  2.414e-01 -37.303  < 2e-16 ***
## Limit              1.684e-01  2.858e-02   5.892 7.57e-09 ***
## Rating             4.542e-01  4.358e-01   1.042    0.298    
## Num_Cards          1.611e+01  3.916e+00   4.113 4.65e-05 ***
## Student           -4.211e+02  1.549e+01 -27.181  < 2e-16 ***
## Ethnicity.Unknown -7.853e+01  1.886e+01  -4.164 3.77e-05 ***
## Limit:Rating       1.026e-04  1.002e-05  10.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.97 on 442 degrees of freedom
## Multiple R-squared:  0.954,  Adjusted R-squared:  0.9533 
## F-statistic:  1311 on 7 and 442 DF,  p-value: < 2.2e-16
reg.min.bic.2 <- lm(Balance ~ Income + Limit + Rating + Limit*Income + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.2)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Income + 
##     Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -316.88  -76.23   -5.41   62.67  651.26 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.453e+02  2.497e+01 -17.835  < 2e-16 ***
## Income            -9.968e+00  4.665e-01 -21.367  < 2e-16 ***
## Limit              1.744e-01  3.085e-02   5.654 2.81e-08 ***
## Rating             1.205e+00  4.616e-01   2.611 0.009323 ** 
## Num_Cards          1.547e+01  4.228e+00   3.658 0.000285 ***
## Student           -4.178e+02  1.673e+01 -24.976  < 2e-16 ***
## Ethnicity.Unknown -8.072e+01  2.036e+01  -3.964 8.59e-05 ***
## Income:Limit       2.633e-04  5.063e-05   5.200 3.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.9 on 442 degrees of freedom
## Multiple R-squared:  0.9464, Adjusted R-squared:  0.9456 
## F-statistic:  1115 on 7 and 442 DF,  p-value: < 2.2e-16
reg.min.bic.3 <- lm(Balance ~ Income + Limit + Rating + Limit*Rating + Limit*Income + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.3)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Rating + 
##     Limit * Income + Num_Cards + Student + Ethnicity.Unknown, 
##     data = dat.cb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -469.35  -48.57    6.11   47.68  536.88 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.779e+02  2.485e+01 -11.185  < 2e-16 ***
## Income            -3.723e+00  6.233e-01  -5.973 4.80e-09 ***
## Limit              1.573e-01  2.630e-02   5.981 4.60e-09 ***
## Rating            -5.120e-01  4.145e-01  -1.235    0.217    
## Num_Cards          1.600e+01  3.600e+00   4.444 1.12e-05 ***
## Student           -4.291e+02  1.427e+01 -30.079  < 2e-16 ***
## Ethnicity.Unknown -7.555e+01  1.734e+01  -4.357 1.64e-05 ***
## Limit:Rating       2.762e-04  2.125e-05  12.999  < 2e-16 ***
## Income:Limit      -9.017e-04  9.944e-05  -9.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.96 on 441 degrees of freedom
## Multiple R-squared:  0.9613, Adjusted R-squared:  0.9606 
## F-statistic:  1368 on 8 and 441 DF,  p-value: < 2.2e-16

The two models above are built on interactive terms, which is superior to the models that contain only the main effects. The p-values for Income * Limit and Rating * Limit are extremely low, indicating that there are strong evidence for Ha: β(Income * Limit) ≠ 0,and β(Rating * Limit) ≠ 0. In other words, it is clear that the true relationships are not additive, and they also result in higer adjust r-squares. (0.9613 - 0.9431)/(1 - 0.9431) = 31.99% of the variability in balance that remains after fitting the additive model has been explained by the interactive Term Income * Limit and Rating * Limit.

#q8.    Randomly select 300 of the 450 rows run the regression model in question 5 on this “training” data set.  Compute the RSS for this training model.  Now use this model to predict the customers’ Balance on the remainder of the data (test data set).  What is the RSS from these predictions?  Comment on the relationship between the two RSS values.
set.seed(654123)
split = sample.split(dat.cb2$Balance, SplitRatio = 300/450)
training_set = subset(dat.cb2, split == TRUE)
test_set = subset(dat.cb2, split == FALSE)

reg.2 <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown, data = training_set)
summary(reg.2)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards + 
##     Student + Ethnicity.Unknown, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.96  -75.33   -6.80   54.84  327.23 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -552.73536   23.80155 -23.223  < 2e-16 ***
## Income              -7.94665    0.26830 -29.619  < 2e-16 ***
## Limit                0.15705    0.03795   4.139 4.57e-05 ***
## Rating               1.66470    0.57218   2.909  0.00390 ** 
## Num_Cards           17.85693    5.40051   3.307  0.00106 ** 
## Student           -421.92072   19.47304 -21.667  < 2e-16 ***
## Ethnicity.Unknown -114.65973   24.36437  -4.706 3.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.5 on 293 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9496 
## F-statistic: 940.7 on 6 and 293 DF,  p-value: < 2.2e-16
anova(reg.2)
## Analysis of Variance Table
## 
## Response: Balance
##                    Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## Income              1 12553970 12553970 1194.0573 < 2.2e-16 ***
## Limit               1 41275742 41275742 3925.8976 < 2.2e-16 ***
## Rating              1   476219   476219   45.2950 8.913e-11 ***
## Num_Cards           1    62063    62063    5.9031   0.01572 *  
## Student             1  4741725  4741725  451.0040 < 2.2e-16 ***
## Ethnicity.Unknown   1   232845   232845   22.1468 3.900e-06 ***
## Residuals         293  3080517    10514                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yhat.test <- predict(reg.2, test_set)

err.yhat.test <- test_set$Balance - yhat.test
RSS.yhat.test <- sum(err.yhat.test^2)
RMSE.yhat.test <- (RSS.yhat.test/150)^0.5

RSS.yhat.test
## [1] 2339922
RMSE.yhat.test
## [1] 124.8979

RSS for train and test are 3080517 and 2339922, and the RMSE are 102.5 and 124.8979 respectively. The model was trained on the train set, so there will be variations in the effects of over fitting. The RSS decrease in the test set might be that there are fewer observations in our test set. However, after computing RMSE, we can observe that there is a small increase in the test RMSE, suggesting that there are some variance that the trained model cannot captured in the test data.

#q9.    Repeat exercise 8 using the “All-In” model from question 3.  Is there a different result from question 8?  Explain.
set.seed(654123)
split = sample.split(dat.cb2$Balance, SplitRatio = 300/450)
training_set = subset(dat.cb2, split == TRUE)
test_set = subset(dat.cb2, split == FALSE)

reg.q9 <- lm(Balance ~ ., data = training_set)
summary(reg.q9)
## 
## Call:
## lm(formula = Balance ~ ., data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -287.39  -78.30   -7.07   55.49  339.43 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -678.31467   44.60134 -15.208  < 2e-16 ***
## Income                       -7.89103    0.27637 -28.553  < 2e-16 ***
## Limit                         0.15252    0.03838   3.974 8.96e-05 ***
## Rating                        1.72146    0.57828   2.977  0.00316 ** 
## Num_Cards                    19.10484    5.45870   3.500  0.00054 ***
## Age                          -0.23550    0.36528  -0.645  0.51963    
## Yrs_Ed                        0.32233    1.87980   0.171  0.86398    
## Male                         27.15554   11.85614   2.290  0.02272 *  
## Student                    -423.30021   19.80574 -21.373  < 2e-16 ***
## Married                       3.97966   12.49493   0.319  0.75034    
## Ethnicity.2.or.more          68.94618   37.29803   1.849  0.06556 .  
## Ethnicity.African.American  107.63388   26.84036   4.010 7.75e-05 ***
## Ethnicity.Asian             117.29501   26.84283   4.370 1.74e-05 ***
## Ethnicity.Caucasian         117.62349   25.32576   4.644 5.21e-06 ***
## Ethnicity.Unknown                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.2 on 286 degrees of freedom
## Multiple R-squared:  0.9522, Adjusted R-squared:   0.95 
## F-statistic: 437.9 on 13 and 286 DF,  p-value: < 2.2e-16
anova(reg.q9)
## Analysis of Variance Table
## 
## Response: Balance
##                             Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## Income                       1 12553970 12553970 1202.3221 < 2.2e-16 ***
## Limit                        1 41275742 41275742 3953.0711 < 2.2e-16 ***
## Rating                       1   476219   476219   45.6085 8.055e-11 ***
## Num_Cards                    1    62063    62063    5.9439   0.01538 *  
## Age                          1    22118    22118    2.1183   0.14664    
## Yrs_Ed                       1    55394    55394    5.3052   0.02198 *  
## Male                         1    15408    15408    1.4757   0.22545    
## Student                      1  4721758  4721758  452.2135 < 2.2e-16 ***
## Married                      1      418      418    0.0401   0.84147    
## Ethnicity.2.or.more          1    18184    18184    1.7415   0.18800    
## Ethnicity.African.American   1        0        0    0.0000   0.99911    
## Ethnicity.Asian              1    10325    10325    0.9889   0.32086    
## Ethnicity.Caucasian          1   225229   225229   21.5707 5.206e-06 ***
## Residuals                  286  2986251    10441                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yhat.test.q9 <- predict(reg.q9, test_set)
## Warning in predict.lm(reg.q9, test_set): prediction from a rank-deficient fit
## may be misleading
err.yhat.test.q9 <- test_set$Balance - yhat.test.q9
RSS.yhat.test.q9 <- sum(err.yhat.test.q9^2)
RMSE.yhat.test.q9 <- (RSS.yhat.test.q9/150)^0.5

RSS.yhat.test.q9
## [1] 2383328
RMSE.yhat.test.q9
## [1] 126.051

RSS for train and test1, test.q9 are 2986251, 2339922 and 2383328, and the RMSE are 102.5, 124.8979 and 126.051 respectively. The “All-In” model is too complicated and results more errors. The reduntant variables are not contributing to the linear regression explaination of our data, so the RSS increased.

#q10.   Summarize what the bank’s managers should know about predicting end of month balance for their customers in 3 to 5 sentences.  How confident would you feel about predictions in this case?

# The customers' balance at the end oftthe month is highly correalted with rating, limit and income, and it is largely determined by whether the customer is a student or not. Rating and income, rating and limit have some interactive influence on the prediction of the balance. The bank could also target on Asian and Caucasian customers to provide more incentives to stimulate consumption using this credit card. I am more confident using the reg.2 model than the all-in model based on the calculation. And the model using two interactive terms Limit*Rating and Limit*Incomeare is 96.13% accurate on the predictions of the end of month customer balance.