This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#q1. Read the data set into R a data frame. Prepare the data for analysis.
setwd("~/Desktop/BANA 288 Predictive Analytics/HW2 Linear Regression Modeling")
dat.cb <- read.csv("hw2_credit_balance.csv")
names(dat.cb)
## [1] "Obs_Number" "Income" "Limit" "Rating" "Num_Cards"
## [6] "Age" "Yrs_Ed" "Gender" "Student" "Married"
## [11] "Ethnicity" "Balance"
dat.cb1 <- dat.cb[,2:12]
names(dat.cb1)
## [1] "Income" "Limit" "Rating" "Num_Cards" "Age" "Yrs_Ed"
## [7] "Gender" "Student" "Married" "Ethnicity" "Balance"
str(dat.cb1)
## 'data.frame': 450 obs. of 11 variables:
## $ Income : num 20.8 17.3 31.4 44.4 35 ...
## $ Limit : int 2672 1335 1705 5278 3327 3211 4865 8117 3388 5107 ...
## $ Rating : int 204 138 160 381 253 265 381 589 266 380 ...
## $ Num_Cards: int 1 2 3 2 3 4 5 4 4 1 ...
## $ Age : int 70 65 81 50 54 59 67 30 74 55 ...
## $ Yrs_Ed : int 18 13 14 12 14 14 11 14 17 10 ...
## $ Gender : Factor w/ 2 levels " Male","Female": 2 1 1 1 2 2 2 1 2 1 ...
## $ Student : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Married : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 2 2 2 ...
## $ Ethnicity: Factor w/ 5 levels "2 or more","African American",..: 2 2 4 5 2 3 4 4 2 4 ...
## $ Balance : int 0 0 0 531 50 199 836 1407 155 651 ...
## Encoding all the categorical variables and create dummy variables for "Ethnicity"
dat.cb1$Gender <- -as.numeric(dat.cb1$Gender)+2
names(dat.cb1)[7] <- "Male"
dat.cb1$Student <- -as.numeric(dat.cb1$Student)+1
dat.cb1$Married <- as.numeric(dat.cb1$Married)-1
# Ethnicity <- unique(dat.cb1$Ethnicity)
# dat.cb1$Ethnicity = factor(dat.cb1$Ethnicity,
# levels = c('2 or more', 'African American', 'Asian', 'Caucasian', 'Unknown'),
# labels = c(1, 2, 3, 4, 5))
Ethnicity <- as.data.frame(dummy(dat.cb1$Ethnicity))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
names(Ethnicity)[1] <- "Ethnicity.2.or.more"
names(Ethnicity)[2] <- "Ethnicity.African.American"
names(Ethnicity)[3] <- "Ethnicity.Asian"
names(Ethnicity)[4] <- "Ethnicity.Caucasian"
names(Ethnicity)[5] <- "Ethnicity.Unknown"
dat.cb2 <- as.data.frame(append(dat.cb1[,-10], Ethnicity))
#q2. Compute the correlation between all independent variables and the response. List the top 3 (or more) correlated variables. Explain why these make sense or why they are surprising.
correlation_all <- cor(dat.cb2)
correlation <- cor(dat.cb2)[10,]
sort(correlation, decreasing = TRUE)
## Balance Rating
## 1.000000000 0.857339800
## Limit Income
## 0.855190700 0.446919280
## Num_Cards Ethnicity.African.American
## 0.087810661 0.030224274
## Ethnicity.Asian Ethnicity.Caucasian
## 0.012950935 0.008986931
## Age Married
## 0.006591191 -0.001521574
## Male Ethnicity.Unknown
## -0.012684800 -0.036727061
## Yrs_Ed Ethnicity.2.or.more
## -0.050196896 -0.055304772
## Student
## -0.234686398
head(sort(correlation_all, decreasing = TRUE), n=25) ## The top 15 will be the correlation between the variable themselves which equal to 1. And in the matrix, there will be 2 duplicated values. So we chose 25 to see the top 5 correlation results
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [15] 1.0000000 0.9965583 0.9965583 0.8573398 0.8573398 0.8551907 0.8551907
## [22] 0.7833861 0.7833861 0.7825671 0.7825671
corrplot(correlation_all)
The top 3 variables that correlated with “Balance” is “Rating”, “Limit” and “Income”, with correlation coefficients equal to 0.857339800, 0.855190700 and 0.446919280 respectively.
Top 3 Correlation of all variables: 1. Limit & Rating: cor = 0.9965583 2. Balance & Rating: cor = 0.8573398 3. Balance & Limit: cor = 0.8551907
Balance is the total amount spent at the end of the month. If the customer has higher rating, which means they have better credit history and higher credit score, they will be allowed to have higher account limit and tend to spend more with that limit(Balance). And it will be related with the customer income. Higer income means the customer can affort more expensive items or spend more monthly.
#q3. Using the entire set of independent variables created in question 1 above, construct a linear regression model predicting Balance that includes all variables (reg.all). Interpret the coefficient of determination. Which variables are significant at the 5% level?
reg.all <- lm(Balance ~., data = dat.cb2)
summary(reg.all)
##
## Call:
## lm(formula = Balance ~ ., data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322.51 -83.26 -13.54 56.97 645.92
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -578.65708 39.40795 -14.684 < 2e-16 ***
## Income -7.82159 0.24088 -32.471 < 2e-16 ***
## Limit 0.17010 0.03215 5.291 1.93e-07 ***
## Rating 1.43568 0.47969 2.993 0.002920 **
## Num_Cards 14.86092 4.36160 3.407 0.000717 ***
## Age -0.41998 0.31126 -1.349 0.177939
## Yrs_Ed -1.01533 1.68618 -0.602 0.547388
## Male 21.51224 10.39382 2.070 0.039067 *
## Student -419.71899 17.39948 -24.122 < 2e-16 ***
## Married -7.87975 10.88566 -0.724 0.469537
## Ethnicity.2.or.more 59.34240 29.50770 2.011 0.044932 *
## Ethnicity.African.American 74.42929 23.17498 3.212 0.001418 **
## Ethnicity.Asian 90.20539 23.03071 3.917 0.000104 ***
## Ethnicity.Caucasian 83.35194 21.78507 3.826 0.000149 ***
## Ethnicity.Unknown NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.7 on 436 degrees of freedom
## Multiple R-squared: 0.9443, Adjusted R-squared: 0.9426
## F-statistic: 568.4 on 13 and 436 DF, p-value: < 2.2e-16
The linear regression function will be: Balance = - 578.65708 - 7.82159 * Income + 0.17010 * Limit + 1.43568 * Rating + 14.86092 * Num_Cards - 0.41998 * Age - 1.01533 * Yrs_Ed - 21.51224 * Male - 419.71899 * Student - 7.87975 * Married + 59.34240 * Ethnicity.2 or more + 74.42929 * Ethnicity.African American + 90.20539 * Ethnicity.Asian + 83.35194 * Ethnicity.Caucasian
All the variables except “Age”, “Yrs_Ed”, “Married” and “Ethnicity_Unknown” are significant at the 5% level. This linear model has p-value <- 0.05 with 0.9443 r-square, which is highly significant and have a very good performance. The variable “Student” tends to have the largest negative impact on the dependent variable balance. If the customer is a student, the balance will be -419.72 lower than non-student customers, holding other variables constant. It makes sense because students will have lower consuming capacity than the professionals. Moreover, the ethnicities also have greater impact on balance. Asian and Caucasian have higher balance than African American and people with 2 or more ethnicities.
#q4. Run a linear regression model predicting Balance using only the Income, Limit and Gender variables. Interpret all three slope coefficients in the context of the problem. Which of these variables are significant at the 1% level?
reg.1 <- lm(Balance ~ Income + Limit + Male, data = dat.cb2)
summary(reg.1)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Male, data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389.47 -118.14 -43.23 63.99 610.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.879e+02 2.068e+01 -18.757 <2e-16 ***
## Income -7.500e+00 3.700e-01 -20.270 <2e-16 ***
## Limit 2.613e-01 5.695e-03 45.885 <2e-16 ***
## Male 7.675e+00 1.623e+01 0.473 0.636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 171.8 on 446 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.8592
## F-statistic: 914.6 on 3 and 446 DF, p-value: < 2.2e-16
The linear regression function will be: Balance = - 0.03879 - 7.5 * Income + 0.2613 * Limit + 7.675 * Male
Income and Limit are are significant at the 1% level. Income has a negative impact on Balance, with every \(1000\) customer annual income increase, the balance will be decreasing \(7.5\), holding other variables constant. Limit has a positive impact on Balance, with every \(1\) customer credit account limit increase, the balance will be decreasing \(0.26\), holding other variables constant. Male tend to spend \(7.675\) more than female though it is not significant, holding other variables constant.
#q5. Run best subsets on the data (exhaustive search). Use the results from best subsets to pick a model. Provide an argument in support of the model chosen.
regfit.full <- regsubsets(Balance~., dat.cb2, nvmax = 15, method = "exhaustive")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : nvmax reduced to 13
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Balance ~ ., dat.cb2, nvmax = 15, method = "exhaustive")
## 14 Variables (and intercept)
## Forced in Forced out
## Income FALSE FALSE
## Limit FALSE FALSE
## Rating FALSE FALSE
## Num_Cards FALSE FALSE
## Age FALSE FALSE
## Yrs_Ed FALSE FALSE
## Male FALSE FALSE
## Student FALSE FALSE
## Married FALSE FALSE
## Ethnicity.2.or.more FALSE FALSE
## Ethnicity.African.American FALSE FALSE
## Ethnicity.Asian FALSE FALSE
## Ethnicity.Caucasian FALSE FALSE
## Ethnicity.Unknown FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## Income Limit Rating Num_Cards Age Yrs_Ed Male Student Married
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" "*" " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" " " "*" " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 9 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## Ethnicity.2.or.more Ethnicity.African.American Ethnicity.Asian
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) "*" " " " "
## 10 ( 1 ) "*" "*" " "
## 11 ( 1 ) "*" "*" " "
## 12 ( 1 ) "*" "*" " "
## 13 ( 1 ) "*" "*" "*"
## Ethnicity.Caucasian Ethnicity.Unknown
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) " " "*"
## 11 ( 1 ) " " "*"
## 12 ( 1 ) " " "*"
## 13 ( 1 ) "*" " "
summary(regfit.full)$which
## (Intercept) Income Limit Rating Num_Cards Age Yrs_Ed Male Student Married
## 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## 4 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## 5 TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## 9 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 13 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Ethnicity.2.or.more Ethnicity.African.American Ethnicity.Asian
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## 7 FALSE FALSE FALSE
## 8 FALSE FALSE FALSE
## 9 TRUE FALSE FALSE
## 10 TRUE TRUE FALSE
## 11 TRUE TRUE FALSE
## 12 TRUE TRUE FALSE
## 13 TRUE TRUE TRUE
## Ethnicity.Caucasian Ethnicity.Unknown
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE TRUE
## 6 FALSE TRUE
## 7 FALSE TRUE
## 8 FALSE TRUE
## 9 FALSE TRUE
## 10 FALSE TRUE
## 11 FALSE TRUE
## 12 FALSE TRUE
## 13 TRUE FALSE
par(mfrow=c(2,2))
plot(summary(regfit.full)$rsq, xlab = "Number of Variables",
ylab ="R-squared")
plot(summary(regfit.full)$adjr2, xlab = "Number of Variables",
ylab ="Adj R-squared")
max.adjr2 <- which.max(summary(regfit.full)$adjr2)
points(9,summary(regfit.full)$adjr2[8], col = "red", cex = 2, pch = 20)
plot(summary(regfit.full)$cp, xlab = "Number of Variables",
ylab ="Mallows Cp")
min.cp <- which.min(summary(regfit.full)$cp)
points(7,summary(regfit.full)$cp[10], col = "blue", cex = 2, pch = 20)
plot(summary(regfit.full)$bic, xlab = "Number of Variables",
ylab ="Bayesian Info Crit")
min.bic <- which.min(summary(regfit.full)$bic)
points(6,summary(regfit.full)$bic[6], col = "green", cex = 2, pch = 20)
Adjusted R-squared, Mallows Cp and BIC all point to models with different numbers of variables (9, 7 and 6 respectively)
reg.max.adjr2 <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Age + Male + Student + Ethnicity.2.or.more + Ethnicity.Unknown, data = dat.cb2)
summary(reg.max.adjr2)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards +
## Age + Male + Student + Ethnicity.2.or.more + Ethnicity.Unknown,
## data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.03 -79.79 -12.17 60.16 643.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -513.87420 26.01470 -19.753 < 2e-16 ***
## Income -7.82753 0.23996 -32.621 < 2e-16 ***
## Limit 0.17179 0.03168 5.422 9.75e-08 ***
## Rating 1.41076 0.47249 2.986 0.002986 **
## Num_Cards 15.10378 4.33832 3.481 0.000548 ***
## Age -0.42029 0.30878 -1.361 0.174167
## Male 21.29683 10.36289 2.055 0.040459 *
## Student -420.25491 17.19236 -24.444 < 2e-16 ***
## Ethnicity.2.or.more -23.87047 21.91132 -1.089 0.276568
## Ethnicity.Unknown -81.99667 20.90062 -3.923 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.4 on 440 degrees of freedom
## Multiple R-squared: 0.944, Adjusted R-squared: 0.9429
## F-statistic: 824.8 on 9 and 440 DF, p-value: < 2.2e-16
#coef(regfit.full, min.cp)
reg.min.cp <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Male + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.cp)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards +
## Male + Student + Ethnicity.Unknown, data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -321.08 -79.52 -10.44 59.25 626.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -536.10498 20.74028 -25.848 < 2e-16 ***
## Income -7.88078 0.23705 -33.246 < 2e-16 ***
## Limit 0.17498 0.03163 5.531 5.44e-08 ***
## Rating 1.37037 0.47193 2.904 0.003872 **
## Num_Cards 14.71800 4.33284 3.397 0.000743 ***
## Male 21.45497 10.36893 2.069 0.039112 *
## Student -421.46030 17.18070 -24.531 < 2e-16 ***
## Ethnicity.Unknown -80.23981 20.88065 -3.843 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.6 on 442 degrees of freedom
## Multiple R-squared: 0.9437, Adjusted R-squared: 0.9428
## F-statistic: 1058 on 7 and 442 DF, p-value: < 2.2e-16
#coef(regfit.full, min.bic)
reg.min.bic <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards +
## Student + Ethnicity.Unknown, data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.45 -78.85 -9.15 57.60 638.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -525.63603 20.18805 -26.037 < 2e-16 ***
## Income -7.86064 0.23772 -33.066 < 2e-16 ***
## Limit 0.17394 0.03175 5.479 7.19e-08 ***
## Rating 1.38156 0.47364 2.917 0.003715 **
## Num_Cards 14.77136 4.34878 3.397 0.000744 ***
## Student -419.26738 17.21137 -24.360 < 2e-16 ***
## Ethnicity.Unknown -81.14807 20.95321 -3.873 0.000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 110 on 443 degrees of freedom
## Multiple R-squared: 0.9431, Adjusted R-squared: 0.9424
## F-statistic: 1225 on 6 and 443 DF, p-value: < 2.2e-16
I will go with the simple (less flexible) model with 6 variables, which has the lowest BIC. All the variables in this model are highly significant and it has the very decent adjusted r-square: 0.9424 with lowest computation cost.
#q6. Compare the model of questions 5 to the model of question 3 using a nested F-test. Interpret the results of the hypothesis test. What does this tell us regarding the choice of model?
# Ho: Models are same
# Ha: Model with more variables is better
anova(reg.min.bic, reg.all)
## Analysis of Variance Table
##
## Model 1: Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown
## Model 2: Balance ~ Income + Limit + Rating + Num_Cards + Age + Yrs_Ed +
## Male + Student + Married + Ethnicity.2.or.more + Ethnicity.African.American +
## Ethnicity.Asian + Ethnicity.Caucasian + Ethnicity.Unknown
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 443 5355995
## 2 436 5248014 7 107982 1.2816 0.2577
P.value = 0.2577 is NOT less than alpha = 0.05, we fail to Reject Ho (same conclusion). We do not have evidence that the Big model is better than the Small model. Thus, the Small model (model with lowest BIC) is just as good as the Big model (model all).
#q7. Are there some extended regression models that should be considered here? That is, are there transformations of the variables or extensions of linear regression that could be useful models for predicting Balance? Provide two separate arguments for possible model extensions. Note, you do not need to actually fit those models. Hint: See James text.
# Interactive terms
# 1. Limit & Rating: cor = 0.9965583
# 2. Limit & Income: cor = 0.7833861
reg.min.bic.1 <- lm(Balance ~ Income + Limit + Rating + Limit*Rating + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.1)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Rating +
## Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -337.90 -63.54 5.38 56.02 617.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.315e+02 2.626e+01 -12.623 < 2e-16 ***
## Income -9.005e+00 2.414e-01 -37.303 < 2e-16 ***
## Limit 1.684e-01 2.858e-02 5.892 7.57e-09 ***
## Rating 4.542e-01 4.358e-01 1.042 0.298
## Num_Cards 1.611e+01 3.916e+00 4.113 4.65e-05 ***
## Student -4.211e+02 1.549e+01 -27.181 < 2e-16 ***
## Ethnicity.Unknown -7.853e+01 1.886e+01 -4.164 3.77e-05 ***
## Limit:Rating 1.026e-04 1.002e-05 10.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.97 on 442 degrees of freedom
## Multiple R-squared: 0.954, Adjusted R-squared: 0.9533
## F-statistic: 1311 on 7 and 442 DF, p-value: < 2.2e-16
reg.min.bic.2 <- lm(Balance ~ Income + Limit + Rating + Limit*Income + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.2)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Income +
## Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -316.88 -76.23 -5.41 62.67 651.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.453e+02 2.497e+01 -17.835 < 2e-16 ***
## Income -9.968e+00 4.665e-01 -21.367 < 2e-16 ***
## Limit 1.744e-01 3.085e-02 5.654 2.81e-08 ***
## Rating 1.205e+00 4.616e-01 2.611 0.009323 **
## Num_Cards 1.547e+01 4.228e+00 3.658 0.000285 ***
## Student -4.178e+02 1.673e+01 -24.976 < 2e-16 ***
## Ethnicity.Unknown -8.072e+01 2.036e+01 -3.964 8.59e-05 ***
## Income:Limit 2.633e-04 5.063e-05 5.200 3.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.9 on 442 degrees of freedom
## Multiple R-squared: 0.9464, Adjusted R-squared: 0.9456
## F-statistic: 1115 on 7 and 442 DF, p-value: < 2.2e-16
reg.min.bic.3 <- lm(Balance ~ Income + Limit + Rating + Limit*Rating + Limit*Income + Num_Cards + Student + Ethnicity.Unknown, data = dat.cb2)
summary(reg.min.bic.3)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Limit * Rating +
## Limit * Income + Num_Cards + Student + Ethnicity.Unknown,
## data = dat.cb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -469.35 -48.57 6.11 47.68 536.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.779e+02 2.485e+01 -11.185 < 2e-16 ***
## Income -3.723e+00 6.233e-01 -5.973 4.80e-09 ***
## Limit 1.573e-01 2.630e-02 5.981 4.60e-09 ***
## Rating -5.120e-01 4.145e-01 -1.235 0.217
## Num_Cards 1.600e+01 3.600e+00 4.444 1.12e-05 ***
## Student -4.291e+02 1.427e+01 -30.079 < 2e-16 ***
## Ethnicity.Unknown -7.555e+01 1.734e+01 -4.357 1.64e-05 ***
## Limit:Rating 2.762e-04 2.125e-05 12.999 < 2e-16 ***
## Income:Limit -9.017e-04 9.944e-05 -9.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.96 on 441 degrees of freedom
## Multiple R-squared: 0.9613, Adjusted R-squared: 0.9606
## F-statistic: 1368 on 8 and 441 DF, p-value: < 2.2e-16
The two models above are built on interactive terms, which is superior to the models that contain only the main effects. The p-values for Income * Limit and Rating * Limit are extremely low, indicating that there are strong evidence for Ha: β(Income * Limit) ≠ 0,and β(Rating * Limit) ≠ 0. In other words, it is clear that the true relationships are not additive, and they also result in higer adjust r-squares. (0.9613 - 0.9431)/(1 - 0.9431) = 31.99% of the variability in balance that remains after fitting the additive model has been explained by the interactive Term Income * Limit and Rating * Limit.
#q8. Randomly select 300 of the 450 rows run the regression model in question 5 on this “training” data set. Compute the RSS for this training model. Now use this model to predict the customers’ Balance on the remainder of the data (test data set). What is the RSS from these predictions? Comment on the relationship between the two RSS values.
set.seed(654123)
split = sample.split(dat.cb2$Balance, SplitRatio = 300/450)
training_set = subset(dat.cb2, split == TRUE)
test_set = subset(dat.cb2, split == FALSE)
reg.2 <- lm(Balance ~ Income + Limit + Rating + Num_Cards + Student + Ethnicity.Unknown, data = training_set)
summary(reg.2)
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Num_Cards +
## Student + Ethnicity.Unknown, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.96 -75.33 -6.80 54.84 327.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -552.73536 23.80155 -23.223 < 2e-16 ***
## Income -7.94665 0.26830 -29.619 < 2e-16 ***
## Limit 0.15705 0.03795 4.139 4.57e-05 ***
## Rating 1.66470 0.57218 2.909 0.00390 **
## Num_Cards 17.85693 5.40051 3.307 0.00106 **
## Student -421.92072 19.47304 -21.667 < 2e-16 ***
## Ethnicity.Unknown -114.65973 24.36437 -4.706 3.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.5 on 293 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9496
## F-statistic: 940.7 on 6 and 293 DF, p-value: < 2.2e-16
anova(reg.2)
## Analysis of Variance Table
##
## Response: Balance
## Df Sum Sq Mean Sq F value Pr(>F)
## Income 1 12553970 12553970 1194.0573 < 2.2e-16 ***
## Limit 1 41275742 41275742 3925.8976 < 2.2e-16 ***
## Rating 1 476219 476219 45.2950 8.913e-11 ***
## Num_Cards 1 62063 62063 5.9031 0.01572 *
## Student 1 4741725 4741725 451.0040 < 2.2e-16 ***
## Ethnicity.Unknown 1 232845 232845 22.1468 3.900e-06 ***
## Residuals 293 3080517 10514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yhat.test <- predict(reg.2, test_set)
err.yhat.test <- test_set$Balance - yhat.test
RSS.yhat.test <- sum(err.yhat.test^2)
RMSE.yhat.test <- (RSS.yhat.test/150)^0.5
RSS.yhat.test
## [1] 2339922
RMSE.yhat.test
## [1] 124.8979
RSS for train and test are 3080517 and 2339922, and the RMSE are 102.5 and 124.8979 respectively. The model was trained on the train set, so there will be variations in the effects of over fitting. The RSS decrease in the test set might be that there are fewer observations in our test set. However, after computing RMSE, we can observe that there is a small increase in the test RMSE, suggesting that there are some variance that the trained model cannot captured in the test data.
#q9. Repeat exercise 8 using the “All-In” model from question 3. Is there a different result from question 8? Explain.
set.seed(654123)
split = sample.split(dat.cb2$Balance, SplitRatio = 300/450)
training_set = subset(dat.cb2, split == TRUE)
test_set = subset(dat.cb2, split == FALSE)
reg.q9 <- lm(Balance ~ ., data = training_set)
summary(reg.q9)
##
## Call:
## lm(formula = Balance ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -287.39 -78.30 -7.07 55.49 339.43
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -678.31467 44.60134 -15.208 < 2e-16 ***
## Income -7.89103 0.27637 -28.553 < 2e-16 ***
## Limit 0.15252 0.03838 3.974 8.96e-05 ***
## Rating 1.72146 0.57828 2.977 0.00316 **
## Num_Cards 19.10484 5.45870 3.500 0.00054 ***
## Age -0.23550 0.36528 -0.645 0.51963
## Yrs_Ed 0.32233 1.87980 0.171 0.86398
## Male 27.15554 11.85614 2.290 0.02272 *
## Student -423.30021 19.80574 -21.373 < 2e-16 ***
## Married 3.97966 12.49493 0.319 0.75034
## Ethnicity.2.or.more 68.94618 37.29803 1.849 0.06556 .
## Ethnicity.African.American 107.63388 26.84036 4.010 7.75e-05 ***
## Ethnicity.Asian 117.29501 26.84283 4.370 1.74e-05 ***
## Ethnicity.Caucasian 117.62349 25.32576 4.644 5.21e-06 ***
## Ethnicity.Unknown NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.2 on 286 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.95
## F-statistic: 437.9 on 13 and 286 DF, p-value: < 2.2e-16
anova(reg.q9)
## Analysis of Variance Table
##
## Response: Balance
## Df Sum Sq Mean Sq F value Pr(>F)
## Income 1 12553970 12553970 1202.3221 < 2.2e-16 ***
## Limit 1 41275742 41275742 3953.0711 < 2.2e-16 ***
## Rating 1 476219 476219 45.6085 8.055e-11 ***
## Num_Cards 1 62063 62063 5.9439 0.01538 *
## Age 1 22118 22118 2.1183 0.14664
## Yrs_Ed 1 55394 55394 5.3052 0.02198 *
## Male 1 15408 15408 1.4757 0.22545
## Student 1 4721758 4721758 452.2135 < 2.2e-16 ***
## Married 1 418 418 0.0401 0.84147
## Ethnicity.2.or.more 1 18184 18184 1.7415 0.18800
## Ethnicity.African.American 1 0 0 0.0000 0.99911
## Ethnicity.Asian 1 10325 10325 0.9889 0.32086
## Ethnicity.Caucasian 1 225229 225229 21.5707 5.206e-06 ***
## Residuals 286 2986251 10441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yhat.test.q9 <- predict(reg.q9, test_set)
## Warning in predict.lm(reg.q9, test_set): prediction from a rank-deficient fit
## may be misleading
err.yhat.test.q9 <- test_set$Balance - yhat.test.q9
RSS.yhat.test.q9 <- sum(err.yhat.test.q9^2)
RMSE.yhat.test.q9 <- (RSS.yhat.test.q9/150)^0.5
RSS.yhat.test.q9
## [1] 2383328
RMSE.yhat.test.q9
## [1] 126.051
RSS for train and test1, test.q9 are 2986251, 2339922 and 2383328, and the RMSE are 102.5, 124.8979 and 126.051 respectively. The “All-In” model is too complicated and results more errors. The reduntant variables are not contributing to the linear regression explaination of our data, so the RSS increased.
#q10. Summarize what the bank’s managers should know about predicting end of month balance for their customers in 3 to 5 sentences. How confident would you feel about predictions in this case?
# The customers' balance at the end oftthe month is highly correalted with rating, limit and income, and it is largely determined by whether the customer is a student or not. Rating and income, rating and limit have some interactive influence on the prediction of the balance. The bank could also target on Asian and Caucasian customers to provide more incentives to stimulate consumption using this credit card. I am more confident using the reg.2 model than the all-in model based on the calculation. And the model using two interactive terms Limit*Rating and Limit*Incomeare is 96.13% accurate on the predictions of the end of month customer balance.