Bank Revenue Regression Analysis
Consider the model:
m6 <- lm(lrev_total ~ lbal_total + offer + age + age2 + card + loan + insur + check, data = tb.bankrev)
Write down the population model.
Write down the estimated model.
Interpret the magnitude and statistical significance of โlbal_totalโ.
Interpret the magnitude and statistical significance of โloanyesโ.
Expand your model by adding an interaction term between โloanโ and โlbal_totalโ. Interpret the interaction term.
Assess the overall quality of the model using the available metrics.
Consider a 48 year old customer with a total balance of $25,736 who received the promotional offer, has a credit card, does not have a loan or insurance, and has a high checking account activity. Predict the total revenue this customer is likely to generate based on the selected model.
Get started by loading libraries, reading and preprocessing data.
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(ggfortify)
load("data/bankrev.RData")
# Create log(rev_total) and log(bal_total)
tb.bankrev <- tb.bankrev %>%
mutate(lrev_total = log(rev_total),
lbal_total = log(bal_total))
# Format factor variables
tb.bankrev <- tb.bankrev %>% mutate(
offer = factor(offer , levels = c(0,1), labels = c("no", "yes"))
, chq = factor(chq , levels = c(0,1), labels = c("no", "yes"))
, card = factor(card , levels = c(0,1), labels = c("no", "yes"))
, sav1 = factor(sav1 , levels = c(0,1), labels = c("no", "yes"))
, loan = factor(loan , levels = c(0,1), labels = c("no", "yes"))
, mort = factor(mort , levels = c(0,1), labels = c("no", "yes"))
, insur = factor(insur , levels = c(0,1), labels = c("no", "yes"))
, pens = factor(pens , levels = c(0,1), labels = c("no", "yes"))
, check = factor(check , levels = c(0,1), labels = c("no", "yes"))
, cd = factor(cd , levels = c(0,1), labels = c("no", "yes"))
, mm = factor(mm , levels = c(0,1), labels = c("no", "yes"))
, savings = factor(savings, levels = c(0,1), labels = c("no", "yes"))
)
# Create age2
tb.bankrev <- tb.bankrev %>% mutate(age2 = age^2)
Consider the model:
m6 <- lm(lrev_total ~ lbal_total + offer + age + age2 + card + loan + insur + check, data = tb.bankrev)
\(log(rev\_total) = \beta_0 + \beta_1log(bal\_total) + \beta_2offer + \beta_3age + \beta_4 age^2 + \beta_5card + \beta_6loan + \beta_7insur + \beta_8 check + \epsilon_i\)
stargazer(m6, type = "text", no.space = TRUE)
##
## ===============================================
## Dependent variable:
## ---------------------------
## lrev_total
## -----------------------------------------------
## lbal_total 0.440***
## (0.005)
## offeryes 0.122***
## (0.039)
## age 0.004**
## (0.002)
## age2 -0.00005**
## (0.00002)
## cardyes 1.590***
## (0.048)
## loanyes -0.120***
## (0.035)
## insuryes -0.087***
## (0.027)
## checkyes -1.391***
## (0.046)
## Constant -2.649***
## (0.055)
## -----------------------------------------------
## Observations 7,398
## R2 0.588
## Adjusted R2 0.588
## Residual Std. Error 0.810 (df = 7389)
## F Statistic 1,319.844*** (df = 8; 7389)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
\(\hat{log(rev\_total)} = -2.649 + 0.440log(bal\_total) + 0.122offer + 0.004age - 0.00005age^2 + 1.590card - 0.120loan - 0.087insur - 1.391check\)
lbal_total.This is a log-log (aka constant elasticity model). Thus, on average, all else constant, a one percent increase in the account balance is associated with a 0.440% increase in revenue. This effect is statistically significant (p<0.01).
loanyes.100*(exp(-0.120)-1)
## [1] -11.30796
This is a log-level case. Thus, on average, all else constant, customers who have a loan yield -11.308% less in revenue that customers who do not.
loan and lbal_total. Interpret the interaction
term.# Regression Analysis
m7 <- lm(lrev_total ~ lbal_total + offer + age + age2 + card + loan*lbal_total + insur + check, data = tb.bankrev)
stargazer(m7, type = "html")
| Dependent variable: | |
| lrev_total | |
| lbal_total | 0.440*** |
| (0.005) | |
| offeryes | 0.115*** |
| (0.039) | |
| age | 0.004** |
| (0.002) | |
| age2 | -0.00004** |
| (0.00002) | |
| cardyes | 1.604*** |
| (0.048) | |
| loanyes | -0.892*** |
| (0.245) | |
| insuryes | -0.104*** |
| (0.027) | |
| checkyes | -1.422*** |
| (0.047) | |
| lbal_total:loanyes | 0.092*** |
| (0.029) | |
| Constant | -2.622*** |
| (0.056) | |
| Observations | 7,398 |
| R2 | 0.589 |
| Adjusted R2 | 0.588 |
| Residual Std. Error | 0.810 (df = 7388) |
| F Statistic | 1,175.767*** (df = 9; 7388) |
| Note: | p<0.1; p<0.05; p<0.01 |
On average, all else constant, when total balance increases by one percent, the increase in total revenue is 0.09% higher for customers with a loan relative to those without a loan. The difference is statistically significant (p<0.01).
autoplot(m7)
The residuals vs fitted plot do not reveal any strange pattern, so a linear model should be appropriate to the data.
The horizontal line with equally spread points in the scale-location plot suggests homoskedasticity.
The normality of the residuals may prevail due to the location of points along the 45ยบ line in the Q-Q plot.
The residuals vs leverage plot suggests the presence of outliers (points located in the right side).
Remember that whenever the dependent variable is in log form, our predictions will be logged values. To convert these values to the original variable (without the log) we use the following formula:
\[e^{\hat{ln(y)}} \times e^{\frac{\hat{\sigma}^2}{2}}\]
# Create a tibble with the new customer data:
new_cust <- tibble(lbal_total = log(25736),
offer = "yes",
age = 48,
age2 = 48^2,
card = "yes",
loan = "no",
insur = "no",
check = "yes")
# Predict log(revenue)
predict(m7, newdata = new_cust) # output: 2.2249
## 1
## 2.224856
# Predicted revenue using the formula
exp(2.2249)*exp((0.810^2)/2) # 12.845 dollars
## [1] 12.84494