Functional Form and Model Selection
Adapted from the case produced by: From Building Better Models With JMP®Pro, Chapter 4, SAS Press (2015). Grayson, Gardner and Stephens.
Lab Overview
- Log transformations
- Quadratic transformations
In this lab, we take on the role of data analysts for a reputable bank striving to understand customer banking behaviors and their impact on revenues and profitability. The bank has provided us with a comprehensive dataset that includes customer age, account activity indicators, and the total revenue generated by each customer over a 6-month period.
Our primary objective is to construct a robust predictive model that allows the bank to forecast the profitability of individual customers accurately. To achieve this, we will use “Total Revenue” as a surrogate measure for customer profitability. The resulting model will serve as a valuable tool for projecting future bank revenues and strategically guiding marketing campaigns.
Our dataset comprises information on 7,398 bank customers, featuring various key variables:
- Rev_Total Total revenue generated by the customer over a 6-month period (in dollars).
- Bal_Total Total of all account balances, across all accounts held by the customer (in dollars).
- Offer An indicator of whether the customer has received a special promotional offer in the previous one-month period. \(Offer=1\) if the offer was received, \(Offer=0\) if it was not.
- AGE The customer’s age.
- CHQ Indicator of debit card account activity. \(CHQ=0\) is low (or zero) account activity, \(CHQ=1\) is greater account activity.
- CARD Indicator of credit card account activity. \(CARD=0\) is low or zero account activity, \(CARD=1\) is greater account activity.
- SAV1 Indicator of primary savings account activity. \(SAV1=0\) is low or zero account activity, \(SAV1=1\) is greater activity.
- LOAN Indicator of personal loan account activity. \(LOAN=0\) is low or zero account activity, \(LOAN=1\) is greater activity.
- MORT Indicator of mortgage account tier. \(MORT=0\) is lower tier and less important to the bank’s portfolio. \(MORT=1\) is higher tier and indicates the account is more important to the bank’s portfolio.
- INSUR Indicator of insurance account activity. \(INSUR=0\) is low or zero account activity, \(INSUR=1\) is greater activity.
- PENS Indicator or retirement savings (pension) account tier. \(PENS=0\) is lower balance and less important to bank’s portfolio. \(PENS=1\) is higher tier and of more importance to the bank’s portfolio.
- Check Indicator of checking account activity. \(Check=0\) is low or zero account activity, \(Check=1\) is greater activity.
- CD Indicator of certificate of deposit account tier. \(CD=0\) is lower tier and of less importance to the bank’s portfolio. \(CD=1\) is higher tier and of more importance to the bank’s portfolio.
- MM Indicator of money market account activity. \(MM=0\) is low or zero account activity, \(MM=1\) is greater activity.
- Savings Indicator of savings accounts (other than primary) activity. \(Savings=0\) is low or zero account activity, \(Savings=1\) is greater activity.
- AccountAge Number of years as a customer of the bank.
1. Getting Started
Download Lab 5’s materials from Moodle:
- Save provided data set in your
datafolder in BRM-Labs project folder. - Save provided R script in your
codefolder in BRM-Labs project folder.
- Save provided data set in your
Open the provided lab 5’s R script.
Setup your R environment.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(infer)
library(janitor)
- Load the data.
# Load data
load("data/bankrev.RData")
2. Data Exploration
2.1 Summary Statistics
We start by producing a summary statistics table.
# Summary statistics table
stargazer(data.frame(tb.bankrev), type = "text")
====================================================
Statistic N Mean St. Dev. Min Max
----------------------------------------------------
rev_total 7,398 2.171 3.210 0.010 94.100
bal_total 7,398 9,259.703 16,309.420 1 166,461
offer 7,398 0.549 0.498 0 1
age 7,398 47.197 21.057 0 98
chq 7,398 0.508 0.500 0 1
card 7,398 0.507 0.500 0 1
sav1 7,398 0.187 0.390 0 1
loan 7,398 0.112 0.315 0 1
mort 7,398 0.371 0.483 0 1
insur 7,398 0.303 0.459 0 1
pens 7,398 0.520 0.500 0 1
check 7,398 0.779 0.415 0 1
cd 7,398 0.112 0.315 0 1
mm 7,398 0.303 0.459 0 1
savings 7,398 0.303 0.459 0 1
account_age 7,398 7.008 4.573 0 26
----------------------------------------------------
Note that most variables in our dataset are binary. We may later recode them as factors so that we can label their levels to facilitate graph and regression interpretation.
2.2 Variable Distributions
We start by looking at the distribution of the numerical variables.
# Histogram of revenue
ggplot(data = tb.bankrev, aes(x = rev_total)) + geom_histogram()
# Histogram of account balance
ggplot(data = tb.bankrev, aes(x = bal_total)) + geom_histogram()
# Histogram of customer age
ggplot(data = tb.bankrev, aes(x = age)) + geom_histogram()
# Histogram of account age
ggplot(data = tb.bankrev, aes(x = account_age)) + geom_histogram()
Focusing on the variable of interest, total revenue (Rev_Total), we can see that it is highly positively skewed, which is fairly typical of financial data. In regression analysis, highly skewed data can result in a poorly fitting model. A transformation that can often be used to normalize highly skewed data, when all of the values are positive, is a log (natural logarithm) transformation.
We will apply a log transformation to the Rev_Total and reexamine the distribution. This transformation gives us a much less skewed and more symmetric distribution; as such, we will use Log(Rev_Total) for the rest of our analysis.
Note that the variable Bal_Total also has a very positively skewed distribution, and would also benefit from a log transformation.
# Create log(rev_total) and log(bal_total)
tb.bankrev <- tb.bankrev %>%
mutate(lrev_total = log(rev_total),
lbal_total = log(bal_total))
# Histogram of log revenue
ggplot(data = tb.bankrev, aes(x = lrev_total)) + geom_histogram()
# Histogram of log bal_total
ggplot(data = tb.bankrev, aes(x = lbal_total)) + geom_histogram()
2.3 Handling factor variables
To facilitate plot and regression interpretation, we will convert all indicator variables to factors and label their levels.
# Format factor variables
tb.bankrev <- tb.bankrev %>% mutate(
offer = factor(offer , levels = c(0,1), labels = c("no", "yes"))
, chq = factor(chq , levels = c(0,1), labels = c("no", "yes"))
, card = factor(card , levels = c(0,1), labels = c("no", "yes"))
, sav1 = factor(sav1 , levels = c(0,1), labels = c("no", "yes"))
, loan = factor(loan , levels = c(0,1), labels = c("no", "yes"))
, mort = factor(mort , levels = c(0,1), labels = c("no", "yes"))
, insur = factor(insur , levels = c(0,1), labels = c("no", "yes"))
, pens = factor(pens , levels = c(0,1), labels = c("no", "yes"))
, check = factor(check , levels = c(0,1), labels = c("no", "yes"))
, cd = factor(cd , levels = c(0,1), labels = c("no", "yes"))
, mm = factor(mm , levels = c(0,1), labels = c("no", "yes"))
, savings = factor(savings, levels = c(0,1), labels = c("no", "yes"))
)
2.4 Variable Relationships
We will now graphically explore the relationship between our outcome variable and the set of explanatory variables.
The relationship between the total revenue and total account balance is shown in the scatterplots below (with and without the log transformation). The relationship between the log-transformed variables appears to be nearly linear at lower account balances – higher account balances generally have higher revenues. This relationship, however, seems to change at higher account balances.
# Scatterplot balance - revenue
ggplot(data = tb.bankrev, aes(x = bal_total, y = rev_total)) +
geom_point() + stat_smooth()
# Scatterplot balance - log(revenue)
ggplot(data = tb.bankrev, aes(x = bal_total, y = lrev_total)) +
geom_point() + stat_smooth()
# Scatterplot log(balance) - revenue
ggplot(data = tb.bankrev, aes(x = lbal_total, y = rev_total)) +
geom_point() + stat_smooth()
# Scatterplot log(balance) - log(revenue)
ggplot(data = tb.bankrev, aes(x = lbal_total, y = lrev_total)) +
geom_point() + stat_smooth()
# Scatterplot age - lrev_total
ggplot(data = tb.bankrev, aes(x = age, y = lrev_total)) +
geom_point() + stat_smooth()
# Scatterplot account_age - lrev_total
ggplot(data = tb.bankrev, aes(x = account_age, y = lrev_total)) +
geom_point() + stat_smooth()
We now examine the categorical variables. We can see their distributions as well as their relationship to Log(Rev_Total).
# Relationships between outcome variable and categorical variables
ggplot(data = tb.bankrev, aes(x = offer , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = chq , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = card , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = sav1 , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = loan , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = mort , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = insur , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = pens , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = check , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = cd , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = mm , y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
ggplot(data = tb.bankrev, aes(x = savings, y = lrev_total)) + geom_boxplot() +
theme(text = element_text(size = 25))
Upon examining the set of boxplots, it appears that there are no major differences between the compared groups, except for customers with and without a loan, where the total revenue seems to be higher for customers with a loan. Some differences are also observed for customers with and without a mortgage and with and without a certificate of deposit (CD) account.
3. Multiple Regression Analysis
3.1 Log-transformations
Consider the following regression models:
# Regression Analysis level-level
m1 <- lm(rev_total ~ bal_total + age + account_age + cd + mort + savings
, data = tb.bankrev)
# Regression Analysis level-log
m2 <- lm(rev_total ~ lbal_total + age + account_age + cd + mort + savings
, data = tb.bankrev)
# Regression Analysis log-level
m3 <- lm(lrev_total ~ bal_total + age + account_age + cd + mort + savings
, data = tb.bankrev)
# Regression Analysis log-log
m4 <- lm(lrev_total ~ lbal_total + age + account_age + cd + mort + savings
, data = tb.bankrev)
stargazer(m1, m2, m3, m4, type = "text", no.space = TRUE)
=============================================================================
Dependent variable:
---------------------------------------------
rev_total lrev_total
(1) (2) (3) (4)
-----------------------------------------------------------------------------
bal_total 0.00005*** 0.00002***
(0.00000) (0.00000)
lbal_total 0.533*** 0.351***
(0.015) (0.005)
age -0.003* -0.001 -0.002*** -0.001**
(0.002) (0.002) (0.001) (0.0005)
account_age 0.048*** 0.073*** 0.020*** 0.041***
(0.009) (0.008) (0.003) (0.002)
cdyes 2.150*** 0.890*** 1.074*** 0.240***
(0.115) (0.114) (0.043) (0.035)
mortyes -0.498*** -1.491*** 0.067* -0.714***
(0.096) (0.091) (0.036) (0.028)
savingsyes -0.145* -0.045 -0.190*** -0.110***
(0.081) (0.076) (0.030) (0.023)
Constant 1.564*** -1.333*** -0.256*** -2.218***
(0.119) (0.142) (0.044) (0.043)
-----------------------------------------------------------------------------
Observations 7,398 7,398 7,398 7,398
R2 0.079 0.189 0.166 0.509
Adjusted R2 0.079 0.188 0.165 0.509
Residual Std. Error (df = 7391) 3.081 2.893 1.153 0.885
F Statistic (df = 6; 7391) 106.052*** 286.269*** 244.768*** 1,278.000***
=============================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Model 1:
\(\hat{\beta_1} = 0.00005\): on average, all else constant, a one dollar increase in the total account balance is associated with a 0.00005 dollar (0.005 dollar cents) increase in revenue. As the coefficient is very small we may consider a larger variation: a $10,000 increase in total account balance is associated with a 50 cents (0.5 dollars) increase in revenue. The effect is statistically significant (p<0.01).
\(\hat{\beta_6} = -0.145\): on average, all else constant, customers who have a savings account yield a $0.145 (around 14.5 dollar cents) less in revenue than customers who do not have a savings account. The effect is marginally significant (p<0.1).
Model 2:
\(\hat{\beta_1} = 0.533\): on average, all else constant, a one percent increase in the account balance is associated with a revenue increase of 0.533/100 = $0.00533 (0.5 dollar cents). The effect is statistically significant (p<0.01.)
\(\hat{\beta_6} = -0.045\) on average, all else constant, customers who have a savings account yield a $0.045 (around 4.5 dollar cents) less in revenue than customers who do not have a savings account. The effect is not statistically significant.
As both Model 1 and Model 2 are built on the same data and have the same dependent variable, we can compare them using the \(R^2\). Model 1 explains 7.9% of the variability of the outcome around its mean and Model 2 explains 18.9%. Thus, Model 2 is preferable to Model 1.
Model 3:
- \(\hat{\beta_1} = 0.00002\): on average, all else constant, a one dollar increase in the total account balance is associated with a 100*(exp(0.00002)-1) = 0.002 percent increase in total revenue. The effect is statistically significant (p<0.01).
100*(exp(0.00002)-1)
[1] 0.00200002
- \(\hat{\beta_6} = -0.190\): on average, all else constant, customers who have a savings account yield 17.3 percent less in revenue than customers who do not have a savings account. The effect is statistically significant (p<0.01).
100*(exp(-0.190)-1)
[1] -17.30409
Model 4:
\(\hat{\beta_1} = 0.351\): on average, all else constant, a one percent increase in total account balance is associated with a 0.351 percent increase in revenue. The effect is statistically significant (p<0.01).
\(\hat{\beta_6} = -0.110\): on average, all else constant, customers who have a savings account yield 10.4 percent less in revenue than customers who do not have a savings account. The effect is statistically significant (p<0.01).
100*(exp(-0.110)-1)
[1] -10.41659
As both Model 3 and Model 4 are built on the same data and have the same dependent variable, we can compare them using the \(R^2\). Model 3 explains 16.6% of the variability of the outcome around its mean and Model 4 explains 50.9%. Thus, Model 4 is preferable to Model 3.
3.2 Adding a Quadratic Term
Consider the following model in which we introduce the quadratic term \(age^2\). The model is presented side by side with the original level-level model for comparison.
# Create age2
tb.bankrev <- tb.bankrev %>% mutate(age2 = age^2)
# Regression Analysis quadratic term
m5 <- lm(rev_total ~ bal_total + age + age2 + account_age + cd + mort + savings
, data = tb.bankrev)
stargazer(m1, m5, type = "text", no.space = TRUE)
======================================================================
Dependent variable:
--------------------------------------------------
rev_total
(1) (2)
----------------------------------------------------------------------
bal_total 0.00005*** 0.00005***
(0.00000) (0.00000)
age -0.003* 0.028***
(0.002) (0.007)
age2 -0.0003***
(0.0001)
account_age 0.048*** 0.048***
(0.009) (0.009)
cdyes 2.150*** 2.100***
(0.115) (0.116)
mortyes -0.498*** -0.482***
(0.096) (0.096)
savingsyes -0.145* -0.182**
(0.081) (0.081)
Constant 1.564*** 0.945***
(0.119) (0.180)
----------------------------------------------------------------------
Observations 7,398 7,398
R2 0.079 0.082
Adjusted R2 0.079 0.081
Residual Std. Error 3.081 (df = 7391) 3.077 (df = 7390)
F Statistic 106.052*** (df = 6; 7391) 94.125*** (df = 7; 7390)
======================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
What is the marginal effect of customer’s age on revenue?
On average, all else constant, a one year increase in age is associated to a change in total revenue of \(0.028 + 2\times(-0.0003)age = 0.028 - 0.0006 \times age\). This means that the impact \(age\) has on revenue will depend on the customer’s age. For an 18-year old customer, the marginal effect is 0.0172, while for a 65-year old customer, the marginal effect is -0.011.
Note that, on average, an additional year in the customer’s age will be associated with an increase in total revenue up until age 47 and from then on be associated with a decrease in total revenue.
\(0.028−0.0006 × age=0 \Leftrightarrow 0.0006 × age = 0.028 \Leftrightarrow age = \frac{0.028}{0.0006} \Leftrightarrow age = 46.667\)