1. Dataset Introduction

This project explores the CrediCard dataset in the AER library. CreditCard is a collection of data on a sample of applicants regarding their credit card usage history and other personal and financial aspects.

In total, the data frame contains 1,319 observations on 12 variables.

  • card: [Categorical - Binary] Factor. Was the application for a credit card accepted?
  • reports: [Numerical] Number of major derogatory reports.
  • age: [Categorical/ Numerical] Age in years plus twelfths of a year.
  • income: [Numerical] Yearly income (in USD 10,000).
  • share: [Numerical] Ratio of monthly credit card expenditure to yearly income.
  • expenditure: [Numerical] Average monthly credit card expenditure.
  • owner: [Categorical - Binary] Factor. Does the individual own their home?
  • selfemp: [Categorical - Binary] Factor. Is the individual self-employed?
  • dependents: [Categorical] Number of dependents.
  • months: [Numerical] Months living at current address.
  • majorcards: [Numerical] Number of major credit cards held.
  • active: [Numerical] Number of active credit accounts.

Note: According to Greene (2003, p. 952) dependents equals 1 + number of dependents.

Acknowledgement: This dataset was originally published alongside the 5th edition of William Greene’s book Econometric Analysis.

Since this project focuses mainly on analyzing people’s credit card expenditure behaviors, variables including card, income and expenditure will be analyzed in respect to other variables.

Needed packages

library(tidyverse)
library(dplyr)
library(moderndive)
library(grid)
library(gridExtra)
library(scatterplot3d)
library(AER)
data("CreditCard")
CreditCard$dependents <- as.character(CreditCard$dependents)

2. Exploratory Data Analysis

2.1 Correlation between income and personal aspects

Income, Expenditure Age and House Ownership

ggplot(CreditCard, aes(x = age, y = income, color = owner))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se = F)+
  labs(title = "Income, Age \n& House Ownership", color = "House", x = "Age", y = "Income")+
  theme(text = element_text(size = 8), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 8), plot.title = element_text(size = 10), legend.text = element_text(size = 8))-> p1

ggplot(CreditCard, aes(x = age, y = expenditure, color = owner))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se = F)+
  labs(title = "Expenditure, Age \n& House Ownership", color = "House", x = "Age", y = "Credit card expenditure")+
  theme(text = element_text(size = 8), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 8), plot.title = element_text(size = 10), legend.text = element_text(size = 8))-> p2

grid.arrange(p1, p2, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

  • Overall, there is a positive correlation between Age and Income for both house owners and non-house owners: As people get older, their incomes increase. However, for every ages, people with house ownership earn a higher income than those without.
  • Credit card expenditure for non-house owners are relatively static over different ages. Meanwhile, there is a downward trend in credit card expenditure for house owners: their credit card expenditures decrease as they age.
  • People owning a house are typically older than those who do not.

Income, expenditure and Self-Employment

ggplot(CreditCard, aes(x = selfemp, y = income))+
  geom_boxplot(color = "#008080", fill = "#AFEEEE")+
  labs(title = "Income \n& Self-Employment", x = "Self-Employment", y = "Credit card expenditure") -> p3

ggplot(CreditCard, aes(x = selfemp, y = expenditure))+
  geom_boxplot(color = "#008080", fill = "#AFEEEE")+
  labs(title = "Credit card expenditure \n& Self-Employment", x = "Self-Employment", y = "Income") -> p4

grid.arrange(p3, p4, ncol = 2)

  • On average, self-employed people earn a higher income, but their incomes have more variability than employed people.
  • In general, employed people spent more money through credit card. However, the difference is minimal.

Income, expenditure and number of dependents

ggplot(CreditCard, aes(x = dependents, y = income))+
  geom_boxplot(color = "#008080", fill = "#AFEEEE")+
  labs(title = "Income and number of dependents", x = "Number of dependents", y = "Income")

ggplot(CreditCard, aes(x = dependents, y = expenditure))+
  geom_boxplot(color = "#008080", fill = "#AFEEEE")+
  labs(title = "Credit card expenditure and number of dependents", x = "Number of dependents", y = "Credit card expenditure")

Distribution of dependent numbers

ggplot(CreditCard, aes(x = dependents))+
  geom_bar(color = "#2F4F4F", fill = "#20B2AA")+
  labs(title = "Distribution of dependent numbers")

ggplot(CreditCard, aes(x = dependents))+
  geom_bar(color = "#2F4F4F", fill = "#20B2AA")+
  facet_wrap(~ owner)+
  labs(title = "Distribution of dependent numbers based on house ownership")

In general, the majority of cardholders in this dataset have 0 dependents.

2.2 Credit card behaviors & Acceptance rate

Summary of income, number of active credit cards and credit card expenditure

CreditCard %>% 
  summarise(mean_inc = mean(income), mean_exp = mean(expenditure), mean_share = mean(share), mean_active = mean(active))
mean_inc mean_exp mean_share mean_active
3.365376 185.0571 0.0687322 6.996967

On average, people have an income of 33,650 USD yearly, credit card expenditure at the rate of 185 USD monthly and number of active credit cards of 7 cards. Interestingly, the average portion of monthly credit card expenditure to yearly income is 6.9%.

Income, Average monthly Credit Card expenditure and Self-Employment

ggplot(CreditCard, aes(x = income, y = expenditure, color = selfemp))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se = F)+
  labs(title = "Income and Average monthly Credit Card expenditure", color = "Self-Employment", x = "Income", y = "Average monthly Credit Card expenditure")
## `geom_smooth()` using formula 'y ~ x'

Generally, there is a positive correlation between income and average monthly credit card expenditure, which is slightly stronger for employed people.

Income and Credit Card application acceptance

ggplot(CreditCard, aes(x = card, y = income))+
  geom_boxplot(color = "#008080", fill = "#AFEEEE")+
  labs(title = "Income and Credit Card application acceptance", x = "Card accepted", y = "Income")

Overall, people who had their credit card application accepted had a higher average income.

Income, Number of Active Cards and Card acceptance

ggplot(CreditCard, aes(x = income, y = active, color = card))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se = F)+
  labs(title = "Income, Number of Active Cards and Card acceptance", color = "Card acceptanced", x = "Income", y = "Numbers of active cards")
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between income and the numbers of active cards.

3. Simple Linear Regression & Two-Sampled Test

3.1 Income & Credit Card Expenditure

cor(CreditCard$expenditure, CreditCard$income)
## [1] 0.281104
exp_inc <- lm(expenditure ~ income, data = CreditCard)
get_regression_table(exp_inc)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 33.027 16.01 2.063 0.039 1.618 64.435
income 45.175 4.25 10.630 0.000 36.838 53.512

Regression Equation:

\[ \widehat{expenditure} = 33.027 + 45.175 \times income \]

Coefficient interpretation:

  • The intercept of 33.027 represents the average monthly credit card expenditure when the card holder’s yearly income equals 0.

  • The slope of 45.175: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated increase in average monthly credit card expenditure of 45.175 USD.

Test the hypothesis of interest:

  • Null: There is no statistically significant relationship between monthly credit card expenditure and yearly income.

\[ H_0: \beta_{inc} = 0 \]

  • Alternative: There is a statistically significant relationship between monthly credit card expenditure and yearly income.

\[ H_A: \beta_{inc} \ne 0 \]

According to the regression table, since the p-value equals 0 for income, we can reject the null hypothesis and conclude that there is a statistically significant relationship between income and expenditure.

3.2 Self-Employment & Credit Card Expenditure

exp_selfemp <- lm(expenditure ~ selfemp, data = CreditCard)
get_regression_table(exp_selfemp)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 187.697 7.766 24.168 0.000 172.461 202.932
selfemp: yes -38.264 29.567 -1.294 0.196 -96.268 19.740

Regression Equation:

\[ \widehat{expenditure} = 187.697 - 38.264 \times 1_{\mbox{selfemp}}(x) \]

Coefficient interpretation:

  • The intercept of 187.697 represents the average monthly credit card expenditure when the card holder is not self-employed.

  • The slope of -38.264 represents the predicted difference in the average monthly credit card expenditures between self-employed and non-self-employed people.

Test the hypothesis of interest:

  • Null: There is no difference between the mean monthly credit card expenditures of a house owner and a non-owner.

\[ H_0: \mu_{selfemp} = \mu_{non-selfemp} \]

  • Alternative: There is a difference between the mean monthly credit card expenditures of a house owner and a non-owner.

\[ H_A: \mu_{selfemp} \ne \mu_{non-selfemp} \]

t.test(expenditure ~ selfemp, data = CreditCard)
## 
##  Welch Two Sample t-test
## 
## data:  expenditure by selfemp
## t = 1.4502, df = 108.14, p-value = 0.1499
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -14.03430  90.56173
## sample estimates:
##  mean in group no mean in group yes 
##          187.6969          149.4332
  • Since p-value is 0.1499 and is bigger than 0.05, and 0 is included in the confidence interval, we lack the evidence to reject the null hypothesis that there is no difference between the mean monthly credit card expenditures of a house owner and a non-owner.
  • In other words, the mean monthly credit card is expected to be the same for self-employed and non-self-employed people.

3.3 House ownership and Credit Card Expenditure

exp_house <- lm(expenditure ~ owner, data = CreditCard)
get_regression_table(exp_house)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 162.559 9.981 16.287 0.000 142.980 182.139
owner: yes 51.075 15.038 3.396 0.001 21.573 80.576

Regression Equation:

\[ \widehat{expenditure} = 162.559 + 51.075 \times 1_{\mbox{owner}}(x) \]

Coefficient interpretation:

  • The intercept of 162.559 represents the average monthly credit card expenditure when the card holder is not owning a house.

  • The slope of 51.075 represents the predicted difference in the average monthly credit card expenditures between a house owner and a non-owner.

Test the hypothesis of interest:

  • Null: There is no difference between the mean monthly credit card expenditures of a house owner and a non-owner.

\[ H_0: \mu_{owner} = \mu_{non-owner} \]

  • Alternative: There is a difference between the mean monthly credit card expenditures of a house owner and a non-owner.

\[ H_A: \mu_{owner} \ne \mu_{non-owner} \]

t.test(expenditure ~ owner, data = CreditCard)
## 
##  Welch Two Sample t-test
## 
## data:  expenditure by owner
## t = -3.3408, df = 1154.6, p-value = 0.0008621
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -81.07022 -21.07906
## sample estimates:
##  mean in group no mean in group yes 
##          162.5594          213.6341

Based on the evidence (p-value is much smaller than 0.05 and 0 is not in the confidence interval), we reject the null hypothesis and conclude that there is a difference between the mean monthly credit card expenditures of a house owner and a non-owner.

4. Multiple Regression

4.1 Income, Self-Employment & Credit Card Expenditure

ggplot(CreditCard, aes(x = income, y = expenditure, color = selfemp))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se = F)+
  labs(title = "Credit card expenditure, income and self-employment", color = "Self-Employemnt", x = "Yearly Income", y = "Monthly credit card expenditure")
## `geom_smooth()` using formula 'y ~ x'

reg_para <- lm(expenditure ~ income + selfemp, data = CreditCard)
summary(reg_para)
## 
## Call:
## lm(formula = expenditure ~ income + selfemp, data = CreditCard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -544.96 -138.90  -68.04   65.74 2485.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.935     15.980   2.124   0.0339 *  
## income        46.403      4.268  10.873   <2e-16 ***
## selfempyes   -73.078     28.513  -2.563   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.8 on 1316 degrees of freedom
## Multiple R-squared:  0.08359,    Adjusted R-squared:  0.0822 
## F-statistic: 60.02 on 2 and 1316 DF,  p-value: < 2.2e-16
reg_inter <- lm(expenditure ~ income * selfemp, data = CreditCard)
summary(reg_inter)
## 
## Call:
## lm(formula = expenditure ~ income * selfemp, data = CreditCard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -552.95 -138.83  -68.29   65.25 2478.33 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         31.336     16.704   1.876   0.0609 .  
## income              47.187      4.513  10.456   <2e-16 ***
## selfempyes         -43.372     62.351  -0.696   0.4868    
## income:selfempyes   -7.454     13.914  -0.536   0.5922    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.9 on 1315 degrees of freedom
## Multiple R-squared:  0.08379,    Adjusted R-squared:  0.0817 
## F-statistic: 40.09 on 3 and 1315 DF,  p-value: < 2.2e-16

Based on the graph and the regression summaries, the parallel slopes model is preferred because it has higher Adjusted R-squared and would produce a simpler prediction compared to interaction model.

exp_inc_parallel <- lm(expenditure ~ income + selfemp, data = CreditCard)
get_regression_table(exp_inc_parallel)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 33.935 15.980 2.124 0.034 2.585 65.285
income 46.403 4.268 10.873 0.000 38.031 54.776
selfemp: yes -73.078 28.513 -2.563 0.010 -129.015 -17.141

Regression Equation:

\[ \widehat{expenditure} = 33.935 + 46.403 \times income - 73.078 \times 1_{\mbox{selfemp}}(x) \]

Coefficient interpretation:

  • The intercept of 33.935 represents the average monthly credit card expenditure when the card holder’s yearly income equals zero, regardless of self-employment status.

  • The slope of 46.403: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated increase in average monthly credit card expenditure of 46.403 USD, regardless of self-employment status.

  • The slope of -73.078: The predicted difference in the average monthly credit card expenditure between self-employed and non-self-employed people.

Test the hypothesis of interest:

For income:

  • Null: There is no statistically significant relationship between monthly credit card expenditure and yearly income.

\[ H_0: \beta_ {inc} = 0 \]

  • Alternative: There is a statistically significant relationship between monthly credit card expenditure and yearly income.

\[ H_A: \beta_ {inc} \ne 0 \]

For selfemp:

  • Null: There is no statistically significant relationship between monthly credit card expenditure and self-employment status.

\[ H_0: \beta_{selfemp} = 0 \]

  • Alternative: There is a statistically significant relationship between monthly credit card expenditure and self-employment status.

\[ H_A: \beta_{selfemp} \ne 0 \]

According to the regression table, since the p-value equals 0 for income, we can reject the null hypothesis and conclude that there is a statistically significant relationship between income and expenditure. On the other hand, the p-value equals 0.010 for selfemp (bigger than 0.05), therefore, we fail to reject the null hypothesis and conclude that there is no statistically significant relationship between monthly credit card expenditure and self-employment status.

4.2 Income, Credit Card Expenditure and Portion of monthly credit card expenditure to yearly income

CreditCard_share <- CreditCard %>% 
  mutate(shareInPercent = share*100)
ggplot(CreditCard_share, aes(x = income, y = shareInPercent))+
  geom_point(alpha = 0.2, color = "#90EE90")+
  geom_smooth(method = "lm", se = F, color = "#006400")+
  labs(title = "Expenditure ratio \n& Income", x = "Income", y = "Expenditure ratio")+
  theme(text = element_text(size = 8), 
        axis.text.x = element_text(size = 8), 
        axis.text.y = element_text(size = 8), 
        plot.title = element_text(size = 10), 
        legend.text = element_text(size = 8))-> p5

ggplot(CreditCard_share, aes(x = expenditure, y = shareInPercent))+
  geom_point(color = "#90EE90", alpha = 0.2)+
  geom_smooth(method = "lm", se = F, color = "#006400")+
  labs(title = "Expenditure ratio \n& Expenditure level", x = "Credit card expenditure", y = "Expenditure ratio")+
  theme(text = element_text(size = 8), 
        axis.text.x = element_text(size = 8), 
        axis.text.y = element_text(size = 8), 
        plot.title = element_text(size = 10), 
        legend.text = element_text(size = 8))-> p6

grid.arrange(p5, p6, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

plot <- scatterplot3d(x = CreditCard_share$income,
       y= CreditCard_share$expenditure,
       z= CreditCard_share$shareInPercent,
       xlab = "Income", ylab = "Expenditure", zlab = "Share", 
       highlight.3d = TRUE, angle = 55,
       cex.axis = 0.5,
       cex.lab = 0.8, main = "Income, Expenditure & Share", pch = 20)

fit <- lm(shareInPercent ~ income + expenditure,data = CreditCard_share)
plot$plane3d(fit, lty.box = "solid")

Because most of the observations are centered around the low income and low expenditure corner, most data was collected from people with humble incomes and . Therefore, implications from this dataset will better represent people in with this chracteristic, rather than exceptionally high-incomed or extravgant credit card users.

share_model <- lm(shareInPercent ~ income + expenditure, data = CreditCard_share)
get_regression_table(share_model)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 6.832 0.263 25.959 0 6.316 7.348
income -1.761 0.073 -24.229 0 -1.903 -1.618
expenditure 0.032 0.000 71.306 0 0.031 0.033

Regression equation:

\[ \widehat{share} = 6.832 - 1.761 \times income + 0.032 \times expenditure \]

Coefficient interpretation:

  • The intercept of 6.832 represents the average ratio (in %) of monthly credit card expenditure to yearly income when the card holder’s yearly income and credit card expenditure equal zeros.

  • The slope of -1.761: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated decrease in the average ratio (in %) of monthly credit card expenditure to yearly income of 1.761%.

  • The slope of 0.032: For every increase in monthly credit card expenditure of 1 USD, there is an associated increase in the average ratio (in %) of monthly credit card expenditure to yearly income of 0.032%.

5. Conclusion

There are three main takeaways from analyzing the CrediCard dataset:

  1. As people age, they tend to earn higher incomes but spend less through credit card. Additionally, house-owners generally have a higher income, higher credit card expenditure and are of higher age groups.

  2. Whilst self-employed credit card users earn more, they tend to have lower credit card expenditure.

  3. Monthly Credit Card Expenditure has a positive correlation with Income. The levels of expenditure are on average lower for self-employed people.

\[ \widehat{expenditure} = 33.935 + 46.403 \times income - 73.078 \times 1_{\mbox{selfemp}}(x) \]

  1. There is a statistically significance relationship between credit card expenditure and house ownership; however, there is no between credit card expenditure and self-employment status. Hence, house ownership can be a better indicator of credit card expenditure than self-employment status.

  2. The average ratio (in %) of monthly credit card expenditure to yearly income has a negative correlation with income levels, but has a positive correlation with expenditure level. One interesting implication from this is that as people’s income increases, the portion of their credit card expenditure with respect to their income decreases.

\[ \widehat{share} = 6.832 - 1.761 \times income + 0.032 \times expenditure \]

Note:

  • This dataset is a good representation of people with relatively low incomes and credit card expenditure. However, its implications are not valid to study about high-income and excessive credit card users.
  • Correlation is not synonymous with causation, but rather demonstrate how variables change with respect to the others.