This project explores the CrediCard dataset in the AER
library. CreditCard is a collection of data on a sample of
applicants regarding their credit card usage history and other personal
and financial aspects.
In total, the data frame contains 1,319 observations on 12 variables.
card: [Categorical - Binary] Factor. Was the
application for a credit card accepted?reports: [Numerical] Number of major derogatory
reports.age: [Categorical/ Numerical] Age in years plus
twelfths of a year.income: [Numerical] Yearly income (in USD 10,000).share: [Numerical] Ratio of monthly credit card
expenditure to yearly income.expenditure: [Numerical] Average monthly credit card
expenditure.owner: [Categorical - Binary] Factor. Does the
individual own their home?selfemp: [Categorical - Binary] Factor. Is the
individual self-employed?dependents: [Categorical] Number of dependents.months: [Numerical] Months living at current
address.majorcards: [Numerical] Number of major credit cards
held.active: [Numerical] Number of active credit
accounts.Note: According to Greene (2003, p. 952) dependents equals 1 + number of dependents.
Acknowledgement: This dataset was originally published alongside the 5th edition of William Greene’s book Econometric Analysis.
Since this project focuses mainly on analyzing people’s credit card expenditure behaviors, variables including
card,incomeandexpenditurewill be analyzed in respect to other variables.
Needed packages
library(tidyverse)
library(dplyr)
library(moderndive)
library(grid)
library(gridExtra)
library(scatterplot3d)library(AER)
data("CreditCard")CreditCard$dependents <- as.character(CreditCard$dependents)Income, Expenditure Age and House Ownership
ggplot(CreditCard, aes(x = age, y = income, color = owner))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se = F)+
labs(title = "Income, Age \n& House Ownership", color = "House", x = "Age", y = "Income")+
theme(text = element_text(size = 8), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 8), plot.title = element_text(size = 10), legend.text = element_text(size = 8))-> p1
ggplot(CreditCard, aes(x = age, y = expenditure, color = owner))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se = F)+
labs(title = "Expenditure, Age \n& House Ownership", color = "House", x = "Age", y = "Credit card expenditure")+
theme(text = element_text(size = 8), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 8), plot.title = element_text(size = 10), legend.text = element_text(size = 8))-> p2
grid.arrange(p1, p2, ncol = 2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
- Overall, there is a positive correlation between Age and Income for both house owners and non-house owners: As people get older, their incomes increase. However, for every ages, people with house ownership earn a higher income than those without.
- Credit card expenditure for non-house owners are relatively static over different ages. Meanwhile, there is a downward trend in credit card expenditure for house owners: their credit card expenditures decrease as they age.
- People owning a house are typically older than those who do not.
Income, expenditure and Self-Employment
ggplot(CreditCard, aes(x = selfemp, y = income))+
geom_boxplot(color = "#008080", fill = "#AFEEEE")+
labs(title = "Income \n& Self-Employment", x = "Self-Employment", y = "Credit card expenditure") -> p3
ggplot(CreditCard, aes(x = selfemp, y = expenditure))+
geom_boxplot(color = "#008080", fill = "#AFEEEE")+
labs(title = "Credit card expenditure \n& Self-Employment", x = "Self-Employment", y = "Income") -> p4
grid.arrange(p3, p4, ncol = 2)
- On average, self-employed people earn a higher income, but their incomes have more variability than employed people.
- In general, employed people spent more money through credit card. However, the difference is minimal.
Income, expenditure and number of dependents
ggplot(CreditCard, aes(x = dependents, y = income))+
geom_boxplot(color = "#008080", fill = "#AFEEEE")+
labs(title = "Income and number of dependents", x = "Number of dependents", y = "Income")ggplot(CreditCard, aes(x = dependents, y = expenditure))+
geom_boxplot(color = "#008080", fill = "#AFEEEE")+
labs(title = "Credit card expenditure and number of dependents", x = "Number of dependents", y = "Credit card expenditure")Distribution of dependent numbers
ggplot(CreditCard, aes(x = dependents))+
geom_bar(color = "#2F4F4F", fill = "#20B2AA")+
labs(title = "Distribution of dependent numbers")ggplot(CreditCard, aes(x = dependents))+
geom_bar(color = "#2F4F4F", fill = "#20B2AA")+
facet_wrap(~ owner)+
labs(title = "Distribution of dependent numbers based on house ownership")In general, the majority of cardholders in this dataset have 0 dependents.
Summary of income, number of active credit cards and credit card expenditure
CreditCard %>%
summarise(mean_inc = mean(income), mean_exp = mean(expenditure), mean_share = mean(share), mean_active = mean(active))| mean_inc | mean_exp | mean_share | mean_active |
|---|---|---|---|
| 3.365376 | 185.0571 | 0.0687322 | 6.996967 |
On average, people have an income of 33,650 USD yearly, credit card expenditure at the rate of 185 USD monthly and number of active credit cards of 7 cards. Interestingly, the average portion of monthly credit card expenditure to yearly income is 6.9%.
Income, Average monthly Credit Card expenditure and Self-Employment
ggplot(CreditCard, aes(x = income, y = expenditure, color = selfemp))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se = F)+
labs(title = "Income and Average monthly Credit Card expenditure", color = "Self-Employment", x = "Income", y = "Average monthly Credit Card expenditure")## `geom_smooth()` using formula 'y ~ x'
Generally, there is a positive correlation between income and average monthly credit card expenditure, which is slightly stronger for employed people.
Income and Credit Card application acceptance
ggplot(CreditCard, aes(x = card, y = income))+
geom_boxplot(color = "#008080", fill = "#AFEEEE")+
labs(title = "Income and Credit Card application acceptance", x = "Card accepted", y = "Income")Overall, people who had their credit card application accepted had a higher average income.
Income, Number of Active Cards and Card acceptance
ggplot(CreditCard, aes(x = income, y = active, color = card))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se = F)+
labs(title = "Income, Number of Active Cards and Card acceptance", color = "Card acceptanced", x = "Income", y = "Numbers of active cards")## `geom_smooth()` using formula 'y ~ x'
There is a positive correlation between income and the numbers of active cards.
cor(CreditCard$expenditure, CreditCard$income)## [1] 0.281104
exp_inc <- lm(expenditure ~ income, data = CreditCard)
get_regression_table(exp_inc)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 33.027 | 16.01 | 2.063 | 0.039 | 1.618 | 64.435 |
| income | 45.175 | 4.25 | 10.630 | 0.000 | 36.838 | 53.512 |
Regression Equation:
\[ \widehat{expenditure} = 33.027 + 45.175 \times income \]
Coefficient interpretation:
The intercept of 33.027 represents the average monthly credit card expenditure when the card holder’s yearly income equals 0.
The slope of 45.175: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated increase in average monthly credit card expenditure of 45.175 USD.
Test the hypothesis of interest:
\[ H_0: \beta_{inc} = 0 \]
\[ H_A: \beta_{inc} \ne 0 \]
According to the regression table, since the p-value equals 0 for income, we can reject the null hypothesis and conclude that there is a statistically significant relationship between
incomeandexpenditure.
exp_selfemp <- lm(expenditure ~ selfemp, data = CreditCard)
get_regression_table(exp_selfemp)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 187.697 | 7.766 | 24.168 | 0.000 | 172.461 | 202.932 |
| selfemp: yes | -38.264 | 29.567 | -1.294 | 0.196 | -96.268 | 19.740 |
Regression Equation:
\[ \widehat{expenditure} = 187.697 - 38.264 \times 1_{\mbox{selfemp}}(x) \]
Coefficient interpretation:
The intercept of 187.697 represents the average monthly credit card expenditure when the card holder is not self-employed.
The slope of -38.264 represents the predicted difference in the average monthly credit card expenditures between self-employed and non-self-employed people.
Test the hypothesis of interest:
\[ H_0: \mu_{selfemp} = \mu_{non-selfemp} \]
\[ H_A: \mu_{selfemp} \ne \mu_{non-selfemp} \]
t.test(expenditure ~ selfemp, data = CreditCard)##
## Welch Two Sample t-test
##
## data: expenditure by selfemp
## t = 1.4502, df = 108.14, p-value = 0.1499
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -14.03430 90.56173
## sample estimates:
## mean in group no mean in group yes
## 187.6969 149.4332
- Since p-value is 0.1499 and is bigger than 0.05, and 0 is included in the confidence interval, we lack the evidence to reject the null hypothesis that there is no difference between the mean monthly credit card expenditures of a house owner and a non-owner.
- In other words, the mean monthly credit card is expected to be the same for self-employed and non-self-employed people.
exp_house <- lm(expenditure ~ owner, data = CreditCard)
get_regression_table(exp_house)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 162.559 | 9.981 | 16.287 | 0.000 | 142.980 | 182.139 |
| owner: yes | 51.075 | 15.038 | 3.396 | 0.001 | 21.573 | 80.576 |
Regression Equation:
\[ \widehat{expenditure} = 162.559 + 51.075 \times 1_{\mbox{owner}}(x) \]
Coefficient interpretation:
The intercept of 162.559 represents the average monthly credit card expenditure when the card holder is not owning a house.
The slope of 51.075 represents the predicted difference in the average monthly credit card expenditures between a house owner and a non-owner.
Test the hypothesis of interest:
\[ H_0: \mu_{owner} = \mu_{non-owner} \]
\[ H_A: \mu_{owner} \ne \mu_{non-owner} \]
t.test(expenditure ~ owner, data = CreditCard)##
## Welch Two Sample t-test
##
## data: expenditure by owner
## t = -3.3408, df = 1154.6, p-value = 0.0008621
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -81.07022 -21.07906
## sample estimates:
## mean in group no mean in group yes
## 162.5594 213.6341
Based on the evidence (p-value is much smaller than 0.05 and 0 is not in the confidence interval), we reject the null hypothesis and conclude that there is a difference between the mean monthly credit card expenditures of a house owner and a non-owner.
ggplot(CreditCard, aes(x = income, y = expenditure, color = selfemp))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se = F)+
labs(title = "Credit card expenditure, income and self-employment", color = "Self-Employemnt", x = "Yearly Income", y = "Monthly credit card expenditure")## `geom_smooth()` using formula 'y ~ x'
reg_para <- lm(expenditure ~ income + selfemp, data = CreditCard)
summary(reg_para)##
## Call:
## lm(formula = expenditure ~ income + selfemp, data = CreditCard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -544.96 -138.90 -68.04 65.74 2485.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.935 15.980 2.124 0.0339 *
## income 46.403 4.268 10.873 <2e-16 ***
## selfempyes -73.078 28.513 -2.563 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.8 on 1316 degrees of freedom
## Multiple R-squared: 0.08359, Adjusted R-squared: 0.0822
## F-statistic: 60.02 on 2 and 1316 DF, p-value: < 2.2e-16
reg_inter <- lm(expenditure ~ income * selfemp, data = CreditCard)
summary(reg_inter)##
## Call:
## lm(formula = expenditure ~ income * selfemp, data = CreditCard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -552.95 -138.83 -68.29 65.25 2478.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.336 16.704 1.876 0.0609 .
## income 47.187 4.513 10.456 <2e-16 ***
## selfempyes -43.372 62.351 -0.696 0.4868
## income:selfempyes -7.454 13.914 -0.536 0.5922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.9 on 1315 degrees of freedom
## Multiple R-squared: 0.08379, Adjusted R-squared: 0.0817
## F-statistic: 40.09 on 3 and 1315 DF, p-value: < 2.2e-16
Based on the graph and the regression summaries, the parallel slopes model is preferred because it has higher Adjusted R-squared and would produce a simpler prediction compared to interaction model.
exp_inc_parallel <- lm(expenditure ~ income + selfemp, data = CreditCard)
get_regression_table(exp_inc_parallel)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 33.935 | 15.980 | 2.124 | 0.034 | 2.585 | 65.285 |
| income | 46.403 | 4.268 | 10.873 | 0.000 | 38.031 | 54.776 |
| selfemp: yes | -73.078 | 28.513 | -2.563 | 0.010 | -129.015 | -17.141 |
Regression Equation:
\[ \widehat{expenditure} = 33.935 + 46.403 \times income - 73.078 \times 1_{\mbox{selfemp}}(x) \]
Coefficient interpretation:
The intercept of 33.935 represents the average monthly credit card expenditure when the card holder’s yearly income equals zero, regardless of self-employment status.
The slope of 46.403: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated increase in average monthly credit card expenditure of 46.403 USD, regardless of self-employment status.
The slope of -73.078: The predicted difference in the average monthly credit card expenditure between self-employed and non-self-employed people.
Test the hypothesis of interest:
For income:
\[ H_0: \beta_ {inc} = 0 \]
\[ H_A: \beta_ {inc} \ne 0 \]
For selfemp:
\[ H_0: \beta_{selfemp} = 0 \]
\[ H_A: \beta_{selfemp} \ne 0 \]
According to the regression table, since the p-value equals 0 for
income, we can reject the null hypothesis and conclude that there is a statistically significant relationship betweenincomeandexpenditure. On the other hand, the p-value equals 0.010 forselfemp(bigger than 0.05), therefore, we fail to reject the null hypothesis and conclude that there is no statistically significant relationship between monthly credit card expenditure and self-employment status.
CreditCard_share <- CreditCard %>%
mutate(shareInPercent = share*100)ggplot(CreditCard_share, aes(x = income, y = shareInPercent))+
geom_point(alpha = 0.2, color = "#90EE90")+
geom_smooth(method = "lm", se = F, color = "#006400")+
labs(title = "Expenditure ratio \n& Income", x = "Income", y = "Expenditure ratio")+
theme(text = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 10),
legend.text = element_text(size = 8))-> p5
ggplot(CreditCard_share, aes(x = expenditure, y = shareInPercent))+
geom_point(color = "#90EE90", alpha = 0.2)+
geom_smooth(method = "lm", se = F, color = "#006400")+
labs(title = "Expenditure ratio \n& Expenditure level", x = "Credit card expenditure", y = "Expenditure ratio")+
theme(text = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 10),
legend.text = element_text(size = 8))-> p6
grid.arrange(p5, p6, ncol = 2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
plot <- scatterplot3d(x = CreditCard_share$income,
y= CreditCard_share$expenditure,
z= CreditCard_share$shareInPercent,
xlab = "Income", ylab = "Expenditure", zlab = "Share",
highlight.3d = TRUE, angle = 55,
cex.axis = 0.5,
cex.lab = 0.8, main = "Income, Expenditure & Share", pch = 20)
fit <- lm(shareInPercent ~ income + expenditure,data = CreditCard_share)
plot$plane3d(fit, lty.box = "solid")Because most of the observations are centered around the low income and low expenditure corner, most data was collected from people with humble incomes and . Therefore, implications from this dataset will better represent people in with this chracteristic, rather than exceptionally high-incomed or extravgant credit card users.
share_model <- lm(shareInPercent ~ income + expenditure, data = CreditCard_share)
get_regression_table(share_model)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 6.832 | 0.263 | 25.959 | 0 | 6.316 | 7.348 |
| income | -1.761 | 0.073 | -24.229 | 0 | -1.903 | -1.618 |
| expenditure | 0.032 | 0.000 | 71.306 | 0 | 0.031 | 0.033 |
Regression equation:
\[ \widehat{share} = 6.832 - 1.761 \times income + 0.032 \times expenditure \]
Coefficient interpretation:
The intercept of 6.832 represents the average ratio (in %) of monthly credit card expenditure to yearly income when the card holder’s yearly income and credit card expenditure equal zeros.
The slope of -1.761: For every increase in income of 10,000 USD (yearly income is recorded in 10,000 USD), there is an associated decrease in the average ratio (in %) of monthly credit card expenditure to yearly income of 1.761%.
The slope of 0.032: For every increase in monthly credit card expenditure of 1 USD, there is an associated increase in the average ratio (in %) of monthly credit card expenditure to yearly income of 0.032%.
There are three main takeaways from analyzing the
CrediCard dataset:
As people age, they tend to earn higher incomes but spend less through credit card. Additionally, house-owners generally have a higher income, higher credit card expenditure and are of higher age groups.
Whilst self-employed credit card users earn more, they tend to have lower credit card expenditure.
Monthly Credit Card Expenditure has a positive correlation with Income. The levels of expenditure are on average lower for self-employed people.
\[ \widehat{expenditure} = 33.935 + 46.403 \times income - 73.078 \times 1_{\mbox{selfemp}}(x) \]
There is a statistically significance relationship between credit card expenditure and house ownership; however, there is no between credit card expenditure and self-employment status. Hence, house ownership can be a better indicator of credit card expenditure than self-employment status.
The average ratio (in %) of monthly credit card expenditure to yearly income has a negative correlation with income levels, but has a positive correlation with expenditure level. One interesting implication from this is that as people’s income increases, the portion of their credit card expenditure with respect to their income decreases.
\[ \widehat{share} = 6.832 - 1.761 \times income + 0.032 \times expenditure \]
Note: