Acquiring your very first credit card can be quite difficult. Since generally you wouldn’t have any previous credit history, credit card companies see you more of a liability than others with a higher credit score. If you aren’t outright declined then they will start you off with an extremely low credit line. This is especially a problem if you try to get a card later in your life and you need to make large purchases where a credit card would be beneficial. Ted Rosen, writer for CreditCards.com and author of “Young adults shouldn’t wait too long to get started with credit,” explained how his younger brother “tried to get one last year at age 25, but was rejected because he lacked a credit history. He was stunned: he paid his rent, utility and other bills on time every month. What did the credit card company mean that he didn’t have a credit history?” This was my personal experience during my first year of college, thus making this a very interesting topic for my data analysis project.
For this project, I will determine what factors card companies use to determine their approval and/or denial for their applicants. These factors include education, gender, ethnicity, prior default, income, employment, and years employed. Based on these factors, it will help determine what steps new applicants can take to increase their chances for being approved.
The data source is from UC Irvine Machine Learning Repository. The data contains 689 rows which is certainly small enough for my development environment. The attributes of the data comprise of the gender, age, debt, married, bank customer, education level, ethnicity, years employed, prior default, employed, credit score, driver’s license, citizen, zip code, income, and whether they were approved or not.
References: https://www.creditcards.com/credit-card-news/first-card-dont-wait-too-long.php http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html https://www.datacamp.com/community/tutorials/fftrees-tutorial https://www.statisticssolutions.com/non-parametric-analysis-chi-square/
Data source from: http://archive.ics.uci.edu/ml/datasets/Credit+Approval
From UCI: All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
The attributes are missing from the dataset. Thanks to Ryan Khun’s Analysis of Credit Approval Data, the analysis provided the attribute names. Although some of the variables are not interpretable such as education level and ethnicity.
Import and transform the data
library(readr)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ ggplot2 3.0.0 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
credit <- read_csv("/Users/ryanmcmahon/Documents/Saint Martin's/Fall 2018/CSC 530 Data Analysis/Assignments/Data Analysis Project/crx.data.txt", col_names = F, col_types = cols(X1 = col_character(),
X10 = col_logical(),
X12 = col_logical(), X16 = col_character(),
X9 = col_logical(), X11 = col_integer()))
colnames(credit) = cols = c("Male", "Age", "Debt", "Married", "BankCustomer", "EducationLevel", "Ethnicity", "YearsEmployed", "PriorDefault", "Employed", "CreditScore", "DriversLicense", "Citizen", "ZipCode", "Income", "Approved")
credit = credit %>%
mutate(Approved = Approved == "+") %>%
mutate(PriorDefault = !PriorDefault)
str(credit)
## Classes 'tbl_df', 'tbl' and 'data.frame': 690 obs. of 16 variables:
## $ Male : chr "b" "a" "a" "b" ...
## $ Age : chr "30.83" "58.67" "24.50" "27.83" ...
## $ Debt : num 0 4.46 0.5 1.54 5.62 ...
## $ Married : chr "u" "u" "u" "u" ...
## $ BankCustomer : chr "g" "g" "g" "g" ...
## $ EducationLevel: chr "w" "q" "q" "w" ...
## $ Ethnicity : chr "v" "h" "h" "v" ...
## $ YearsEmployed : num 1.25 3.04 1.5 3.75 1.71 ...
## $ PriorDefault : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Employed : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ CreditScore : int 1 6 0 5 0 0 0 0 0 0 ...
## $ DriversLicense: logi FALSE FALSE FALSE TRUE FALSE TRUE ...
## $ Citizen : chr "g" "g" "g" "g" ...
## $ ZipCode : chr "00202" "00043" "00280" "00100" ...
## $ Income : int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ Approved : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Due to males generally having a higher income, we can find out by checking which has the higher income.
library(ggplot2)
credit = credit
incomeBar = ggplot(credit, aes(Male, Income, fill = Male)) +
geom_col()
incomeBar
Based on the results of the graph, b has a larger income so it is the male variable.
Mutate the Male column to be a logical data type with b as true.
credit = credit %>%
mutate(Male = Male == "b")
str(credit)
## Classes 'tbl_df', 'tbl' and 'data.frame': 690 obs. of 16 variables:
## $ Male : logi TRUE FALSE FALSE TRUE TRUE TRUE ...
## $ Age : chr "30.83" "58.67" "24.50" "27.83" ...
## $ Debt : num 0 4.46 0.5 1.54 5.62 ...
## $ Married : chr "u" "u" "u" "u" ...
## $ BankCustomer : chr "g" "g" "g" "g" ...
## $ EducationLevel: chr "w" "q" "q" "w" ...
## $ Ethnicity : chr "v" "h" "h" "v" ...
## $ YearsEmployed : num 1.25 3.04 1.5 3.75 1.71 ...
## $ PriorDefault : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Employed : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ CreditScore : int 1 6 0 5 0 0 0 0 0 0 ...
## $ DriversLicense: logi FALSE FALSE FALSE TRUE FALSE TRUE ...
## $ Citizen : chr "g" "g" "g" "g" ...
## $ ZipCode : chr "00202" "00043" "00280" "00100" ...
## $ Income : int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ Approved : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
I created a boxplot to view the average income between approved and denied credit card applications.
li = log10(credit$Income + 0.01)
boxplot(li~credit$Approved)
This plot shows that only people with a yearly income of less than approximately $10,000 have been denied, yet there are still people that have been accepted within that range. The majority of the people get approved for a credit card around 20,000.
I created a barplot to show the affect of having a prior default has on being approved for a credit card application.
ggplot(credit, aes(factor(PriorDefault, labels = c("No prior default", "Prior default")), fill = Approved)) +
geom_bar() +
scale_fill_brewer(palette = "RdYlGn") +
ggtitle("Prior Default - Approved Percentage") +
xlab("Prior Default")
The plot shows that having a prior default has a major influence of being denied a credit card. About three quarters of the applications that did not have a prior default were approved, whereas about 7% of the applications with a prior default were approved.
Here is another barplot showing the affect of being employed has on being approved for a credit card.
ggplot(credit, aes(factor(Employed, labels = c("Not employed", "Employed")), fill = Approved)) +
geom_bar() +
ggtitle("Employed - Approved Percentage") +
xlab("Employed")
Three quarters of the unemployed people were denied a credit card, whereas only a third of the employed people were denied. This means being employed has a large influence on being approved for a credit card.
This barplot shows the number of approved credit cards in per income range.
creditIncPctA = credit %>%
mutate(IncomeCategory = cut(Income, 100, labels = F)) %>%
group_by(Income, IncomeCategory, Approved) %>%
summarize(pctA = mean(Approved))
ggplot(creditIncPctA, aes(IncomeCategory, pctA)) +
geom_col() +
ggtitle("Income - Approved Percentage") +
xlab("Income") +
ylab("Approved Percentage")
The graph shows that a large percentage of applications are approved with little to no income. This indicates that other factors can affect an approved credit card.
This barplot shows if there is a correlation between ethnicity and being approved for a credit card.
ggplot(credit, aes(Ethnicity, fill = Approved)) +
geom_bar()
There are a large amount of applications for the v attribute, but around half of them are denied. There seems to be no correlation between ethnicity and being approved for a credit card.
I ran a chi-squared test to confirm that there is no association between ethnicity and being approved for a credit card.
eth = table(credit$Ethnicity)
ethAppTbl = table(credit$Ethnicity, credit$Approved)
ethAppTbl = cbind(eth, ethAppTbl)
ethApp = data.frame(ethAppTbl) %>%
select(eth, "TRUE.")
colnames(ethApp) = c("EthFreq", "True")
#View(ethApp)
chisq.test(ethApp)
## Warning in chisq.test(ethApp): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: ethApp
## X-squared = 18.471, df = 9, p-value = 0.03008
The p-value from the Chi-squared test is less than 0.05 which means the null hypothesis cannot be rejected, meaning that the gender and being approved may be independent. This helps verify that there is no association between ethnicity and being approved for a credit card.
Next I test if there is an association between gender and being approved for a credit card.
male = table(credit$Male)
maleAppTbl = table(credit$Male, credit$Approved)
maleAppTbl = cbind(male, maleAppTbl)
maleApp = data.frame(maleAppTbl)
maleApp = maleApp %>%
select(male, "TRUE.")
colnames(maleApp) = c("MaleFreq", "True")
chisq.test(maleApp)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: maleApp
## X-squared = 0.023271, df = 1, p-value = 0.8788
The p-value from the Chi-squared test is greater than 0.05 which means that the null hypothesis can be rejected. There may be an association between gender and being approved for a credit card.
To test the chi-squared test, I created a barplot to view the amount of approved males and females.
ggplot(credit, aes(Male, fill = Approved)) +
geom_bar()
40% of males are approved and 44% of females are approved. There seems to be no correlation between gender and being approved of a credit card. Even though the p-value of the chi-squared test is greater than 0.05, I suspect that the difference in the number of males and females had an effect on the p-value. I believe there is not association between gender and being approved for a credit card application.
I then created a barplot showing the number of approved credit cards per education level. The values are have no meaning, so we must see there are indications of bias towards the values.
ggplot(credit, aes(EducationLevel, fill = Approved)) +
geom_bar()
There are some variables that have a large amount of approved and denied card application. It is possible that education level has an effect on the chances of being approved for a credit card.
ed = table(credit$EducationLevel)
edAppTbl = table(credit$EducationLevel, credit$Approved)
edAppTbl = cbind(ed, edAppTbl)
edApp = data.frame(edAppTbl)
edApp = edApp %>%
select(ed, "TRUE.")
colnames(maleApp) = c("EdFreq", "True")
chisq.test(edApp)
## Warning in chisq.test(edApp): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: edApp
## X-squared = 37.974, df = 14, p-value = 0.0005245
The p-value is less than 0.05 which indicates there may not be an association between education level and being approved for a credit card.
I created a boxplot showing the average number of years per approved and denied credit card.
ggplot(credit, aes(factor(Approved, labels = c("Denied", "Approved")), YearsEmployed)) +
geom_boxplot() +
ggtitle("Years Employed - Approved") +
xlab("Approved") +
ylab("Years Employed")
The boxplot shows that the majority of the denied applications had been employed for less than two years with outliers ranging up to more than ten. The majority of the approved applications had been employed for more than two years. The outliers may indicate having other factors involved in being denied a credit card.
Lastly I created a fast and frugal tree which automatically found attributes that affect approved credit cards. It also made an algorithm of finding the best possible route of predicting the value of being approved for a credit card. The triangles are approved credit cards and the circles are denied credit cards.
library(FFTrees)
##
## O
## / \
## F O
## / \
## F Trees 1.4.0
##
## Nathaniel.D.Phillips.is@gmail.com
## FFTrees.guide() opens the guide.
# Cannot have factors with levels greater than 53. Must remove age and zip code
ffTreeCredit = credit %>%
select(c("Male", "Debt", "Married", "BankCustomer", "EducationLevel", "Ethnicity", "YearsEmployed", "PriorDefault", "Employed", "CreditScore", "DriversLicense", "Citizen", "Income", "Approved"))
set.seed(199)
rows = sample(nrow(ffTreeCredit))
split = round(nrow(ffTreeCredit) * 0.7)
creditTraining = ffTreeCredit[1:split, ]
creditTest = ffTreeCredit[(split + 1):nrow(ffTreeCredit), ]
ffTree = FFTrees(formula = Approved ~ .,
data = creditTraining,
data.test = creditTest,
main = "Credit Card Approval",
decision.labels = c("Denied", "Approved"))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
plot(ffTree, data = "test")
The Fast and Frugal Tree automatically found that having a prior default, being employed, having a credit score greater than 0, and being employed for more than a year all have an impact on being approved for a credit score. It created a tree/algorithm based on these variables with an 85% accuracy. It first checks whether the applicant had a prior default, which if they did then it predicts that they will be denied. Next it will check if they are both employed and have a credit score greater than 0, which will result in being approved. Last it checks if they worked greater than one year and will predict they will be approved if they did. I believe this last condition is fairly inaccurate since the number of oh correct and incorrect predictions are roughly even.
I beleive the major factors in being approved a credit card is having a prior default, being employed, and having a good credit score. There may be other factors that have an effect such as the number of years employed, but they do not have as much influence than the previous attributes.