library(tidyverse)
library(GGally)
MyCC_data <- read_csv("application_data.csv")
This data set is about credit card company, my goal is to find out the factors of late payment. There are total 307,511 observations and 122 variables, among those variables, there are 16 categorical and all others are numerical ones. My target variable is “TARGET” .
CODE_GENDER : Categorical. Gender of the client.
FLAG_OWN_REALTY : Categorical. Flag if client owns a house or flat.
NAME_INCOME_TYPE : Categorical. Clients income type (businessman, working, maternity leave,Ö).
NAME_EDUCATION_TYPE : Categorical. Level of highest education the client achieved
TARGET : Categorical. Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases).
AMT_INCOME_TOTAL : Numerical. Income of the client.
AMT_CREDIT : Numerical. Credit amount of the loan.
AMT_ANNUITY : Numerical. Loan annuity.
DAYS_BIRTH : Numerical. Client’s age in days at the time of application
DAYS_EMPLOYED : Numerical. How many days before the application the person started current employment
table(cc_data$NAME_EDUCATION_TYPE)
##
## Academic degree Higher education
## 164 74863
## Incomplete higher Lower secondary
## 10277 3816
## Secondary / secondary special
## 218391
ggplot(cc_data) + geom_bar(aes(NAME_EDUCATION_TYPE)) + coord_flip()
table(cc_data$TARGET)
##
## 0 1
## 282686 24825
ggplot(cc_data) + geom_bar(aes(TARGET))
table(cc_data$CODE_GENDER)
##
## F M XNA
## 202448 105059 4
ggplot(cc_data) + geom_bar(aes(CODE_GENDER))
table(cc_data$FLAG_OWN_REALTY)
##
## N Y
## 94199 213312
ggplot(cc_data) + geom_bar(aes(FLAG_OWN_REALTY))
summary(cc_data$AMT_INCOME_TOTAL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25650 112500 147150 168798 202500 117000000
ggplot(cc_data) + geom_histogram(aes(AMT_INCOME_TOTAL))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(cc_data$AMT_CREDIT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45000 270000 513531 599026 808650 4050000
ggplot(cc_data) + geom_histogram(aes(AMT_CREDIT))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(cc_data$AMT_ANNUITY)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1616 16524 24903 27109 34596 258026 12
ggplot(cc_data) + geom_histogram(aes(AMT_ANNUITY))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
summary(cc_data$DAYS_BIRTH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25229 -19682 -15750 -16037 -12413 -7489
ggplot(cc_data) + geom_histogram(aes(DAYS_BIRTH))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(cc_data$DAYS_EMPLOYED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17912 -2760 -1213 63815 -289 365243
ggplot(cc_data) + geom_histogram(aes(DAYS_EMPLOYED))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are about 10% credit card holders have at least one late payment; about 65% credit card holders have secondary level education; female holders are about as twice as male holders; about two third of holders have their own house; 20% of holders have jobs which is related to commercial associate, 20% are government employees or pensioners, 45% are working class; average age of holders is 43years old; and has stay at the current job for at least 3.3 years; average income is $168K; average loan credit is $599k; average loan payment is $27k. But in the column”DAYS_EMPLOYED”, I find something weird,this variable tells how many days the applicant work for the current job before application, it suppose to be an negative number, but there are some positive numbers showed up and also they’re very large that beyond common sense,so I filter all those numbers:
cc_data1 <- cc_data %>%
filter(DAYS_EMPLOYED<0)%>%
print()
## # A tibble: 252,135 × 10
## CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M Y Second… Working 1 202500 4.07e5 24700. -9461
## 2 F N Higher… State … 0 270000 1.29e6 35698. -16765
## 3 M Y Second… Working 0 67500 1.35e5 6750 -19046
## 4 F Y Second… Working 0 135000 3.13e5 29686. -19005
## 5 M Y Second… Working 0 121500 5.13e5 21866. -19932
## 6 M Y Second… State … 0 99000 4.90e5 27518. -16941
## 7 F Y Higher… Commer… 0 171000 1.56e6 41301 -13778
## 8 M Y Higher… State … 0 360000 1.53e6 42075 -18850
## 9 M Y Second… Working 0 135000 4.05e5 20250 -14469
## 10 F Y Higher… Working 0 112500 6.52e5 21177 -10197
## # … with 252,125 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and
## # abbreviated variable names ¹FLAG_OWN_REALTY, ²NAME_EDUCATION_TYPE,
## # ³NAME_INCOME_TYPE, ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT, ⁶AMT_ANNUITY,
## # ⁷DAYS_BIRTH
ggplot(cc_data1) + geom_histogram(aes(DAYS_EMPLOYED))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Then I check whether there are any missing values in the data set.
map(cc_data, ~ sum(is.na(.)))
## $CODE_GENDER
## [1] 0
##
## $FLAG_OWN_REALTY
## [1] 0
##
## $NAME_EDUCATION_TYPE
## [1] 0
##
## $NAME_INCOME_TYPE
## [1] 0
##
## $TARGET
## [1] 0
##
## $AMT_INCOME_TOTAL
## [1] 0
##
## $AMT_CREDIT
## [1] 0
##
## $AMT_ANNUITY
## [1] 12
##
## $DAYS_BIRTH
## [1] 0
##
## $DAYS_EMPLOYED
## [1] 0
This is pretty clean data. then I plot ggpairs:
ggpairs(cc_data)
## Warning: Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing missing values (`geom_point()`).
I don’t see any variables have strong relationship with “target” from the plots, then I continue with test between “target” and all other variables I chose.
chisq.test(table(cc_data$TARGET, cc_data$CODE_GENDER))
##
## Pearson's Chi-squared test
##
## data: table(cc_data$TARGET, cc_data$CODE_GENDER)
## X-squared = 920.79, df = 2, p-value < 2.2e-16
chisq.test(table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY)
## X-squared = 11.576, df = 1, p-value = 0.0006681
chisq.test(table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE))
##
## Pearson's Chi-squared test
##
## data: table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE)
## X-squared = 1019.2, df = 4, p-value < 2.2e-16
data1 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "1"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 0.73067, df = 24920, p-value = 0.465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5831.714 12763.636
## sample estimates:
## mean of x mean of y
## 169077.7 165611.8
data1 <- cc_data$AMT_CREDIT[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_CREDIT[cc_data$TARGET == "1"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 19.273, df = 31161, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 40306.60 49432.91
## sample estimates:
## mean of x mean of y
## 602648.3 557778.5
data1 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "1"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 8.1473, df = 31195, p-value = 3.858e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 517.8364 845.9217
## sample estimates:
## mean of x mean of y
## 27163.62 26481.74
data1 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "0"]
data2 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "1"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = -45.006, df = 29749, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1307.932 -1198.764
## sample estimates:
## mean of x mean of y
## -16138.18 -14884.83
data1 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "0"]
data2 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "1"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = -0.68096, df = 24195, p-value = 0.4959
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -45.13094 21.85780
## sample estimates:
## mean of x mean of y
## -2385.132 -2373.496
It shows that the variable”AMT_INCOME_TOTAL” has very weak relationship, then I check “NAME_INCOME_TYPE”, and found that at least half of this group are coming from middle class,combine with the high average income(168k), it makes sense that income has very little effect on “TARGET”,then I delete it.
ggplot(cc_data) +
geom_bar(mapping = aes(x = as.factor(TARGET), fill = CODE_GENDER), position = "dodge") +
labs(title = "TARGET by Gender",
x = "Target by gender",
y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
cc_data2<- cc_data %>%
mutate(age_group = case_when(
DAYS_BIRTH >-9490 ~ "20+",
between(DAYS_BIRTH,-12775,-9441) ~"30+",
between(DAYS_BIRTH,-16425,-12776) ~"40+",
between(DAYS_BIRTH,-Inf,-16426) ~"50+",
.default=NA)) %>%
print()
## # A tibble: 307,511 × 11
## CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M Y Second… Working 1 202500 4.07e5 24700. -9461
## 2 F N Higher… State … 0 270000 1.29e6 35698. -16765
## 3 M Y Second… Working 0 67500 1.35e5 6750 -19046
## 4 F Y Second… Working 0 135000 3.13e5 29686. -19005
## 5 M Y Second… Working 0 121500 5.13e5 21866. -19932
## 6 M Y Second… State … 0 99000 4.90e5 27518. -16941
## 7 F Y Higher… Commer… 0 171000 1.56e6 41301 -13778
## 8 M Y Higher… State … 0 360000 1.53e6 42075 -18850
## 9 F Y Second… Pensio… 0 112500 1.02e6 33826. -20099
## 10 M Y Second… Working 0 135000 4.05e5 20250 -14469
## # … with 307,501 more rows, 2 more variables: DAYS_EMPLOYED <dbl>,
## # age_group <chr>, and abbreviated variable names ¹FLAG_OWN_REALTY,
## # ²NAME_EDUCATION_TYPE, ³NAME_INCOME_TYPE, ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT,
## # ⁶AMT_ANNUITY, ⁷DAYS_BIRTH
ggplot(cc_data2) +
geom_bar(mapping = aes(x = as.factor(TARGET), fill = age_group), position = "dodge")+
facet_wrap(~CODE_GENDER , nrow = 2) +
labs(title = "Age & Gender effect on Target",
x = "Target",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
ggplot(cc_data2) +
geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_EDUCATION_TYPE), position = "dodge")+
facet_wrap(~CODE_GENDER , nrow = 2) +
labs(title = "Education & Gender effect on Target",
x = "Target",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer:Female has more late payment than male,
both younger(20-)and older(50+) have more than other age group, and also
a lot of them have secondary education.
ggplot(cc_data1) +
geom_boxplot(mapping = aes(y = as.factor(TARGET), x = AMT_CREDIT)) +
labs(title = "TARGET by Credit ",
x = "Credit Anount",
y = "Target") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
ggplot(cc_data3) +
geom_boxplot(mapping = aes(y = AMT_CREDIT, x =employee_group )) +
facet_wrap(~as.factor(TARGET),nrow = 2) +
labs(title = "Target by credit amount & employee",
y = "Credit amount",
x = "Length of employee") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
ggplot(cc_data2) +
geom_boxplot(mapping = aes(x = AMT_CREDIT, y = age_group)) +
facet_wrap(~as.factor(TARGET),nrow = 2) +
labs(title = "Target by credit amount & age",
x = "Credit amount",
y = "Days of birth") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Answer: the customer who has less credit amount
tends to have more late payment, and the customer who have been working
longer tends to have more late payment.
ggplot(cc_data) +
geom_bar(mapping = aes(x = as.factor(TARGET), fill = FLAG_OWN_REALTY), position = "dodge") +
labs(title = "Target by owning a property",
x = "Target",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
ggplot(cc_data) +
geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_INCOME_TYPE), position = "dodge") +
labs(title = "Target by owning a property",
x = "Target",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
ggplot(cc_data) +
geom_boxplot(mapping = aes(x = AMT_CREDIT, y = NAME_INCOME_TYPE),position = "dodge") +
facet_wrap(~as.factor(TARGET),nrow = 2) +
labs(title = "Target by credit amount & income type",
x = "Credit amount",
y = "Income type") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(0.8)))
cc_data_stu <- cc_data %>%
filter(NAME_INCOME_TYPE=="Student" | NAME_INCOME_TYPE=="Businessman" | NAME_INCOME_TYPE=="Maternity leave") %>%
print()
## # A tibble: 33 × 10
## CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M Y Second… Student 0 180000 5.24e5 22311 -18790
## 2 M N Higher… Student 0 135000 3.82e5 24512. -9115
## 3 M Y Higher… Busine… 0 900000 2.25e6 112500 -20364
## 4 M N Higher… Busine… 0 2250000 1.35e6 67500 -12249
## 5 F N Second… Student 0 90000 3.14e5 16164 -16337
## 6 M N Higher… Student 0 225000 9.59e4 10332 -19180
## 7 M N Higher… Matern… 0 360000 7.65e5 76500 -22166
## 8 M Y Second… Student 0 144000 1.38e6 39712. -14317
## 9 F Y Second… Student 0 112500 2.76e5 19778. -12127
## 10 F N Second… Student 0 225000 6.60e5 60692. -12280
## # … with 23 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and abbreviated
## # variable names ¹FLAG_OWN_REALTY, ²NAME_EDUCATION_TYPE, ³NAME_INCOME_TYPE,
## # ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT, ⁶AMT_ANNUITY, ⁷DAYS_BIRTH
ggplot(cc_data_stu) + geom_bar(aes(as.factor(TARGET),fill=NAME_INCOME_TYPE),position= "dodge")
Answer:owning a real property does have effect
on “TARGET”, but regarding different income group, it shows big
difference: student and business man have no late payment at all,
working class has the highest late payment.
Since this is a average high income group, so the total amount of line of credit, annuity payment don’t have strong effect on the late payment, but the income type and the days of current job do show a strong effect on it.the people who has secondary education tends to have more late payment than others. gender also is a factor to the Target.