library(tidyverse)
library(GGally)

MyCC_data <- read_csv("application_data.csv")


About the data

What its about

This data set is about credit card company, my goal is to find out the factors of late payment. There are total 307,511 observations and 122 variables, among those variables, there are 16 categorical and all others are numerical ones. My target variable is “TARGET” .

The chosen variables

  • CODE_GENDER : Categorical. Gender of the client.

  • FLAG_OWN_REALTY : Categorical. Flag if client owns a house or flat.

  • NAME_INCOME_TYPE : Categorical. Clients income type (businessman, working, maternity leave,Ö).

  • NAME_EDUCATION_TYPE : Categorical. Level of highest education the client achieved

  • TARGET : Categorical. Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases).

  • AMT_INCOME_TOTAL : Numerical. Income of the client.

  • AMT_CREDIT : Numerical. Credit amount of the loan.

  • AMT_ANNUITY : Numerical. Loan annuity.

  • DAYS_BIRTH : Numerical. Client’s age in days at the time of application

  • DAYS_EMPLOYED : Numerical. How many days before the application the person started current employment

EDA PROCESS


table(cc_data$NAME_EDUCATION_TYPE)
## 
##               Academic degree              Higher education 
##                           164                         74863 
##             Incomplete higher               Lower secondary 
##                         10277                          3816 
## Secondary / secondary special 
##                        218391
ggplot(cc_data) + geom_bar(aes(NAME_EDUCATION_TYPE)) + coord_flip()

table(cc_data$TARGET)
## 
##      0      1 
## 282686  24825
ggplot(cc_data) + geom_bar(aes(TARGET))

table(cc_data$CODE_GENDER)
## 
##      F      M    XNA 
## 202448 105059      4
ggplot(cc_data) + geom_bar(aes(CODE_GENDER))

table(cc_data$FLAG_OWN_REALTY)
## 
##      N      Y 
##  94199 213312
ggplot(cc_data) + geom_bar(aes(FLAG_OWN_REALTY))

summary(cc_data$AMT_INCOME_TOTAL)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     25650    112500    147150    168798    202500 117000000
ggplot(cc_data) + geom_histogram(aes(AMT_INCOME_TOTAL))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$AMT_CREDIT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45000  270000  513531  599026  808650 4050000
ggplot(cc_data) + geom_histogram(aes(AMT_CREDIT))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$AMT_ANNUITY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1616   16524   24903   27109   34596  258026      12
ggplot(cc_data) + geom_histogram(aes(AMT_ANNUITY))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

summary(cc_data$DAYS_BIRTH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -25229  -19682  -15750  -16037  -12413   -7489
ggplot(cc_data) + geom_histogram(aes(DAYS_BIRTH))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$DAYS_EMPLOYED)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -17912   -2760   -1213   63815    -289  365243
ggplot(cc_data) + geom_histogram(aes(DAYS_EMPLOYED))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


There are about 10% credit card holders have at least one late payment; about 65% credit card holders have secondary level education; female holders are about as twice as male holders; about two third of holders have their own house; 20% of holders have jobs which is related to commercial associate, 20% are government employees or pensioners, 45% are working class; average age of holders is 43years old; and has stay at the current job for at least 3.3 years; average income is $168K; average loan credit is $599k; average loan payment is $27k. But in the column”DAYS_EMPLOYED”, I find something weird,this variable tells how many days the applicant work for the current job before application, it suppose to be an negative number, but there are some positive numbers showed up and also they’re very large that beyond common sense,so I filter all those numbers:


cc_data1 <- cc_data %>%
  filter(DAYS_EMPLOYED<0)%>%
  print()
## # A tibble: 252,135 × 10
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Working      1  202500  4.07e5  24700.   -9461
##  2 F           N          Higher… State …      0  270000  1.29e6  35698.  -16765
##  3 M           Y          Second… Working      0   67500  1.35e5   6750   -19046
##  4 F           Y          Second… Working      0  135000  3.13e5  29686.  -19005
##  5 M           Y          Second… Working      0  121500  5.13e5  21866.  -19932
##  6 M           Y          Second… State …      0   99000  4.90e5  27518.  -16941
##  7 F           Y          Higher… Commer…      0  171000  1.56e6  41301   -13778
##  8 M           Y          Higher… State …      0  360000  1.53e6  42075   -18850
##  9 M           Y          Second… Working      0  135000  4.05e5  20250   -14469
## 10 F           Y          Higher… Working      0  112500  6.52e5  21177   -10197
## # … with 252,125 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and
## #   abbreviated variable names ¹​FLAG_OWN_REALTY, ²​NAME_EDUCATION_TYPE,
## #   ³​NAME_INCOME_TYPE, ⁴​AMT_INCOME_TOTAL, ⁵​AMT_CREDIT, ⁶​AMT_ANNUITY,
## #   ⁷​DAYS_BIRTH
ggplot(cc_data1) + geom_histogram(aes(DAYS_EMPLOYED))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Then I check whether there are any missing values in the data set.


map(cc_data, ~ sum(is.na(.)))
## $CODE_GENDER
## [1] 0
## 
## $FLAG_OWN_REALTY
## [1] 0
## 
## $NAME_EDUCATION_TYPE
## [1] 0
## 
## $NAME_INCOME_TYPE
## [1] 0
## 
## $TARGET
## [1] 0
## 
## $AMT_INCOME_TOTAL
## [1] 0
## 
## $AMT_CREDIT
## [1] 0
## 
## $AMT_ANNUITY
## [1] 12
## 
## $DAYS_BIRTH
## [1] 0
## 
## $DAYS_EMPLOYED
## [1] 0


This is pretty clean data. then I plot ggpairs:


ggpairs(cc_data)
## Warning: Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing missing values (`geom_point()`).


I don’t see any variables have strong relationship with “target” from the plots, then I continue with test between “target” and all other variables I chose.


chisq.test(table(cc_data$TARGET, cc_data$CODE_GENDER))
## 
##  Pearson's Chi-squared test
## 
## data:  table(cc_data$TARGET, cc_data$CODE_GENDER)
## X-squared = 920.79, df = 2, p-value < 2.2e-16
chisq.test(table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY)
## X-squared = 11.576, df = 1, p-value = 0.0006681
chisq.test(table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE))
## 
##  Pearson's Chi-squared test
## 
## data:  table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE)
## X-squared = 1019.2, df = 4, p-value < 2.2e-16
data1 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "1"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 0.73067, df = 24920, p-value = 0.465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5831.714 12763.636
## sample estimates:
## mean of x mean of y 
##  169077.7  165611.8
data1 <- cc_data$AMT_CREDIT[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_CREDIT[cc_data$TARGET == "1"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 19.273, df = 31161, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  40306.60 49432.91
## sample estimates:
## mean of x mean of y 
##  602648.3  557778.5
data1 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "1"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 8.1473, df = 31195, p-value = 3.858e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  517.8364 845.9217
## sample estimates:
## mean of x mean of y 
##  27163.62  26481.74
data1 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "0"]
data2 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "1"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -45.006, df = 29749, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1307.932 -1198.764
## sample estimates:
## mean of x mean of y 
## -16138.18 -14884.83
data1 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "0"]
data2 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "1"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -0.68096, df = 24195, p-value = 0.4959
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -45.13094  21.85780
## sample estimates:
## mean of x mean of y 
## -2385.132 -2373.496


It shows that the variable”AMT_INCOME_TOTAL” has very weak relationship, then I check “NAME_INCOME_TYPE”, and found that at least half of this group are coming from middle class,combine with the high average income(168k), it makes sense that income has very little effect on “TARGET”,then I delete it.


Q1, which gender is more likely to have late payment? what age is tend to have late payment more? and what are education level?


ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = CODE_GENDER), position = "dodge") + 
  labs(title = "TARGET by Gender", 
       x = "Target by gender", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

cc_data2<- cc_data %>%
  mutate(age_group = case_when(
  DAYS_BIRTH >-9490 ~ "20+", 
  between(DAYS_BIRTH,-12775,-9441) ~"30+",
  between(DAYS_BIRTH,-16425,-12776) ~"40+", 
  between(DAYS_BIRTH,-Inf,-16426) ~"50+",
  .default=NA)) %>%
  print()
## # A tibble: 307,511 × 11
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Working      1  202500  4.07e5  24700.   -9461
##  2 F           N          Higher… State …      0  270000  1.29e6  35698.  -16765
##  3 M           Y          Second… Working      0   67500  1.35e5   6750   -19046
##  4 F           Y          Second… Working      0  135000  3.13e5  29686.  -19005
##  5 M           Y          Second… Working      0  121500  5.13e5  21866.  -19932
##  6 M           Y          Second… State …      0   99000  4.90e5  27518.  -16941
##  7 F           Y          Higher… Commer…      0  171000  1.56e6  41301   -13778
##  8 M           Y          Higher… State …      0  360000  1.53e6  42075   -18850
##  9 F           Y          Second… Pensio…      0  112500  1.02e6  33826.  -20099
## 10 M           Y          Second… Working      0  135000  4.05e5  20250   -14469
## # … with 307,501 more rows, 2 more variables: DAYS_EMPLOYED <dbl>,
## #   age_group <chr>, and abbreviated variable names ¹​FLAG_OWN_REALTY,
## #   ²​NAME_EDUCATION_TYPE, ³​NAME_INCOME_TYPE, ⁴​AMT_INCOME_TOTAL, ⁵​AMT_CREDIT,
## #   ⁶​AMT_ANNUITY, ⁷​DAYS_BIRTH
  ggplot(cc_data2) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = age_group), position = "dodge")+ 
  facet_wrap(~CODE_GENDER , nrow = 2) +
  labs(title = "Age & Gender effect on Target", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data2) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_EDUCATION_TYPE), position = "dodge")+ 
  facet_wrap(~CODE_GENDER , nrow = 2) +
  labs(title = "Education & Gender effect on Target", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))


Answer:Female has more late payment than male, both younger(20-)and older(50+) have more than other age group, and also a lot of them have secondary education.

Q3, what is the effect of total credit amount on “TARGET”? is the days of employee the factor? what about the age?


ggplot(cc_data1) + 
  geom_boxplot(mapping = aes(y = as.factor(TARGET), x = AMT_CREDIT)) + 
  labs(title = "TARGET by Credit ", 
       x = "Credit Anount", 
       y = "Target") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data3) + 
  geom_boxplot(mapping = aes(y = AMT_CREDIT, x =employee_group )) + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & employee", 
       y = "Credit amount", 
       x     = "Length of employee") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data2) + 
  geom_boxplot(mapping = aes(x = AMT_CREDIT, y = age_group)) + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & age", 
       x = "Credit amount", 
       y = "Days of birth") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))


Answer: the customer who has less credit amount tends to have more late payment, and the customer who have been working longer tends to have more late payment.


Q4, what is the effect of owning a real property on “TARGET”? how does the income type affect it? is the credit amount a factor?


ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = FLAG_OWN_REALTY), position = "dodge")  + 
  labs(title = "Target by owning a property", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_INCOME_TYPE), position = "dodge") + 
  labs(title = "Target by owning a property", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data) + 
  geom_boxplot(mapping = aes(x = AMT_CREDIT, y = NAME_INCOME_TYPE),position = "dodge") + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & income type", 
       x = "Credit amount", 
       y = "Income type") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(0.8)))

cc_data_stu <- cc_data %>%
  filter(NAME_INCOME_TYPE=="Student" | NAME_INCOME_TYPE=="Businessman" | NAME_INCOME_TYPE=="Maternity leave") %>%
  print()
## # A tibble: 33 × 10
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Student      0  180000  5.24e5  22311   -18790
##  2 M           N          Higher… Student      0  135000  3.82e5  24512.   -9115
##  3 M           Y          Higher… Busine…      0  900000  2.25e6 112500   -20364
##  4 M           N          Higher… Busine…      0 2250000  1.35e6  67500   -12249
##  5 F           N          Second… Student      0   90000  3.14e5  16164   -16337
##  6 M           N          Higher… Student      0  225000  9.59e4  10332   -19180
##  7 M           N          Higher… Matern…      0  360000  7.65e5  76500   -22166
##  8 M           Y          Second… Student      0  144000  1.38e6  39712.  -14317
##  9 F           Y          Second… Student      0  112500  2.76e5  19778.  -12127
## 10 F           N          Second… Student      0  225000  6.60e5  60692.  -12280
## # … with 23 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and abbreviated
## #   variable names ¹​FLAG_OWN_REALTY, ²​NAME_EDUCATION_TYPE, ³​NAME_INCOME_TYPE,
## #   ⁴​AMT_INCOME_TOTAL, ⁵​AMT_CREDIT, ⁶​AMT_ANNUITY, ⁷​DAYS_BIRTH
ggplot(cc_data_stu) + geom_bar(aes(as.factor(TARGET),fill=NAME_INCOME_TYPE),position= "dodge")


Answer:owning a real property does have effect on “TARGET”, but regarding different income group, it shows big difference: student and business man have no late payment at all, working class has the highest late payment.

Conclusion:

Since this is a average high income group, so the total amount of line of credit, annuity payment don’t have strong effect on the late payment, but the income type and the days of current job do show a strong effect on it.the people who has secondary education tends to have more late payment than others. gender also is a factor to the Target.