Project1

EDA PROCESS

table(cc_data$NAME_EDUCATION_TYPE)

## 
##               Academic degree              Higher education 
##                           164                         74863 
##             Incomplete higher               Lower secondary 
##                         10277                          3816 
## Secondary / secondary special 
##                        218391

ggplot(cc_data) + geom_bar(aes(NAME_EDUCATION_TYPE)) + coord_flip()

table(cc_data$TARGET)

## 
##      0      1 
## 282686  24825

ggplot(cc_data) + geom_bar(aes(TARGET))

table(cc_data$CODE_GENDER)

## 
##      F      M    XNA 
## 202448 105059      4

ggplot(cc_data) + geom_bar(aes(CODE_GENDER))

table(cc_data$FLAG_OWN_REALTY)

## 
##      N      Y 
##  94199 213312

ggplot(cc_data) + geom_bar(aes(FLAG_OWN_REALTY))

summary(cc_data$AMT_INCOME_TOTAL)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     25650    112500    147150    168798    202500 117000000

ggplot(cc_data) + geom_histogram(aes(AMT_INCOME_TOTAL))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$AMT_CREDIT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45000  270000  513531  599026  808650 4050000

ggplot(cc_data) + geom_histogram(aes(AMT_CREDIT))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$AMT_ANNUITY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1616   16524   24903   27109   34596  258026      12

ggplot(cc_data) + geom_histogram(aes(AMT_ANNUITY))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

summary(cc_data$DAYS_BIRTH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -25229  -19682  -15750  -16037  -12413   -7489

ggplot(cc_data) + geom_histogram(aes(DAYS_BIRTH))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cc_data$DAYS_EMPLOYED)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -17912   -2760   -1213   63815    -289  365243

ggplot(cc_data) + geom_histogram(aes(DAYS_EMPLOYED))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are about 10% credit card holders have at least one late payment; about 65% credit card holders have secondary level education; female holders are about as twice as male holders; about two third of holders have their own house; 20% of holders have jobs which is related to commercial associate, 20% are government employees or pensioners, 45% are working class; average age of holders is 43years old; and has stay at the current job for at least 3.3 years; average income is $168K; average loan credit is $599k; average loan payment is $27k. But in the column”DAYS_EMPLOYED”, I find something weird,this variable tells how many days the applicant work for the current job before application, it suppose to be an negative number, but there are some positive numbers showed up and also they’re very large that beyond common sense,so I filter all those numbers:

cc_data1 <- cc_data %>%
  filter(DAYS_EMPLOYED<0)%>%
  print()

## # A tibble: 252,135 × 10
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Working      1  202500  4.07e5  24700.   -9461
##  2 F           N          Higher… State …      0  270000  1.29e6  35698.  -16765
##  3 M           Y          Second… Working      0   67500  1.35e5   6750   -19046
##  4 F           Y          Second… Working      0  135000  3.13e5  29686.  -19005
##  5 M           Y          Second… Working      0  121500  5.13e5  21866.  -19932
##  6 M           Y          Second… State …      0   99000  4.90e5  27518.  -16941
##  7 F           Y          Higher… Commer…      0  171000  1.56e6  41301   -13778
##  8 M           Y          Higher… State …      0  360000  1.53e6  42075   -18850
##  9 M           Y          Second… Working      0  135000  4.05e5  20250   -14469
## 10 F           Y          Higher… Working      0  112500  6.52e5  21177   -10197
## # … with 252,125 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and
## #   abbreviated variable names ¹FLAG_OWN_REALTY, ²NAME_EDUCATION_TYPE,
## #   ³NAME_INCOME_TYPE, ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT, ⁶AMT_ANNUITY,
## #   ⁷DAYS_BIRTH

ggplot(cc_data1) + geom_histogram(aes(DAYS_EMPLOYED))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then I check whether there are any missing values in the data set.

map(cc_data, ~ sum(is.na(.)))

## $CODE_GENDER
## [1] 0
## 
## $FLAG_OWN_REALTY
## [1] 0
## 
## $NAME_EDUCATION_TYPE
## [1] 0
## 
## $NAME_INCOME_TYPE
## [1] 0
## 
## $TARGET
## [1] 0
## 
## $AMT_INCOME_TOTAL
## [1] 0
## 
## $AMT_CREDIT
## [1] 0
## 
## $AMT_ANNUITY
## [1] 12
## 
## $DAYS_BIRTH
## [1] 0
## 
## $DAYS_EMPLOYED
## [1] 0

This is pretty clean data. then I plot ggpairs:

ggpairs(cc_data)

## Warning: Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).
## Removed 12 rows containing non-finite values (`stat_boxplot()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).
## Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing missing values (`geom_point()`).

I don’t see any variables have strong relationship with “target” from the plots, then I continue with test between “target” and all other variables I chose.

chisq.test(table(cc_data$TARGET, cc_data$CODE_GENDER))

## 
##  Pearson's Chi-squared test
## 
## data:  table(cc_data$TARGET, cc_data$CODE_GENDER)
## X-squared = 920.79, df = 2, p-value < 2.2e-16

chisq.test(table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(cc_data$TARGET, cc_data$FLAG_OWN_REALTY)
## X-squared = 11.576, df = 1, p-value = 0.0006681

chisq.test(table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE))

## 
##  Pearson's Chi-squared test
## 
## data:  table(cc_data$TARGET, cc_data$NAME_EDUCATION_TYPE)
## X-squared = 1019.2, df = 4, p-value < 2.2e-16

data1 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_INCOME_TOTAL[cc_data$TARGET == "1"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 0.73067, df = 24920, p-value = 0.465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5831.714 12763.636
## sample estimates:
## mean of x mean of y 
##  169077.7  165611.8

data1 <- cc_data$AMT_CREDIT[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_CREDIT[cc_data$TARGET == "1"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 19.273, df = 31161, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  40306.60 49432.91
## sample estimates:
## mean of x mean of y 
##  602648.3  557778.5

data1 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "0"]
data2 <- cc_data$AMT_ANNUITY[cc_data$TARGET == "1"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 8.1473, df = 31195, p-value = 3.858e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  517.8364 845.9217
## sample estimates:
## mean of x mean of y 
##  27163.62  26481.74

data1 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "0"]
data2 <- cc_data$DAYS_BIRTH[cc_data$TARGET == "1"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -45.006, df = 29749, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1307.932 -1198.764
## sample estimates:
## mean of x mean of y 
## -16138.18 -14884.83

data1 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "0"]
data2 <- cc_data1$DAYS_EMPLOYED[cc_data$TARGET == "1"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -0.68096, df = 24195, p-value = 0.4959
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -45.13094  21.85780
## sample estimates:
## mean of x mean of y 
## -2385.132 -2373.496

It shows that the variable”AMT_INCOME_TOTAL” has very weak relationship, then I check “NAME_INCOME_TYPE”, and found that at least half of this group are coming from middle class,combine with the high average income(168k), it makes sense that income has very little effect on “TARGET”,then I delete it.

Q1, which gender is more likely to have late payment? what age is tend to have late payment more? and what are education level?

ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = CODE_GENDER), position = "dodge") + 
  labs(title = "TARGET by Gender", 
       x = "Target by gender", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

cc_data2<- cc_data %>%
  mutate(age_group = case_when(
  DAYS_BIRTH >-9490 ~ "20+", 
  between(DAYS_BIRTH,-12775,-9441) ~"30+",
  between(DAYS_BIRTH,-16425,-12776) ~"40+", 
  between(DAYS_BIRTH,-Inf,-16426) ~"50+",
  .default=NA)) %>%
  print()

## # A tibble: 307,511 × 11
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Working      1  202500  4.07e5  24700.   -9461
##  2 F           N          Higher… State …      0  270000  1.29e6  35698.  -16765
##  3 M           Y          Second… Working      0   67500  1.35e5   6750   -19046
##  4 F           Y          Second… Working      0  135000  3.13e5  29686.  -19005
##  5 M           Y          Second… Working      0  121500  5.13e5  21866.  -19932
##  6 M           Y          Second… State …      0   99000  4.90e5  27518.  -16941
##  7 F           Y          Higher… Commer…      0  171000  1.56e6  41301   -13778
##  8 M           Y          Higher… State …      0  360000  1.53e6  42075   -18850
##  9 F           Y          Second… Pensio…      0  112500  1.02e6  33826.  -20099
## 10 M           Y          Second… Working      0  135000  4.05e5  20250   -14469
## # … with 307,501 more rows, 2 more variables: DAYS_EMPLOYED <dbl>,
## #   age_group <chr>, and abbreviated variable names ¹FLAG_OWN_REALTY,
## #   ²NAME_EDUCATION_TYPE, ³NAME_INCOME_TYPE, ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT,
## #   ⁶AMT_ANNUITY, ⁷DAYS_BIRTH

  ggplot(cc_data2) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = age_group), position = "dodge")+ 
  facet_wrap(~CODE_GENDER , nrow = 2) +
  labs(title = "Age & Gender effect on Target", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data2) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_EDUCATION_TYPE), position = "dodge")+ 
  facet_wrap(~CODE_GENDER , nrow = 2) +
  labs(title = "Education & Gender effect on Target", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Answer:Female has more late payment than male, both younger(20-)and older(50+) have more than other age group, and also a lot of them have secondary education.

Q2, How is “TARGET” related to the days of employee? what are those employee’s education level, whether or not they own a real property?

cc_data3<- cc_data1 %>%
  mutate(employee_group = case_when(
  DAYS_EMPLOYED >-730 ~ "2-", 
  between(DAYS_EMPLOYED,-1648,-768) ~"5-",
  between(DAYS_EMPLOYED,-3175,-769) ~"9-", 
  between(DAYS_EMPLOYED,-Inf,-3176) ~"10+",
  .default=NA)) %>%
  print()

## # A tibble: 252,135 × 11
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Working      1  202500  4.07e5  24700.   -9461
##  2 F           N          Higher… State …      0  270000  1.29e6  35698.  -16765
##  3 M           Y          Second… Working      0   67500  1.35e5   6750   -19046
##  4 F           Y          Second… Working      0  135000  3.13e5  29686.  -19005
##  5 M           Y          Second… Working      0  121500  5.13e5  21866.  -19932
##  6 M           Y          Second… State …      0   99000  4.90e5  27518.  -16941
##  7 F           Y          Higher… Commer…      0  171000  1.56e6  41301   -13778
##  8 M           Y          Higher… State …      0  360000  1.53e6  42075   -18850
##  9 M           Y          Second… Working      0  135000  4.05e5  20250   -14469
## 10 F           Y          Higher… Working      0  112500  6.52e5  21177   -10197
## # … with 252,125 more rows, 2 more variables: DAYS_EMPLOYED <dbl>,
## #   employee_group <chr>, and abbreviated variable names ¹FLAG_OWN_REALTY,
## #   ²NAME_EDUCATION_TYPE, ³NAME_INCOME_TYPE, ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT,
## #   ⁶AMT_ANNUITY, ⁷DAYS_BIRTH

ggplot(cc_data3) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = employee_group), position = "dodge") + 
  labs(title = "TARGET by Length of employee", 
       x = "Target by employee group", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data3) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_EDUCATION_TYPE), position = "dodge")+ 
  facet_grid(employee_group ~ FLAG_OWN_REALTY  ) +
  labs(title = "Education & Gender effect on Target", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Answer: the longer the employee, the less the late payment;the people who has house tends to have more late payment than who don’t have, especially for group with secondary education level

Q3, what is the effect of total credit amount on “TARGET”? is the days of employee the factor? what about the age?

ggplot(cc_data1) + 
  geom_boxplot(mapping = aes(y = as.factor(TARGET), x = AMT_CREDIT)) + 
  labs(title = "TARGET by Credit ", 
       x = "Credit Anount", 
       y = "Target") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data3) + 
  geom_boxplot(mapping = aes(y = AMT_CREDIT, x =employee_group )) + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & employee", 
       y = "Credit amount", 
       x     = "Length of employee") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data2) + 
  geom_boxplot(mapping = aes(x = AMT_CREDIT, y = age_group)) + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & age", 
       x = "Credit amount", 
       y = "Days of birth") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Answer: the customer who has less credit amount tends to have more late payment, and the customer who have been working longer tends to have more late payment.

Q4, what is the effect of owning a real property on “TARGET”? how does the income type affect it? is the credit amount a factor?

ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = FLAG_OWN_REALTY), position = "dodge")  + 
  labs(title = "Target by owning a property", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data) + 
  geom_bar(mapping = aes(x = as.factor(TARGET), fill = NAME_INCOME_TYPE), position = "dodge") + 
  labs(title = "Target by owning a property", 
       x = "Target", 
       y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

ggplot(cc_data) + 
  geom_boxplot(mapping = aes(x = AMT_CREDIT, y = NAME_INCOME_TYPE),position = "dodge") + 
  facet_wrap(~as.factor(TARGET),nrow = 2) + 
  labs(title = "Target by credit amount & income type", 
       x = "Credit amount", 
       y = "Income type") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(0.8)))

cc_data_stu <- cc_data %>%
  filter(NAME_INCOME_TYPE=="Student" | NAME_INCOME_TYPE=="Businessman" | NAME_INCOME_TYPE=="Maternity leave") %>%
  print()

## # A tibble: 33 × 10
##    CODE_GENDER FLAG_OWN…¹ NAME_…² NAME_…³ TARGET AMT_I…⁴ AMT_C…⁵ AMT_A…⁶ DAYS_…⁷
##    <chr>       <chr>      <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 M           Y          Second… Student      0  180000  5.24e5  22311   -18790
##  2 M           N          Higher… Student      0  135000  3.82e5  24512.   -9115
##  3 M           Y          Higher… Busine…      0  900000  2.25e6 112500   -20364
##  4 M           N          Higher… Busine…      0 2250000  1.35e6  67500   -12249
##  5 F           N          Second… Student      0   90000  3.14e5  16164   -16337
##  6 M           N          Higher… Student      0  225000  9.59e4  10332   -19180
##  7 M           N          Higher… Matern…      0  360000  7.65e5  76500   -22166
##  8 M           Y          Second… Student      0  144000  1.38e6  39712.  -14317
##  9 F           Y          Second… Student      0  112500  2.76e5  19778.  -12127
## 10 F           N          Second… Student      0  225000  6.60e5  60692.  -12280
## # … with 23 more rows, 1 more variable: DAYS_EMPLOYED <dbl>, and abbreviated
## #   variable names ¹FLAG_OWN_REALTY, ²NAME_EDUCATION_TYPE, ³NAME_INCOME_TYPE,
## #   ⁴AMT_INCOME_TOTAL, ⁵AMT_CREDIT, ⁶AMT_ANNUITY, ⁷DAYS_BIRTH

ggplot(cc_data_stu) + geom_bar(aes(as.factor(TARGET),fill=NAME_INCOME_TYPE),position= "dodge")

Answer:owning a real property does have effect on “TARGET”, but regarding different income group, it shows big difference: student and business man have no late payment at all, working class has the highest late payment.

Project1_V3

Lin

2023-04-01

About the data

What its about

The chosen variables

EDA PROCESS

Q1, which gender is more likely to have late payment? what age is tend to have late payment more? and what are education level?

Q3, what is the effect of total credit amount on “TARGET”? is the days of employee the factor? what about the age?

Q4, what is the effect of owning a real property on “TARGET”? how does the income type affect it? is the credit amount a factor?

Conclusion: