GOAL: create a model to predict which customers are likely to default on their credit car payments next month

EVALUATE THE MODEL: using AUC (area under the ROC curve) for binary classification models on the validation data.

CRISP-DM report on the business problem, the data, data preparation, insights, details of model training including the assumptions, evaluation methodology, preliminary results, consideration of ethical issues

STATEMENT: outlining contributions of each team member to this assignment

#Import the data
credit_training <- read.csv('AT2_credit_train_STUDENT.csv', header = TRUE)
credit_test <- read.csv('AT2_credit_test_STUDENT.csv', header = TRUE)
summary(credit_training)

##        ID          LIMIT_BAL            SEX          EDUCATION    
##  Min.   :    1   Min.   :    -99   1      : 9244   Min.   :0.000  
##  1st Qu.: 7489   1st Qu.:  50000   2      :13854   1st Qu.:1.000  
##  Median :14987   Median : 140000   cat    :    1   Median :2.000  
##  Mean   :14981   Mean   : 167524   dog    :    1   Mean   :1.853  
##  3rd Qu.:22452   3rd Qu.: 240000   dolphin:    1   3rd Qu.:2.000  
##  Max.   :30000   Max.   :1000000                   Max.   :6.000  
##     MARRIAGE          AGE           PAY_PC1              PAY_PC2        
##  Min.   :0.000   Min.   : 21.0   Min.   :-11.859675   Min.   :-4.42243  
##  1st Qu.:1.000   1st Qu.: 28.0   1st Qu.: -0.393308   1st Qu.:-0.23617  
##  Median :2.000   Median : 34.0   Median : -0.393308   Median : 0.17555  
##  Mean   :1.553   Mean   : 35.7   Mean   : -0.001656   Mean   :-0.00177  
##  3rd Qu.:2.000   3rd Qu.: 41.0   3rd Qu.:  1.360047   3rd Qu.: 0.36112  
##  Max.   :3.000   Max.   :141.0   Max.   :  3.813348   Max.   : 5.44103  
##     PAY_PC3             AMT_PC1            AMT_PC2        
##  Min.   :-3.864638   Min.   :-3.41080   Min.   :-4.71769  
##  1st Qu.:-0.283941   1st Qu.:-1.50827   1st Qu.:-0.42961  
##  Median : 0.004886   Median :-0.86433   Median :-0.20780  
##  Mean   : 0.000652   Mean   : 0.00461   Mean   : 0.00137  
##  3rd Qu.: 0.093942   3rd Qu.: 0.49766   3rd Qu.: 0.09062  
##  Max.   : 3.364030   Max.   :37.49240   Max.   :83.52137  
##     AMT_PC3             AMT_PC4              AMT_PC5         
##  Min.   :-38.46500   Min.   :-21.593416   Min.   :-42.37665  
##  1st Qu.: -0.13710   1st Qu.: -0.068199   1st Qu.: -0.08239  
##  Median : -0.07044   Median :  0.018389   Median : -0.03200  
##  Mean   :  0.00383   Mean   :  0.004618   Mean   :  0.00148  
##  3rd Qu.:  0.00325   3rd Qu.:  0.083236   3rd Qu.:  0.02644  
##  Max.   : 21.98483   Max.   : 21.823749   Max.   : 17.43097  
##     AMT_PC6             AMT_PC7          default  
##  Min.   :-38.88504   Min.   :-41.71546   N:17518  
##  1st Qu.: -0.04241   1st Qu.: -0.09273   Y: 5583  
##  Median : -0.00216   Median : -0.04099            
##  Mean   : -0.00202   Mean   : -0.00409            
##  3rd Qu.:  0.06754   3rd Qu.:  0.03157            
##  Max.   : 20.22670   Max.   : 22.92727

str(credit_training)

## 'data.frame':    23101 obs. of  17 variables:
##  $ ID       : int  1 2 3 4 6 7 8 9 10 11 ...
##  $ LIMIT_BAL: num  20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
##  $ SEX      : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
##  $ EDUCATION: int  2 2 2 2 1 1 2 3 3 3 ...
##  $ MARRIAGE : int  1 2 2 1 2 2 2 1 2 2 ...
##  $ AGE      : int  24 26 34 37 37 29 23 28 35 34 ...
##  $ PAY_PC1  : num  0.477 -1.462 -0.393 -0.393 -0.393 ...
##  $ PAY_PC2  : num  -3.225 0.854 0.176 0.176 0.176 ...
##  $ PAY_PC3  : num  0.14504 -0.36086 0.00489 0.00489 0.00489 ...
##  $ AMT_PC1  : num  -1.752 -1.663 -1.135 -0.397 -0.393 ...
##  $ AMT_PC2  : num  -0.224 -0.144 -0.177 -0.451 -0.5 ...
##  $ AMT_PC3  : num  -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
##  $ AMT_PC4  : num  0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
##  $ AMT_PC5  : num  -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
##  $ AMT_PC6  : num  0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
##  $ AMT_PC7  : num  -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
##  $ default  : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...

Observations of the data

Customer ID is the first column - it should be a factor, rather than an integer Check the “sex” column- why does it have 5 variables? There is “cat”, “dog” and “dolphin”, which would be incorrect data as these are not sex descriptors but animal descriptors. Education has levels from 0 to 6 - you assume this refers to differing levels of education status Marriage can be 0 to 3 - so there are 4 different categories of marriage Age, assumedly describes how old the card holder (ID) is. THIS NEEDS FIXING - WEIRD AGES OVER 100 We do not yet understand what PAY_PC cols are referring to, nor the AMT_PC Default would be the predictor or target variable, as this is what we are trying to determine.

#need to change ID to be "character", so the IDs are not counted as numbers

clean_data <- function(dataSet) {
  
  #clean up sex, call it new_sex and change the animal classifications to 0
  output <- dataSet
  
  output$ID <- as.character(output$ID)
  
  output$SEX <- as.integer(output$SEX)
  
  output$SEX[output$SEX > 2] <- 0
  #clean age to remove aged over 100 entries will become NA
  output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
  
  output$SEX <- as.factor(output$SEX)
  output$EDUCATION <- as.factor(output$EDUCATION)
  output$MARRIAGE <- as.factor(output$MARRIAGE)
  
  return(output)
  }

credit_train_clean <- clean_data(credit_training)
credit_test_clean <- clean_data(credit_test)

str(credit_train_clean)

## 'data.frame':    23101 obs. of  17 variables:
##  $ ID       : chr  "1" "2" "3" "4" ...
##  $ LIMIT_BAL: num  20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
##  $ SEX      : Factor w/ 3 levels "0","1","2": 3 3 3 3 2 2 3 3 2 3 ...
##  $ EDUCATION: Factor w/ 7 levels "0","1","2","3",..: 3 3 3 3 2 2 3 4 4 4 ...
##  $ MARRIAGE : Factor w/ 4 levels "0","1","2","3": 2 3 3 2 3 3 3 2 3 3 ...
##  $ AGE      : int  24 26 34 37 37 29 23 28 35 34 ...
##  $ PAY_PC1  : num  0.477 -1.462 -0.393 -0.393 -0.393 ...
##  $ PAY_PC2  : num  -3.225 0.854 0.176 0.176 0.176 ...
##  $ PAY_PC3  : num  0.14504 -0.36086 0.00489 0.00489 0.00489 ...
##  $ AMT_PC1  : num  -1.752 -1.663 -1.135 -0.397 -0.393 ...
##  $ AMT_PC2  : num  -0.224 -0.144 -0.177 -0.451 -0.5 ...
##  $ AMT_PC3  : num  -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
##  $ AMT_PC4  : num  0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
##  $ AMT_PC5  : num  -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
##  $ AMT_PC6  : num  0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
##  $ AMT_PC7  : num  -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
##  $ default  : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...

str(credit_test_clean)

## 'data.frame':    6899 obs. of  16 variables:
##  $ ID       : chr  "5" "17" "19" "23" ...
##  $ LIMIT_BAL: int  50000 20000 360000 70000 60000 50000 50000 500000 280000 60000 ...
##  $ SEX      : Factor w/ 2 levels "1","2": 1 1 2 2 1 1 1 2 1 2 ...
##  $ EDUCATION: Factor w/ 7 levels "0","1","2","3",..: 3 2 2 3 2 2 3 3 3 3 ...
##  $ MARRIAGE : Factor w/ 4 levels "0","1","2","3": 2 3 2 3 3 3 3 2 2 3 ...
##  $ AGE      : int  57 24 49 26 27 26 33 54 40 22 ...
##  $ PAY_PC1  : num  0.273 -3.294 2.876 -3.211 1.425 ...
##  $ PAY_PC2  : num  0.847 1.835 -1.4 0.831 -0.57 ...
##  $ PAY_PC3  : num  -0.0458 -0.2383 1.297 1.7682 1.1754 ...
##  $ AMT_PC1  : num  -0.793 -1.117 -1.797 -0.148 -1.785 ...
##  $ AMT_PC2  : num  0.864 -0.249 -0.22 -0.431 -0.169 ...
##  $ AMT_PC3  : num  -0.64 -0.0889 -0.0704 -0.1271 -0.0622 ...
##  $ AMT_PC4  : num  0.21823 0.02798 0.01867 0.08418 -0.00615 ...
##  $ AMT_PC5  : num  -0.46541 -0.10028 -0.032 0.04805 -0.00469 ...
##  $ AMT_PC6  : num  0.30533 -0.04551 -0.00997 0.1519 0.01235 ...
##  $ AMT_PC7  : num  -1.0232 0.1004 -0.0433 -0.1224 -0.0813 ...

plot(credit_train_clean)

#check test and train are the same, with no imbalances
plot(credit_test_clean)

## Let’s see what we can see and dig in to do some EDA

#check distribution of education and limit balance against defaults
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
p <- ggplot(credit_train_clean, aes(EDUCATION, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Education vs Credit Card Limit Balance", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

#People in the 0 education class have no defaults at all.

#check distribution of marriage and limit balance against defaults
p <- ggplot(credit_train_clean, aes(MARRIAGE, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Marriage Status vs Balance", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")

#check distribution of sex and limit balance against defaults
p <- ggplot(credit_train_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Sex vs Limit Balance", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")

##Some observations People with lower limit balances are more likely to be defaulters.

#plotting sex, age and defaults
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AGE, y = SEX, color = default))+
  labs(title = "Scatterplot: Sex, Age and Defaults", y = "0=unknown, 1=male, 2=female")

## Warning: Removed 50 rows containing missing values (geom_point).

#so the animals were all defaults (three entries only)

#plotting education, limit balance and defauults
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL, color = default))+
  labs(title = "Scatterplot: Education, Limit Balance & Defaults", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = AGE, color = default))+
  facet_grid(~ MARRIAGE)+
  labs(title = "Facet grid of Marital Status by Age and Defaults", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

## Warning: Removed 50 rows containing missing values (geom_point).

#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = SEX, color = default))+
  facet_grid(~ EDUCATION)+
  labs(title = "Facet grid of each Education status by Sex, Limit Balance & Defaults", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

Need to investigate Principal Components

Need to understand how the transformations occured and what they mean. It’s a method to reduce the number of components - a maths procedure to transform correlated variables into uncorrelated variables called Principal Component. The FIRST PC accounts for as much variability as possible. PCA reduces attribute space to a large number of variables. It’s a dimensionality of reduction data compression method. The goal is dimension reduction, but there is no guarantee the results are interpretable. Hence, take the EDA with a grain of salt not the mix of numbers and the negatives - hard to see what it really ‘means’ Based on the original variable having the highest correlation with the principal component.

#Understanding the PAY_PC1 variable
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC1, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the PAY_PC2 variable
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC2, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the PAY_PC3 variable
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC3, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

$ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 … $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 … $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 … $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 … $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 … $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 … $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 …

#Understanding the AMT_PC1 variables
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC1, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the AMT_PC2 variable

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC2, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the AMT_PC3 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC3, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

$ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 … $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 … $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 … $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 …

#Understanding the AMT_PC4 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC4, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the AMT_PC5 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC5, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the AMT_PC6 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC6, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

#Understanding the AMT_PC7 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC7, y = AGE, color = default))

## Warning: Removed 50 rows containing missing values (geom_point).

AT2EDAbyAB

Alex Brooks

29 September 2018

GOAL: create a model to predict which customers are likely to default on their credit car payments next month

Observations of the data

Need to investigate Principal Components