GOAL: create a model to predict which customers are likely to default on their credit car payments next month

TO DO - clean up age to remove over 100 to impute as median, make test and train age, sex and marraige match

EVALUATE THE MODEL: using AUC (area under the ROC curve) for binary classification models on the validation data.

CRISP-DM report on the business problem, the data, data preparation, insights, details of model training including the assumptions, evaluation methodology, preliminary results, consideration of ethical issues

STATEMENT: outlining contributions of each team member to this assignment

#Import the data
credit_training <- read.csv('AT2_credit_train_STUDENT.csv', header = TRUE)
credit_test <- read.csv('AT2_credit_test_STUDENT.csv', header = TRUE)
summary(credit_training)

##        ID          LIMIT_BAL            SEX          EDUCATION    
##  Min.   :    1   Min.   :    -99   1      : 9244   Min.   :0.000  
##  1st Qu.: 7489   1st Qu.:  50000   2      :13854   1st Qu.:1.000  
##  Median :14987   Median : 140000   cat    :    1   Median :2.000  
##  Mean   :14981   Mean   : 167524   dog    :    1   Mean   :1.853  
##  3rd Qu.:22452   3rd Qu.: 240000   dolphin:    1   3rd Qu.:2.000  
##  Max.   :30000   Max.   :1000000                   Max.   :6.000  
##     MARRIAGE          AGE           PAY_PC1              PAY_PC2        
##  Min.   :0.000   Min.   : 21.0   Min.   :-11.859675   Min.   :-4.42243  
##  1st Qu.:1.000   1st Qu.: 28.0   1st Qu.: -0.393308   1st Qu.:-0.23617  
##  Median :2.000   Median : 34.0   Median : -0.393308   Median : 0.17555  
##  Mean   :1.553   Mean   : 35.7   Mean   : -0.001656   Mean   :-0.00177  
##  3rd Qu.:2.000   3rd Qu.: 41.0   3rd Qu.:  1.360047   3rd Qu.: 0.36112  
##  Max.   :3.000   Max.   :141.0   Max.   :  3.813348   Max.   : 5.44103  
##     PAY_PC3             AMT_PC1            AMT_PC2        
##  Min.   :-3.864638   Min.   :-3.41080   Min.   :-4.71769  
##  1st Qu.:-0.283941   1st Qu.:-1.50827   1st Qu.:-0.42961  
##  Median : 0.004886   Median :-0.86433   Median :-0.20780  
##  Mean   : 0.000652   Mean   : 0.00461   Mean   : 0.00137  
##  3rd Qu.: 0.093942   3rd Qu.: 0.49766   3rd Qu.: 0.09062  
##  Max.   : 3.364030   Max.   :37.49240   Max.   :83.52137  
##     AMT_PC3             AMT_PC4              AMT_PC5         
##  Min.   :-38.46500   Min.   :-21.593416   Min.   :-42.37665  
##  1st Qu.: -0.13710   1st Qu.: -0.068199   1st Qu.: -0.08239  
##  Median : -0.07044   Median :  0.018389   Median : -0.03200  
##  Mean   :  0.00383   Mean   :  0.004618   Mean   :  0.00148  
##  3rd Qu.:  0.00325   3rd Qu.:  0.083236   3rd Qu.:  0.02644  
##  Max.   : 21.98483   Max.   : 21.823749   Max.   : 17.43097  
##     AMT_PC6             AMT_PC7          default  
##  Min.   :-38.88504   Min.   :-41.71546   N:17518  
##  1st Qu.: -0.04241   1st Qu.: -0.09273   Y: 5583  
##  Median : -0.00216   Median : -0.04099            
##  Mean   : -0.00202   Mean   : -0.00409            
##  3rd Qu.:  0.06754   3rd Qu.:  0.03157            
##  Max.   : 20.22670   Max.   : 22.92727

str(credit_training)

## 'data.frame':    23101 obs. of  17 variables:
##  $ ID       : int  1 2 3 4 6 7 8 9 10 11 ...
##  $ LIMIT_BAL: num  20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
##  $ SEX      : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
##  $ EDUCATION: int  2 2 2 2 1 1 2 3 3 3 ...
##  $ MARRIAGE : int  1 2 2 1 2 2 2 1 2 2 ...
##  $ AGE      : int  24 26 34 37 37 29 23 28 35 34 ...
##  $ PAY_PC1  : num  0.477 -1.462 -0.393 -0.393 -0.393 ...
##  $ PAY_PC2  : num  -3.225 0.854 0.176 0.176 0.176 ...
##  $ PAY_PC3  : num  0.14504 -0.36086 0.00489 0.00489 0.00489 ...
##  $ AMT_PC1  : num  -1.752 -1.663 -1.135 -0.397 -0.393 ...
##  $ AMT_PC2  : num  -0.224 -0.144 -0.177 -0.451 -0.5 ...
##  $ AMT_PC3  : num  -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
##  $ AMT_PC4  : num  0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
##  $ AMT_PC5  : num  -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
##  $ AMT_PC6  : num  0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
##  $ AMT_PC7  : num  -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
##  $ default  : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...

Observations of the data

Customer ID is the first column - it should be a factor, rather than an integer Check the “sex” column- why does it have 5 variables? There is “cat”, “dog” and “dolphin”, which would be incorrect data as these are not sex descriptors but animal descriptors. Education has levels from 0 to 6 - you assume this refers to differing levels of education status Marriage can be 0 to 3 - so there are 4 different categories of marriage Age, assumedly describes how old the card holder (ID) is. THIS NEEDS FIXING - WEIRD AGES OVER 100 We do not yet understand what PAY_PC cols are referring to, nor the AMT_PC Default would be the predictor or target variable, as this is what we are trying to determine.

DATA CLEANING - remove the animal references, cos there is no 0 in the test set Limit balances in the training set, some are -99, which is non sensical Training has 0, 1 nd 2 for sex but test set has none

##OLD CLEANING FUNCTION
#need to change ID to be "character", so the IDs are not counted as numbers

## clean_data <- function(dataSet) {
  
  #clean up sex, call it new_sex and change the animal classifications to 0
 # output <- dataSet
  
 # output$ID <- as.character(output$ID)
  
  #output$SEX <- as.integer(output$SEX)
  
  #output$SEX[output$SEX > 2] <- 0
  #clean age to remove aged over 100 entries will become NA
 # output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
  
 # output$SEX <- as.factor(output$SEX)
  #output$EDUCATION <- as.factor(output$EDUCATION)
  #output$MARRIAGE <- as.factor(output$MARRIAGE)
  
 # return(output)
#}

New cleaning function with more recent clean up observations

#Data cleaning function we can use each time to make sure we share the same dataset  
#Cleaned ID into character, Sex into two variables (to remove animals), Age into NAs for over 100 errored data and then transformed those NAs into median age of 35, Education into 4 categorised variables, marriage into 3 categorised variables. See explanation in markdown if you want to know more.
#also removed previously coded factors of marriage, sex and education

clean_data <- function(dataSet) {
  
  #clean up sex to remove the animal classifications 
  output <- dataSet
  
  output$ID <- as.character(output$ID)
  
  output$SEX <- as.integer(output$SEX)
  
  output$SEX[output$SEX > 2] <- 0
  output <- output[output$SEX !=0,]

   #clean age to remove aged over 100 entries and make them NA first
  output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
  #changing the NAs to now become the median age, which is rounded to 35
  output$AGE[is.na(output$AGE)] <- round(mean(output$AGE[!is.na(output$AGE)]))

   #all education factors greater than 4, set them to 4, also set 0 to 4
  output$EDUCATION[output$EDUCATION > 4] <- 4
  output$EDUCATION[output$EDUCATION == 0] <- 4
  #clean marriage
   output$MARRIAGE[output$MARRIAGE == 0] <- 3

  
  return(output)
}

Check the data again

#clean both train and validation sets using function above


credit_train_clean <- clean_data(credit_training)
credit_test_clean <- clean_data(credit_test)



#cleaned train
str(credit_train_clean)

## 'data.frame':    23098 obs. of  17 variables:
##  $ ID       : chr  "1" "2" "3" "4" ...
##  $ LIMIT_BAL: num  20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
##  $ SEX      : num  2 2 2 2 1 1 2 2 1 2 ...
##  $ EDUCATION: num  2 2 2 2 1 1 2 3 3 3 ...
##  $ MARRIAGE : num  1 2 2 1 2 2 2 1 2 2 ...
##  $ AGE      : num  24 26 34 37 37 29 23 28 35 34 ...
##  $ PAY_PC1  : num  0.477 -1.462 -0.393 -0.393 -0.393 ...
##  $ PAY_PC2  : num  -3.225 0.854 0.176 0.176 0.176 ...
##  $ PAY_PC3  : num  0.14504 -0.36086 0.00489 0.00489 0.00489 ...
##  $ AMT_PC1  : num  -1.752 -1.663 -1.135 -0.397 -0.393 ...
##  $ AMT_PC2  : num  -0.224 -0.144 -0.177 -0.451 -0.5 ...
##  $ AMT_PC3  : num  -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
##  $ AMT_PC4  : num  0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
##  $ AMT_PC5  : num  -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
##  $ AMT_PC6  : num  0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
##  $ AMT_PC7  : num  -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
##  $ default  : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...

#clean test and check for consistency
str(credit_test_clean)

## 'data.frame':    6899 obs. of  16 variables:
##  $ ID       : chr  "5" "17" "19" "23" ...
##  $ LIMIT_BAL: int  50000 20000 360000 70000 60000 50000 50000 500000 280000 60000 ...
##  $ SEX      : num  1 1 2 2 1 1 1 2 1 2 ...
##  $ EDUCATION: num  2 1 1 2 1 1 2 2 2 2 ...
##  $ MARRIAGE : num  1 2 1 2 2 2 2 1 1 2 ...
##  $ AGE      : num  57 24 49 26 27 26 33 54 40 22 ...
##  $ PAY_PC1  : num  0.273 -3.294 2.876 -3.211 1.425 ...
##  $ PAY_PC2  : num  0.847 1.835 -1.4 0.831 -0.57 ...
##  $ PAY_PC3  : num  -0.0458 -0.2383 1.297 1.7682 1.1754 ...
##  $ AMT_PC1  : num  -0.793 -1.117 -1.797 -0.148 -1.785 ...
##  $ AMT_PC2  : num  0.864 -0.249 -0.22 -0.431 -0.169 ...
##  $ AMT_PC3  : num  -0.64 -0.0889 -0.0704 -0.1271 -0.0622 ...
##  $ AMT_PC4  : num  0.21823 0.02798 0.01867 0.08418 -0.00615 ...
##  $ AMT_PC5  : num  -0.46541 -0.10028 -0.032 0.04805 -0.00469 ...
##  $ AMT_PC6  : num  0.30533 -0.04551 -0.00997 0.1519 0.01235 ...
##  $ AMT_PC7  : num  -1.0232 0.1004 -0.0433 -0.1224 -0.0813 ...

View(credit_test_clean)

plot(credit_train_clean)

#check test and train are the same, with no imbalances
plot(credit_test_clean)

#check sex, limit balance and remove animals

Let’s see what we can see and dig in to do some EDA

#check correlation between variables on train
#remove factors to run correlation plots on cor(credit_train_clean)
#don't want ID- keep 2 - sex, education marriage or age
removed <- credit_train_clean[,-c(1,3:5)]  
removed <- removed[,-c(13)]
corplot <-cor(removed)
corrplot(corplot, method="pie")

#check correlation between variables on test - they seem to match train, so that's good
#don't want ID- keep 2 - sex, education marriage or age
removed <- credit_test_clean[,-c(1,3:5)]  
removed <- removed[,-c(13)]
corplot <-cor(removed)
corrplot(corplot, method="pie")

#let's count the number of defaults and non-defaults
train_defaults <- data.frame(label=c("Y", "N"),value=c(length(credit_train_clean$default[credit_train_clean$default == "Y"]),length(credit_train_clean$default[credit_train_clean$default == "N"])))

ggplot(train_defaults, aes(x = label, y = value)) + 
  geom_bar(stat="identity")

#check distribution of education and limit balance against defaults on train
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
p <- ggplot(credit_train_clean, aes(EDUCATION, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Education vs Credit Card Limit Balance on train", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

#People in the 0 education class have no defaults at all.

#check distribution of education and limit balance against defaults on test
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
ptest <- ggplot(credit_test_clean, aes(EDUCATION, LIMIT_BAL))
ptest + geom_boxplot(aes(),group = 1)+
  labs(title = "Boxplot: Education vs Credit Card Limit Balance on test", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#People in the 0 education class have no defaults at all.

#check distribution of marriage and limit balance against defaults
p <- ggplot(credit_train_clean, aes(MARRIAGE, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Marriage Status vs Balance on train", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")

#check distribution of marriage and limit balance on test set
ptest <- ggplot(credit_test_clean, aes(MARRIAGE, LIMIT_BAL))
ptest + geom_boxplot(aes(),group = 1)+
  labs(title = "Boxplot: Marriage Status vs Balance on test", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#check distribution of sex and limit balance against defaults
p <- ggplot(credit_train_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
  labs(title = "Boxplot: Sex vs Limit Balance on train", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")

#check distribution of sex and limit balance on test
#THERE IS NO 0 IN THE TEST - NEED TO CLEAN OUT THE ZEROS
p <- ggplot(credit_test_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(),group = 1)+
  labs(title = "Boxplot: Sex vs Limit Balance on test", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Some observations

People with lower limit balances are more likely to be defaulters.

#plotting sex, age and defaults
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AGE, y = SEX, color = default))+
  labs(title = "Scatterplot: Sex, Age and Defaults on train", y = "0=unknown, 1=male, 2=female")

#so the animals were all defaults (three entries only)

#plotting sex, age and defaults on test - we need to remove the zeros
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = AGE, y = SEX))+
  labs(title = "Scatterplot: Sex, Age on test", y = "0=unknown, 1=male, 2=female")

#so the animals were all defaults (three entries only)
# go back and remove the zeros

#plotting education, limit balance and defaults
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL, color = default))+
  labs(title = "Scatterplot: Education, Limit Balance & Defaults on train", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

#plotting education, limit balance on test
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL))+
  labs(title = "Scatterplot: Education, Limit Balance on test", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")

#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = AGE, color = default))+
  facet_grid(~ MARRIAGE)+
  labs(title = "Facet grid of Marital Status by Age and Defaults on train", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

#plotting marriage, limit balance on test
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = AGE))+
  facet_grid(~ MARRIAGE)+
  labs(title = "Facet grid of Marital Status by Age and Defaults on test", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = SEX, color = default))+
  facet_grid(~ EDUCATION)+
  labs(title = "Facet grid of each Education status by Sex, Limit Balance & Defaults on train", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

#plotting marriage, limit balance and on test
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = LIMIT_BAL, y = SEX))+
  facet_grid(~ EDUCATION)+
  labs(title = "Facet grid of each Education status by Sex, Limit Balance on Test", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

##no category 0 in sex, so needs to be cleaned

Need to investigate Principal Components

Need to understand how the transformations occured and what they mean. It’s a method to reduce the number of components - a maths procedure to transform correlated variables into uncorrelated variables called Principal Component. The FIRST PC accounts for as much variability as possible. PCA reduces attribute space to a large number of variables. It’s a dimensionality of reduction data compression method. The goal is dimension reduction, but there is no guarantee the results are interpretable. Hence, take the EDA with a grain of salt not the mix of numbers and the negatives - hard to see what it really ‘means’ Based on the original variable having the highest correlation with the principal component.

#Understanding the PAY_PC1 variable on train
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC1, y = AGE, color = default))+
  labs(title = "PAY_PC1 and age on train")

#Understanding the PAY_PC1 variable on test
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = PAY_PC1, y = AGE))+
  labs(title = "PAY_PC1 and age on test")

#Understanding the PAY_PC2 variable on train
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC2, y = AGE, color = default))+
  labs(title = "PAY_PC2 and age on train")

#Understanding the PAY_PC2 variable on test
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = PAY_PC2, y = AGE))+
  labs(title = "PAY_PC2 and age on test")

#Understanding the PAY_PC3 variable on train
ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = PAY_PC3, y = AGE, color = default))+
  labs(title = "PAY_PC3 and age on train")

#Understanding the PAY_PC3 variable on test
ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = PAY_PC3, y = AGE))+
  labs(title = "PAY_PC3 and age on test")

#Understanding the AMT_PC1 variables
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC1, y = AGE, color = default))+
  labs(title = "AMT_PC1 and age on train")

#Understanding the AMT_PC1 variables on test
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)

ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = AMT_PC1, y = AGE))+
  labs(title = "AMT_PC1 and age on test")

#Understanding the AMT_PC2 variable on train

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC2, y = AGE, color = default))+
  labs(title = "AMT_PC2 and age on train")

#Understanding the AMT_PC2 variable on test

ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = AMT_PC2, y = AGE))+
  labs(title = "AMT_PC2 and age on test")

#Understanding the AMT_PC3 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC3, y = AGE, color = default))+
  labs(title = "AMT_PC3 and age on train")

#Understanding the AMT_PC3 variables on test

ggplot(data = credit_test_clean) +
  geom_point(mapping = aes(x = AMT_PC3, y = AGE))+
  labs(title = "AMT_PC3 and age on test")

#Understanding the AMT_PC4 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC4, y = AGE, color = default))

#Understanding the AMT_PC5 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC5, y = AGE, color = default))

#Understanding the AMT_PC6 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC6, y = AGE, color = default))

#Understanding the AMT_PC7 variables

ggplot(data = credit_train_clean) +
  geom_point(mapping = aes(x = AMT_PC7, y = AGE, color = default))

AT2 EDA byAB

Alex Brooks

29 September 2018

GOAL: create a model to predict which customers are likely to default on their credit car payments next month

Observations of the data

Let’s see what we can see and dig in to do some EDA

Some observations

Need to investigate Principal Components