EVALUATE THE MODEL: using AUC (area under the ROC curve) for binary classification models on the validation data.
CRISP-DM report on the business problem, the data, data preparation, insights, details of model training including the assumptions, evaluation methodology, preliminary results, consideration of ethical issues
STATEMENT: outlining contributions of each team member to this assignment
#Import the data
credit_training <- read.csv('AT2_credit_train_STUDENT.csv', header = TRUE)
credit_test <- read.csv('AT2_credit_test_STUDENT.csv', header = TRUE)
summary(credit_training)
## ID LIMIT_BAL SEX EDUCATION
## Min. : 1 Min. : -99 1 : 9244 Min. :0.000
## 1st Qu.: 7489 1st Qu.: 50000 2 :13854 1st Qu.:1.000
## Median :14987 Median : 140000 cat : 1 Median :2.000
## Mean :14981 Mean : 167524 dog : 1 Mean :1.853
## 3rd Qu.:22452 3rd Qu.: 240000 dolphin: 1 3rd Qu.:2.000
## Max. :30000 Max. :1000000 Max. :6.000
## MARRIAGE AGE PAY_PC1 PAY_PC2
## Min. :0.000 Min. : 21.0 Min. :-11.859675 Min. :-4.42243
## 1st Qu.:1.000 1st Qu.: 28.0 1st Qu.: -0.393308 1st Qu.:-0.23617
## Median :2.000 Median : 34.0 Median : -0.393308 Median : 0.17555
## Mean :1.553 Mean : 35.7 Mean : -0.001656 Mean :-0.00177
## 3rd Qu.:2.000 3rd Qu.: 41.0 3rd Qu.: 1.360047 3rd Qu.: 0.36112
## Max. :3.000 Max. :141.0 Max. : 3.813348 Max. : 5.44103
## PAY_PC3 AMT_PC1 AMT_PC2
## Min. :-3.864638 Min. :-3.41080 Min. :-4.71769
## 1st Qu.:-0.283941 1st Qu.:-1.50827 1st Qu.:-0.42961
## Median : 0.004886 Median :-0.86433 Median :-0.20780
## Mean : 0.000652 Mean : 0.00461 Mean : 0.00137
## 3rd Qu.: 0.093942 3rd Qu.: 0.49766 3rd Qu.: 0.09062
## Max. : 3.364030 Max. :37.49240 Max. :83.52137
## AMT_PC3 AMT_PC4 AMT_PC5
## Min. :-38.46500 Min. :-21.593416 Min. :-42.37665
## 1st Qu.: -0.13710 1st Qu.: -0.068199 1st Qu.: -0.08239
## Median : -0.07044 Median : 0.018389 Median : -0.03200
## Mean : 0.00383 Mean : 0.004618 Mean : 0.00148
## 3rd Qu.: 0.00325 3rd Qu.: 0.083236 3rd Qu.: 0.02644
## Max. : 21.98483 Max. : 21.823749 Max. : 17.43097
## AMT_PC6 AMT_PC7 default
## Min. :-38.88504 Min. :-41.71546 N:17518
## 1st Qu.: -0.04241 1st Qu.: -0.09273 Y: 5583
## Median : -0.00216 Median : -0.04099
## Mean : -0.00202 Mean : -0.00409
## 3rd Qu.: 0.06754 3rd Qu.: 0.03157
## Max. : 20.22670 Max. : 22.92727
str(credit_training)
## 'data.frame': 23101 obs. of 17 variables:
## $ ID : int 1 2 3 4 6 7 8 9 10 11 ...
## $ LIMIT_BAL: num 20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
## $ SEX : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
## $ EDUCATION: int 2 2 2 2 1 1 2 3 3 3 ...
## $ MARRIAGE : int 1 2 2 1 2 2 2 1 2 2 ...
## $ AGE : int 24 26 34 37 37 29 23 28 35 34 ...
## $ PAY_PC1 : num 0.477 -1.462 -0.393 -0.393 -0.393 ...
## $ PAY_PC2 : num -3.225 0.854 0.176 0.176 0.176 ...
## $ PAY_PC3 : num 0.14504 -0.36086 0.00489 0.00489 0.00489 ...
## $ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 ...
## $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 ...
## $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
## $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
## $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
## $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
## $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
## $ default : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
Customer ID is the first column - it should be a factor, rather than an integer Check the “sex” column- why does it have 5 variables? There is “cat”, “dog” and “dolphin”, which would be incorrect data as these are not sex descriptors but animal descriptors. Education has levels from 0 to 6 - you assume this refers to differing levels of education status Marriage can be 0 to 3 - so there are 4 different categories of marriage Age, assumedly describes how old the card holder (ID) is. THIS NEEDS FIXING - WEIRD AGES OVER 100 We do not yet understand what PAY_PC cols are referring to, nor the AMT_PC Default would be the predictor or target variable, as this is what we are trying to determine.
#need to change ID to be "character", so the IDs are not counted as numbers
clean_data <- function(dataSet) {
#clean up sex, call it new_sex and change the animal classifications to 0
output <- dataSet
output$ID <- as.character(output$ID)
output$SEX <- as.integer(output$SEX)
output$SEX[output$SEX > 2] <- 0
#clean age to remove aged over 100 entries will become NA
output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
output$SEX <- as.factor(output$SEX)
output$EDUCATION <- as.factor(output$EDUCATION)
output$MARRIAGE <- as.factor(output$MARRIAGE)
return(output)
}
credit_train_clean <- clean_data(credit_training)
credit_test_clean <- clean_data(credit_test)
str(credit_train_clean)
## 'data.frame': 23101 obs. of 17 variables:
## $ ID : chr "1" "2" "3" "4" ...
## $ LIMIT_BAL: num 20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
## $ SEX : Factor w/ 3 levels "0","1","2": 3 3 3 3 2 2 3 3 2 3 ...
## $ EDUCATION: Factor w/ 7 levels "0","1","2","3",..: 3 3 3 3 2 2 3 4 4 4 ...
## $ MARRIAGE : Factor w/ 4 levels "0","1","2","3": 2 3 3 2 3 3 3 2 3 3 ...
## $ AGE : int 24 26 34 37 37 29 23 28 35 34 ...
## $ PAY_PC1 : num 0.477 -1.462 -0.393 -0.393 -0.393 ...
## $ PAY_PC2 : num -3.225 0.854 0.176 0.176 0.176 ...
## $ PAY_PC3 : num 0.14504 -0.36086 0.00489 0.00489 0.00489 ...
## $ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 ...
## $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 ...
## $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
## $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
## $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
## $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
## $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
## $ default : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
str(credit_test_clean)
## 'data.frame': 6899 obs. of 16 variables:
## $ ID : chr "5" "17" "19" "23" ...
## $ LIMIT_BAL: int 50000 20000 360000 70000 60000 50000 50000 500000 280000 60000 ...
## $ SEX : Factor w/ 2 levels "1","2": 1 1 2 2 1 1 1 2 1 2 ...
## $ EDUCATION: Factor w/ 7 levels "0","1","2","3",..: 3 2 2 3 2 2 3 3 3 3 ...
## $ MARRIAGE : Factor w/ 4 levels "0","1","2","3": 2 3 2 3 3 3 3 2 2 3 ...
## $ AGE : int 57 24 49 26 27 26 33 54 40 22 ...
## $ PAY_PC1 : num 0.273 -3.294 2.876 -3.211 1.425 ...
## $ PAY_PC2 : num 0.847 1.835 -1.4 0.831 -0.57 ...
## $ PAY_PC3 : num -0.0458 -0.2383 1.297 1.7682 1.1754 ...
## $ AMT_PC1 : num -0.793 -1.117 -1.797 -0.148 -1.785 ...
## $ AMT_PC2 : num 0.864 -0.249 -0.22 -0.431 -0.169 ...
## $ AMT_PC3 : num -0.64 -0.0889 -0.0704 -0.1271 -0.0622 ...
## $ AMT_PC4 : num 0.21823 0.02798 0.01867 0.08418 -0.00615 ...
## $ AMT_PC5 : num -0.46541 -0.10028 -0.032 0.04805 -0.00469 ...
## $ AMT_PC6 : num 0.30533 -0.04551 -0.00997 0.1519 0.01235 ...
## $ AMT_PC7 : num -1.0232 0.1004 -0.0433 -0.1224 -0.0813 ...
plot(credit_train_clean)
#check test and train are the same, with no imbalances
plot(credit_test_clean)
## Let’s see what we can see and dig in to do some EDA
#check distribution of education and limit balance against defaults
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
p <- ggplot(credit_train_clean, aes(EDUCATION, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Education vs Credit Card Limit Balance", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
#People in the 0 education class have no defaults at all.
#check distribution of marriage and limit balance against defaults
p <- ggplot(credit_train_clean, aes(MARRIAGE, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Marriage Status vs Balance", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")
#check distribution of sex and limit balance against defaults
p <- ggplot(credit_train_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Sex vs Limit Balance", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")
##Some observations People with lower limit balances are more likely to be defaulters.
#plotting sex, age and defaults
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AGE, y = SEX, color = default))+
labs(title = "Scatterplot: Sex, Age and Defaults", y = "0=unknown, 1=male, 2=female")
## Warning: Removed 50 rows containing missing values (geom_point).
#so the animals were all defaults (three entries only)
#plotting education, limit balance and defauults
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL, color = default))+
labs(title = "Scatterplot: Education, Limit Balance & Defaults", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = AGE, color = default))+
facet_grid(~ MARRIAGE)+
labs(title = "Facet grid of Marital Status by Age and Defaults", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
## Warning: Removed 50 rows containing missing values (geom_point).
#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = SEX, color = default))+
facet_grid(~ EDUCATION)+
labs(title = "Facet grid of each Education status by Sex, Limit Balance & Defaults", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
Need to understand how the transformations occured and what they mean. It’s a method to reduce the number of components - a maths procedure to transform correlated variables into uncorrelated variables called Principal Component. The FIRST PC accounts for as much variability as possible. PCA reduces attribute space to a large number of variables. It’s a dimensionality of reduction data compression method. The goal is dimension reduction, but there is no guarantee the results are interpretable. Hence, take the EDA with a grain of salt not the mix of numbers and the negatives - hard to see what it really ‘means’ Based on the original variable having the highest correlation with the principal component.
#Understanding the PAY_PC1 variable
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC1, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the PAY_PC2 variable
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC2, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the PAY_PC3 variable
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC3, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
$ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 … $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 … $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 … $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 … $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 … $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 … $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 …
#Understanding the AMT_PC1 variables
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC1, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the AMT_PC2 variable
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC2, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the AMT_PC3 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC3, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
$ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 … $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 … $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 … $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 …
#Understanding the AMT_PC4 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC4, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the AMT_PC5 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC5, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the AMT_PC6 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC6, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).
#Understanding the AMT_PC7 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC7, y = AGE, color = default))
## Warning: Removed 50 rows containing missing values (geom_point).