TO DO - clean up age to remove over 100 to impute as median, make test and train age, sex and marraige match
EVALUATE THE MODEL: using AUC (area under the ROC curve) for binary classification models on the validation data.
CRISP-DM report on the business problem, the data, data preparation, insights, details of model training including the assumptions, evaluation methodology, preliminary results, consideration of ethical issues
STATEMENT: outlining contributions of each team member to this assignment
#Import the data
credit_training <- read.csv('AT2_credit_train_STUDENT.csv', header = TRUE)
credit_test <- read.csv('AT2_credit_test_STUDENT.csv', header = TRUE)
summary(credit_training)
## ID LIMIT_BAL SEX EDUCATION
## Min. : 1 Min. : -99 1 : 9244 Min. :0.000
## 1st Qu.: 7489 1st Qu.: 50000 2 :13854 1st Qu.:1.000
## Median :14987 Median : 140000 cat : 1 Median :2.000
## Mean :14981 Mean : 167524 dog : 1 Mean :1.853
## 3rd Qu.:22452 3rd Qu.: 240000 dolphin: 1 3rd Qu.:2.000
## Max. :30000 Max. :1000000 Max. :6.000
## MARRIAGE AGE PAY_PC1 PAY_PC2
## Min. :0.000 Min. : 21.0 Min. :-11.859675 Min. :-4.42243
## 1st Qu.:1.000 1st Qu.: 28.0 1st Qu.: -0.393308 1st Qu.:-0.23617
## Median :2.000 Median : 34.0 Median : -0.393308 Median : 0.17555
## Mean :1.553 Mean : 35.7 Mean : -0.001656 Mean :-0.00177
## 3rd Qu.:2.000 3rd Qu.: 41.0 3rd Qu.: 1.360047 3rd Qu.: 0.36112
## Max. :3.000 Max. :141.0 Max. : 3.813348 Max. : 5.44103
## PAY_PC3 AMT_PC1 AMT_PC2
## Min. :-3.864638 Min. :-3.41080 Min. :-4.71769
## 1st Qu.:-0.283941 1st Qu.:-1.50827 1st Qu.:-0.42961
## Median : 0.004886 Median :-0.86433 Median :-0.20780
## Mean : 0.000652 Mean : 0.00461 Mean : 0.00137
## 3rd Qu.: 0.093942 3rd Qu.: 0.49766 3rd Qu.: 0.09062
## Max. : 3.364030 Max. :37.49240 Max. :83.52137
## AMT_PC3 AMT_PC4 AMT_PC5
## Min. :-38.46500 Min. :-21.593416 Min. :-42.37665
## 1st Qu.: -0.13710 1st Qu.: -0.068199 1st Qu.: -0.08239
## Median : -0.07044 Median : 0.018389 Median : -0.03200
## Mean : 0.00383 Mean : 0.004618 Mean : 0.00148
## 3rd Qu.: 0.00325 3rd Qu.: 0.083236 3rd Qu.: 0.02644
## Max. : 21.98483 Max. : 21.823749 Max. : 17.43097
## AMT_PC6 AMT_PC7 default
## Min. :-38.88504 Min. :-41.71546 N:17518
## 1st Qu.: -0.04241 1st Qu.: -0.09273 Y: 5583
## Median : -0.00216 Median : -0.04099
## Mean : -0.00202 Mean : -0.00409
## 3rd Qu.: 0.06754 3rd Qu.: 0.03157
## Max. : 20.22670 Max. : 22.92727
str(credit_training)
## 'data.frame': 23101 obs. of 17 variables:
## $ ID : int 1 2 3 4 6 7 8 9 10 11 ...
## $ LIMIT_BAL: num 20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
## $ SEX : Factor w/ 5 levels "1","2","cat",..: 2 2 2 2 1 1 2 2 1 2 ...
## $ EDUCATION: int 2 2 2 2 1 1 2 3 3 3 ...
## $ MARRIAGE : int 1 2 2 1 2 2 2 1 2 2 ...
## $ AGE : int 24 26 34 37 37 29 23 28 35 34 ...
## $ PAY_PC1 : num 0.477 -1.462 -0.393 -0.393 -0.393 ...
## $ PAY_PC2 : num -3.225 0.854 0.176 0.176 0.176 ...
## $ PAY_PC3 : num 0.14504 -0.36086 0.00489 0.00489 0.00489 ...
## $ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 ...
## $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 ...
## $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
## $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
## $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
## $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
## $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
## $ default : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
Customer ID is the first column - it should be a factor, rather than an integer Check the “sex” column- why does it have 5 variables? There is “cat”, “dog” and “dolphin”, which would be incorrect data as these are not sex descriptors but animal descriptors. Education has levels from 0 to 6 - you assume this refers to differing levels of education status Marriage can be 0 to 3 - so there are 4 different categories of marriage Age, assumedly describes how old the card holder (ID) is. THIS NEEDS FIXING - WEIRD AGES OVER 100 We do not yet understand what PAY_PC cols are referring to, nor the AMT_PC Default would be the predictor or target variable, as this is what we are trying to determine.
DATA CLEANING - remove the animal references, cos there is no 0 in the test set Limit balances in the training set, some are -99, which is non sensical Training has 0, 1 nd 2 for sex but test set has none
##OLD CLEANING FUNCTION
#need to change ID to be "character", so the IDs are not counted as numbers
## clean_data <- function(dataSet) {
#clean up sex, call it new_sex and change the animal classifications to 0
# output <- dataSet
# output$ID <- as.character(output$ID)
#output$SEX <- as.integer(output$SEX)
#output$SEX[output$SEX > 2] <- 0
#clean age to remove aged over 100 entries will become NA
# output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
# output$SEX <- as.factor(output$SEX)
#output$EDUCATION <- as.factor(output$EDUCATION)
#output$MARRIAGE <- as.factor(output$MARRIAGE)
# return(output)
#}
New cleaning function with more recent clean up observations
#Data cleaning function we can use each time to make sure we share the same dataset
#Cleaned ID into character, Sex into two variables (to remove animals), Age into NAs for over 100 errored data and then transformed those NAs into median age of 35, Education into 4 categorised variables, marriage into 3 categorised variables. See explanation in markdown if you want to know more.
#also removed previously coded factors of marriage, sex and education
clean_data <- function(dataSet) {
#clean up sex to remove the animal classifications
output <- dataSet
output$ID <- as.character(output$ID)
output$SEX <- as.integer(output$SEX)
output$SEX[output$SEX > 2] <- 0
output <- output[output$SEX !=0,]
#clean age to remove aged over 100 entries and make them NA first
output$AGE <- ifelse(output$AGE >=100, NA, output$AGE)
#changing the NAs to now become the median age, which is rounded to 35
output$AGE[is.na(output$AGE)] <- round(mean(output$AGE[!is.na(output$AGE)]))
#all education factors greater than 4, set them to 4, also set 0 to 4
output$EDUCATION[output$EDUCATION > 4] <- 4
output$EDUCATION[output$EDUCATION == 0] <- 4
#clean marriage
output$MARRIAGE[output$MARRIAGE == 0] <- 3
return(output)
}
Check the data again
#clean both train and validation sets using function above
credit_train_clean <- clean_data(credit_training)
credit_test_clean <- clean_data(credit_test)
#cleaned train
str(credit_train_clean)
## 'data.frame': 23098 obs. of 17 variables:
## $ ID : chr "1" "2" "3" "4" ...
## $ LIMIT_BAL: num 20000 120000 90000 50000 50000 500000 100000 140000 20000 200000 ...
## $ SEX : num 2 2 2 2 1 1 2 2 1 2 ...
## $ EDUCATION: num 2 2 2 2 1 1 2 3 3 3 ...
## $ MARRIAGE : num 1 2 2 1 2 2 2 1 2 2 ...
## $ AGE : num 24 26 34 37 37 29 23 28 35 34 ...
## $ PAY_PC1 : num 0.477 -1.462 -0.393 -0.393 -0.393 ...
## $ PAY_PC2 : num -3.225 0.854 0.176 0.176 0.176 ...
## $ PAY_PC3 : num 0.14504 -0.36086 0.00489 0.00489 0.00489 ...
## $ AMT_PC1 : num -1.752 -1.663 -1.135 -0.397 -0.393 ...
## $ AMT_PC2 : num -0.224 -0.144 -0.177 -0.451 -0.5 ...
## $ AMT_PC3 : num -0.0778 -0.0546 0.016 -0.0998 -0.1033 ...
## $ AMT_PC4 : num 0.00696 -0.00285 -0.12907 -0.03534 -0.1179 ...
## $ AMT_PC5 : num -0.0414 0.0439 0.0982 -0.0553 -0.0546 ...
## $ AMT_PC6 : num 0.000887 -0.02619 -0.022383 0.050465 0.112137 ...
## $ AMT_PC7 : num -0.0563 -0.1 -0.069 -0.0282 0.0186 ...
## $ default : Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
#clean test and check for consistency
str(credit_test_clean)
## 'data.frame': 6899 obs. of 16 variables:
## $ ID : chr "5" "17" "19" "23" ...
## $ LIMIT_BAL: int 50000 20000 360000 70000 60000 50000 50000 500000 280000 60000 ...
## $ SEX : num 1 1 2 2 1 1 1 2 1 2 ...
## $ EDUCATION: num 2 1 1 2 1 1 2 2 2 2 ...
## $ MARRIAGE : num 1 2 1 2 2 2 2 1 1 2 ...
## $ AGE : num 57 24 49 26 27 26 33 54 40 22 ...
## $ PAY_PC1 : num 0.273 -3.294 2.876 -3.211 1.425 ...
## $ PAY_PC2 : num 0.847 1.835 -1.4 0.831 -0.57 ...
## $ PAY_PC3 : num -0.0458 -0.2383 1.297 1.7682 1.1754 ...
## $ AMT_PC1 : num -0.793 -1.117 -1.797 -0.148 -1.785 ...
## $ AMT_PC2 : num 0.864 -0.249 -0.22 -0.431 -0.169 ...
## $ AMT_PC3 : num -0.64 -0.0889 -0.0704 -0.1271 -0.0622 ...
## $ AMT_PC4 : num 0.21823 0.02798 0.01867 0.08418 -0.00615 ...
## $ AMT_PC5 : num -0.46541 -0.10028 -0.032 0.04805 -0.00469 ...
## $ AMT_PC6 : num 0.30533 -0.04551 -0.00997 0.1519 0.01235 ...
## $ AMT_PC7 : num -1.0232 0.1004 -0.0433 -0.1224 -0.0813 ...
View(credit_test_clean)
plot(credit_train_clean)
#check test and train are the same, with no imbalances
plot(credit_test_clean)
#check sex, limit balance and remove animals
#check correlation between variables on train
#remove factors to run correlation plots on cor(credit_train_clean)
#don't want ID- keep 2 - sex, education marriage or age
removed <- credit_train_clean[,-c(1,3:5)]
removed <- removed[,-c(13)]
corplot <-cor(removed)
corrplot(corplot, method="pie")
#check correlation between variables on test - they seem to match train, so that's good
#don't want ID- keep 2 - sex, education marriage or age
removed <- credit_test_clean[,-c(1,3:5)]
removed <- removed[,-c(13)]
corplot <-cor(removed)
corrplot(corplot, method="pie")
#let's count the number of defaults and non-defaults
train_defaults <- data.frame(label=c("Y", "N"),value=c(length(credit_train_clean$default[credit_train_clean$default == "Y"]),length(credit_train_clean$default[credit_train_clean$default == "N"])))
ggplot(train_defaults, aes(x = label, y = value)) +
geom_bar(stat="identity")
#check distribution of education and limit balance against defaults on train
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
p <- ggplot(credit_train_clean, aes(EDUCATION, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Education vs Credit Card Limit Balance on train", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
#People in the 0 education class have no defaults at all.
#check distribution of education and limit balance against defaults on test
#Education 1 = grad school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown
ptest <- ggplot(credit_test_clean, aes(EDUCATION, LIMIT_BAL))
ptest + geom_boxplot(aes(),group = 1)+
labs(title = "Boxplot: Education vs Credit Card Limit Balance on test", subtitle = "Defaults in green", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#People in the 0 education class have no defaults at all.
#check distribution of marriage and limit balance against defaults
p <- ggplot(credit_train_clean, aes(MARRIAGE, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Marriage Status vs Balance on train", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")
#check distribution of marriage and limit balance on test set
ptest <- ggplot(credit_test_clean, aes(MARRIAGE, LIMIT_BAL))
ptest + geom_boxplot(aes(),group = 1)+
labs(title = "Boxplot: Marriage Status vs Balance on test", subtitle = "Defaults in green", x = "0 = unknown, 1=married, 2=single, 3=others")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#check distribution of sex and limit balance against defaults
p <- ggplot(credit_train_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(colour=default),group = 1)+
labs(title = "Boxplot: Sex vs Limit Balance on train", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")
#check distribution of sex and limit balance on test
#THERE IS NO 0 IN THE TEST - NEED TO CLEAN OUT THE ZEROS
p <- ggplot(credit_test_clean, aes(SEX, LIMIT_BAL))
p + geom_boxplot(aes(),group = 1)+
labs(title = "Boxplot: Sex vs Limit Balance on test", subtitle = "Defaults in green", x = "0=unknown, 1=male, 2=female")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
People with lower limit balances are more likely to be defaulters.
#plotting sex, age and defaults
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AGE, y = SEX, color = default))+
labs(title = "Scatterplot: Sex, Age and Defaults on train", y = "0=unknown, 1=male, 2=female")
#so the animals were all defaults (three entries only)
#plotting sex, age and defaults on test - we need to remove the zeros
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = AGE, y = SEX))+
labs(title = "Scatterplot: Sex, Age on test", y = "0=unknown, 1=male, 2=female")
#so the animals were all defaults (three entries only)
# go back and remove the zeros
#plotting education, limit balance and defaults
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL, color = default))+
labs(title = "Scatterplot: Education, Limit Balance & Defaults on train", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
#plotting education, limit balance on test
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = EDUCATION, y = LIMIT_BAL))+
labs(title = "Scatterplot: Education, Limit Balance on test", x = "0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")
#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = AGE, color = default))+
facet_grid(~ MARRIAGE)+
labs(title = "Facet grid of Marital Status by Age and Defaults on train", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
#plotting marriage, limit balance on test
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = AGE))+
facet_grid(~ MARRIAGE)+
labs(title = "Facet grid of Marital Status by Age and Defaults on test", subtitle = "Marital status codes: 0 = unknown, 1=married, 2=single, 3=others")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
#plotting marriage, limit balance and defaults
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = SEX, color = default))+
facet_grid(~ EDUCATION)+
labs(title = "Facet grid of each Education status by Sex, Limit Balance & Defaults on train", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
#plotting marriage, limit balance and on test
#marital 1 = married, 2 = single, 3 = others but what does 0 mean (not available?)?
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = LIMIT_BAL, y = SEX))+
facet_grid(~ EDUCATION)+
labs(title = "Facet grid of each Education status by Sex, Limit Balance on Test", subtitle = "Education codes: 0 = not specified, 1=grad school, 2=university, 3=high school, 4 = others, 5 = unknown, 6 = unknown")+
theme(axis.text.x = element_text(angle=90, hjust = 1))
##no category 0 in sex, so needs to be cleaned
Need to understand how the transformations occured and what they mean. It’s a method to reduce the number of components - a maths procedure to transform correlated variables into uncorrelated variables called Principal Component. The FIRST PC accounts for as much variability as possible. PCA reduces attribute space to a large number of variables. It’s a dimensionality of reduction data compression method. The goal is dimension reduction, but there is no guarantee the results are interpretable. Hence, take the EDA with a grain of salt not the mix of numbers and the negatives - hard to see what it really ‘means’ Based on the original variable having the highest correlation with the principal component.
#Understanding the PAY_PC1 variable on train
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC1, y = AGE, color = default))+
labs(title = "PAY_PC1 and age on train")
#Understanding the PAY_PC1 variable on test
#these are the first three principal components of repayment status (reduces the first 6 variables to three principal components)
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = PAY_PC1, y = AGE))+
labs(title = "PAY_PC1 and age on test")
#Understanding the PAY_PC2 variable on train
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC2, y = AGE, color = default))+
labs(title = "PAY_PC2 and age on train")
#Understanding the PAY_PC2 variable on test
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = PAY_PC2, y = AGE))+
labs(title = "PAY_PC2 and age on test")
#Understanding the PAY_PC3 variable on train
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = PAY_PC3, y = AGE, color = default))+
labs(title = "PAY_PC3 and age on train")
#Understanding the PAY_PC3 variable on test
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = PAY_PC3, y = AGE))+
labs(title = "PAY_PC3 and age on test")
#Understanding the AMT_PC1 variables
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC1, y = AGE, color = default))+
labs(title = "AMT_PC1 and age on train")
#Understanding the AMT_PC1 variables on test
#First 7 principal components of the bill statement amount, and the amount of previous payments from April to September (12 variables reduced to 7 variables while retaining 90% of the variation)
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = AMT_PC1, y = AGE))+
labs(title = "AMT_PC1 and age on test")
#Understanding the AMT_PC2 variable on train
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC2, y = AGE, color = default))+
labs(title = "AMT_PC2 and age on train")
#Understanding the AMT_PC2 variable on test
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = AMT_PC2, y = AGE))+
labs(title = "AMT_PC2 and age on test")
#Understanding the AMT_PC3 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC3, y = AGE, color = default))+
labs(title = "AMT_PC3 and age on train")
#Understanding the AMT_PC3 variables on test
ggplot(data = credit_test_clean) +
geom_point(mapping = aes(x = AMT_PC3, y = AGE))+
labs(title = "AMT_PC3 and age on test")
#Understanding the AMT_PC4 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC4, y = AGE, color = default))
#Understanding the AMT_PC5 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC5, y = AGE, color = default))
#Understanding the AMT_PC6 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC6, y = AGE, color = default))
#Understanding the AMT_PC7 variables
ggplot(data = credit_train_clean) +
geom_point(mapping = aes(x = AMT_PC7, y = AGE, color = default))