Here is my attempt at using machine-learning algorithms such as linear regression and decision tree to predict credit limit from the user’s information. I also try to utlize cross-valdiation, one hot encoding to try to improve the model and it’s prediction (Spoiler: It does not go well..)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
ddf <- read.csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\100 CC Records.csv")
Let’s take a look at what our dataset contains
We have 100 observations and 11 features the credit limit is the dependent variable and the other features are all features of the dataset..
## Exploratory data analysis..
str(ddf)
## 'data.frame': 100 obs. of 11 variables:
## $ Card.Type.Code : chr "DS" "DC" "DC" "AX" ...
## $ Card.Type.Full.Name: chr "Discover" "Diners Club International" "Diners Club International" "American Express" ...
## $ Issuing.Bank : chr "Discover" "Diners Club" "Diners Club" "Bank of America" ...
## $ Card.Number : num 6.48e+15 3.03e+13 3.01e+13 3.45e+14 3.58e+15 ...
## $ Card.Holder.s.Name : chr "Brenda D Peterson" "Dawn U Reese" "Helen P Perry" "Christine E Kim" ...
## $ CVV.CVV2 : int 689 70 709 3819 913 217 3555 241 82 509 ...
## $ Issue.Date : chr "01/2017" "12/2015" "02/2007" "08/2011" ...
## $ Expiry.Date : chr "01/2022" "12/2016" "02/2020" "08/2013" ...
## $ Billing.Date : int 4 11 13 4 22 12 25 15 26 12 ...
## $ Card.PIN : int 1998 3915 2319 9017 66 2867 8493 3907 7665 5766 ...
## $ Credit.Limit : int 22700 12700 81900 77700 38300 158000 68000 41500 47800 93300 ...
summary(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Card.Number
## Length:100 Length:100 Length:100 Min. :3.002e+13
## Class :character Class :character Class :character 1st Qu.:3.466e+14
## Mode :character Mode :character Mode :character Median :4.354e+15
## Mean :3.282e+15
## 3rd Qu.:5.265e+15
## Max. :6.504e+15
## Card.Holder.s.Name CVV.CVV2 Issue.Date Expiry.Date
## Length:100 Min. : 0.0 Length:100 Length:100
## Class :character 1st Qu.: 336.8 Class :character Class :character
## Mode :character Median : 584.0 Mode :character Mode :character
## Mean :1149.3
## 3rd Qu.: 867.5
## Max. :9594.0
## Billing.Date Card.PIN Credit.Limit
## Min. : 1.00 Min. : 23 Min. : 10700
## 1st Qu.: 8.00 1st Qu.:2434 1st Qu.: 38600
## Median :16.00 Median :3911 Median : 82000
## Mean :15.47 Mean :4443 Mean : 92831
## 3rd Qu.:24.00 3rd Qu.:6901 3rd Qu.:146350
## Max. :28.00 Max. :9815 Max. :199000
Each observations contains information about a customer and their credit card information i suspect that their personal information wouldn’t be that much helpful in predicting credit limit.
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Card.Number
## 1 DS Discover Discover 6.480195e+15
## 2 DC Diners Club International Diners Club 3.029520e+13
## 3 DC Diners Club International Diners Club 3.008249e+13
## 4 AX American Express Bank of America 3.448500e+14
## 5 JC Japan Credit Bureau JCB 3.583388e+15
## 6 JC Japan Credit Bureau JCB 3.587385e+15
## Card.Holder.s.Name CVV.CVV2 Issue.Date Expiry.Date Billing.Date Card.PIN
## 1 Brenda D Peterson 689 01/2017 01/2022 4 1998
## 2 Dawn U Reese 70 12/2015 12/2016 11 3915
## 3 Helen P Perry 709 02/2007 02/2020 13 2319
## 4 Christine E Kim 3819 08/2011 08/2013 4 9017
## 5 Elizabeth Z Lopez 913 11/2017 11/2026 22 66
## 6 Nancy S Robles 217 11/2015 11/2029 12 2867
## Credit.Limit
## 1 22700
## 2 12700
## 3 81900
## 4 77700
## 5 38300
## 6 158000
In this data set, we see that there are a lot of Master Card Holders followed by Diners Club International and then American Express.
ddf %>%
select(Card.Type.Full.Name) %>%
table() %>%
prop.table()
## Card.Type.Full.Name
## American Express Diners Club International Discover
## 0.16 0.18 0.11
## Japan Credit Bureau Master Card Visa
## 0.15 0.26 0.14
Chase, JCB and American Express seems to account for the majority of issuing credit card in this dataset..
ddf %>%
select(Issuing.Bank) %>%
table() %>%
prop.table()
## Issuing.Bank
## American Express Cabela\x92s WFB Bank of America Barclays
## 0.14 0.01 0.05 0.05
## Capital One Chase Citibank Diners Club
## 0.05 0.15 0.05 0.18
## Discover GE Capital JCB U.S. Bancorp
## 0.11 0.02 0.15 0.01
## USAA Wells Fargo
## 0.01 0.02
## get rid of the special characters before visualizing the data..
ddf$Issuing.Bank <- iconv(ddf$Issuing.Bank, "ASCII", "UTF-8", sub="")
ggplot(ddf,aes(x=as.factor(Issuing.Bank))) +
geom_bar(stat = "count") + coord_flip() +
geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 4,hjust = -0.3) +
ggtitle("Count of Bank Issuers")
ggplot(ddf,aes(x = Card.Type.Full.Name)) +
geom_bar(stat = "count") + coord_flip()+
geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 4,hjust = -0.3) +
ggtitle("Count of Bank Type")
Finally, I will look at the distribution of the credit limit, the dependent variable of our dataset.., the distribution is not normal but it doesn’t look skewed so we have a variety of values here.
ggplot(ddf,aes(x=Credit.Limit)) +
geom_histogram(bins = 30) +
ggtitle("Credit Limit Distribution") +
xlab("Credit Limit") +
ylab("Count") + theme_minimal()
I used a corrplot to check for correlation between the numeric predictors to see which predictors can predict credit limit.
library(corrplot)
## corrplot 0.92 loaded
# Assuming 'df' is your data frame with numeric predictors
# You can select only the numeric columns from your data frame
# addCoef.col adds the correlation number on your scatterplot matrix..
numeric_predictors <- ddf[, sapply(ddf, is.numeric)]
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_predictors)
# Create a scatterplot matrix with corrplot
corrplot(correlation_matrix, method = "circle",number.cex = 0.7,addCoef.col = "black",type = "lower")
Creating a linear regression model with all of the predictors to see if there are any significant predictors that can help explain the Credit.Limit it seems like the linear model is unable to find such anything. I will attempt to remove any predictors that I feel are unimportant..
## Yikes this is not a good model to use to predict credit limit oh my...
linearmod <- lm(Credit.Limit~.,data = ddf)
summary(linearmod)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = ddf)
##
## Residuals:
## ALL 100 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (179 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 2.026e+05 NaN NaN
## Card.Type.CodeDC -1.348e+05 NaN NaN
## Card.Type.CodeDS 8.293e+05 NaN NaN
## Card.Type.CodeJC 4.386e+05 NaN NaN
## Card.Type.CodeMC 8.778e+05 NaN NaN
## Card.Type.CodeVI 7.308e+05 NaN NaN
## Card.Type.Full.NameDiners Club International NA NA NA
## Card.Type.Full.NameDiscover NA NA NA
## Card.Type.Full.NameJapan Credit Bureau NA NA NA
## Card.Type.Full.NameMaster Card NA NA NA
## Card.Type.Full.NameVisa NA NA NA
## Issuing.BankBank of America -6.722e+04 NaN NaN
## Issuing.BankBarclays -1.198e+05 NaN NaN
## Issuing.BankCabela\022s WFB 1.803e+02 NaN NaN
## Issuing.BankCapital One -1.501e+05 NaN NaN
## Issuing.BankChase -1.151e+05 NaN NaN
## Issuing.BankCitibank -6.950e+04 NaN NaN
## Issuing.BankDiners Club NA NA NA
## Issuing.BankDiscover NA NA NA
## Issuing.BankGE Capital -4.578e+04 NaN NaN
## Issuing.BankJCB NA NA NA
## Issuing.BankU.S. Bancorp -6.150e+04 NaN NaN
## Issuing.BankUSAA -1.341e+05 NaN NaN
## Issuing.BankWells Fargo -1.693e+05 NaN NaN
## Card.Number -1.674e-10 NaN NaN
## Card.Holder.s.NameAlice G Mitchell 5.225e+03 NaN NaN
## Card.Holder.s.NameAlicia S Crawford -2.400e+04 NaN NaN
## Card.Holder.s.NameAllen Rodgers -8.558e+04 NaN NaN
## Card.Holder.s.NameAmanda L Soto 8.725e+04 NaN NaN
## Card.Holder.s.NameAnita O Brennan 2.675e+02 NaN NaN
## Card.Holder.s.NameAnnie R House 6.958e+04 NaN NaN
## Card.Holder.s.NameAnthony C Sanchez -3.670e+04 NaN NaN
## Card.Holder.s.NameArthur Hines 9.956e+04 NaN NaN
## Card.Holder.s.NameBilly O Bell -1.297e+04 NaN NaN
## Card.Holder.s.NameBobby F Walker 1.900e+04 NaN NaN
## Card.Holder.s.NameBrandon B Boone 1.462e+05 NaN NaN
## Card.Holder.s.NameBrenda D Peterson 7.547e+04 NaN NaN
## Card.Holder.s.NameCharles R Brooks -1.820e+04 NaN NaN
## Card.Holder.s.NameCharles R Farrell -6.867e+04 NaN NaN
## Card.Holder.s.NameChristina G Francis -1.326e+05 NaN NaN
## Card.Holder.s.NameChristine E Kim NA NA NA
## Card.Holder.s.NameCraig Gonzalez -1.098e+05 NaN NaN
## Card.Holder.s.NameCraig X Keith 1.316e+05 NaN NaN
## Card.Holder.s.NameCynthia Evans -8.336e+04 NaN NaN
## Card.Holder.s.NameDavid Z Hood -3.368e+02 NaN NaN
## Card.Holder.s.NameDawn U Reese -5.010e+04 NaN NaN
## Card.Holder.s.NameDiane Carroll 1.531e+05 NaN NaN
## Card.Holder.s.NameDiane R Ross 1.144e+05 NaN NaN
## Card.Holder.s.NameDon P Wooten 1.374e+05 NaN NaN
## Card.Holder.s.NameDonna A Rodriguez 1.636e+05 NaN NaN
## Card.Holder.s.NameDorothy H Le 7.290e+04 NaN NaN
## Card.Holder.s.NameEdward Z Owen -6.175e+04 NaN NaN
## Card.Holder.s.NameEleanor Callahan 1.297e+05 NaN NaN
## Card.Holder.s.NameEleanor Crane 2.548e+04 NaN NaN
## Card.Holder.s.NameElizabeth Z Lopez -3.148e+03 NaN NaN
## Card.Holder.s.NameFrances S Henderson -3.508e+04 NaN NaN
## Card.Holder.s.NameFrancis Koch -6.799e+04 NaN NaN
## Card.Holder.s.NameGerald O Lee -1.171e+05 NaN NaN
## Card.Holder.s.NameGrace D Sparks 4.023e+04 NaN NaN
## Card.Holder.s.NameHeather W Acosta -3.259e+04 NaN NaN
## Card.Holder.s.NameHeather Wong NA NA NA
## Card.Holder.s.NameHelen P Perry 1.906e+04 NaN NaN
## Card.Holder.s.NameHelen R Lucas -1.054e+05 NaN NaN
## Card.Holder.s.NameHenry T Miller 4.370e+04 NaN NaN
## Card.Holder.s.NameHenry W Ballard -1.868e+04 NaN NaN
## Card.Holder.s.NameHenry W Hewitt -6.609e+04 NaN NaN
## Card.Holder.s.NameHoward C Dunlap 4.775e+04 NaN NaN
## Card.Holder.s.NameHoward R Bernard 4.861e+04 NaN NaN
## Card.Holder.s.NameHoward X Nelson 3.779e+03 NaN NaN
## Card.Holder.s.NameJack Foster -2.594e+04 NaN NaN
## Card.Holder.s.NameJanet Alexander 7.088e+04 NaN NaN
## Card.Holder.s.NameJeffrey J Barnes 1.149e+05 NaN NaN
## Card.Holder.s.NameJennifer K Potts 2.737e+04 NaN NaN
## Card.Holder.s.NameJennifer T Turner 1.496e+05 NaN NaN
## Card.Holder.s.NameJeremy N White 5.832e+04 NaN NaN
## Card.Holder.s.NameJeremy O Wilson 3.726e+03 NaN NaN
## Card.Holder.s.NameJeremy P Price -1.503e+04 NaN NaN
## Card.Holder.s.NameJesse J Perez -7.258e+04 NaN NaN
## Card.Holder.s.NameJim X Ballard 1.723e+04 NaN NaN
## Card.Holder.s.NameJimmy V Baker 1.110e+05 NaN NaN
## Card.Holder.s.NameJoe N Mccarty 3.049e+04 NaN NaN
## Card.Holder.s.NameJon O Reese NA NA NA
## Card.Holder.s.NameJoyce B Warren 2.053e+05 NaN NaN
## Card.Holder.s.NameJudith I Rollins 2.248e+04 NaN NaN
## Card.Holder.s.NameJudith Y Ross 9.944e+04 NaN NaN
## Card.Holder.s.NameJustin Y Coleman 8.220e+04 NaN NaN
## Card.Holder.s.NameKathy Robinson -1.122e+05 NaN NaN
## Card.Holder.s.NameKeith Z Morris -5.136e+04 NaN NaN
## Card.Holder.s.NameKristen Frank NA NA NA
## Card.Holder.s.NameLaura N Jackson 2.110e+04 NaN NaN
## Card.Holder.s.NameLaura W Richardson 6.391e+04 NaN NaN
## Card.Holder.s.NameLeonard K Browning 1.032e+05 NaN NaN
## Card.Holder.s.NameLillian M Green -4.293e+04 NaN NaN
## Card.Holder.s.NameLinda T Brown 9.176e+04 NaN NaN
## Card.Holder.s.NameLouis Martinez -4.334e+04 NaN NaN
## Card.Holder.s.NameMarcus D Ashley -1.162e+05 NaN NaN
## Card.Holder.s.NameMargaret T Ramirez 9.777e+04 NaN NaN
## Card.Holder.s.NameMarilyn E Coleman 1.626e+05 NaN NaN
## Card.Holder.s.NameMario D Blanchard 6.764e+04 NaN NaN
## Card.Holder.s.NameMark Y Sheppard NA NA NA
## Card.Holder.s.NameNancy C Cox 1.439e+05 NaN NaN
## Card.Holder.s.NameNancy S Robles 1.172e+05 NaN NaN
## Card.Holder.s.NamePatricia Williams -2.982e+04 NaN NaN
## Card.Holder.s.NamePaula E Head 7.239e+04 NaN NaN
## Card.Holder.s.NamePaula R Bryan NA NA NA
## Card.Holder.s.NamePhyllis H Johnson NA NA NA
## Card.Holder.s.NameRachel Turner 4.126e+04 NaN NaN
## Card.Holder.s.NameRachel V Jenkins NA NA NA
## Card.Holder.s.NameRandall Z Lott -1.001e+05 NaN NaN
## Card.Holder.s.NameRebecca Jenkins -7.971e+04 NaN NaN
## Card.Holder.s.NameRobert A Long 4.717e+04 NaN NaN
## Card.Holder.s.NameRonald R Stewart 8.950e+04 NaN NaN
## Card.Holder.s.NameRose Jenkins 1.230e+05 NaN NaN
## Card.Holder.s.NameSandra H Adams NA NA NA
## Card.Holder.s.NameSandra N Parrish 1.734e+04 NaN NaN
## Card.Holder.s.NameSean J Young 2.086e+04 NaN NaN
## Card.Holder.s.NameStanley K Hopper -7.047e+04 NaN NaN
## Card.Holder.s.NameStephen H Ramirez NA NA NA
## Card.Holder.s.NameVictor J Mercado NA NA NA
## Card.Holder.s.NameVincent M Fulton NA NA NA
## Card.Holder.s.NameVincent U Jordan NA NA NA
## Card.Holder.s.NameWalter B Lopez NA NA NA
## Card.Holder.s.NameWilliam N French NA NA NA
## Card.Holder.s.NameWillie O King NA NA NA
## CVV.CVV2 NA NA NA
## Issue.Date01/2009 NA NA NA
## Issue.Date01/2010 NA NA NA
## Issue.Date01/2011 NA NA NA
## Issue.Date01/2013 NA NA NA
## Issue.Date01/2016 NA NA NA
## Issue.Date01/2017 NA NA NA
## Issue.Date02/2007 NA NA NA
## Issue.Date02/2010 NA NA NA
## Issue.Date02/2012 NA NA NA
## Issue.Date02/2013 NA NA NA
## Issue.Date02/2014 NA NA NA
## Issue.Date02/2015 NA NA NA
## Issue.Date02/2016 NA NA NA
## Issue.Date03/2011 NA NA NA
## Issue.Date03/2012 NA NA NA
## Issue.Date03/2013 NA NA NA
## Issue.Date03/2015 NA NA NA
## Issue.Date03/2016 NA NA NA
## Issue.Date04/2007 NA NA NA
## Issue.Date04/2009 NA NA NA
## Issue.Date04/2011 NA NA NA
## Issue.Date04/2013 NA NA NA
## Issue.Date04/2015 NA NA NA
## Issue.Date04/2017 NA NA NA
## Issue.Date05/2010 NA NA NA
## Issue.Date05/2011 NA NA NA
## Issue.Date05/2015 NA NA NA
## Issue.Date05/2016 NA NA NA
## Issue.Date06/2007 NA NA NA
## Issue.Date06/2008 NA NA NA
## Issue.Date06/2011 NA NA NA
## Issue.Date06/2012 NA NA NA
## Issue.Date06/2014 NA NA NA
## Issue.Date06/2015 NA NA NA
## Issue.Date07/2007 NA NA NA
## Issue.Date07/2008 NA NA NA
## Issue.Date07/2009 NA NA NA
## Issue.Date07/2010 NA NA NA
## Issue.Date07/2015 NA NA NA
## Issue.Date07/2017 NA NA NA
## Issue.Date08/2007 NA NA NA
## Issue.Date08/2008 NA NA NA
## Issue.Date08/2010 NA NA NA
## Issue.Date08/2011 NA NA NA
## Issue.Date08/2013 NA NA NA
## Issue.Date08/2014 NA NA NA
## Issue.Date08/2015 NA NA NA
## Issue.Date08/2016 NA NA NA
## Issue.Date08/2017 NA NA NA
## Issue.Date09/2009 NA NA NA
## Issue.Date09/2010 NA NA NA
## Issue.Date09/2011 NA NA NA
## Issue.Date09/2012 NA NA NA
## Issue.Date09/2013 NA NA NA
## Issue.Date09/2017 NA NA NA
## Issue.Date10/2012 NA NA NA
## Issue.Date10/2013 NA NA NA
## Issue.Date10/2014 NA NA NA
## Issue.Date10/2015 NA NA NA
## Issue.Date10/2016 NA NA NA
## Issue.Date11/2009 NA NA NA
## Issue.Date11/2013 NA NA NA
## Issue.Date11/2015 NA NA NA
## Issue.Date11/2016 NA NA NA
## Issue.Date11/2017 NA NA NA
## Issue.Date12/2007 NA NA NA
## Issue.Date12/2015 NA NA NA
## Issue.Date12/2017 NA NA NA
## Expiry.Date01/2015 NA NA NA
## Expiry.Date01/2017 NA NA NA
## Expiry.Date01/2018 NA NA NA
## Expiry.Date01/2020 NA NA NA
## Expiry.Date01/2022 NA NA NA
## Expiry.Date01/2027 NA NA NA
## Expiry.Date01/2033 NA NA NA
## Expiry.Date02/2012 NA NA NA
## Expiry.Date02/2019 NA NA NA
## Expiry.Date02/2020 NA NA NA
## Expiry.Date02/2021 NA NA NA
## Expiry.Date02/2028 NA NA NA
## Expiry.Date02/2030 NA NA NA
## Expiry.Date02/2031 NA NA NA
## Expiry.Date03/2019 NA NA NA
## Expiry.Date03/2021 NA NA NA
## Expiry.Date03/2022 NA NA NA
## Expiry.Date03/2024 NA NA NA
## Expiry.Date03/2028 NA NA NA
## Expiry.Date04/2017 NA NA NA
## Expiry.Date04/2020 NA NA NA
## Expiry.Date04/2021 NA NA NA
## Expiry.Date04/2024 NA NA NA
## Expiry.Date04/2025 NA NA NA
## Expiry.Date04/2027 NA NA NA
## Expiry.Date04/2035 NA NA NA
## Expiry.Date05/2015 NA NA NA
## Expiry.Date05/2016 NA NA NA
## Expiry.Date05/2018 NA NA NA
## Expiry.Date05/2023 NA NA NA
## Expiry.Date05/2024 NA NA NA
## Expiry.Date06/2008 NA NA NA
## Expiry.Date06/2011 NA NA NA
## Expiry.Date06/2016 NA NA NA
## Expiry.Date06/2019 NA NA NA
## Expiry.Date06/2020 NA NA NA
## Expiry.Date06/2021 NA NA NA
## Expiry.Date06/2022 NA NA NA
## Expiry.Date06/2026 NA NA NA
## Expiry.Date06/2029 NA NA NA
## Expiry.Date06/2032 NA NA NA
## Expiry.Date06/2033 NA NA NA
## Expiry.Date07/2009 NA NA NA
## Expiry.Date07/2016 NA NA NA
## Expiry.Date07/2021 NA NA NA
## Expiry.Date07/2026 NA NA NA
## Expiry.Date07/2028 NA NA NA
## Expiry.Date08/2012 NA NA NA
## Expiry.Date08/2013 NA NA NA
## Expiry.Date08/2015 NA NA NA
## Expiry.Date08/2022 NA NA NA
## Expiry.Date08/2023 NA NA NA
## Expiry.Date08/2025 NA NA NA
## Expiry.Date08/2026 NA NA NA
## Expiry.Date08/2028 NA NA NA
## Expiry.Date08/2031 NA NA NA
## Expiry.Date08/2034 NA NA NA
## Expiry.Date09/2012 NA NA NA
## Expiry.Date09/2018 NA NA NA
## Expiry.Date09/2020 NA NA NA
## Expiry.Date09/2023 NA NA NA
## Expiry.Date09/2026 NA NA NA
## Expiry.Date09/2029 NA NA NA
## Expiry.Date10/2017 NA NA NA
## Expiry.Date10/2020 NA NA NA
## Expiry.Date10/2022 NA NA NA
## Expiry.Date10/2023 NA NA NA
## Expiry.Date10/2024 NA NA NA
## Expiry.Date10/2026 NA NA NA
## Expiry.Date10/2027 NA NA NA
## Expiry.Date10/2030 NA NA NA
## Expiry.Date11/2014 NA NA NA
## Expiry.Date11/2016 NA NA NA
## Expiry.Date11/2021 NA NA NA
## Expiry.Date11/2024 NA NA NA
## Expiry.Date11/2025 NA NA NA
## Expiry.Date11/2026 NA NA NA
## Expiry.Date11/2029 NA NA NA
## Expiry.Date11/2037 NA NA NA
## Expiry.Date12/2008 NA NA NA
## Expiry.Date12/2014 NA NA NA
## Expiry.Date12/2016 NA NA NA
## Expiry.Date12/2024 NA NA NA
## Expiry.Date12/2027 NA NA NA
## Billing.Date NA NA NA
## Card.PIN NA NA NA
## Pr(>|t|)
## (Intercept) NaN
## Card.Type.CodeDC NaN
## Card.Type.CodeDS NaN
## Card.Type.CodeJC NaN
## Card.Type.CodeMC NaN
## Card.Type.CodeVI NaN
## Card.Type.Full.NameDiners Club International NA
## Card.Type.Full.NameDiscover NA
## Card.Type.Full.NameJapan Credit Bureau NA
## Card.Type.Full.NameMaster Card NA
## Card.Type.Full.NameVisa NA
## Issuing.BankBank of America NaN
## Issuing.BankBarclays NaN
## Issuing.BankCabela\022s WFB NaN
## Issuing.BankCapital One NaN
## Issuing.BankChase NaN
## Issuing.BankCitibank NaN
## Issuing.BankDiners Club NA
## Issuing.BankDiscover NA
## Issuing.BankGE Capital NaN
## Issuing.BankJCB NA
## Issuing.BankU.S. Bancorp NaN
## Issuing.BankUSAA NaN
## Issuing.BankWells Fargo NaN
## Card.Number NaN
## Card.Holder.s.NameAlice G Mitchell NaN
## Card.Holder.s.NameAlicia S Crawford NaN
## Card.Holder.s.NameAllen Rodgers NaN
## Card.Holder.s.NameAmanda L Soto NaN
## Card.Holder.s.NameAnita O Brennan NaN
## Card.Holder.s.NameAnnie R House NaN
## Card.Holder.s.NameAnthony C Sanchez NaN
## Card.Holder.s.NameArthur Hines NaN
## Card.Holder.s.NameBilly O Bell NaN
## Card.Holder.s.NameBobby F Walker NaN
## Card.Holder.s.NameBrandon B Boone NaN
## Card.Holder.s.NameBrenda D Peterson NaN
## Card.Holder.s.NameCharles R Brooks NaN
## Card.Holder.s.NameCharles R Farrell NaN
## Card.Holder.s.NameChristina G Francis NaN
## Card.Holder.s.NameChristine E Kim NA
## Card.Holder.s.NameCraig Gonzalez NaN
## Card.Holder.s.NameCraig X Keith NaN
## Card.Holder.s.NameCynthia Evans NaN
## Card.Holder.s.NameDavid Z Hood NaN
## Card.Holder.s.NameDawn U Reese NaN
## Card.Holder.s.NameDiane Carroll NaN
## Card.Holder.s.NameDiane R Ross NaN
## Card.Holder.s.NameDon P Wooten NaN
## Card.Holder.s.NameDonna A Rodriguez NaN
## Card.Holder.s.NameDorothy H Le NaN
## Card.Holder.s.NameEdward Z Owen NaN
## Card.Holder.s.NameEleanor Callahan NaN
## Card.Holder.s.NameEleanor Crane NaN
## Card.Holder.s.NameElizabeth Z Lopez NaN
## Card.Holder.s.NameFrances S Henderson NaN
## Card.Holder.s.NameFrancis Koch NaN
## Card.Holder.s.NameGerald O Lee NaN
## Card.Holder.s.NameGrace D Sparks NaN
## Card.Holder.s.NameHeather W Acosta NaN
## Card.Holder.s.NameHeather Wong NA
## Card.Holder.s.NameHelen P Perry NaN
## Card.Holder.s.NameHelen R Lucas NaN
## Card.Holder.s.NameHenry T Miller NaN
## Card.Holder.s.NameHenry W Ballard NaN
## Card.Holder.s.NameHenry W Hewitt NaN
## Card.Holder.s.NameHoward C Dunlap NaN
## Card.Holder.s.NameHoward R Bernard NaN
## Card.Holder.s.NameHoward X Nelson NaN
## Card.Holder.s.NameJack Foster NaN
## Card.Holder.s.NameJanet Alexander NaN
## Card.Holder.s.NameJeffrey J Barnes NaN
## Card.Holder.s.NameJennifer K Potts NaN
## Card.Holder.s.NameJennifer T Turner NaN
## Card.Holder.s.NameJeremy N White NaN
## Card.Holder.s.NameJeremy O Wilson NaN
## Card.Holder.s.NameJeremy P Price NaN
## Card.Holder.s.NameJesse J Perez NaN
## Card.Holder.s.NameJim X Ballard NaN
## Card.Holder.s.NameJimmy V Baker NaN
## Card.Holder.s.NameJoe N Mccarty NaN
## Card.Holder.s.NameJon O Reese NA
## Card.Holder.s.NameJoyce B Warren NaN
## Card.Holder.s.NameJudith I Rollins NaN
## Card.Holder.s.NameJudith Y Ross NaN
## Card.Holder.s.NameJustin Y Coleman NaN
## Card.Holder.s.NameKathy Robinson NaN
## Card.Holder.s.NameKeith Z Morris NaN
## Card.Holder.s.NameKristen Frank NA
## Card.Holder.s.NameLaura N Jackson NaN
## Card.Holder.s.NameLaura W Richardson NaN
## Card.Holder.s.NameLeonard K Browning NaN
## Card.Holder.s.NameLillian M Green NaN
## Card.Holder.s.NameLinda T Brown NaN
## Card.Holder.s.NameLouis Martinez NaN
## Card.Holder.s.NameMarcus D Ashley NaN
## Card.Holder.s.NameMargaret T Ramirez NaN
## Card.Holder.s.NameMarilyn E Coleman NaN
## Card.Holder.s.NameMario D Blanchard NaN
## Card.Holder.s.NameMark Y Sheppard NA
## Card.Holder.s.NameNancy C Cox NaN
## Card.Holder.s.NameNancy S Robles NaN
## Card.Holder.s.NamePatricia Williams NaN
## Card.Holder.s.NamePaula E Head NaN
## Card.Holder.s.NamePaula R Bryan NA
## Card.Holder.s.NamePhyllis H Johnson NA
## Card.Holder.s.NameRachel Turner NaN
## Card.Holder.s.NameRachel V Jenkins NA
## Card.Holder.s.NameRandall Z Lott NaN
## Card.Holder.s.NameRebecca Jenkins NaN
## Card.Holder.s.NameRobert A Long NaN
## Card.Holder.s.NameRonald R Stewart NaN
## Card.Holder.s.NameRose Jenkins NaN
## Card.Holder.s.NameSandra H Adams NA
## Card.Holder.s.NameSandra N Parrish NaN
## Card.Holder.s.NameSean J Young NaN
## Card.Holder.s.NameStanley K Hopper NaN
## Card.Holder.s.NameStephen H Ramirez NA
## Card.Holder.s.NameVictor J Mercado NA
## Card.Holder.s.NameVincent M Fulton NA
## Card.Holder.s.NameVincent U Jordan NA
## Card.Holder.s.NameWalter B Lopez NA
## Card.Holder.s.NameWilliam N French NA
## Card.Holder.s.NameWillie O King NA
## CVV.CVV2 NA
## Issue.Date01/2009 NA
## Issue.Date01/2010 NA
## Issue.Date01/2011 NA
## Issue.Date01/2013 NA
## Issue.Date01/2016 NA
## Issue.Date01/2017 NA
## Issue.Date02/2007 NA
## Issue.Date02/2010 NA
## Issue.Date02/2012 NA
## Issue.Date02/2013 NA
## Issue.Date02/2014 NA
## Issue.Date02/2015 NA
## Issue.Date02/2016 NA
## Issue.Date03/2011 NA
## Issue.Date03/2012 NA
## Issue.Date03/2013 NA
## Issue.Date03/2015 NA
## Issue.Date03/2016 NA
## Issue.Date04/2007 NA
## Issue.Date04/2009 NA
## Issue.Date04/2011 NA
## Issue.Date04/2013 NA
## Issue.Date04/2015 NA
## Issue.Date04/2017 NA
## Issue.Date05/2010 NA
## Issue.Date05/2011 NA
## Issue.Date05/2015 NA
## Issue.Date05/2016 NA
## Issue.Date06/2007 NA
## Issue.Date06/2008 NA
## Issue.Date06/2011 NA
## Issue.Date06/2012 NA
## Issue.Date06/2014 NA
## Issue.Date06/2015 NA
## Issue.Date07/2007 NA
## Issue.Date07/2008 NA
## Issue.Date07/2009 NA
## Issue.Date07/2010 NA
## Issue.Date07/2015 NA
## Issue.Date07/2017 NA
## Issue.Date08/2007 NA
## Issue.Date08/2008 NA
## Issue.Date08/2010 NA
## Issue.Date08/2011 NA
## Issue.Date08/2013 NA
## Issue.Date08/2014 NA
## Issue.Date08/2015 NA
## Issue.Date08/2016 NA
## Issue.Date08/2017 NA
## Issue.Date09/2009 NA
## Issue.Date09/2010 NA
## Issue.Date09/2011 NA
## Issue.Date09/2012 NA
## Issue.Date09/2013 NA
## Issue.Date09/2017 NA
## Issue.Date10/2012 NA
## Issue.Date10/2013 NA
## Issue.Date10/2014 NA
## Issue.Date10/2015 NA
## Issue.Date10/2016 NA
## Issue.Date11/2009 NA
## Issue.Date11/2013 NA
## Issue.Date11/2015 NA
## Issue.Date11/2016 NA
## Issue.Date11/2017 NA
## Issue.Date12/2007 NA
## Issue.Date12/2015 NA
## Issue.Date12/2017 NA
## Expiry.Date01/2015 NA
## Expiry.Date01/2017 NA
## Expiry.Date01/2018 NA
## Expiry.Date01/2020 NA
## Expiry.Date01/2022 NA
## Expiry.Date01/2027 NA
## Expiry.Date01/2033 NA
## Expiry.Date02/2012 NA
## Expiry.Date02/2019 NA
## Expiry.Date02/2020 NA
## Expiry.Date02/2021 NA
## Expiry.Date02/2028 NA
## Expiry.Date02/2030 NA
## Expiry.Date02/2031 NA
## Expiry.Date03/2019 NA
## Expiry.Date03/2021 NA
## Expiry.Date03/2022 NA
## Expiry.Date03/2024 NA
## Expiry.Date03/2028 NA
## Expiry.Date04/2017 NA
## Expiry.Date04/2020 NA
## Expiry.Date04/2021 NA
## Expiry.Date04/2024 NA
## Expiry.Date04/2025 NA
## Expiry.Date04/2027 NA
## Expiry.Date04/2035 NA
## Expiry.Date05/2015 NA
## Expiry.Date05/2016 NA
## Expiry.Date05/2018 NA
## Expiry.Date05/2023 NA
## Expiry.Date05/2024 NA
## Expiry.Date06/2008 NA
## Expiry.Date06/2011 NA
## Expiry.Date06/2016 NA
## Expiry.Date06/2019 NA
## Expiry.Date06/2020 NA
## Expiry.Date06/2021 NA
## Expiry.Date06/2022 NA
## Expiry.Date06/2026 NA
## Expiry.Date06/2029 NA
## Expiry.Date06/2032 NA
## Expiry.Date06/2033 NA
## Expiry.Date07/2009 NA
## Expiry.Date07/2016 NA
## Expiry.Date07/2021 NA
## Expiry.Date07/2026 NA
## Expiry.Date07/2028 NA
## Expiry.Date08/2012 NA
## Expiry.Date08/2013 NA
## Expiry.Date08/2015 NA
## Expiry.Date08/2022 NA
## Expiry.Date08/2023 NA
## Expiry.Date08/2025 NA
## Expiry.Date08/2026 NA
## Expiry.Date08/2028 NA
## Expiry.Date08/2031 NA
## Expiry.Date08/2034 NA
## Expiry.Date09/2012 NA
## Expiry.Date09/2018 NA
## Expiry.Date09/2020 NA
## Expiry.Date09/2023 NA
## Expiry.Date09/2026 NA
## Expiry.Date09/2029 NA
## Expiry.Date10/2017 NA
## Expiry.Date10/2020 NA
## Expiry.Date10/2022 NA
## Expiry.Date10/2023 NA
## Expiry.Date10/2024 NA
## Expiry.Date10/2026 NA
## Expiry.Date10/2027 NA
## Expiry.Date10/2030 NA
## Expiry.Date11/2014 NA
## Expiry.Date11/2016 NA
## Expiry.Date11/2021 NA
## Expiry.Date11/2024 NA
## Expiry.Date11/2025 NA
## Expiry.Date11/2026 NA
## Expiry.Date11/2029 NA
## Expiry.Date11/2037 NA
## Expiry.Date12/2008 NA
## Expiry.Date12/2014 NA
## Expiry.Date12/2016 NA
## Expiry.Date12/2024 NA
## Expiry.Date12/2027 NA
## Billing.Date NA
## Card.PIN NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 99 and 0 DF, p-value: NA
Now that we have look at the data and visualizes some of the columns that I believe are relevant, it is time to remove irrelevant columns and clean the data.
## Let's remove columns that would not be useful for credit limit estimation.. such as card number and card holder's name cvv and card pin since these are security features and doesn't contain any useful info..
ddf <- ddf %>%
select(-Card.Number,-Card.Holder.s.Name,-CVV.CVV2,-Card.PIN)
Create proper date-time columns.. we are going to transform the issue dat and expiry dates into numerical features to help our model runs better. We are going to calculate the card age by subtracting the issue date from the e
library(lubridate)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ddf$Issue.Date <- as.Date(paste("01/", ddf$Issue.Date, sep=""), format="%d/%m/%Y")
ddf$Expiry.Date <- as.Date(paste("01/", ddf$Expiry.Date, sep=""), format="%d/%m/%Y")
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Issue.Date
## 1 DS Discover Discover 2017-01-01
## 2 DC Diners Club International Diners Club 2015-12-01
## 3 DC Diners Club International Diners Club 2007-02-01
## 4 AX American Express Bank of America 2011-08-01
## 5 JC Japan Credit Bureau JCB 2017-11-01
## 6 JC Japan Credit Bureau JCB 2015-11-01
## Expiry.Date Billing.Date Credit.Limit
## 1 2022-01-01 4 22700
## 2 2016-12-01 11 12700
## 3 2020-02-01 13 81900
## 4 2013-08-01 4 77700
## 5 2026-11-01 22 38300
## 6 2029-11-01 12 158000
## I calculated the card age by subtracting the expiry date from the issue date and counted the numbers of days.. between each card duration
ddf$Card_Age <- as.numeric(difftime(ddf$Expiry.Date, ddf$Issue.Date, units = "days"))
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Issue.Date
## 1 DS Discover Discover 2017-01-01
## 2 DC Diners Club International Diners Club 2015-12-01
## 3 DC Diners Club International Diners Club 2007-02-01
## 4 AX American Express Bank of America 2011-08-01
## 5 JC Japan Credit Bureau JCB 2017-11-01
## 6 JC Japan Credit Bureau JCB 2015-11-01
## Expiry.Date Billing.Date Credit.Limit Card_Age
## 1 2022-01-01 4 22700 1826
## 2 2016-12-01 11 12700 366
## 3 2020-02-01 13 81900 4748
## 4 2013-08-01 4 77700 731
## 5 2026-11-01 22 38300 3287
## 6 2029-11-01 12 158000 5114
## FInally, I will remove the Card Type Code Since it just a code for the card type and The Card Type Full Name since it is just the full written name for the bank card.
Final_df <- ddf %>%
select(-Card.Type.Code,-Card.Type.Full.Name)
head(Final_df)
## Issuing.Bank Issue.Date Expiry.Date Billing.Date Credit.Limit Card_Age
## 1 Discover 2017-01-01 2022-01-01 4 22700 1826
## 2 Diners Club 2015-12-01 2016-12-01 11 12700 366
## 3 Diners Club 2007-02-01 2020-02-01 13 81900 4748
## 4 Bank of America 2011-08-01 2013-08-01 4 77700 731
## 5 JCB 2017-11-01 2026-11-01 22 38300 3287
## 6 JCB 2015-11-01 2029-11-01 12 158000 5114
I will now use linear regression on the cleaned-up data, I will remove the card_age from the dataset since it is not useful at all in creating the linear regression model.
linearmod2 <- lm(Credit.Limit~.,data = Final_df)
summary(linearmod2)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90124 -50042 -7724 42308 122986
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.014e+05 9.682e+04 2.080 0.0406 *
## Issuing.BankBank of America 2.797e+04 3.310e+04 0.845 0.4005
## Issuing.BankBarclays 1.002e+04 3.317e+04 0.302 0.7633
## Issuing.BankCabela\022s WFB 5.306e+04 6.542e+04 0.811 0.4197
## Issuing.BankCapital One 1.227e+03 3.225e+04 0.038 0.9697
## Issuing.BankChase -4.236e+03 2.368e+04 -0.179 0.8585
## Issuing.BankCitibank -1.209e+03 3.221e+04 -0.038 0.9701
## Issuing.BankDiners Club -6.857e+03 2.246e+04 -0.305 0.7609
## Issuing.BankDiscover 1.148e+04 2.501e+04 0.459 0.6476
## Issuing.BankGE Capital 1.558e+04 4.725e+04 0.330 0.7424
## Issuing.BankJCB 1.643e+04 2.304e+04 0.713 0.4779
## Issuing.BankU.S. Bancorp 2.645e+04 6.508e+04 0.406 0.6855
## Issuing.BankUSAA -7.998e+04 6.464e+04 -1.237 0.2195
## Issuing.BankWells Fargo -6.760e+04 4.841e+04 -1.397 0.1663
## Issue.Date -6.814e+00 6.555e+00 -1.040 0.3016
## Expiry.Date -4.092e-01 3.535e+00 -0.116 0.9081
## Billing.Date 2.468e+02 7.757e+02 0.318 0.7512
## Card_Age NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61740 on 83 degrees of freedom
## Multiple R-squared: 0.1, Adjusted R-squared: -0.07349
## F-statistic: 0.5764 on 16 and 83 DF, p-value: 0.8931
## Remove this column from the predictor .
Final_df2 <- Final_df %>%
select(-Card_Age)
This dataset seems to be pretty good atleast we have values for each of the predictors, though the linear regression model is doing a poor job at explaining this data.
linearmod2 <- lm(Credit.Limit~.,data = Final_df2)
summary(linearmod2)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90124 -50042 -7724 42308 122986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.014e+05 9.682e+04 2.080 0.0406 *
## Issuing.BankBank of America 2.797e+04 3.310e+04 0.845 0.4005
## Issuing.BankBarclays 1.002e+04 3.317e+04 0.302 0.7633
## Issuing.BankCabela\022s WFB 5.306e+04 6.542e+04 0.811 0.4197
## Issuing.BankCapital One 1.227e+03 3.225e+04 0.038 0.9697
## Issuing.BankChase -4.236e+03 2.368e+04 -0.179 0.8585
## Issuing.BankCitibank -1.209e+03 3.221e+04 -0.038 0.9701
## Issuing.BankDiners Club -6.857e+03 2.246e+04 -0.305 0.7609
## Issuing.BankDiscover 1.148e+04 2.501e+04 0.459 0.6476
## Issuing.BankGE Capital 1.558e+04 4.725e+04 0.330 0.7424
## Issuing.BankJCB 1.643e+04 2.304e+04 0.713 0.4779
## Issuing.BankU.S. Bancorp 2.645e+04 6.508e+04 0.406 0.6855
## Issuing.BankUSAA -7.998e+04 6.464e+04 -1.237 0.2195
## Issuing.BankWells Fargo -6.760e+04 4.841e+04 -1.397 0.1663
## Issue.Date -6.814e+00 6.555e+00 -1.040 0.3016
## Expiry.Date -4.092e-01 3.535e+00 -0.116 0.9081
## Billing.Date 2.468e+02 7.757e+02 0.318 0.7512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61740 on 83 degrees of freedom
## Multiple R-squared: 0.1, Adjusted R-squared: -0.07349
## F-statistic: 0.5764 on 16 and 83 DF, p-value: 0.8931
I was curious to see if one-hot encoding would do anything but alas it seems there were NA values so it seems I will stick with the final_df2 and leave it with that..
library(caret)
encoded_data <- model.matrix(~Issuing.Bank - 1,data = Final_df2)
head(encoded_data)
## Issuing.BankAmerican Express Issuing.BankBank of America Issuing.BankBarclays
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankCabela\022s WFB Issuing.BankCapital One Issuing.BankChase
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankCitibank Issuing.BankDiners Club Issuing.BankDiscover
## 1 0 0 1
## 2 0 1 0
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankGE Capital Issuing.BankJCB Issuing.BankU.S. Bancorp
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 1 0
## 6 0 1 0
## Issuing.BankUSAA Issuing.BankWells Fargo
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Final_df3 <- cbind(encoded_data,Final_df2)
linearmod3 <- lm(Credit.Limit~.,data = Final_df3)
summary(linearmod3)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90124 -50042 -7724 42308 122986
##
## Coefficients: (14 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.338e+05 9.511e+04 1.407 0.1632
## `Issuing.BankAmerican Express` 6.760e+04 4.841e+04 1.397 0.1663
## `Issuing.BankBank of America` 9.557e+04 5.223e+04 1.830 0.0709 .
## Issuing.BankBarclays 7.762e+04 5.268e+04 1.474 0.1444
## `Issuing.BankCabela\\022s WFB` 1.207e+05 7.636e+04 1.580 0.1179
## `Issuing.BankCapital One` 6.883e+04 5.326e+04 1.292 0.1998
## Issuing.BankChase 6.337e+04 4.710e+04 1.345 0.1822
## Issuing.BankCitibank 6.639e+04 5.313e+04 1.250 0.2149
## `Issuing.BankDiners Club` 6.075e+04 4.712e+04 1.289 0.2009
## Issuing.BankDiscover 7.908e+04 4.903e+04 1.613 0.1105
## `Issuing.BankGE Capital` 8.318e+04 6.234e+04 1.334 0.1858
## Issuing.BankJCB 8.403e+04 4.823e+04 1.742 0.0852 .
## `Issuing.BankU.S. Bancorp` 9.405e+04 7.562e+04 1.244 0.2171
## Issuing.BankUSAA -1.238e+04 7.597e+04 -0.163 0.8710
## `Issuing.BankWells Fargo` NA NA NA NA
## Issuing.BankBank of America NA NA NA NA
## Issuing.BankBarclays NA NA NA NA
## Issuing.BankCabela\022s WFB NA NA NA NA
## Issuing.BankCapital One NA NA NA NA
## Issuing.BankChase NA NA NA NA
## Issuing.BankCitibank NA NA NA NA
## Issuing.BankDiners Club NA NA NA NA
## Issuing.BankDiscover NA NA NA NA
## Issuing.BankGE Capital NA NA NA NA
## Issuing.BankJCB NA NA NA NA
## Issuing.BankU.S. Bancorp NA NA NA NA
## Issuing.BankUSAA NA NA NA NA
## Issuing.BankWells Fargo NA NA NA NA
## Issue.Date -6.814e+00 6.555e+00 -1.040 0.3016
## Expiry.Date -4.092e-01 3.535e+00 -0.116 0.9081
## Billing.Date 2.468e+02 7.757e+02 0.318 0.7512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61740 on 83 degrees of freedom
## Multiple R-squared: 0.1, Adjusted R-squared: -0.07349
## F-statistic: 0.5764 on 16 and 83 DF, p-value: 0.8931
I will now utilize cross validation in at attempt to improve my linear regression model and use RMSe as my metric to evaluate my model.
I will make a testing and training dataset.. (I had to stratify the data on Issuing.Bank because the dataset gets split in scenarios where the unknown labels are either in the training dataset or testset)
set.seed(123)
split_index <- createDataPartition(Final_df2$Issuing.Bank,p=0.7,list = FALSE)
## Warning in createDataPartition(Final_df2$Issuing.Bank, p = 0.7, list = FALSE):
## Some classes have a single record ( Cabelas WFB, U.S. Bancorp, USAA ) and these
## will be selected for the sample
training_data <- Final_df2[split_index,]
testing_data <- Final_df2[-split_index,]
## Set a 5 fold cross validation
set.seed(5)
control <- trainControl(method = "cv",number = 5)
metric = "rmse"
## Fit a Linear Algorithm.. with the cross validation..
set.seed(1234)
fit.lm <- train(Credit.Limit~.,data = training_data,method = "lm",trControl = control)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
Here we see the performance result for each iteration we refer to the rsample object of the model
fit.lm$resample %>%
arrange(Resample)
## RMSE Rsquared MAE Resample
## 1 76504.87 3.686655e-02 66719.57 Fold1
## 2 73557.54 9.463288e-02 66697.10 Fold2
## 3 64955.93 1.561662e-02 53366.86 Fold3
## 4 64067.66 1.038633e-06 55772.15 Fold4
## 5 75021.51 9.062165e-03 62802.01 Fold5
The estimated accuracy of the model is the average of the five iterations so average RMSE is around 65877
fit.lm$resample %>%
arrange(Resample) %>%
summarise(AVGRMSE = mean(RMSE))
## AVGRMSE
## 1 70821.5
Meanwhile the average Rsquared of the model is around 3% percent using 5-fold cross validation
fit.lm$resample %>%
arrange(Resample) %>%
summarise(AvgRsquared = mean(Rsquared))
## AvgRsquared
## 1 0.03123585
Here is all the evaluation metric displayed (R-squared,RMSE,MAE) the model is pretty terrible at predicting the credit limit.
print(fit.lm)
## Linear Regression
##
## 76 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 60, 63, 60, 61, 60
## Resampling results:
##
## RMSE Rsquared MAE
## 70821.5 0.03123585 61071.54
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
predictions <- predict(fit.lm,newdata = testing_data)
head(predictions)
## 10 34 38 43 51 52
## 58484.33 118943.20 80597.11 105214.91 86101.50 90847.18
I will now attempt to use decision tree in order to improve my predictions. I will use the Caret package and the classwork from Week # 5 to help guide me with the predictions
## Use the textbook example..
library(rpart)
set.seed(3456)
fit <- rpart(Credit.Limit~.,,method = 'anova',data = training_data)
The default k-fold cross validation is 10 so we are using 10-folds cross-validation..
## plot the cv error curve for the tree.. and visualizes the cross-validation results
plotcp(fit)
printcp(fit)
##
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] Billing.Date Issuing.Bank
##
## Root node error: 2.7825e+11/76 = 3661182291
##
## n= 76
##
## CP nsplit rel error xerror xstd
## 1 0.104690 0 1.00000 1.0211 0.093379
## 2 0.047020 2 0.79062 1.1893 0.137519
## 3 0.017008 5 0.64956 1.2033 0.160569
## 4 0.010000 6 0.63255 1.1871 0.162612
summary(fit)
## Call:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
## n= 76
##
## CP nsplit rel error xerror xstd
## 1 0.10469046 0 1.0000000 1.021110 0.0933792
## 2 0.04701965 2 0.7906191 1.189330 0.1375194
## 3 0.01700815 5 0.6495601 1.203257 0.1605691
## 4 0.01000000 6 0.6325520 1.187051 0.1626117
##
## Variable importance
## Billing.Date Issuing.Bank Expiry.Date Issue.Date
## 48 33 12 8
##
## Node number 1: 76 observations, complexity param=0.1046905
## mean=92885.53, MSE=3.661182e+09
## left son=2 (52 obs) right son=3 (24 obs)
## Primary splits:
## Issuing.Bank splits as LRRRRLLLRRLRLL, improve=0.07126884, (0 missing)
## Billing.Date < 9.5 to the left, improve=0.05548784, (0 missing)
## Issue.Date < 14579.5 to the right, improve=0.04089661, (0 missing)
## Expiry.Date < 21381.5 to the right, improve=0.01537840, (0 missing)
## Surrogate splits:
## Expiry.Date < 14776 to the right, agree=0.697, adj=0.042, (0 split)
##
## Node number 2: 52 observations, complexity param=0.04701965
## mean=81911.54, MSE=3.540029e+09
## left son=4 (16 obs) right son=5 (36 obs)
## Primary splits:
## Billing.Date < 21 to the right, improve=0.04930842, (0 missing)
## Issuing.Bank splits as R----RLR--R-LL, improve=0.04543175, (0 missing)
## Expiry.Date < 19235.5 to the left, improve=0.04107124, (0 missing)
## Issue.Date < 14518 to the right, improve=0.01963911, (0 missing)
## Surrogate splits:
## Expiry.Date < 22340 to the right, agree=0.75, adj=0.188, (0 split)
##
## Node number 3: 24 observations, complexity param=0.1046905
## mean=116662.5, MSE=3.097409e+09
## left son=6 (15 obs) right son=7 (9 obs)
## Primary splits:
## Billing.Date < 20.5 to the left, improve=0.51695990, (0 missing)
## Issue.Date < 15095.5 to the right, improve=0.09541808, (0 missing)
## Expiry.Date < 18566 to the right, improve=0.06577621, (0 missing)
## Issuing.Bank splits as -RLRL---RL-R--, improve=0.02043071, (0 missing)
## Surrogate splits:
## Expiry.Date < 15735.5 to the right, agree=0.750, adj=0.333, (0 split)
## Issue.Date < 14442 to the right, agree=0.708, adj=0.222, (0 split)
## Issuing.Bank splits as -LLRL---LL-L--, agree=0.667, adj=0.111, (0 split)
##
## Node number 4: 16 observations
## mean=62093.75, MSE=1.663733e+09
##
## Node number 5: 36 observations, complexity param=0.04701965
## mean=90719.44, MSE=4.121805e+09
## left son=10 (7 obs) right son=11 (29 obs)
## Primary splits:
## Issuing.Bank splits as R----RLR--R-LL, improve=0.09826102, (0 missing)
## Billing.Date < 14.5 to the left, improve=0.07934006, (0 missing)
## Expiry.Date < 20499 to the left, improve=0.06797713, (0 missing)
## Issue.Date < 14518 to the right, improve=0.02139059, (0 missing)
##
## Node number 6: 15 observations
## mean=85666.67, MSE=2.207621e+09
##
## Node number 7: 9 observations
## mean=168322.2, MSE=3.104262e+08
##
## Node number 10: 7 observations
## mean=49757.14, MSE=2.097682e+09
##
## Node number 11: 29 observations, complexity param=0.04701965
## mean=100606.9, MSE=4.107612e+09
## left son=22 (22 obs) right son=23 (7 obs)
## Primary splits:
## Billing.Date < 14.5 to the left, improve=0.13089580, (0 missing)
## Issuing.Bank splits as R----L-L--R---, improve=0.02664292, (0 missing)
## Expiry.Date < 20499 to the left, improve=0.02592121, (0 missing)
## Issue.Date < 14518 to the right, improve=0.02488519, (0 missing)
##
## Node number 22: 22 observations, complexity param=0.01700815
## mean=87527.27, MSE=4.376385e+09
## left son=44 (9 obs) right son=45 (13 obs)
## Primary splits:
## Issuing.Bank splits as L----R-L--R---, improve=0.04915344, (0 missing)
## Expiry.Date < 18915.5 to the left, improve=0.02988246, (0 missing)
## Billing.Date < 9.5 to the left, improve=0.02918640, (0 missing)
## Issue.Date < 15066.5 to the right, improve=0.01159310, (0 missing)
## Surrogate splits:
## Issue.Date < 13848 to the left, agree=0.727, adj=0.333, (0 split)
## Billing.Date < 1.5 to the left, agree=0.682, adj=0.222, (0 split)
## Expiry.Date < 18673 to the left, agree=0.636, adj=0.111, (0 split)
##
## Node number 23: 7 observations
## mean=141714.3, MSE=1.035413e+09
##
## Node number 44: 9 observations
## mean=69900, MSE=3.379296e+09
##
## Node number 45: 13 observations
## mean=99730.77, MSE=4.702638e+09
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit)
##
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] Billing.Date Issuing.Bank
##
## Root node error: 2.7825e+11/76 = 3661182291
##
## n= 76
##
## CP nsplit rel error xerror xstd
## 1 0.104690 0 1.00000 1.0211 0.093379
## 2 0.047020 2 0.79062 1.1893 0.137519
## 3 0.017008 5 0.64956 1.2033 0.160569
## 4 0.010000 6 0.63255 1.1871 0.162612
# create attractive postcript plot of tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
post(fit, title = "Regression Tree for Credit Limit ")
## Warning in text.default(rightptx, rightpty - 0.52 * cxy[2L],
## rows[right.child[!is.na(right.child)]], : font width unknown for character 0x12
## Warning in text.default(rightptx, rightpty - 0.52 * cxy[2L],
## rows[right.child[!is.na(right.child)]], : font metrics unknown for character
## 0x12
# Create predictions using the decision tree..
predictions1 <- predict(fit,newdata = testing_data)
head(predictions1)
## 10 34 38 43 51 52
## 69900.00 168322.22 99730.77 99730.77 62093.75 62093.75
RMSE(predictions1,testing_data$Credit.Limit)
## [1] 71028.21
I was curious why the tree didn’t choouse the minimum cp since it seems like the obvious choice however after looking at the predictions I can see why.. Interesting enough we get the same predictions everytime since there were no splits being made at the tree.. The RMSE was also reduced from the second predictions appromixately a difference of 20k..
## Finds the best value of cp..
min_cp <- fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
min_cp
## [1] 0.1046905
Using the minimal cost penalty doesn’t help with the predictions..
min_cp_optimal <- prune(fit,cp = min_cp)
predictions3 <- predict(min_cp_optimal,newdata = testing_data)
head(predictions3)
## [1] 92885.53 92885.53 92885.53 92885.53 92885.53 92885.53
RMSE(predictions3,testing_data$Credit.Limit)
## [1] 55259.12
From the way I have cleaned up the data, and attempted to use K-folds cross-validation in order to improve the data the quality of the data was terrible, Regardless of how I cleaned it, it seems to always give terrible predictions no matter how much I tried removing predictors or use techniques like cross-validation. This data set contained a lot of inconsistent data such as personal information of the users and their credit card information which I felt would not be helpful at all for the predictions. It also may be the case that there was simply not a lot of observations of the data.