Here is my attempt at using a larger variant of the small dataset I’ve gained from ExcelBi Analytics. I’ve attempt to use the same techniques I used for the small dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
ddf <- read.csv('C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\100000 CC Records.csv')
Let’s take a look at what our dataset contains
We have 100 observations and 11 features the credit limit is the dependent variable and the other features are all features of the dataset..
## Exploratory data analysis..
str(ddf)
## 'data.frame': 100000 obs. of 11 variables:
## $ Card.Type.Code : chr "JC" "DS" "DC" "VI" ...
## $ Card.Type.Full.Name: chr "Japan Credit Bureau" "Discover" "Diners Club International" "Visa" ...
## $ Issuing.Bank : chr "JCB" "Discover" "Diners Club" "Cabela\x92s WFB" ...
## $ Card.Number : num 3.56e+15 6.01e+15 3.09e+13 4.97e+15 4.29e+15 ...
## $ Card.Holder.s.Name : chr "Abel E Graziano" "Shane F Simon" "Johanna Cruickshank" "Yvonne L Garcia" ...
## $ CVV.CVV2 : int 443 29 383 552 998 7019 258 371 243 1601 ...
## $ Issue.Date : chr "06/2009" "02/2012" "01/2012" "06/2009" ...
## $ Expiry.Date : chr "06/2025" "02/2029" "01/2017" "06/2013" ...
## $ Billing.Date : int 10 24 10 25 19 23 28 23 16 13 ...
## $ Card.PIN : int 8289 924 6147 5466 9423 1185 6871 1277 548 4639 ...
## $ Credit.Limit : int 84300 190000 174300 18600 131600 130300 154400 151700 129300 133700 ...
summary(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Card.Number
## Length:100000 Length:100000 Length:100000 Min. :3.000e+13
## Class :character Class :character Class :character 1st Qu.:3.701e+14
## Mode :character Mode :character Mode :character Median :3.590e+15
## Mean :3.313e+15
## 3rd Qu.:5.301e+15
## Max. :6.600e+15
## Card.Holder.s.Name CVV.CVV2 Issue.Date Expiry.Date
## Length:100000 Min. : 0 Length:100000 Length:100000
## Class :character 1st Qu.: 293 Class :character Class :character
## Mode :character Median : 587 Mode :character Mode :character
## Mean :1248
## 3rd Qu.: 881
## Max. :9999
## Billing.Date Card.PIN Credit.Limit
## Min. : 1.00 Min. : 0 Min. : 10000
## 1st Qu.: 8.00 1st Qu.:2506 1st Qu.: 57400
## Median :15.00 Median :4993 Median :104600
## Mean :14.51 Mean :5004 Mean :104788
## 3rd Qu.:21.00 3rd Qu.:7505 3rd Qu.:152300
## Max. :28.00 Max. :9999 Max. :200000
Each observations contains information about a customer and their credit card information i suspect that their personal information wouldn’t be that much helpful in predicting credit limit. I doubt more observations will help improve the accuracy of the models.
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Card.Number
## 1 JC Japan Credit Bureau JCB 3.560752e+15
## 2 DS Discover Discover 6.011213e+15
## 3 DC Diners Club International Diners Club 3.090841e+13
## 4 VI Visa Cabela\x92s WFB 4.968524e+15
## 5 VI Visa Chase 4.287324e+15
## 6 AX American Express U.S. Bancorp 3.747893e+14
## Card.Holder.s.Name CVV.CVV2 Issue.Date Expiry.Date Billing.Date Card.PIN
## 1 Abel E Graziano 443 06/2009 06/2025 10 8289
## 2 Shane F Simon 29 02/2012 02/2029 24 924
## 3 Johanna Cruickshank 383 01/2012 01/2017 10 6147
## 4 Yvonne L Garcia 552 06/2009 06/2013 25 5466
## 5 Man Wojcik 998 02/2015 02/2019 19 9423
## 6 Edward J Downs 7019 04/2015 04/2025 23 1185
## Credit.Limit
## 1 84300
## 2 190000
## 3 174300
## 4 18600
## 5 131600
## 6 130300
In this data set, we see that there are an equal distribution of card-holders in the larger dataset this time around. Approximately 16% of each card in the dataset
ddf %>%
select(Card.Type.Full.Name) %>%
table() %>%
prop.table()
## Card.Type.Full.Name
## American Express Diners Club International Discover
## 0.16645 0.16658 0.16500
## Japan Credit Bureau Master Card Visa
## 0.16782 0.16765 0.16650
This time around there are seems to be more proportions of credit cards from a variety of banks.
ddf %>%
select(Issuing.Bank) %>%
table() %>%
prop.table()
## Issuing.Bank
## Cabela\x92s WFB American Express Bank of America Barclays
## 0.00678 0.11632 0.06639 0.01681
## Capital One Chase Citibank Diners Club
## 0.03269 0.10059 0.05028 0.16658
## Discover First National GE Capital JCB
## 0.16500 0.00508 0.01478 0.16782
## PNC U.S. Bancorp USAA Wells Fargo
## 0.00664 0.03407 0.03350 0.01667
## get rid of the special characters before visualizing the data..
ddf$Issuing.Bank <- iconv(ddf$Issuing.Bank, "ASCII", "UTF-8", sub="")
ggplot(ddf,aes(x=as.factor(Issuing.Bank))) +
geom_bar(stat = "count") + coord_flip() +
geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 2,hjust = -0.3) +
ggtitle("Count of Bank Issuers")
ggplot(ddf,aes(x = Card.Type.Full.Name)) +
geom_bar(stat = "count") + coord_flip()+
geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 2,hjust = -0.3) +
ggtitle("Count of Bank Type")
Finally, I will look at the distribution of the credit limit, the distribution is really difficult to tell here, it appears the observation are all over the place, I would assume that it would follow a gaussian distribution with the amount of observations within the dataset.
ggplot(ddf,aes(x=Credit.Limit)) +
geom_histogram() +
ggtitle("Credit Limit Distribution") +
xlab("Credit Limit") +
ylab("Count") + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I used a corrplot to check for correlation between the numeric predictors to see which predictors can predict credit limit. It seems the bigger the observations within the dataset the less and less the correlation between the numeric predictors.
library(corrplot)
## corrplot 0.92 loaded
# Assuming 'df' is your data frame with numeric predictors
# You can select only the numeric columns from your data frame
# addCoef.col adds the correlation number on your scatterplot matrix..
numeric_predictors <- ddf[, sapply(ddf, is.numeric)]
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_predictors)
# Create a scatterplot matrix with corrplot
corrplot(correlation_matrix, method = "circle",number.cex = 0.7,addCoef.col = "black",type = "lower")
Creating a linear regression model with all of the predictors to see if there are any significant predictors that can help explain the Credit.Limit it seems like the linear model is unable to find such anything. The model doesn’t even seem to converge because of it’s sheer size.
## This model is not converging..
linearmod <- lm(Credit.Limit~.,data = ddf)
## Error: cannot allocate vector of size 74.0 Gb
summary(linearmod)
## Error in eval(expr, envir, enclos): object 'linearmod' not found
Here we will clean the data again, the method shouldn’t be too different from the small dataset.
## Let's remove columns that would not be useful for credit limit estimation.. such as card number and card holder's name cvv and card pin since these are security features and doesn't contain any useful info..
ddf <- ddf %>%
select(-Card.Number,-Card.Holder.s.Name,-CVV.CVV2,-Card.PIN)
Create proper date-time columns.. we are going to transform the issue dat and expiry dates into numerical features to help our model runs better. We are going to calculate the card age by subtracting the issue date from the e
library(lubridate)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ddf$Issue.Date <- as.Date(paste("01/", ddf$Issue.Date, sep=""), format="%d/%m/%Y")
ddf$Expiry.Date <- as.Date(paste("01/", ddf$Expiry.Date, sep=""), format="%d/%m/%Y")
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Issue.Date
## 1 JC Japan Credit Bureau JCB 2009-06-01
## 2 DS Discover Discover 2012-02-01
## 3 DC Diners Club International Diners Club 2012-01-01
## 4 VI Visa Cabela\022s WFB 2009-06-01
## 5 VI Visa Chase 2015-02-01
## 6 AX American Express U.S. Bancorp 2015-04-01
## Expiry.Date Billing.Date Credit.Limit
## 1 2025-06-01 10 84300
## 2 2029-02-01 24 190000
## 3 2017-01-01 10 174300
## 4 2013-06-01 25 18600
## 5 2019-02-01 19 131600
## 6 2025-04-01 23 130300
## I calculated the card age by subtracting the expiry date from the issue date and counted the numbers of days.. between each card duration
ddf$Card_Age <- as.numeric(difftime(ddf$Expiry.Date, ddf$Issue.Date, units = "days"))
head(ddf)
## Card.Type.Code Card.Type.Full.Name Issuing.Bank Issue.Date
## 1 JC Japan Credit Bureau JCB 2009-06-01
## 2 DS Discover Discover 2012-02-01
## 3 DC Diners Club International Diners Club 2012-01-01
## 4 VI Visa Cabela\022s WFB 2009-06-01
## 5 VI Visa Chase 2015-02-01
## 6 AX American Express U.S. Bancorp 2015-04-01
## Expiry.Date Billing.Date Credit.Limit Card_Age
## 1 2025-06-01 10 84300 5844
## 2 2029-02-01 24 190000 6210
## 3 2017-01-01 10 174300 1827
## 4 2013-06-01 25 18600 1461
## 5 2019-02-01 19 131600 1461
## 6 2025-04-01 23 130300 3653
## FInally, I will remove the Card Type Code Since it just a code for the card type and The Card Type Full Name since it is just the full written name for the bank card.
Final_df <- ddf %>%
select(-Card.Type.Code,-Card.Type.Full.Name)
head(Final_df)
## Issuing.Bank Issue.Date Expiry.Date Billing.Date Credit.Limit Card_Age
## 1 JCB 2009-06-01 2025-06-01 10 84300 5844
## 2 Discover 2012-02-01 2029-02-01 24 190000 6210
## 3 Diners Club 2012-01-01 2017-01-01 10 174300 1827
## 4 Cabela\022s WFB 2009-06-01 2013-06-01 25 18600 1461
## 5 Chase 2015-02-01 2019-02-01 19 131600 1461
## 6 U.S. Bancorp 2015-04-01 2025-04-01 23 130300 3653
I will remove the card_age from the dataset since it is not useful at all in creating the linear regression model. The R-squared is really really small basically the model can’t even attempt to explain the variance in the dataset at all.
linearmod2 <- lm(Credit.Limit~.,data = Final_df)
summary(linearmod2)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96027 -47294 -218 47483 98216
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.064e+05 2.418e+03 44.007 <2e-16 ***
## Issuing.BankBank of America -1.047e+03 8.443e+02 -1.240 0.215
## Issuing.BankBarclays 7.729e+02 1.432e+03 0.540 0.589
## Issuing.BankCabela\022s WFB -6.975e+02 2.169e+03 -0.322 0.748
## Issuing.BankCapital One 6.324e+02 1.087e+03 0.582 0.561
## Issuing.BankChase 3.290e+02 7.474e+02 0.440 0.660
## Issuing.BankCitibank 8.550e+02 9.265e+02 0.923 0.356
## Issuing.BankDiners Club -7.763e+01 6.633e+02 -0.117 0.907
## Issuing.BankDiscover 7.733e+02 6.646e+02 1.164 0.245
## Issuing.BankFirst National 5.773e+02 2.488e+03 0.232 0.817
## Issuing.BankGE Capital -3.098e+03 1.516e+03 -2.044 0.041 *
## Issuing.BankJCB 8.164e+01 6.623e+02 0.123 0.902
## Issuing.BankPNC 7.297e+02 2.190e+03 0.333 0.739
## Issuing.BankU.S. Bancorp -6.281e+02 1.069e+03 -0.587 0.557
## Issuing.BankUSAA -3.771e+02 1.076e+03 -0.350 0.726
## Issuing.BankWells Fargo -2.350e+03 1.438e+03 -1.634 0.102
## Issue.Date -1.941e-01 1.709e-01 -1.136 0.256
## Expiry.Date 5.606e-02 8.231e-02 0.681 0.496
## Billing.Date 1.883e+01 2.151e+01 0.875 0.381
## Card_Age NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared: 0.0001874, Adjusted R-squared: 7.352e-06
## F-statistic: 1.041 on 18 and 99981 DF, p-value: 0.4083
Once again, the Card-Age is useless as a predictor, so it will be removed
## Remove this column from the predictor .
Final_df2 <- Final_df %>%
select(-Card_Age)
The data set is basically the same but at least there is no NA singularites now. Once again, there are no significant predictors that can help us with improvin the accuracy of the model.
linearmod2 <- lm(Credit.Limit~.,data = Final_df2)
summary(linearmod2)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96027 -47294 -218 47483 98216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.064e+05 2.418e+03 44.007 <2e-16 ***
## Issuing.BankBank of America -1.047e+03 8.443e+02 -1.240 0.215
## Issuing.BankBarclays 7.729e+02 1.432e+03 0.540 0.589
## Issuing.BankCabela\022s WFB -6.975e+02 2.169e+03 -0.322 0.748
## Issuing.BankCapital One 6.324e+02 1.087e+03 0.582 0.561
## Issuing.BankChase 3.290e+02 7.474e+02 0.440 0.660
## Issuing.BankCitibank 8.550e+02 9.265e+02 0.923 0.356
## Issuing.BankDiners Club -7.763e+01 6.633e+02 -0.117 0.907
## Issuing.BankDiscover 7.733e+02 6.646e+02 1.164 0.245
## Issuing.BankFirst National 5.773e+02 2.488e+03 0.232 0.817
## Issuing.BankGE Capital -3.098e+03 1.516e+03 -2.044 0.041 *
## Issuing.BankJCB 8.164e+01 6.623e+02 0.123 0.902
## Issuing.BankPNC 7.297e+02 2.190e+03 0.333 0.739
## Issuing.BankU.S. Bancorp -6.281e+02 1.069e+03 -0.587 0.557
## Issuing.BankUSAA -3.771e+02 1.076e+03 -0.350 0.726
## Issuing.BankWells Fargo -2.350e+03 1.438e+03 -1.634 0.102
## Issue.Date -1.941e-01 1.709e-01 -1.136 0.256
## Expiry.Date 5.606e-02 8.231e-02 0.681 0.496
## Billing.Date 1.883e+01 2.151e+01 0.875 0.381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared: 0.0001874, Adjusted R-squared: 7.352e-06
## F-statistic: 1.041 on 18 and 99981 DF, p-value: 0.4083
Like before, I was also curious to see if one-hot encoding would do anything but alas it seems there were NA values so it seems I will stick with the final_df2 and leave it with that..
library(caret)
encoded_data <- model.matrix(~Issuing.Bank - 1,data = Final_df2)
head(encoded_data)
## Issuing.BankAmerican Express Issuing.BankBank of America Issuing.BankBarclays
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankCabela\022s WFB Issuing.BankCapital One Issuing.BankChase
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 1
## 6 0 0 0
## Issuing.BankCitibank Issuing.BankDiners Club Issuing.BankDiscover
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankFirst National Issuing.BankGE Capital Issuing.BankJCB
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Issuing.BankPNC Issuing.BankU.S. Bancorp Issuing.BankUSAA
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 1 0
## Issuing.BankWells Fargo
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Final_df3 <- cbind(encoded_data,Final_df2)
linearmod3 <- lm(Credit.Limit~.,data = Final_df3)
summary(linearmod3)
##
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96027 -47294 -218 47483 98216
##
## Coefficients: (16 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.040e+05 2.720e+03 38.244 <2e-16 ***
## `Issuing.BankAmerican Express` 2.350e+03 1.438e+03 1.634 0.1022
## `Issuing.BankBank of America` 1.302e+03 1.504e+03 0.866 0.3865
## Issuing.BankBarclays 3.122e+03 1.897e+03 1.646 0.0998 .
## `Issuing.BankCabela\\022s WFB` 1.652e+03 2.500e+03 0.661 0.5088
## `Issuing.BankCapital One` 2.982e+03 1.652e+03 1.805 0.0711 .
## Issuing.BankChase 2.679e+03 1.452e+03 1.845 0.0650 .
## Issuing.BankCitibank 3.205e+03 1.551e+03 2.066 0.0389 *
## `Issuing.BankDiners Club` 2.272e+03 1.410e+03 1.611 0.1071
## Issuing.BankDiscover 3.123e+03 1.411e+03 2.214 0.0269 *
## `Issuing.BankFirst National` 2.927e+03 2.782e+03 1.052 0.2927
## `Issuing.BankGE Capital` -7.485e+02 1.961e+03 -0.382 0.7027
## Issuing.BankJCB 2.431e+03 1.410e+03 1.725 0.0846 .
## Issuing.BankPNC 3.079e+03 2.519e+03 1.222 0.2216
## `Issuing.BankU.S. Bancorp` 1.721e+03 1.641e+03 1.049 0.2941
## Issuing.BankUSAA 1.972e+03 1.645e+03 1.199 0.2306
## `Issuing.BankWells Fargo` NA NA NA NA
## Issuing.BankBank of America NA NA NA NA
## Issuing.BankBarclays NA NA NA NA
## Issuing.BankCabela\022s WFB NA NA NA NA
## Issuing.BankCapital One NA NA NA NA
## Issuing.BankChase NA NA NA NA
## Issuing.BankCitibank NA NA NA NA
## Issuing.BankDiners Club NA NA NA NA
## Issuing.BankDiscover NA NA NA NA
## Issuing.BankFirst National NA NA NA NA
## Issuing.BankGE Capital NA NA NA NA
## Issuing.BankJCB NA NA NA NA
## Issuing.BankPNC NA NA NA NA
## Issuing.BankU.S. Bancorp NA NA NA NA
## Issuing.BankUSAA NA NA NA NA
## Issuing.BankWells Fargo NA NA NA NA
## Issue.Date -1.941e-01 1.709e-01 -1.136 0.2559
## Expiry.Date 5.606e-02 8.231e-02 0.681 0.4959
## Billing.Date 1.883e+01 2.151e+01 0.875 0.3814
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared: 0.0001874, Adjusted R-squared: 7.352e-06
## F-statistic: 1.041 on 18 and 99981 DF, p-value: 0.4083
Utilizing cross-validation to improve my linear regression model and use RMSe as my metric to evaluate my model.
I will make a testing and training dataset.. (I had to stratify the data on Issuing.Bank because the dataset gets split in scenarios where the unknown labels are either in the training dataset or testset)
set.seed(123)
split_index <- createDataPartition(Final_df2$Issuing.Bank,p=0.7,list = FALSE)
training_data <- Final_df2[split_index,]
testing_data <- Final_df2[-split_index,]
## Set a 5 fold cross validation
set.seed(5)
control <- trainControl(method = "cv",number = 5)
metric = "rmse"
## Fit a Linear Algorithm.. with the cross validation..
set.seed(1234)
fit.lm <- train(Credit.Limit~.,data = training_data,method = "lm",trControl = control)
Here we see the performance result for each iteration we refer to the rsample object of the model
fit.lm$resample %>%
arrange(Resample)
## RMSE Rsquared MAE Resample
## 1 54943.29 1.287917e-06 47604.01 Fold1
## 2 54861.51 3.556867e-05 47415.03 Fold2
## 3 54668.85 1.524391e-05 47316.85 Fold3
## 4 54994.35 2.168152e-05 47672.28 Fold4
## 5 54852.01 3.714464e-05 47415.29 Fold5
The estimated accuracy of the model is the average of the five iterations so average RMSE is around 54864
fit.lm$resample %>%
arrange(Resample) %>%
summarise(AVGRMSE = mean(RMSE))
## AVGRMSE
## 1 54864
Meanwhile the average Rsquared of the model is not even close to 1 percent using 5-fold cross validation
fit.lm$resample %>%
arrange(Resample) %>%
summarise(AvgRsquared = mean(Rsquared))
## AvgRsquared
## 1 2.218533e-05
Here is all the evaluation metric displayed (R-squared,RMSE,MAE) the model is pretty terrible at predicting the credit limit.
print(fit.lm)
## Linear Regression
##
## 70006 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 56004, 56005, 56005, 56006, 56004
## Resampling results:
##
## RMSE Rsquared MAE
## 54864 2.218533e-05 47484.69
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The predictions are all around 10,000
predictions <- predict(fit.lm,newdata = testing_data)
head(predictions)
## 1 3 4 15 18 19
## 105118.1 104484.1 102681.4 105518.5 105252.5 103866.0
Here I will use decision tree again, I will use the Caret package and the classwork from Week # 5 to help guide me with the predictions
## Use the textbook example..
library(rpart)
set.seed(3456)
fit <- rpart(Credit.Limit~.,,method = 'anova',data = training_data)
The default k-fold cross validation is 10 so we are using 10-folds cross-validation..
There is no tree at all, there is just a single root, and there is no root cp value just Inf. Since there are no splits and just a root, this means that all of the predicted value will be the same number.
## plot the cv error curve for the tree.. and visualizes the cross-validation results
plotcp(fit)
printcp(fit)
##
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
##
## Variables actually used in tree construction:
## character(0)
##
## Root node error: 2.1065e+14/70006 = 3009088678
##
## n= 70006
##
## CP nsplit rel error xerror xstd
## 1 0.00013756 0 1 0 0
summary(fit)
## Call:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
## n= 70006
##
## CP nsplit rel error xerror xstd
## 1 0.0001375629 0 1 0 0
##
## Node number 1: 70006 observations
## mean=104847.4, MSE=3.009089e+09
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit)
##
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
##
## Variables actually used in tree construction:
## character(0)
##
## Root node error: 2.1065e+14/70006 = 3009088678
##
## n= 70006
##
## CP nsplit rel error xerror xstd
## 1 0.00013756 0 1 0 0
# create attractive postcript plot of tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
## Error in plot.rpart(fit, uniform = TRUE, main = "Regression Tree for Mileage "): fit is not a tree, just a root
text(fit, use.n=TRUE, all=TRUE, cex=.8)
## Error in text.rpart(fit, use.n = TRUE, all = TRUE, cex = 0.8): fit is not a tree, just a root
post(fit, title = "Regression Tree for Credit Limit ")
## Error in plot.rpart(tree, uniform = TRUE, branch = 0.2, compress = TRUE, : fit is not a tree, just a root
The predictions are all just a single value here
# Create predictions using the decision tree..
predictions1 <- predict(fit,newdata = testing_data)
head(predictions1)
## [1] 104847.4 104847.4 104847.4 104847.4 104847.4 104847.4
RMSE(predictions1,testing_data$Credit.Limit)
## [1] 54977.27
Like before, this dataset is just a larger variant of the smaller dataset I got from ExcelBi Analytics I had hoped that perhaps the model would improve since there is a lot more observations contained in the dataset. But alas,this doesn’t not appear to be the case. As the data suffers from multicollinearity,the more observations seems to make the RMSE and the predictions even worse. We can see within the decision tree algorithm that a single tree hasn’t been created and all of the predictions are all the same. This tells me that the data quality within the dataset is poor since the predictions themselves are terrible and so was the model fit. Perhaps next time, I may create better predictors such as income,amount of credit cards,whether the user has a loan or not. If I would attempt to collect data such as this. This dataset merely contained predictors that were related to each other which made it really difficult to explain the variance in the data and hard to predict credit limit.