1 Intro

Here is my attempt at using a larger variant of the small dataset I’ve gained from ExcelBi Analytics. I’ve attempt to use the same techniques I used for the small dataset.

1.1 Importing The Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

ddf <- read.csv('C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\100000 CC Records.csv')

1.2 Data Exploration

Let’s take a look at what our dataset contains

We have 100 observations and 11 features the credit limit is the dependent variable and the other features are all features of the dataset..

## Exploratory data analysis..

str(ddf)

## 'data.frame':    100000 obs. of  11 variables:
##  $ Card.Type.Code     : chr  "JC" "DS" "DC" "VI" ...
##  $ Card.Type.Full.Name: chr  "Japan Credit Bureau" "Discover" "Diners Club International" "Visa" ...
##  $ Issuing.Bank       : chr  "JCB" "Discover" "Diners Club" "Cabela\x92s WFB" ...
##  $ Card.Number        : num  3.56e+15 6.01e+15 3.09e+13 4.97e+15 4.29e+15 ...
##  $ Card.Holder.s.Name : chr  "Abel E Graziano" "Shane F Simon" "Johanna Cruickshank" "Yvonne L Garcia" ...
##  $ CVV.CVV2           : int  443 29 383 552 998 7019 258 371 243 1601 ...
##  $ Issue.Date         : chr  "06/2009" "02/2012" "01/2012" "06/2009" ...
##  $ Expiry.Date        : chr  "06/2025" "02/2029" "01/2017" "06/2013" ...
##  $ Billing.Date       : int  10 24 10 25 19 23 28 23 16 13 ...
##  $ Card.PIN           : int  8289 924 6147 5466 9423 1185 6871 1277 548 4639 ...
##  $ Credit.Limit       : int  84300 190000 174300 18600 131600 130300 154400 151700 129300 133700 ...

summary(ddf)

##  Card.Type.Code     Card.Type.Full.Name Issuing.Bank        Card.Number       
##  Length:100000      Length:100000       Length:100000      Min.   :3.000e+13  
##  Class :character   Class :character    Class :character   1st Qu.:3.701e+14  
##  Mode  :character   Mode  :character    Mode  :character   Median :3.590e+15  
##                                                            Mean   :3.313e+15  
##                                                            3rd Qu.:5.301e+15  
##                                                            Max.   :6.600e+15  
##  Card.Holder.s.Name    CVV.CVV2     Issue.Date        Expiry.Date       
##  Length:100000      Min.   :   0   Length:100000      Length:100000     
##  Class :character   1st Qu.: 293   Class :character   Class :character  
##  Mode  :character   Median : 587   Mode  :character   Mode  :character  
##                     Mean   :1248                                        
##                     3rd Qu.: 881                                        
##                     Max.   :9999                                        
##   Billing.Date      Card.PIN     Credit.Limit   
##  Min.   : 1.00   Min.   :   0   Min.   : 10000  
##  1st Qu.: 8.00   1st Qu.:2506   1st Qu.: 57400  
##  Median :15.00   Median :4993   Median :104600  
##  Mean   :14.51   Mean   :5004   Mean   :104788  
##  3rd Qu.:21.00   3rd Qu.:7505   3rd Qu.:152300  
##  Max.   :28.00   Max.   :9999   Max.   :200000

Each observations contains information about a customer and their credit card information i suspect that their personal information wouldn’t be that much helpful in predicting credit limit. I doubt more observations will help improve the accuracy of the models.

head(ddf)

##   Card.Type.Code       Card.Type.Full.Name    Issuing.Bank  Card.Number
## 1             JC       Japan Credit Bureau             JCB 3.560752e+15
## 2             DS                  Discover        Discover 6.011213e+15
## 3             DC Diners Club International     Diners Club 3.090841e+13
## 4             VI                      Visa Cabela\x92s WFB 4.968524e+15
## 5             VI                      Visa           Chase 4.287324e+15
## 6             AX          American Express    U.S. Bancorp 3.747893e+14
##    Card.Holder.s.Name CVV.CVV2 Issue.Date Expiry.Date Billing.Date Card.PIN
## 1     Abel E Graziano      443    06/2009     06/2025           10     8289
## 2       Shane F Simon       29    02/2012     02/2029           24      924
## 3 Johanna Cruickshank      383    01/2012     01/2017           10     6147
## 4     Yvonne L Garcia      552    06/2009     06/2013           25     5466
## 5          Man Wojcik      998    02/2015     02/2019           19     9423
## 6      Edward J Downs     7019    04/2015     04/2025           23     1185
##   Credit.Limit
## 1        84300
## 2       190000
## 3       174300
## 4        18600
## 5       131600
## 6       130300

In this data set, we see that there are an equal distribution of card-holders in the larger dataset this time around. Approximately 16% of each card in the dataset

ddf %>%
  select(Card.Type.Full.Name) %>%
  table() %>%
  prop.table()

## Card.Type.Full.Name
##          American Express Diners Club International                  Discover 
##                   0.16645                   0.16658                   0.16500 
##       Japan Credit Bureau               Master Card                      Visa 
##                   0.16782                   0.16765                   0.16650

This time around there are seems to be more proportions of credit cards from a variety of banks.

ddf %>%
  select(Issuing.Bank) %>%
  table() %>%
  prop.table()

## Issuing.Bank
##  Cabela\x92s WFB American Express  Bank of America         Barclays 
##          0.00678          0.11632          0.06639          0.01681 
##      Capital One            Chase         Citibank      Diners Club 
##          0.03269          0.10059          0.05028          0.16658 
##         Discover   First National       GE Capital              JCB 
##          0.16500          0.00508          0.01478          0.16782 
##              PNC     U.S. Bancorp             USAA      Wells Fargo 
##          0.00664          0.03407          0.03350          0.01667

1.3 Data visualization

## get rid of the special characters before visualizing the data..
ddf$Issuing.Bank <- iconv(ddf$Issuing.Bank, "ASCII", "UTF-8", sub="")

ggplot(ddf,aes(x=as.factor(Issuing.Bank))) +
  geom_bar(stat = "count") + coord_flip() + 
  geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 2,hjust = -0.3) + 
  ggtitle("Count of Bank Issuers")

ggplot(ddf,aes(x = Card.Type.Full.Name)) + 
  geom_bar(stat = "count") + coord_flip()+ 
  geom_text(stat = "count",aes(label = after_stat(count)),vjust = 0.7,size = 2,hjust = -0.3) + 
  ggtitle("Count of Bank Type")

Finally, I will look at the distribution of the credit limit, the distribution is really difficult to tell here, it appears the observation are all over the place, I would assume that it would follow a gaussian distribution with the amount of observations within the dataset.

ggplot(ddf,aes(x=Credit.Limit)) +
  geom_histogram() +
  ggtitle("Credit Limit Distribution") +
  xlab("Credit Limit") +
  ylab("Count") + theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.4 Checking Correlation

I used a corrplot to check for correlation between the numeric predictors to see which predictors can predict credit limit. It seems the bigger the observations within the dataset the less and less the correlation between the numeric predictors.

library(corrplot)

## corrplot 0.92 loaded

# Assuming 'df' is your data frame with numeric predictors
# You can select only the numeric columns from your data frame
# addCoef.col adds the correlation number on your scatterplot matrix.. 
numeric_predictors <- ddf[, sapply(ddf, is.numeric)]

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_predictors)

# Create a scatterplot matrix with corrplot
corrplot(correlation_matrix, method = "circle",number.cex = 0.7,addCoef.col = "black",type = "lower")

1.5 Creating a Linear Regression Model..

Creating a linear regression model with all of the predictors to see if there are any significant predictors that can help explain the Credit.Limit it seems like the linear model is unable to find such anything. The model doesn’t even seem to converge because of it’s sheer size.

## This model is not converging..
linearmod <- lm(Credit.Limit~.,data = ddf)

## Error: cannot allocate vector of size 74.0 Gb

summary(linearmod)

## Error in eval(expr, envir, enclos): object 'linearmod' not found

1.6 Data Cleaning

Here we will clean the data again, the method shouldn’t be too different from the small dataset.

## Let's remove columns that would not be useful for credit limit estimation.. such as card number and card holder's name cvv and card pin since these are security features and doesn't contain any useful info..

ddf <- ddf %>%
  select(-Card.Number,-Card.Holder.s.Name,-CVV.CVV2,-Card.PIN)

Create proper date-time columns.. we are going to transform the issue dat and expiry dates into numerical features to help our model runs better. We are going to calculate the card age by subtracting the issue date from the e

library(lubridate)
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

ddf$Issue.Date <- as.Date(paste("01/", ddf$Issue.Date, sep=""), format="%d/%m/%Y")
ddf$Expiry.Date <- as.Date(paste("01/", ddf$Expiry.Date, sep=""), format="%d/%m/%Y")
head(ddf)

##   Card.Type.Code       Card.Type.Full.Name    Issuing.Bank Issue.Date
## 1             JC       Japan Credit Bureau             JCB 2009-06-01
## 2             DS                  Discover        Discover 2012-02-01
## 3             DC Diners Club International     Diners Club 2012-01-01
## 4             VI                      Visa Cabela\022s WFB 2009-06-01
## 5             VI                      Visa           Chase 2015-02-01
## 6             AX          American Express    U.S. Bancorp 2015-04-01
##   Expiry.Date Billing.Date Credit.Limit
## 1  2025-06-01           10        84300
## 2  2029-02-01           24       190000
## 3  2017-01-01           10       174300
## 4  2013-06-01           25        18600
## 5  2019-02-01           19       131600
## 6  2025-04-01           23       130300

## I calculated the card age by subtracting the expiry date from the issue date and counted the numbers of days.. between each card duration 
ddf$Card_Age <- as.numeric(difftime(ddf$Expiry.Date, ddf$Issue.Date, units = "days"))

head(ddf)

##   Card.Type.Code       Card.Type.Full.Name    Issuing.Bank Issue.Date
## 1             JC       Japan Credit Bureau             JCB 2009-06-01
## 2             DS                  Discover        Discover 2012-02-01
## 3             DC Diners Club International     Diners Club 2012-01-01
## 4             VI                      Visa Cabela\022s WFB 2009-06-01
## 5             VI                      Visa           Chase 2015-02-01
## 6             AX          American Express    U.S. Bancorp 2015-04-01
##   Expiry.Date Billing.Date Credit.Limit Card_Age
## 1  2025-06-01           10        84300     5844
## 2  2029-02-01           24       190000     6210
## 3  2017-01-01           10       174300     1827
## 4  2013-06-01           25        18600     1461
## 5  2019-02-01           19       131600     1461
## 6  2025-04-01           23       130300     3653

## FInally, I will remove the Card Type Code Since it just a code for the card type and The Card Type Full Name since it is just the full written name for the bank card.

Final_df <- ddf %>%
  select(-Card.Type.Code,-Card.Type.Full.Name)

head(Final_df)

##      Issuing.Bank Issue.Date Expiry.Date Billing.Date Credit.Limit Card_Age
## 1             JCB 2009-06-01  2025-06-01           10        84300     5844
## 2        Discover 2012-02-01  2029-02-01           24       190000     6210
## 3     Diners Club 2012-01-01  2017-01-01           10       174300     1827
## 4 Cabela\022s WFB 2009-06-01  2013-06-01           25        18600     1461
## 5           Chase 2015-02-01  2019-02-01           19       131600     1461
## 6    U.S. Bancorp 2015-04-01  2025-04-01           23       130300     3653

1.7 Linear Regression

I will remove the card_age from the dataset since it is not useful at all in creating the linear regression model. The R-squared is really really small basically the model can’t even attempt to explain the variance in the dataset at all.

linearmod2 <- lm(Credit.Limit~.,data = Final_df)

summary(linearmod2)

## 
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96027 -47294   -218  47483  98216 
## 
## Coefficients: (1 not defined because of singularities)
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.064e+05  2.418e+03  44.007   <2e-16 ***
## Issuing.BankBank of America -1.047e+03  8.443e+02  -1.240    0.215    
## Issuing.BankBarclays         7.729e+02  1.432e+03   0.540    0.589    
## Issuing.BankCabela\022s WFB -6.975e+02  2.169e+03  -0.322    0.748    
## Issuing.BankCapital One      6.324e+02  1.087e+03   0.582    0.561    
## Issuing.BankChase            3.290e+02  7.474e+02   0.440    0.660    
## Issuing.BankCitibank         8.550e+02  9.265e+02   0.923    0.356    
## Issuing.BankDiners Club     -7.763e+01  6.633e+02  -0.117    0.907    
## Issuing.BankDiscover         7.733e+02  6.646e+02   1.164    0.245    
## Issuing.BankFirst National   5.773e+02  2.488e+03   0.232    0.817    
## Issuing.BankGE Capital      -3.098e+03  1.516e+03  -2.044    0.041 *  
## Issuing.BankJCB              8.164e+01  6.623e+02   0.123    0.902    
## Issuing.BankPNC              7.297e+02  2.190e+03   0.333    0.739    
## Issuing.BankU.S. Bancorp    -6.281e+02  1.069e+03  -0.587    0.557    
## Issuing.BankUSAA            -3.771e+02  1.076e+03  -0.350    0.726    
## Issuing.BankWells Fargo     -2.350e+03  1.438e+03  -1.634    0.102    
## Issue.Date                  -1.941e-01  1.709e-01  -1.136    0.256    
## Expiry.Date                  5.606e-02  8.231e-02   0.681    0.496    
## Billing.Date                 1.883e+01  2.151e+01   0.875    0.381    
## Card_Age                            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared:  0.0001874,  Adjusted R-squared:  7.352e-06 
## F-statistic: 1.041 on 18 and 99981 DF,  p-value: 0.4083

Once again, the Card-Age is useless as a predictor, so it will be removed

## Remove this column from the predictor .
Final_df2 <- Final_df %>%
  select(-Card_Age)

The data set is basically the same but at least there is no NA singularites now. Once again, there are no significant predictors that can help us with improvin the accuracy of the model.

linearmod2 <- lm(Credit.Limit~.,data = Final_df2)

summary(linearmod2)

## 
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96027 -47294   -218  47483  98216 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.064e+05  2.418e+03  44.007   <2e-16 ***
## Issuing.BankBank of America -1.047e+03  8.443e+02  -1.240    0.215    
## Issuing.BankBarclays         7.729e+02  1.432e+03   0.540    0.589    
## Issuing.BankCabela\022s WFB -6.975e+02  2.169e+03  -0.322    0.748    
## Issuing.BankCapital One      6.324e+02  1.087e+03   0.582    0.561    
## Issuing.BankChase            3.290e+02  7.474e+02   0.440    0.660    
## Issuing.BankCitibank         8.550e+02  9.265e+02   0.923    0.356    
## Issuing.BankDiners Club     -7.763e+01  6.633e+02  -0.117    0.907    
## Issuing.BankDiscover         7.733e+02  6.646e+02   1.164    0.245    
## Issuing.BankFirst National   5.773e+02  2.488e+03   0.232    0.817    
## Issuing.BankGE Capital      -3.098e+03  1.516e+03  -2.044    0.041 *  
## Issuing.BankJCB              8.164e+01  6.623e+02   0.123    0.902    
## Issuing.BankPNC              7.297e+02  2.190e+03   0.333    0.739    
## Issuing.BankU.S. Bancorp    -6.281e+02  1.069e+03  -0.587    0.557    
## Issuing.BankUSAA            -3.771e+02  1.076e+03  -0.350    0.726    
## Issuing.BankWells Fargo     -2.350e+03  1.438e+03  -1.634    0.102    
## Issue.Date                  -1.941e-01  1.709e-01  -1.136    0.256    
## Expiry.Date                  5.606e-02  8.231e-02   0.681    0.496    
## Billing.Date                 1.883e+01  2.151e+01   0.875    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared:  0.0001874,  Adjusted R-squared:  7.352e-06 
## F-statistic: 1.041 on 18 and 99981 DF,  p-value: 0.4083

1.8 Creating dummy variables for Issung.Bank

Like before, I was also curious to see if one-hot encoding would do anything but alas it seems there were NA values so it seems I will stick with the final_df2 and leave it with that..

library(caret)

encoded_data <- model.matrix(~Issuing.Bank - 1,data = Final_df2)

head(encoded_data)

##   Issuing.BankAmerican Express Issuing.BankBank of America Issuing.BankBarclays
## 1                            0                           0                    0
## 2                            0                           0                    0
## 3                            0                           0                    0
## 4                            0                           0                    0
## 5                            0                           0                    0
## 6                            0                           0                    0
##   Issuing.BankCabela\022s WFB Issuing.BankCapital One Issuing.BankChase
## 1                           0                       0                 0
## 2                           0                       0                 0
## 3                           0                       0                 0
## 4                           1                       0                 0
## 5                           0                       0                 1
## 6                           0                       0                 0
##   Issuing.BankCitibank Issuing.BankDiners Club Issuing.BankDiscover
## 1                    0                       0                    0
## 2                    0                       0                    1
## 3                    0                       1                    0
## 4                    0                       0                    0
## 5                    0                       0                    0
## 6                    0                       0                    0
##   Issuing.BankFirst National Issuing.BankGE Capital Issuing.BankJCB
## 1                          0                      0               1
## 2                          0                      0               0
## 3                          0                      0               0
## 4                          0                      0               0
## 5                          0                      0               0
## 6                          0                      0               0
##   Issuing.BankPNC Issuing.BankU.S. Bancorp Issuing.BankUSAA
## 1               0                        0                0
## 2               0                        0                0
## 3               0                        0                0
## 4               0                        0                0
## 5               0                        0                0
## 6               0                        1                0
##   Issuing.BankWells Fargo
## 1                       0
## 2                       0
## 3                       0
## 4                       0
## 5                       0
## 6                       0

Final_df3 <- cbind(encoded_data,Final_df2)

linearmod3 <- lm(Credit.Limit~.,data = Final_df3)

summary(linearmod3)

## 
## Call:
## lm(formula = Credit.Limit ~ ., data = Final_df3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96027 -47294   -218  47483  98216 
## 
## Coefficients: (16 not defined because of singularities)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.040e+05  2.720e+03  38.244   <2e-16 ***
## `Issuing.BankAmerican Express`  2.350e+03  1.438e+03   1.634   0.1022    
## `Issuing.BankBank of America`   1.302e+03  1.504e+03   0.866   0.3865    
## Issuing.BankBarclays            3.122e+03  1.897e+03   1.646   0.0998 .  
## `Issuing.BankCabela\\022s WFB`  1.652e+03  2.500e+03   0.661   0.5088    
## `Issuing.BankCapital One`       2.982e+03  1.652e+03   1.805   0.0711 .  
## Issuing.BankChase               2.679e+03  1.452e+03   1.845   0.0650 .  
## Issuing.BankCitibank            3.205e+03  1.551e+03   2.066   0.0389 *  
## `Issuing.BankDiners Club`       2.272e+03  1.410e+03   1.611   0.1071    
## Issuing.BankDiscover            3.123e+03  1.411e+03   2.214   0.0269 *  
## `Issuing.BankFirst National`    2.927e+03  2.782e+03   1.052   0.2927    
## `Issuing.BankGE Capital`       -7.485e+02  1.961e+03  -0.382   0.7027    
## Issuing.BankJCB                 2.431e+03  1.410e+03   1.725   0.0846 .  
## Issuing.BankPNC                 3.079e+03  2.519e+03   1.222   0.2216    
## `Issuing.BankU.S. Bancorp`      1.721e+03  1.641e+03   1.049   0.2941    
## Issuing.BankUSAA                1.972e+03  1.645e+03   1.199   0.2306    
## `Issuing.BankWells Fargo`              NA         NA      NA       NA    
## Issuing.BankBank of America            NA         NA      NA       NA    
## Issuing.BankBarclays                   NA         NA      NA       NA    
## Issuing.BankCabela\022s WFB            NA         NA      NA       NA    
## Issuing.BankCapital One                NA         NA      NA       NA    
## Issuing.BankChase                      NA         NA      NA       NA    
## Issuing.BankCitibank                   NA         NA      NA       NA    
## Issuing.BankDiners Club                NA         NA      NA       NA    
## Issuing.BankDiscover                   NA         NA      NA       NA    
## Issuing.BankFirst National             NA         NA      NA       NA    
## Issuing.BankGE Capital                 NA         NA      NA       NA    
## Issuing.BankJCB                        NA         NA      NA       NA    
## Issuing.BankPNC                        NA         NA      NA       NA    
## Issuing.BankU.S. Bancorp               NA         NA      NA       NA    
## Issuing.BankUSAA                       NA         NA      NA       NA    
## Issuing.BankWells Fargo                NA         NA      NA       NA    
## Issue.Date                     -1.941e-01  1.709e-01  -1.136   0.2559    
## Expiry.Date                     5.606e-02  8.231e-02   0.681   0.4959    
## Billing.Date                    1.883e+01  2.151e+01   0.875   0.3814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54890 on 99981 degrees of freedom
## Multiple R-squared:  0.0001874,  Adjusted R-squared:  7.352e-06 
## F-statistic: 1.041 on 18 and 99981 DF,  p-value: 0.4083

1.9 Utlizing Cross_validation

Utilizing cross-validation to improve my linear regression model and use RMSe as my metric to evaluate my model.

I will make a testing and training dataset.. (I had to stratify the data on Issuing.Bank because the dataset gets split in scenarios where the unknown labels are either in the training dataset or testset)

set.seed(123)
split_index <- createDataPartition(Final_df2$Issuing.Bank,p=0.7,list = FALSE)

training_data <- Final_df2[split_index,]
testing_data <- Final_df2[-split_index,]

## Set a 5 fold cross validation
set.seed(5)
control <- trainControl(method = "cv",number = 5)
metric = "rmse"

## Fit a Linear Algorithm.. with the cross validation..
set.seed(1234)
fit.lm <- train(Credit.Limit~.,data = training_data,method = "lm",trControl = control)

Here we see the performance result for each iteration we refer to the rsample object of the model

fit.lm$resample %>%
  arrange(Resample)

##       RMSE     Rsquared      MAE Resample
## 1 54943.29 1.287917e-06 47604.01    Fold1
## 2 54861.51 3.556867e-05 47415.03    Fold2
## 3 54668.85 1.524391e-05 47316.85    Fold3
## 4 54994.35 2.168152e-05 47672.28    Fold4
## 5 54852.01 3.714464e-05 47415.29    Fold5

The estimated accuracy of the model is the average of the five iterations so average RMSE is around 54864

fit.lm$resample %>%
  arrange(Resample) %>%
  summarise(AVGRMSE  = mean(RMSE))

##   AVGRMSE
## 1   54864

Meanwhile the average Rsquared of the model is not even close to 1 percent using 5-fold cross validation

fit.lm$resample %>%
  arrange(Resample) %>%
  summarise(AvgRsquared  = mean(Rsquared))

##    AvgRsquared
## 1 2.218533e-05

Here is all the evaluation metric displayed (R-squared,RMSE,MAE) the model is pretty terrible at predicting the credit limit.

print(fit.lm)

## Linear Regression 
## 
## 70006 samples
##     4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 56004, 56005, 56005, 56006, 56004 
## Resampling results:
## 
##   RMSE   Rsquared      MAE     
##   54864  2.218533e-05  47484.69
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The predictions are all around 10,000

predictions <- predict(fit.lm,newdata = testing_data)
head(predictions)

##        1        3        4       15       18       19 
## 105118.1 104484.1 102681.4 105518.5 105252.5 103866.0

1.10 Decision-Tree

Here I will use decision tree again, I will use the Caret package and the classwork from Week # 5 to help guide me with the predictions

## Use the textbook example..

library(rpart)
set.seed(3456)
fit <- rpart(Credit.Limit~.,,method = 'anova',data = training_data)

The default k-fold cross validation is 10 so we are using 10-folds cross-validation..

There is no tree at all, there is just a single root, and there is no root cp value just Inf. Since there are no splits and just a root, this means that all of the predicted value will be the same number.

## plot the cv error curve for the tree.. and visualizes the cross-validation results

plotcp(fit)

printcp(fit)

## 
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
## 
## Variables actually used in tree construction:
## character(0)
## 
## Root node error: 2.1065e+14/70006 = 3009088678
## 
## n= 70006 
## 
##           CP nsplit rel error xerror xstd
## 1 0.00013756      0         1      0    0

summary(fit)

## Call:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
##   n= 70006 
## 
##             CP nsplit rel error xerror xstd
## 1 0.0001375629      0         1      0    0
## 
## Node number 1: 70006 observations
##   mean=104847.4, MSE=3.009089e+09

par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit)

## 
## Regression tree:
## rpart(formula = Credit.Limit ~ ., data = training_data, method = "anova")
## 
## Variables actually used in tree construction:
## character(0)
## 
## Root node error: 2.1065e+14/70006 = 3009088678
## 
## n= 70006 
## 
##           CP nsplit rel error xerror xstd
## 1 0.00013756      0         1      0    0

# create attractive postcript plot of tree
plot(fit, uniform=TRUE,
   main="Regression Tree for Mileage ")

## Error in plot.rpart(fit, uniform = TRUE, main = "Regression Tree for Mileage "): fit is not a tree, just a root

text(fit, use.n=TRUE, all=TRUE, cex=.8)

## Error in text.rpart(fit, use.n = TRUE, all = TRUE, cex = 0.8): fit is not a tree, just a root

post(fit, title = "Regression Tree for Credit Limit ")

## Error in plot.rpart(tree, uniform = TRUE, branch = 0.2, compress = TRUE, : fit is not a tree, just a root

The predictions are all just a single value here

# Create predictions using the decision tree..

predictions1 <- predict(fit,newdata = testing_data)

head(predictions1)

## [1] 104847.4 104847.4 104847.4 104847.4 104847.4 104847.4

RMSE(predictions1,testing_data$Credit.Limit)

## [1] 54977.27

1.11 Thoughts

Like before, this dataset is just a larger variant of the smaller dataset I got from ExcelBi Analytics I had hoped that perhaps the model would improve since there is a lot more observations contained in the dataset. But alas,this doesn’t not appear to be the case. As the data suffers from multicollinearity,the more observations seems to make the RMSE and the predictions even worse. We can see within the decision tree algorithm that a single tree hasn’t been created and all of the predictions are all the same. This tells me that the data quality within the dataset is poor since the predictions themselves are terrible and so was the model fit. Perhaps next time, I may create better predictors such as income,amount of credit cards,whether the user has a loan or not. If I would attempt to collect data such as this. This dataset merely contained predictors that were related to each other which made it really difficult to explain the variance in the data and hard to predict credit limit.

Hw#1 Large Dataset

Al Haque

2023-10-05