Data Analysis Project 1 Variable Analysis

Step 3

George Fisher george@georgefisher.com

Observations

  1. the reciprocal of Monthly.Income is a better predictor than either the raw data or its log_e or log_10 transformations

  2. Open.CREDIT.Lines does not improve with transforms

  3. the log of Revolving.CREDIT.Balance works better than the raw data or the reciprocal

  4. Longer loans have higher rates, no surprise

  5. Factors without a big visible impact:

  6. FICO.numeric


Model Monthly.Income

In the EDA section it was noted that the Monthly.Income data was highly skewed and that a linear model with Interest.Rate showed no predictive power.

log(Monthly.Income)

The following demonstrates that a log transform greatly improves this variable's power in a model of Interest.Rate.

Monthly.Income.Log = log(loansData.complete$Monthly.Income)

model = lm(Interest.Rate ~ Monthly.Income, data = loansData.complete)
log.model = lm(Interest.Rate ~ Monthly.Income.Log, data = loansData.complete)

lm_compare(model, log.model)
## Residual Standard Error 
##       model       0.04178 
##       log.model       0.04175 
##       Decreased:      -3.6e-05 
##               log.model preferred 
## Adjusted R^2 
##       model       -0.0002336 
##       log.model       0.001489 
##       Increased:      0.001723 
##               log.model preferred 
## F Statistic 
##       model       0.4168 
##       log.model       4.724 
##       Increased:      4.307 
##               log.model preferred 
## F Statistic p-value 
##       model       0.5186 
##       log.model       0.02984 
##       Decreased:      -0.4888 
## 
## Coeffcient Statistics 
##   Monthly.Income  abs(t stat) 
##       model       0.6456 
##       log.model       2.173 
##       Increased:      1.528 
##               log.model preferred 
##   Monthly.Income  t stat p-value 
##       model       0.5186 
##       log.model       0.02984 
##       Decreased:      -0.4888
  1. The F statistic p-value goes
  2. The Adjusted R2, however, while improved, remains low at 0.0015

1/Monthly.Income

Monthly.Income.Recip = 1/loansData.complete$Monthly.Income

recip.model = lm(Interest.Rate ~ Monthly.Income.Recip, data = loansData.complete)
lm_compare(log.model, recip.model)
## Residual Standard Error 
##       log.model       0.04175 
##       recip.model         0.04174 
##       Decreased:      -4.177e-06 
##               recip.model preferred 
## Adjusted R^2 
##       log.model       0.001489 
##       recip.model         0.001689 
##       Increased:      0.0001998 
##               recip.model preferred 
## F Statistic 
##       log.model       4.724 
##       recip.model         5.224 
##       Increased:      0.5005 
##               recip.model preferred 
## F Statistic p-value 
##       log.model       0.02984 
##       recip.model         0.02236 
##       Decreased:      -0.007486 
## 
## Coeffcient Statistics 
##   Monthly.Income.Log  abs(t stat) 
##       log.model       2.173 
##       recip.model         2.286 
##       Increased:      0.1122 
##               recip.model preferred 
##   Monthly.Income.Log  t stat p-value 
##       log.model       0.02984 
##       recip.model         0.02236 
##       Decreased:      -0.007486

A reciprocal transformation is an improvement over a log:

  1. The F statistic p-value goes down (good)
  2. The Adjusted R2 increases (a bit)
par(mfrow = c(3, 1))

plot(loansData.complete$Monthly.Income, loansData.complete$Interest.Rate, main = "Monthly.Income", 
    xlab = "Monthly.Income", ylab = "Interest.Rate")
abline(model, col = "red")

plot(Monthly.Income.Log, loansData.complete$Interest.Rate, main = "log(Monthly.Income)", 
    xlab = "log(Monthly.Income)", ylab = "Interest.Rate")
abline(log.model, col = "red")

plot(Monthly.Income.Recip, loansData.complete$Interest.Rate, main = "1/Monthly.Income", 
    xlab = "1/Monthly.Income", ylab = "Interest.Rate")
abline(recip.model, col = "red")

plot of chunk plot Monthly.Income models

look at the reciprocal model's assumption data

lm_assumptions_summary(recip.model)
## 
## 
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
## 
## 
## Call:
## lm(formula = Interest.Rate ~ Monthly.Income.Recip, data = loansData.complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07838 -0.03012 -0.00017  0.02709  0.11844 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.13400    0.00167   80.42   <2e-16 ***
## Monthly.Income.Recip -14.15507    6.19305   -2.29    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared:  0.00209,    Adjusted R-squared:  0.00169 
## F-statistic: 5.22 on 1 and 2496 DF,  p-value: 0.0224
## 
## 
## ----------F, F p, Adj R^2------------------------------
##   F_statistic F_statistic_p   adjusted_R2 
##      5.224143      0.022359      0.001689 
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
## 
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
## 
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
## 
## ---------Heteroskedasticity-----------------------
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 22.96, df = 1, p-value = 1.656e-06
## 
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
## 
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
## 
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
## 
## --------Mean Zero?-------------------------------
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8800 -0.7220 -0.0042  0.0001  0.6490  2.8400

plot of chunk unnamed-chunk-2 plot of chunk unnamed-chunk-2

scatter.smooth(Monthly.Income.Recip, rstudent(recip.model), col = "blue")
abline(recip.model, col = "red", lty = 2)

plot of chunk unnamed-chunk-2

look at the log model's assumption data

lm_assumptions_summary(log.model)
## 
## 
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
## 
## 
## Call:
## lm(formula = Interest.Rate ~ Monthly.Income.Log, data = loansData.complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07897 -0.03002  0.00006  0.02723  0.11892 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.10117    0.01362    7.43  1.5e-13 ***
## Monthly.Income.Log  0.00347    0.00160    2.17     0.03 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared:  0.00189,    Adjusted R-squared:  0.00149 
## F-statistic: 4.72 on 1 and 2496 DF,  p-value: 0.0298
## 
## 
## ----------F, F p, Adj R^2------------------------------
##   F_statistic F_statistic_p   adjusted_R2 
##      4.723623      0.029845      0.001489 
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
## 
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
## 
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
## 
## ---------Heteroskedasticity-----------------------
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 44.19, df = 1, p-value = 2.976e-11
## 
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
## 
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
## 
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
## 
## --------Mean Zero?-------------------------------
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8900 -0.7190  0.0015  0.0000  0.6520  2.8500

plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-3

scatter.smooth(Monthly.Income.Log, rstudent(log.model), col = "blue")
abline(log.model, col = "red", lty = 2)

plot of chunk unnamed-chunk-3

conclusion: reciprocal of Monthly.Income is a better predictor

  1. All of the observations above about the two models, plus
  2. Residual, Leverage and Scale plots look better for reciprocal
  3. Even though the Normality assumption seems to tip a little bit in favor of the log model
# remove the log model objects so we don't get confused
rm(Monthly.Income.Log, log.model)

Test log of Revolving.CREDIT.Balance

Revolving.CREDIT.Balance.log = log(loansData.complete$Revolving.CREDIT.Balance + 
    1)
rc.model = lm(Interest.Rate ~ Revolving.CREDIT.Balance, data = loansData.complete)
rc.model.log = lm(Interest.Rate ~ Revolving.CREDIT.Balance.log, data = loansData.complete)
lm_compare(rc.model, rc.model.log)
## Residual Standard Error 
##       rc.model        0.04171 
##       rc.model.log        0.04139 
##       Decreased:      -0.000321 
##               rc.model.log preferred 
## Adjusted R^2 
##       rc.model        0.003335 
##       rc.model.log        0.01862 
##       Increased:      0.01528 
##               rc.model.log preferred 
## F Statistic 
##       rc.model        9.356 
##       rc.model.log        48.37 
##       Increased:      39.01 
##               rc.model.log preferred 
## F Statistic p-value 
##       rc.model        0.002246 
##       rc.model.log        4.498e-12 
##       Decreased:      -0.002246 
## 
## Coeffcient Statistics 
##   Revolving.CREDIT.Balance  abs(t stat) 
##       rc.model        3.059 
##       rc.model.log        6.955 
##       Increased:      3.896 
##               rc.model.log preferred 
##   Revolving.CREDIT.Balance  t stat p-value 
##       rc.model        0.002246 
##       rc.model.log        4.498e-12 
##       Decreased:      -0.002246

par(mfrow = c(2, 1))

plot(loansData.complete$Revolving.CREDIT.Balance, loansData.complete$Interest.Rate, 
    main = "Revolving.CREDIT.Balance", xlab = "Revolving.CREDIT.Balance", ylab = "Interest.Rate")
abline(rc.model, col = "red")

plot(Revolving.CREDIT.Balance.log, loansData.complete$Interest.Rate, main = "log(Revolving.CREDIT.Balance)", 
    xlab = "log(Revolving.CREDIT.Balance)", ylab = "Interest.Rate")
abline(rc.model.log, col = "red")

plot of chunk unnamed-chunk-5

par(mfrow = c(1, 1))

# test the log assumptions
lm_assumptions_summary(rc.model.log)
## 
## 
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
## 
## 
## Call:
## lm(formula = Interest.Rate ~ Revolving.CREDIT.Balance.log, data = loansData.complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08161 -0.03060 -0.00077  0.02686  0.12302 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.09808    0.00476   20.59  < 2e-16 ***
## Revolving.CREDIT.Balance.log  0.00361    0.00052    6.95  4.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0414 on 2496 degrees of freedom
## Multiple R-squared:  0.019,  Adjusted R-squared:  0.0186 
## F-statistic: 48.4 on 1 and 2496 DF,  p-value: 4.5e-12
## 
## 
## ----------F, F p, Adj R^2------------------------------
##   F_statistic F_statistic_p   adjusted_R2 
##     4.837e+01     4.498e-12     1.862e-02 
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
## 
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
## 
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
## 
## ---------Heteroskedasticity-----------------------
## [1] "Breusch-Pagan test for Heteroskedasticity indicates Constant Variance"
## 
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
## 
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
## 
## --------Mean Zero?-------------------------------
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9700 -0.7400 -0.0185  0.0001  0.6490  3.0000

plot of chunk unnamed-chunk-5 plot of chunk unnamed-chunk-5

scatter.smooth(Revolving.CREDIT.Balance.log, rstudent(rc.model.log), col = "blue")
abline(rc.model.log, col = "red", lty = 2)

plot of chunk unnamed-chunk-5


# test the raw data assumptions
lm_assumptions_summary(rc.model)
## 
## 
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
## 
## 
## Call:
## lm(formula = Interest.Rate ~ Revolving.CREDIT.Balance, data = loansData.complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0812 -0.0302  0.0002  0.0274  0.1193 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.29e-01   1.09e-03  118.39   <2e-16 ***
## Revolving.CREDIT.Balance 1.39e-07   4.56e-08    3.06   0.0022 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared:  0.00373,    Adjusted R-squared:  0.00334 
## F-statistic: 9.36 on 1 and 2496 DF,  p-value: 0.00225
## 
## 
## ----------F, F p, Adj R^2------------------------------
##   F_statistic F_statistic_p   adjusted_R2 
##      9.355737      0.002246      0.003335 
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
## 
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
## 
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
## 
## ---------Heteroskedasticity-----------------------
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 10.46, df = 1, p-value = 0.001218
## 
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
## 
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
## 
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
## 
## --------Mean Zero?-------------------------------
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9500 -0.7240  0.0049  0.0000  0.6570  2.8600

plot of chunk unnamed-chunk-5 plot of chunk unnamed-chunk-5

scatter.smooth(loansData.complete$Revolving.CREDIT.Balance, rstudent(rc.model), 
    col = "blue")
abline(rc.model, col = "red", lty = 2)

plot of chunk unnamed-chunk-5


rm(rc.model.log, rc.model)

Is it different by state?

plot(loansData.complete$Interest.Rate ~ loansData.complete$State)

plot of chunk state boxplots


Is it different by loan length?

plot(loansData.complete$Interest.Rate ~ loansData.complete$Loan.Length)

plot of chunk length boxplots


Is it different by loan purpose?

plot(loansData.complete$Interest.Rate ~ loansData.complete$Loan.Purpose)

plot of chunk purpose boxplots


Is it different by home ownership?

plot(loansData.complete$Interest.Rate ~ loansData.complete$Home.Ownership)

plot of chunk ownership boxplots


Is it different by employment length?

plot(loansData.complete$Interest.Rate ~ loansData.complete$Employment.Length)

plot of chunk employment boxplots


Is it different by inquiries in the last 6 months?

plot(loansData.complete$Interest.Rate ~ loansData.complete$Inquiries.in.the.Last.6.Months)

plot of chunk inquiries boxplots


Does FICO.numeric get better with a quadratic added?

FICO.numeric.alone.model = lm(Interest.Rate ~ FICO.numeric, data = loansData.complete)
scatter.smooth(loansData.complete$Interest.Rate, rstudent(FICO.numeric.alone.model), 
    col = "blue", main = "FICO.numeric.alone.model")
abline(h = 0, lty = 2, col = "red")

plot of chunk FICO.numeric test quadratic


FICO.numeric2 = loansData.complete$FICO.numeric^2
FICO.numeric.quad.model = lm(Interest.Rate ~ FICO.numeric + FICO.numeric2, data = loansData.complete)
scatter.smooth(loansData.complete$Interest.Rate, rstudent(FICO.numeric.quad.model), 
    col = "blue", main = "FICO.numeric.quad.model")
abline(h = 0, lty = 2, col = "red")

plot of chunk FICO.numeric test quadratic


lm_compare(FICO.numeric.alone.model, FICO.numeric.quad.model)
## Residual Standard Error 
##       FICO.numeric.alone.model        0.02946 
##       FICO.numeric.quad.model         0.02868 
##       Decreased:      -0.0007839 
##               FICO.numeric.quad.model preferred 
## Adjusted R^2 
##       FICO.numeric.alone.model        0.5028 
##       FICO.numeric.quad.model         0.5289 
##       Increased:      0.02611 
##               FICO.numeric.quad.model preferred 
## F Statistic 
##       FICO.numeric.alone.model        2526 
##       FICO.numeric.quad.model         1403 
##       Decreased:      -1123 
##               FICO.numeric.alone.model preferred 
## F Statistic p-value 
##       FICO.numeric.alone.model        0 
##       FICO.numeric.quad.model         0 
##       Unchanged:      0 
## 
## Coeffcient Statistics 
##   FICO.numeric  abs(t stat) 
##       FICO.numeric.alone.model        50.26 
##       FICO.numeric.quad.model         23.3 
##       Decreased:      -26.96 
##               FICO.numeric.alone.model preferred 
##   FICO.numeric  t stat p-value 
##       FICO.numeric.alone.model        0 
##       FICO.numeric.quad.model         8.718e-109 
##       Increased:      8.718e-109

Info about the system running this code

print(str(.Platform))
## List of 8
##  $ OS.type   : chr "windows"
##  $ file.sep  : chr "/"
##  $ dynlib.ext: chr ".dll"
##  $ GUI       : chr "RTerm"
##  $ endian    : chr "little"
##  $ pkgType   : chr "win.binary"
##  $ path.sep  : chr ";"
##  $ r_arch    : chr "x64"
## NULL
print(version)
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          0.2                         
## year           2013                        
## month          09                          
## day            25                          
## svn rev        63987                       
## language       R                           
## version.string R version 3.0.2 (2013-09-25)
## nickname       Frisbee Sailing
print(sessionInfo(), locale = FALSE)
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## attached base packages:
## [1] splines   grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] HH_2.3-42           multcomp_1.3-0      survival_2.37-4    
##  [4] mvtnorm_0.9-9996    latticeExtra_0.6-26 RColorBrewer_1.0-5 
##  [7] lattice_0.20-24     randomizeBE_0.3-1   lmtest_0.9-32      
## [10] zoo_1.7-10          knitr_1.5          
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 evaluate_0.5.1   formatR_0.10     leaps_2.9       
##  [5] MASS_7.3-29      reshape_0.8.4    sandwich_2.3-0   stringr_0.6.2   
##  [9] tools_3.0.2      vcd_1.3-1
print(Sys.time())
## [1] "2013-11-07 11:21:46 EST"