the reciprocal of Monthly.Income is a better predictor than either the raw data or its log_e or log_10 transformations
Open.CREDIT.Lines does not improve with transforms
the log of Revolving.CREDIT.Balance works better than the raw data or the reciprocal
Longer loans have higher rates, no surprise
Factors without a big visible impact:
FICO.numeric
In the EDA section it was noted that the Monthly.Income data was highly skewed and that a linear model with Interest.Rate showed no predictive power.
The following demonstrates that a log transform greatly improves this variable's power in a model of Interest.Rate.
Monthly.Income.Log = log(loansData.complete$Monthly.Income)
model = lm(Interest.Rate ~ Monthly.Income, data = loansData.complete)
log.model = lm(Interest.Rate ~ Monthly.Income.Log, data = loansData.complete)
lm_compare(model, log.model)
## Residual Standard Error
## model 0.04178
## log.model 0.04175
## Decreased: -3.6e-05
## log.model preferred
## Adjusted R^2
## model -0.0002336
## log.model 0.001489
## Increased: 0.001723
## log.model preferred
## F Statistic
## model 0.4168
## log.model 4.724
## Increased: 4.307
## log.model preferred
## F Statistic p-value
## model 0.5186
## log.model 0.02984
## Decreased: -0.4888
##
## Coeffcient Statistics
## Monthly.Income abs(t stat)
## model 0.6456
## log.model 2.173
## Increased: 1.528
## log.model preferred
## Monthly.Income t stat p-value
## model 0.5186
## log.model 0.02984
## Decreased: -0.4888
Monthly.Income.Recip = 1/loansData.complete$Monthly.Income
recip.model = lm(Interest.Rate ~ Monthly.Income.Recip, data = loansData.complete)
lm_compare(log.model, recip.model)
## Residual Standard Error
## log.model 0.04175
## recip.model 0.04174
## Decreased: -4.177e-06
## recip.model preferred
## Adjusted R^2
## log.model 0.001489
## recip.model 0.001689
## Increased: 0.0001998
## recip.model preferred
## F Statistic
## log.model 4.724
## recip.model 5.224
## Increased: 0.5005
## recip.model preferred
## F Statistic p-value
## log.model 0.02984
## recip.model 0.02236
## Decreased: -0.007486
##
## Coeffcient Statistics
## Monthly.Income.Log abs(t stat)
## log.model 2.173
## recip.model 2.286
## Increased: 0.1122
## recip.model preferred
## Monthly.Income.Log t stat p-value
## log.model 0.02984
## recip.model 0.02236
## Decreased: -0.007486
A reciprocal transformation is an improvement over a log:
par(mfrow = c(3, 1))
plot(loansData.complete$Monthly.Income, loansData.complete$Interest.Rate, main = "Monthly.Income",
xlab = "Monthly.Income", ylab = "Interest.Rate")
abline(model, col = "red")
plot(Monthly.Income.Log, loansData.complete$Interest.Rate, main = "log(Monthly.Income)",
xlab = "log(Monthly.Income)", ylab = "Interest.Rate")
abline(log.model, col = "red")
plot(Monthly.Income.Recip, loansData.complete$Interest.Rate, main = "1/Monthly.Income",
xlab = "1/Monthly.Income", ylab = "Interest.Rate")
abline(recip.model, col = "red")
lm_assumptions_summary(recip.model)
##
##
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
##
##
## Call:
## lm(formula = Interest.Rate ~ Monthly.Income.Recip, data = loansData.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07838 -0.03012 -0.00017 0.02709 0.11844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13400 0.00167 80.42 <2e-16 ***
## Monthly.Income.Recip -14.15507 6.19305 -2.29 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared: 0.00209, Adjusted R-squared: 0.00169
## F-statistic: 5.22 on 1 and 2496 DF, p-value: 0.0224
##
##
## ----------F, F p, Adj R^2------------------------------
## F_statistic F_statistic_p adjusted_R2
## 5.224143 0.022359 0.001689
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
##
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
##
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
##
## ---------Heteroskedasticity-----------------------
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 22.96, df = 1, p-value = 1.656e-06
##
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
##
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
##
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
##
## --------Mean Zero?-------------------------------
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8800 -0.7220 -0.0042 0.0001 0.6490 2.8400
scatter.smooth(Monthly.Income.Recip, rstudent(recip.model), col = "blue")
abline(recip.model, col = "red", lty = 2)
lm_assumptions_summary(log.model)
##
##
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
##
##
## Call:
## lm(formula = Interest.Rate ~ Monthly.Income.Log, data = loansData.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07897 -0.03002 0.00006 0.02723 0.11892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10117 0.01362 7.43 1.5e-13 ***
## Monthly.Income.Log 0.00347 0.00160 2.17 0.03 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared: 0.00189, Adjusted R-squared: 0.00149
## F-statistic: 4.72 on 1 and 2496 DF, p-value: 0.0298
##
##
## ----------F, F p, Adj R^2------------------------------
## F_statistic F_statistic_p adjusted_R2
## 4.723623 0.029845 0.001489
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
##
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
##
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
##
## ---------Heteroskedasticity-----------------------
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 44.19, df = 1, p-value = 2.976e-11
##
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
##
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
##
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
##
## --------Mean Zero?-------------------------------
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8900 -0.7190 0.0015 0.0000 0.6520 2.8500
scatter.smooth(Monthly.Income.Log, rstudent(log.model), col = "blue")
abline(log.model, col = "red", lty = 2)
# remove the log model objects so we don't get confused
rm(Monthly.Income.Log, log.model)
Revolving.CREDIT.Balance.log = log(loansData.complete$Revolving.CREDIT.Balance +
1)
rc.model = lm(Interest.Rate ~ Revolving.CREDIT.Balance, data = loansData.complete)
rc.model.log = lm(Interest.Rate ~ Revolving.CREDIT.Balance.log, data = loansData.complete)
lm_compare(rc.model, rc.model.log)
## Residual Standard Error
## rc.model 0.04171
## rc.model.log 0.04139
## Decreased: -0.000321
## rc.model.log preferred
## Adjusted R^2
## rc.model 0.003335
## rc.model.log 0.01862
## Increased: 0.01528
## rc.model.log preferred
## F Statistic
## rc.model 9.356
## rc.model.log 48.37
## Increased: 39.01
## rc.model.log preferred
## F Statistic p-value
## rc.model 0.002246
## rc.model.log 4.498e-12
## Decreased: -0.002246
##
## Coeffcient Statistics
## Revolving.CREDIT.Balance abs(t stat)
## rc.model 3.059
## rc.model.log 6.955
## Increased: 3.896
## rc.model.log preferred
## Revolving.CREDIT.Balance t stat p-value
## rc.model 0.002246
## rc.model.log 4.498e-12
## Decreased: -0.002246
par(mfrow = c(2, 1))
plot(loansData.complete$Revolving.CREDIT.Balance, loansData.complete$Interest.Rate,
main = "Revolving.CREDIT.Balance", xlab = "Revolving.CREDIT.Balance", ylab = "Interest.Rate")
abline(rc.model, col = "red")
plot(Revolving.CREDIT.Balance.log, loansData.complete$Interest.Rate, main = "log(Revolving.CREDIT.Balance)",
xlab = "log(Revolving.CREDIT.Balance)", ylab = "Interest.Rate")
abline(rc.model.log, col = "red")
par(mfrow = c(1, 1))
# test the log assumptions
lm_assumptions_summary(rc.model.log)
##
##
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
##
##
## Call:
## lm(formula = Interest.Rate ~ Revolving.CREDIT.Balance.log, data = loansData.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08161 -0.03060 -0.00077 0.02686 0.12302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09808 0.00476 20.59 < 2e-16 ***
## Revolving.CREDIT.Balance.log 0.00361 0.00052 6.95 4.5e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0414 on 2496 degrees of freedom
## Multiple R-squared: 0.019, Adjusted R-squared: 0.0186
## F-statistic: 48.4 on 1 and 2496 DF, p-value: 4.5e-12
##
##
## ----------F, F p, Adj R^2------------------------------
## F_statistic F_statistic_p adjusted_R2
## 4.837e+01 4.498e-12 1.862e-02
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
##
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
##
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
##
## ---------Heteroskedasticity-----------------------
## [1] "Breusch-Pagan test for Heteroskedasticity indicates Constant Variance"
##
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
##
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
##
## --------Mean Zero?-------------------------------
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9700 -0.7400 -0.0185 0.0001 0.6490 3.0000
scatter.smooth(Revolving.CREDIT.Balance.log, rstudent(rc.model.log), col = "blue")
abline(rc.model.log, col = "red", lty = 2)
# test the raw data assumptions
lm_assumptions_summary(rc.model)
##
##
## NOTE: in addition to this analysis, look at scatter.smooth plots of the residuals vs the main variables individually to see if quadratic transforms may be required
##
##
## Call:
## lm(formula = Interest.Rate ~ Revolving.CREDIT.Balance, data = loansData.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0812 -0.0302 0.0002 0.0274 0.1193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.29e-01 1.09e-03 118.39 <2e-16 ***
## Revolving.CREDIT.Balance 1.39e-07 4.56e-08 3.06 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0417 on 2496 degrees of freedom
## Multiple R-squared: 0.00373, Adjusted R-squared: 0.00334
## F-statistic: 9.36 on 1 and 2496 DF, p-value: 0.00225
##
##
## ----------F, F p, Adj R^2------------------------------
## F_statistic F_statistic_p adjusted_R2
## 9.355737 0.002246 0.003335
## [1] "F statistic p-value <= 0.05 indicates at least one predictor is predictive"
##
## ---------p-values > 0.05------------------------------
## [1] "Below are listed, in descending order, the individual p-values > 0.05"
##
## ---------Cook's Distance------------------------------
## [1] "Cook's Distances less than 0.5 indicate no outlying Y's or Leveraged X's"
##
## ---------Heteroskedasticity-----------------------
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 10.46, df = 1, p-value = 0.001218
##
## [1] "Breusch-Pagan test indicates possible Heteroskedasticity"
##
## --------Autocorrelation--------------------------
## [1] "Autocorrelation not indicated"
##
## -------Multicollinearity if GT 10---------------
## [1] "Multicollinearity test generated an error"
##
## --------Mean Zero?-------------------------------
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9500 -0.7240 0.0049 0.0000 0.6570 2.8600
scatter.smooth(loansData.complete$Revolving.CREDIT.Balance, rstudent(rc.model),
col = "blue")
abline(rc.model, col = "red", lty = 2)
rm(rc.model.log, rc.model)
plot(loansData.complete$Interest.Rate ~ loansData.complete$State)
plot(loansData.complete$Interest.Rate ~ loansData.complete$Loan.Length)
plot(loansData.complete$Interest.Rate ~ loansData.complete$Loan.Purpose)
plot(loansData.complete$Interest.Rate ~ loansData.complete$Home.Ownership)
plot(loansData.complete$Interest.Rate ~ loansData.complete$Employment.Length)
plot(loansData.complete$Interest.Rate ~ loansData.complete$Inquiries.in.the.Last.6.Months)
FICO.numeric.alone.model = lm(Interest.Rate ~ FICO.numeric, data = loansData.complete)
scatter.smooth(loansData.complete$Interest.Rate, rstudent(FICO.numeric.alone.model),
col = "blue", main = "FICO.numeric.alone.model")
abline(h = 0, lty = 2, col = "red")
FICO.numeric2 = loansData.complete$FICO.numeric^2
FICO.numeric.quad.model = lm(Interest.Rate ~ FICO.numeric + FICO.numeric2, data = loansData.complete)
scatter.smooth(loansData.complete$Interest.Rate, rstudent(FICO.numeric.quad.model),
col = "blue", main = "FICO.numeric.quad.model")
abline(h = 0, lty = 2, col = "red")
lm_compare(FICO.numeric.alone.model, FICO.numeric.quad.model)
## Residual Standard Error
## FICO.numeric.alone.model 0.02946
## FICO.numeric.quad.model 0.02868
## Decreased: -0.0007839
## FICO.numeric.quad.model preferred
## Adjusted R^2
## FICO.numeric.alone.model 0.5028
## FICO.numeric.quad.model 0.5289
## Increased: 0.02611
## FICO.numeric.quad.model preferred
## F Statistic
## FICO.numeric.alone.model 2526
## FICO.numeric.quad.model 1403
## Decreased: -1123
## FICO.numeric.alone.model preferred
## F Statistic p-value
## FICO.numeric.alone.model 0
## FICO.numeric.quad.model 0
## Unchanged: 0
##
## Coeffcient Statistics
## FICO.numeric abs(t stat)
## FICO.numeric.alone.model 50.26
## FICO.numeric.quad.model 23.3
## Decreased: -26.96
## FICO.numeric.alone.model preferred
## FICO.numeric t stat p-value
## FICO.numeric.alone.model 0
## FICO.numeric.quad.model 8.718e-109
## Increased: 8.718e-109
print(str(.Platform))
## List of 8
## $ OS.type : chr "windows"
## $ file.sep : chr "/"
## $ dynlib.ext: chr ".dll"
## $ GUI : chr "RTerm"
## $ endian : chr "little"
## $ pkgType : chr "win.binary"
## $ path.sep : chr ";"
## $ r_arch : chr "x64"
## NULL
print(version)
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 0.2
## year 2013
## month 09
## day 25
## svn rev 63987
## language R
## version.string R version 3.0.2 (2013-09-25)
## nickname Frisbee Sailing
print(sessionInfo(), locale = FALSE)
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## attached base packages:
## [1] splines grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] HH_2.3-42 multcomp_1.3-0 survival_2.37-4
## [4] mvtnorm_0.9-9996 latticeExtra_0.6-26 RColorBrewer_1.0-5
## [7] lattice_0.20-24 randomizeBE_0.3-1 lmtest_0.9-32
## [10] zoo_1.7-10 knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 evaluate_0.5.1 formatR_0.10 leaps_2.9
## [5] MASS_7.3-29 reshape_0.8.4 sandwich_2.3-0 stringr_0.6.2
## [9] tools_3.0.2 vcd_1.3-1
print(Sys.time())
## [1] "2013-11-07 11:21:46 EST"