R Markdown

rm(list=ls())
bank = na.omit(read.csv(choose.files(), header = TRUE))
attach(bank)

1.)

a.)

Create boxplots that go horizontally of interest rate by the purposes. Make at least 2 appropriate comments on the plots. (Use, las=1) at the end of your command)

boxplot(rate ~ purpose, horizontal = TRUE, las=1) 

Viewing the given boxplots, the car purpose box plot has the lowest median and a substantially small 25% IQR. This shows that interest rates with a purpose of car are likely to be smaller than the other purposes. Additionally, the boxplot with the purpose of moving includes the only outlier in the entire dataset. This indicates that for that one specific interest rate with the purpose of moving, it is considerably higher than the rest of the dataset.

b.)

Create histograms one on top of the other of interest rate and income. Comment on each plot.

par(mfrow=c(2,1))
hist(rate)
hist(inc)

The histogram of the rate category appears to be normally distributed with a median sitting just under 15.The frequency of the rate 10 through around 18 appears to be at about 60 for each. The histogram for income is very right skewed with a large portion of the data falling within 0e+00 and 1e+05. The mode is just under 1e+05 topping out at a frequency of around 200 data points.

c.)

Deal with what you found in part (b) appropriately.

par(mfrow = c(1,3))
hist(inc)
loginc = log(inc)
hist(loginc)
sqinc = sqrt(inc)
hist(sqinc)

From the graphs shown above, having the income variable undergo a log transformation gives us a graph with the most normal data points.

d.)

Produce a table of correlations of all quantitative variables. Which 2 variables are most highly correlated with the interest rate?

cor(cbind(loan, rate, inc, check, acc, revol, recov), use = "complete.obs")
##              loan       rate       inc       check        acc      revol
## loan   1.00000000 0.35216455 0.2874920 -0.08138747 0.05301899 0.16649508
## rate   0.35216455 1.00000000 0.1258046  0.10554252 0.07893518 0.06455868
## inc    0.28749203 0.12580464 1.0000000  0.12988367 0.18007104 0.29418830
## check -0.08138747 0.10554252 0.1298837  1.00000000 0.10405585 0.02725230
## acc    0.05301899 0.07893518 0.1800710  0.10405585 1.00000000 0.22382055
## revol  0.16649508 0.06455868 0.2941883  0.02725230 0.22382055 1.00000000
## recov  0.36285130 0.19232753 0.1894774  0.03555805 0.02051966 0.09758808
##            recov
## loan  0.36285130
## rate  0.19232753
## inc   0.18947736
## check 0.03555805
## acc   0.02051966
## revol 0.09758808
## recov 1.00000000
The 2 variables most hightly correlated with interest rate are loan at 0.35216455 and recov at 0.19232753.

e.)

Produce a scatterplot matrix. Do the correlations from part(d) correspond to what you see in these plots? Describe how you know from what you see visually with at least 2 examples. Should they correspond? Explain.

pairs(~loan + inc + check + acc + revol + recov + rate)

The correlations from part (d) do correspond to what is shown in these plots. This is prevalent through the scatterplot with loan against rate, where the data is observed to have a small positive correlation across the data points. Additionally, the recov against rate graph displays a small positive correlation across the data points due to the points have an up and rightward trend.

f.)

Fit a regression model predicting the interest rate with all of the other quantitative variables. Call this model m1.

m1 = lm(rate ~ loan + loginc + check + acc + revol + recov)
summary(m1)
## 
## Call:
## lm(formula = rate ~ loan + loginc + check + acc + revol + recov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3611 -2.4167 -0.0143  2.5288  8.0753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.115e+00  3.984e+00   2.037  0.04229 *  
## loan         1.338e-04  2.162e-05   6.188  1.5e-09 ***
## loginc       3.251e-01  3.735e-01   0.871  0.38452    
## check        2.011e-01  7.729e-02   2.601  0.00962 ** 
## acc          2.865e-02  3.353e-02   0.854  0.39339    
## revol       -2.786e-06  6.103e-06  -0.457  0.64822    
## recov        1.104e-04  8.692e-05   1.270  0.20484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.402 on 405 degrees of freedom
## Multiple R-squared:  0.1496, Adjusted R-squared:  0.137 
## F-statistic: 11.88 on 6 and 405 DF,  p-value: 2.74e-12

g.)

Comment on R^2 and R^2a from m1. Interpret R^2a. Make these clearly separate.

Comment on R^2: R^2 from m1 is a very small value only being 0.1496. This is not a strong value and likely means there is still another predictor out there that can help in determining rate. Comment on R^2a: This value is very small and indicates that the model is not a strong predictor of rate. It is likely there is still another variable not in our model that can help in predicting interest rate. Interpreting R^2a: 0.137 of the change in interest rate is predicted by the variables in the model after adjusting for the amount of variables used in the model.

h.)

Use m1 and do a hypothesis test to determine whether the recov variable is a significant predictor of interest rate. Provide hypothesis statements, test statistic, critical value, and a decision. Use alpha = 0.10.

qt(0.05, 405)
## [1] -1.648625
pt(-1.27, 405)*2
## [1] 0.2048138
H0: B6 = 0
Ha: B6 != 0
Test statistic = 1.27
t c.v = 1.648625
Given that the test statistic = 1.27 < t c.v = 1.648625, we Do not reject the Null Hypothesis. There is not significant evidence to suggest that the null hypothesis is false. We can not conclude if recov is a significant predictor of interest rate at the alpha level of 0.10

i.)

Using m1, is the check variable significant at alpha = 0.05? How do you know?

qt(0.025, 405)
## [1] -1.965839
H0: B3 = 0
Ha: B3 != 0
Test statistic = 2.601
t c.v = 1.965839
Given that the test statistic = 2.601 > t c.v = 1.965839, we Do reject the Null Hypothesis. We can conclude that check is a significant predictor of interest rate at an alpha level of 0.05. Additionally this is noted in the model by viewing the star symbols to the right of the p-values. Because check has two star values, it is found to be a significant predictor of interest rate at the alpha level of 0.05.

j.)

Using m1, interpret the slope for the acc variable.

Interpret the slope for acc: Given the slope is 2.865e-02, for every one change in acc, the interest rate will change by 2.865e-0 rate units, all other variables held constant.

k.)

Starting from m1, try a squared term. Tell my why you chose what you did and whether it turned out significant at alpha = 0.05.

sqacc = acc^2
m2 = lm(rate ~ loan + loginc + check + sqacc + revol + recov)
summary(m2)
## 
## Call:
## lm(formula = rate ~ loan + loginc + check + sqacc + revol + recov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2482 -2.4094 -0.0155  2.4835  8.1352 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.640e+00  3.985e+00   2.168  0.03073 *  
## loan         1.350e-04  2.159e-05   6.254 1.02e-09 ***
## loginc       2.798e-01  3.712e-01   0.754  0.45150    
## check        1.996e-01  7.698e-02   2.592  0.00988 ** 
## sqacc        2.120e-03  1.291e-03   1.642  0.10146    
## revol       -3.642e-06  6.097e-06  -0.597  0.55060    
## recov        1.120e-04  8.671e-05   1.292  0.19725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.394 on 405 degrees of freedom
## Multiple R-squared:  0.1537, Adjusted R-squared:  0.1412 
## F-statistic: 12.26 on 6 and 405 DF,  p-value: 1.086e-12
qt(0.025, 405)
## [1] -1.965839
I chose the acc variable to undergo a squaring transformation because viewing the scatterplots in part (e), the acc scatterplot has a distribution of points that appear to have a strong initial slope, and then taper off partway through. Applying the squaring function helps adjust the points to appear more linearly.
H0: B4 = 0
Ha: B4 != 0
Test statistic = 1.642
t c.v = 1.965839
Given that the test statistic = 1.642 < t c.v = 1.965839, we Do not reject the Null Hypothesis. There is not significant evidence to suggest that the null hypothesis is false. We can not conclude if sqacc is a significant predictor of interest rate at the alpha level of 0.05

l.)

Starting from m1, pick a logical variable and find its partial correlation coefficient. What is the interpretation of this coefficient.

6.188/(sqrt(6.188^2 + 405))
## [1] 0.2939041
Interpretation of the coefficient from part (l): 0.2939041 of the change in interest rate is expressed by the predictor “loan”, holding other predictors constant. on of the coefficient from part (l): 0.2968017 of the change in interest rate is expressed by the predictor “loan”, holding other predictors constant.