Step 1

###(1 point) Create a new RMD file in RStudio and use a code chunk to complete the following tasks: ### 1. Load the following libraries: tidyverse, magrittr, lubridate, and corrplot packages.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.1
## Warning: package 'tibble' was built under R version 4.2.1
## Warning: package 'purrr' was built under R version 4.2.1
## Warning: package 'stringr' was built under R version 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.2.1
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.2.1
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
library(huxtable)
## Warning: package 'huxtable' was built under R version 4.2.1
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

2. Read in the hmdaInterestRate.rds file.

3. Report the structure of the dataframe. (No need to comment on the structure.)

df1<- readRDS("C:/Users/kiram/Dropbox/Kira/imba/Data Analytics/Module 5/hmdaInterestRate.rds")
str(df1)
## 'data.frame':    6509 obs. of  14 variables:
##  $ activity_year            : Factor w/ 2 levels "2019","2018": 2 2 2 2 2 2 2 2 2 2 ...
##  $ state_code               : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ county_code              : Factor w/ 3 levels "Missing","Coles",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ aus_1                    : Factor w/ 6 levels "Desktop Underwriter (DU)",..: 2 1 3 4 1 4 1 1 1 1 ...
##  $ loan_purpose             : Factor w/ 6 levels "Home purchase",..: 2 3 1 1 3 3 1 1 1 1 ...
##  $ applicant_ethnicity_1    : Factor w/ 8 levels "Not Hispanic or Latino",..: 2 1 1 1 2 1 1 1 1 1 ...
##  $ applicant_sex            : Factor w/ 4 levels "Male","Female",..: 1 2 1 1 3 1 1 1 2 1 ...
##  $ derived_loan_product_type: Factor w/ 6 levels "Conventional:First Lien",..: 1 1 1 2 4 2 1 1 1 1 ...
##  $ interest_rate            : num  3.62 4.99 4.12 4.25 3.99 ...
##  $ loan_amount              : num  185000 105000 255000 255000 95000 205000 235000 105000 275000 75000 ...
##  $ loan_term                : num  180 360 360 360 240 360 360 360 360 360 ...
##  $ property_value           : num  235000 215000 265000 255000 105000 265000 335000 265000 285000 85000 ...
##  $ income                   : num  154000 88000 66000 89000 81000 61000 84000 76000 160000 35000 ...
##  $ applicant_age            : num  50 50 30 50 60 40 30 50 30 30 ...

Step 2

###(1 point) Data preparation:d ### 1. Replace the values in the following columns with the same value divided by 1,000: loan_amount, property_value, and income. ### 2. Create a new column, ltp, that is equal to the values in the loan_amount column divided by the values in the property_value column. ### 3. Filter the data to keep observations for which income is less than 300 (i.e., $300,000). ### 4. Report a summary of all columns. (No need to comment on the summary of the columns.)

df1$loan_amount<-as.numeric(df1$loan_amount/1000)
df1$property_value<-as.numeric(df1$property_value/1000)
df1$income<-as.numeric(df1$income/1000)
df1$ltp<-(df1$loan_amount/df1$property_value)
df1_low<-df1 %>% filter(income < 300)

Step 3

###(1 point) Create a correlation plot of the following columns: interest_rate, ltp, income, applicant_age, property_value, and loan_amount. ### 1. Below the plot, identify what variable has the strongest negative correlation with interest_rate. Comment on what might explain why that correlation is negative.

cor(df1_low[,c('interest_rate', 'ltp', 'income', 'applicant_age', 'property_value', 'loan_amount')])
##                interest_rate        ltp      income applicant_age
## interest_rate     1.00000000 -0.3188587 -0.04683836   0.109516616
## ltp              -0.31885873  1.0000000 -0.13544451  -0.339076509
## income           -0.04683836 -0.1354445  1.00000000   0.045085985
## applicant_age     0.10951662 -0.3390765  0.04508598   1.000000000
## property_value   -0.16659083 -0.1658255  0.58661218   0.003786898
## loan_amount      -0.33150222  0.3903140  0.46424427  -0.168519166
##                property_value loan_amount
## interest_rate    -0.166590831  -0.3315022
## ltp              -0.165825527   0.3903140
## income            0.586612176   0.4642443
## applicant_age     0.003786898  -0.1685192
## property_value    1.000000000   0.8000781
## loan_amount       0.800078124   1.0000000

###Reason for Negative Correlation with loan_amount

###The likely reason interest rate is negatively correlate the highest is because interest rates are a control mechanism for loans. As interest rates go up, borrowing costs go up and the loan amount per $ of income that people can qualify for goes down. As the interest rates go down, borrowing costs go down, and the amount people can qualify for on their loan goes up. Whereas, with the other categories like age, income, property value, etc, there are other factors that affect these categories more heavily, so interest rate plays a much smaller role in thes categories.

Step 4: (1 point) Regress interest_rate on ltp. Display a summary of the regression results. Interpret the coefficient estimate on ltp.

lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
summary(lm1)
## 
## Call:
## lm(formula = df1_low$interest_rate ~ ltp, data = df1_low)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8933 -0.5055  0.0285  0.4834  2.5221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.31852    0.03147  168.98   <2e-16 ***
## ltp         -1.05188    0.03950  -26.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7182 on 6265 degrees of freedom
## Multiple R-squared:  0.1017, Adjusted R-squared:  0.1015 
## F-statistic: 709.1 on 1 and 6265 DF,  p-value: < 2.2e-16

###The coefficient for ltp is -1.051. This means that ltp is negatively correlated to the interest rate and the slope -1.05, leaving an equation of y=-1.05x +5.32.

Step 5: (3 points) Regress interest_rate on ltp and loan_amount. Display a summary of the regression results. Comment on the change in the adjusted R-squared, as well as the change in the coefficient on ltp.

library(jtools)
## Warning: package 'jtools' was built under R version 4.2.1
library(ggstance)
## Warning: package 'ggstance' was built under R version 4.2.1
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
library(jtools)
lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
lm2 <- lm(df1_low$interest_rate ~ loan_amount, data = df1_low)
export_summs(lm1, lm2)
Model 1Model 2
(Intercept)5.32 ***4.93 ***
(0.03)   (0.02)   
ltp-1.05 ***       
(0.04)          
loan_amount       -0.00 ***
       (0.00)   
N6267       6267       
R20.10    0.11    
*** p < 0.001; ** p < 0.01; * p < 0.05.

###For model 1, the coefficient estimate is 5.32 for the intercept and is -1.05 for ltp.

###For model 2, the coefficient estimate is 4.93 for the intercept and is -0.00 for loan_amount.

###The coefficient in Model 1 for ltp has a more negative slope than in model 2. This means that in Model 1 ltp is affected more greater negative correlation than with Model 2.

###R-squared for both are pretty similar. Model 1 has a value of 0.10 and 0.11 for Model 2. This means that the both explain about the same amount of observations, which is a small perent. This means that something else explains the difference in the other observations.

Step 6: (3 points) Regress interest_rate on ltp, loan_amount, and aus_1. Display a summary of the regression results. Interpret the change in adjusted R-squared, as well as the coefficients for each independent variable. (You should have four separate points, one for the change in adjusted R-squared, and one for the change in the coefficient for each independent variable.)

lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
lm2 <- lm(df1_low$interest_rate ~ loan_amount, data = df1_low)
lm3 <- lm(df1_low$interest_rate ~ aus_1, data = df1_low)
export_summs(lm1, lm2, lm3)
Model 1Model 2Model 3
(Intercept)5.32 ***4.93 ***4.34 ***
(0.03)   (0.02)   (0.01)   
ltp-1.05 ***              
(0.04)                 
loan_amount       -0.00 ***       
       (0.00)          
aus_1Not applicable              1.04 ***
              (0.02)   
aus_1Loan Propspector (LP) or Loan Product Advisor              0.01    
              (0.02)   
aus_1Technology Open to Approved Lenders (TOTAL) Scorecard              0.22 ***
              (0.03)   
aus_1Guaranteed Underwriting System              0.08    
              (0.05)   
aus_1Other              0.38 ***
              (0.09)   
N6267       6267       6267       
R20.10    0.11    0.23    
*** p < 0.001; ** p < 0.01; * p < 0.05.

###For model 1, the coefficient estimate is 5.32 for the intercept and is -1.05 for ltp.

###For model 2, the coefficient estimate is 4.93 for the intercept and is -0.00 for loan_amount.

###For model 3, the coefficient estimate is 0.22 for the total scorecard for aus_1.

###The coefficient in Model 1 afor ltp has a greater correlation with interest_rate than model 2 and 3 and is negatively correlated. Model 2 for loan_amont has a smaller correlation that Model 1 and 3, but is also negatively correlated. Model 3 is a positive correlation to interest_rate, but is a smaller number. This means that the order of magnitude for correlation means Model 1 > Model 2 > Model 3.

###R-squared for both Model 1 and 2 are pretty similar. Model 1 has a value of 0.10 and 0.11 for Model 2. This means that the both explain about the same amount of observations, which is a small perent. Model 3, however, has a significantly larger R-sequared of 0.23. This means it explains the variations in more of the observations than Model 1 and Model 2, and is better for making predictions.