###(1 point) Create a new RMD file in RStudio and use a code chunk to complete the following tasks: ### 1. Load the following libraries: tidyverse, magrittr, lubridate, and corrplot packages.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.1
## Warning: package 'tibble' was built under R version 4.2.1
## Warning: package 'purrr' was built under R version 4.2.1
## Warning: package 'stringr' was built under R version 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.2.1
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.2.1
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
library(huxtable)
## Warning: package 'huxtable' was built under R version 4.2.1
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:ggplot2':
##
## theme_grey
df1<- readRDS("C:/Users/kiram/Dropbox/Kira/imba/Data Analytics/Module 5/hmdaInterestRate.rds")
str(df1)
## 'data.frame': 6509 obs. of 14 variables:
## $ activity_year : Factor w/ 2 levels "2019","2018": 2 2 2 2 2 2 2 2 2 2 ...
## $ state_code : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
## $ county_code : Factor w/ 3 levels "Missing","Coles",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ aus_1 : Factor w/ 6 levels "Desktop Underwriter (DU)",..: 2 1 3 4 1 4 1 1 1 1 ...
## $ loan_purpose : Factor w/ 6 levels "Home purchase",..: 2 3 1 1 3 3 1 1 1 1 ...
## $ applicant_ethnicity_1 : Factor w/ 8 levels "Not Hispanic or Latino",..: 2 1 1 1 2 1 1 1 1 1 ...
## $ applicant_sex : Factor w/ 4 levels "Male","Female",..: 1 2 1 1 3 1 1 1 2 1 ...
## $ derived_loan_product_type: Factor w/ 6 levels "Conventional:First Lien",..: 1 1 1 2 4 2 1 1 1 1 ...
## $ interest_rate : num 3.62 4.99 4.12 4.25 3.99 ...
## $ loan_amount : num 185000 105000 255000 255000 95000 205000 235000 105000 275000 75000 ...
## $ loan_term : num 180 360 360 360 240 360 360 360 360 360 ...
## $ property_value : num 235000 215000 265000 255000 105000 265000 335000 265000 285000 85000 ...
## $ income : num 154000 88000 66000 89000 81000 61000 84000 76000 160000 35000 ...
## $ applicant_age : num 50 50 30 50 60 40 30 50 30 30 ...
###(1 point) Data preparation:d ### 1. Replace the values in the following columns with the same value divided by 1,000: loan_amount, property_value, and income. ### 2. Create a new column, ltp, that is equal to the values in the loan_amount column divided by the values in the property_value column. ### 3. Filter the data to keep observations for which income is less than 300 (i.e., $300,000). ### 4. Report a summary of all columns. (No need to comment on the summary of the columns.)
df1$loan_amount<-as.numeric(df1$loan_amount/1000)
df1$property_value<-as.numeric(df1$property_value/1000)
df1$income<-as.numeric(df1$income/1000)
df1$ltp<-(df1$loan_amount/df1$property_value)
df1_low<-df1 %>% filter(income < 300)
###(1 point) Create a correlation plot of the following columns: interest_rate, ltp, income, applicant_age, property_value, and loan_amount. ### 1. Below the plot, identify what variable has the strongest negative correlation with interest_rate. Comment on what might explain why that correlation is negative.
cor(df1_low[,c('interest_rate', 'ltp', 'income', 'applicant_age', 'property_value', 'loan_amount')])
## interest_rate ltp income applicant_age
## interest_rate 1.00000000 -0.3188587 -0.04683836 0.109516616
## ltp -0.31885873 1.0000000 -0.13544451 -0.339076509
## income -0.04683836 -0.1354445 1.00000000 0.045085985
## applicant_age 0.10951662 -0.3390765 0.04508598 1.000000000
## property_value -0.16659083 -0.1658255 0.58661218 0.003786898
## loan_amount -0.33150222 0.3903140 0.46424427 -0.168519166
## property_value loan_amount
## interest_rate -0.166590831 -0.3315022
## ltp -0.165825527 0.3903140
## income 0.586612176 0.4642443
## applicant_age 0.003786898 -0.1685192
## property_value 1.000000000 0.8000781
## loan_amount 0.800078124 1.0000000
###Reason for Negative Correlation with loan_amount
###The likely reason interest rate is negatively correlate the highest is because interest rates are a control mechanism for loans. As interest rates go up, borrowing costs go up and the loan amount per $ of income that people can qualify for goes down. As the interest rates go down, borrowing costs go down, and the amount people can qualify for on their loan goes up. Whereas, with the other categories like age, income, property value, etc, there are other factors that affect these categories more heavily, so interest rate plays a much smaller role in thes categories.
lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
summary(lm1)
##
## Call:
## lm(formula = df1_low$interest_rate ~ ltp, data = df1_low)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8933 -0.5055 0.0285 0.4834 2.5221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.31852 0.03147 168.98 <2e-16 ***
## ltp -1.05188 0.03950 -26.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7182 on 6265 degrees of freedom
## Multiple R-squared: 0.1017, Adjusted R-squared: 0.1015
## F-statistic: 709.1 on 1 and 6265 DF, p-value: < 2.2e-16
###The coefficient for ltp is -1.051. This means that ltp is negatively correlated to the interest rate and the slope -1.05, leaving an equation of y=-1.05x +5.32.
library(jtools)
## Warning: package 'jtools' was built under R version 4.2.1
library(ggstance)
## Warning: package 'ggstance' was built under R version 4.2.1
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
library(jtools)
lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
lm2 <- lm(df1_low$interest_rate ~ loan_amount, data = df1_low)
export_summs(lm1, lm2)
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 5.32 *** | 4.93 *** |
| (0.03) | (0.02) | |
| ltp | -1.05 *** | |
| (0.04) | ||
| loan_amount | -0.00 *** | |
| (0.00) | ||
| N | 6267 | 6267 |
| R2 | 0.10 | 0.11 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
###For model 1, the coefficient estimate is 5.32 for the intercept and is -1.05 for ltp.
###For model 2, the coefficient estimate is 4.93 for the intercept and is -0.00 for loan_amount.
###The coefficient in Model 1 for ltp has a more negative slope than in model 2. This means that in Model 1 ltp is affected more greater negative correlation than with Model 2.
###R-squared for both are pretty similar. Model 1 has a value of 0.10 and 0.11 for Model 2. This means that the both explain about the same amount of observations, which is a small perent. This means that something else explains the difference in the other observations.
lm1 <- lm(df1_low$interest_rate ~ ltp, data = df1_low)
lm2 <- lm(df1_low$interest_rate ~ loan_amount, data = df1_low)
lm3 <- lm(df1_low$interest_rate ~ aus_1, data = df1_low)
export_summs(lm1, lm2, lm3)
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | 5.32 *** | 4.93 *** | 4.34 *** |
| (0.03) | (0.02) | (0.01) | |
| ltp | -1.05 *** | ||
| (0.04) | |||
| loan_amount | -0.00 *** | ||
| (0.00) | |||
| aus_1Not applicable | 1.04 *** | ||
| (0.02) | |||
| aus_1Loan Propspector (LP) or Loan Product Advisor | 0.01 | ||
| (0.02) | |||
| aus_1Technology Open to Approved Lenders (TOTAL) Scorecard | 0.22 *** | ||
| (0.03) | |||
| aus_1Guaranteed Underwriting System | 0.08 | ||
| (0.05) | |||
| aus_1Other | 0.38 *** | ||
| (0.09) | |||
| N | 6267 | 6267 | 6267 |
| R2 | 0.10 | 0.11 | 0.23 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
###For model 1, the coefficient estimate is 5.32 for the intercept and is -1.05 for ltp.
###For model 2, the coefficient estimate is 4.93 for the intercept and is -0.00 for loan_amount.
###For model 3, the coefficient estimate is 0.22 for the total scorecard for aus_1.
###The coefficient in Model 1 afor ltp has a greater correlation with interest_rate than model 2 and 3 and is negatively correlated. Model 2 for loan_amont has a smaller correlation that Model 1 and 3, but is also negatively correlated. Model 3 is a positive correlation to interest_rate, but is a smaller number. This means that the order of magnitude for correlation means Model 1 > Model 2 > Model 3.
###R-squared for both Model 1 and 2 are pretty similar. Model 1 has a value of 0.10 and 0.11 for Model 2. This means that the both explain about the same amount of observations, which is a small perent. Model 3, however, has a significantly larger R-sequared of 0.23. This means it explains the variations in more of the observations than Model 1 and Model 2, and is better for making predictions.