- Create a new RMD file in RStudio and use a code chunk to complete
the following tasks: Load the following libraries: tidyverse, magrittr,
lubridate, and corrplot packages. (You may need to install those
libraries if you have not already done so.) Read in the
hmdaInterestRate.rds file. Report the structure of the dataframe. (No
need to comment on the structure.)
df <- readRDS('hmdaInterestRate.rds')
str(df)
## 'data.frame': 6509 obs. of 14 variables:
## $ activity_year : Factor w/ 2 levels "2019","2018": 2 2 2 2 2 2 2 2 2 2 ...
## $ state_code : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
## $ county_code : Factor w/ 3 levels "Missing","Coles",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ aus_1 : Factor w/ 6 levels "Desktop Underwriter (DU)",..: 2 1 3 4 1 4 1 1 1 1 ...
## $ loan_purpose : Factor w/ 6 levels "Home purchase",..: 2 3 1 1 3 3 1 1 1 1 ...
## $ applicant_ethnicity_1 : Factor w/ 8 levels "Not Hispanic or Latino",..: 2 1 1 1 2 1 1 1 1 1 ...
## $ applicant_sex : Factor w/ 4 levels "Male","Female",..: 1 2 1 1 3 1 1 1 2 1 ...
## $ derived_loan_product_type: Factor w/ 6 levels "Conventional:First Lien",..: 1 1 1 2 4 2 1 1 1 1 ...
## $ interest_rate : num 3.62 4.99 4.12 4.25 3.99 ...
## $ loan_amount : num 185000 105000 255000 255000 95000 205000 235000 105000 275000 75000 ...
## $ loan_term : num 180 360 360 360 240 360 360 360 360 360 ...
## $ property_value : num 235000 215000 265000 255000 105000 265000 335000 265000 285000 85000 ...
## $ income : num 154000 88000 66000 89000 81000 61000 84000 76000 160000 35000 ...
## $ applicant_age : num 50 50 30 50 60 40 30 50 30 30 ...
- Data preparation: Replace the values in the following columns with
the same value divided by 1,000: loan_amount, property_value, and
income. (This will make it easier to see the impact on the interest
rate.) Create a new column, ltp, that is equal to the values in the
loan_amount column divided by the values in the property_value column.
Filter the data to keep observations for which income is less than 300
(i.e., $300,000). Report a summary of all columns. (No need to comment
on the summary of the columns.)
#Replace values with the same value divided by 1,000
df$loan_amount<-as.numeric(df$loan_amount/1000)
df$property_value<-as.numeric(df$property_value/1000)
df$income<-as.numeric(df$income/1000)
#Create a new column, ltp, that is equal to the values in the loan_amount column divided by the values in the property_value column
df$ltp<-(df$loan_amount/df$property_value)
#Filter to keep observations under 300
print(filter(df,income< 300))
## Error in filter(df, income < 300): object 'income' not found
- Create a correlation plot of the following columns: interest_rate,
ltp, income, applicant_age, property_value, and loan_amount. Below the
plot, identify what variable has the strongest negative correlation with
interest_rate. Comment on what might explain why that correlation is
negative.
cor(df[,c('interest_rate', 'ltp', 'income', 'applicant_age', 'property_value', 'loan_amount')])
## interest_rate ltp income applicant_age
## interest_rate 1.00000000 -0.3187453 0.018838487 0.10613878
## ltp -0.31874529 1.0000000 -0.156275481 -0.32813270
## income 0.01883849 -0.1562755 1.000000000 -0.01168400
## applicant_age 0.10613878 -0.3281327 -0.011683996 1.00000000
## property_value -0.16424866 -0.1799328 0.002400404 0.01788185
## loan_amount -0.33540169 0.3735469 -0.064664128 -0.14932330
## property_value loan_amount
## interest_rate -0.164248656 -0.33540169
## ltp -0.179932776 0.37354686
## income 0.002400404 -0.06466413
## applicant_age 0.017881851 -0.14932330
## property_value 1.000000000 0.78909790
## loan_amount 0.789097900 1.00000000
#if interest_rate increases,the variables ltp, income, applicant_age, property_value, and loan_amount decreases a reason might be because banks see lower income people, younger people, etc. as a bigger risk.
- Regress interest_rate on ltp. Display a summary of the regression
results. Interpret the coefficient estimate on ltp.
#regress with scatterplot
ggplot(df, aes(x = ltp, y = interest_rate)) +
geom_point() +
stat_smooth(method = 'lm')
## Error in ggplot(df, aes(x = ltp, y = interest_rate)): could not find function "ggplot"
#regress with linear model 'lm' function
lm1 <- lm(interest_rate ~ ltp, data = df)
summary(lm1)
##
## Call:
## lm(formula = interest_rate ~ ltp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.87122 -0.50450 0.02457 0.48534 2.52783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.29597 0.03015 175.67 <2e-16 ***
## ltp -1.03131 0.03802 -27.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7187 on 6507 degrees of freedom
## Multiple R-squared: 0.1016, Adjusted R-squared: 0.1015
## F-statistic: 735.9 on 1 and 6507 DF, p-value: < 2.2e-16
#the P value [Pr(>|t|)] is less than 0.05, the coefficient is statistically significant
#the relationship is negative and the equation is y =(-1.03)x + 5.30
- Regress interest_rate on ltp and loan_amount. Display a summary of
the regression results. Comment on the change in the adjusted R-squared,
as well as the change in the coefficient on ltp.
lm1 <- lm(interest_rate ~ ltp, data = df)
lm2 <- lm(interest_rate ~ loan_amount, data = df)
export_summs(lm1, lm2)
## Error in export_summs(lm1, lm2): could not find function "export_summs"
#in Q4 the adjusted R-Squared and coefficient are as follows: 0.1015, and y =(-1.03)x + 5.30 (5.295...) with a very significant p-value
#in Q5 the adjusted R-Squared and coefficient are as follows: 0.10, and y =(-1.03)x + 5.30 with a very significant p-value
#the differences are with how the different models round the numbers.
- Regress interest_rate on ltp, loan_amount, and aus_1. Display a
summary of the regression results. Interpret the change in adjusted
R-squared, as well as the coefficients for each independent variable.
(You should have four separate points, one for the change in adjusted
R-squared, and one for the change in the coefficient for each
independent variable.)
lm1 <- lm(interest_rate ~ ltp, data = df)
lm2 <- lm(interest_rate ~ loan_amount, data = df)
lm3 <- lm(interest_rate ~ aus_1, data = df)
export_summs(lm1, lm2, lm3)
## Error in export_summs(lm1, lm2, lm3): could not find function "export_summs"
#the data for lm1 and lm2 remain the same as above, I am unsure what the 4 points refer to, will need to attend office hours