1. Create a new RMD file in RStudio and use a code chunk to complete the following tasks: Load the following libraries: tidyverse, magrittr, lubridate, and corrplot packages. (You may need to install those libraries if you have not already done so.) Read in the hmdaInterestRate.rds file. Report the structure of the dataframe. (No need to comment on the structure.)
df <- readRDS('hmdaInterestRate.rds')
str(df)
## 'data.frame':    6509 obs. of  14 variables:
##  $ activity_year            : Factor w/ 2 levels "2019","2018": 2 2 2 2 2 2 2 2 2 2 ...
##  $ state_code               : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ county_code              : Factor w/ 3 levels "Missing","Coles",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ aus_1                    : Factor w/ 6 levels "Desktop Underwriter (DU)",..: 2 1 3 4 1 4 1 1 1 1 ...
##  $ loan_purpose             : Factor w/ 6 levels "Home purchase",..: 2 3 1 1 3 3 1 1 1 1 ...
##  $ applicant_ethnicity_1    : Factor w/ 8 levels "Not Hispanic or Latino",..: 2 1 1 1 2 1 1 1 1 1 ...
##  $ applicant_sex            : Factor w/ 4 levels "Male","Female",..: 1 2 1 1 3 1 1 1 2 1 ...
##  $ derived_loan_product_type: Factor w/ 6 levels "Conventional:First Lien",..: 1 1 1 2 4 2 1 1 1 1 ...
##  $ interest_rate            : num  3.62 4.99 4.12 4.25 3.99 ...
##  $ loan_amount              : num  185000 105000 255000 255000 95000 205000 235000 105000 275000 75000 ...
##  $ loan_term                : num  180 360 360 360 240 360 360 360 360 360 ...
##  $ property_value           : num  235000 215000 265000 255000 105000 265000 335000 265000 285000 85000 ...
##  $ income                   : num  154000 88000 66000 89000 81000 61000 84000 76000 160000 35000 ...
##  $ applicant_age            : num  50 50 30 50 60 40 30 50 30 30 ...
  1. Data preparation: Replace the values in the following columns with the same value divided by 1,000: loan_amount, property_value, and income. (This will make it easier to see the impact on the interest rate.) Create a new column, ltp, that is equal to the values in the loan_amount column divided by the values in the property_value column. Filter the data to keep observations for which income is less than 300 (i.e., $300,000). Report a summary of all columns. (No need to comment on the summary of the columns.)
#Replace values with the same value divided by 1,000
df$loan_amount<-as.numeric(df$loan_amount/1000)
df$property_value<-as.numeric(df$property_value/1000)
df$income<-as.numeric(df$income/1000)
#Create a new column, ltp, that is equal to the values in the loan_amount column divided by the values in the property_value column
df$ltp<-(df$loan_amount/df$property_value)
#Filter to keep observations under 300
print(filter(df,income< 300))
## Error in filter(df, income < 300): object 'income' not found
  1. Create a correlation plot of the following columns: interest_rate, ltp, income, applicant_age, property_value, and loan_amount. Below the plot, identify what variable has the strongest negative correlation with interest_rate. Comment on what might explain why that correlation is negative.
cor(df[,c('interest_rate', 'ltp', 'income', 'applicant_age', 'property_value', 'loan_amount')])
##                interest_rate        ltp       income applicant_age
## interest_rate     1.00000000 -0.3187453  0.018838487    0.10613878
## ltp              -0.31874529  1.0000000 -0.156275481   -0.32813270
## income            0.01883849 -0.1562755  1.000000000   -0.01168400
## applicant_age     0.10613878 -0.3281327 -0.011683996    1.00000000
## property_value   -0.16424866 -0.1799328  0.002400404    0.01788185
## loan_amount      -0.33540169  0.3735469 -0.064664128   -0.14932330
##                property_value loan_amount
## interest_rate    -0.164248656 -0.33540169
## ltp              -0.179932776  0.37354686
## income            0.002400404 -0.06466413
## applicant_age     0.017881851 -0.14932330
## property_value    1.000000000  0.78909790
## loan_amount       0.789097900  1.00000000
#if interest_rate increases,the variables ltp, income, applicant_age, property_value, and loan_amount decreases a reason might be because banks see lower income people, younger people, etc. as a bigger risk.
  1. Regress interest_rate on ltp. Display a summary of the regression results. Interpret the coefficient estimate on ltp.
#regress with scatterplot
ggplot(df, aes(x = ltp, y = interest_rate)) +
  geom_point() +
  stat_smooth(method = 'lm')
## Error in ggplot(df, aes(x = ltp, y = interest_rate)): could not find function "ggplot"
#regress with linear model 'lm' function
lm1 <- lm(interest_rate ~ ltp, data = df)

summary(lm1)
## 
## Call:
## lm(formula = interest_rate ~ ltp, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87122 -0.50450  0.02457  0.48534  2.52783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.29597    0.03015  175.67   <2e-16 ***
## ltp         -1.03131    0.03802  -27.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7187 on 6507 degrees of freedom
## Multiple R-squared:  0.1016, Adjusted R-squared:  0.1015 
## F-statistic: 735.9 on 1 and 6507 DF,  p-value: < 2.2e-16
#the P value [Pr(>|t|)] is less than 0.05, the coefficient is statistically significant
#the relationship is negative and the equation is y =(-1.03)x + 5.30
  1. Regress interest_rate on ltp and loan_amount. Display a summary of the regression results. Comment on the change in the adjusted R-squared, as well as the change in the coefficient on ltp.
lm1 <- lm(interest_rate ~ ltp, data = df)
lm2 <- lm(interest_rate ~ loan_amount, data = df)
export_summs(lm1, lm2)
## Error in export_summs(lm1, lm2): could not find function "export_summs"
#in Q4 the adjusted R-Squared and coefficient are as follows:  0.1015, and y =(-1.03)x + 5.30 (5.295...) with a very significant p-value
#in Q5 the adjusted R-Squared and coefficient are as follows:  0.10, and y =(-1.03)x + 5.30 with a very significant p-value
#the differences are with how the different models round the numbers.
  1. Regress interest_rate on ltp, loan_amount, and aus_1. Display a summary of the regression results. Interpret the change in adjusted R-squared, as well as the coefficients for each independent variable. (You should have four separate points, one for the change in adjusted R-squared, and one for the change in the coefficient for each independent variable.)
lm1 <- lm(interest_rate ~ ltp, data = df)
lm2 <- lm(interest_rate ~ loan_amount, data = df)
lm3 <- lm(interest_rate ~ aus_1, data = df)
export_summs(lm1, lm2, lm3)
## Error in export_summs(lm1, lm2, lm3): could not find function "export_summs"
#the data for lm1 and lm2 remain the same as above, I am unsure what the 4 points refer to, will need to attend office hours