How many distinct values are there for homeownership variable? Which value is the most common one?
sort(table(loans$homeownership))
##
## ANY OWN RENT MORTGAGE
## 0 0 1353 3858 4789
4 distinct values for homeownership variables. Mortgage is the most common one.
How many distinct interest rates are there? Which value is the most common one?
length(unique(loans$interest_rate))
## [1] 58
sort(table(loans$interest_rate))
##
## 30.94 6 30.75 30.65 29.69 30.17 30.79 22.9 23.87 25.81 22.91 24.84 28.72
## 1 3 4 5 9 9 11 13 20 26 28 31 31
## 23.88 26.77 24.85 25.82 26.3 21.85 20.39 19.42 17.47 20 18.45 21.45 18.06
## 37 38 42 47 53 90 93 114 124 137 146 172 176
## 14.07 5.31 6.71 10.41 17.09 16.01 19.03 15.04 6.07 7.96 13.58 5.32 7.34
## 183 188 192 194 195 196 197 199 202 211 225 234 243
## 9.92 11.98 12.61 7.97 10.9 6.08 9.43 16.02 15.05 10.91 6.72 14.08 7.35
## 248 255 264 274 275 277 280 284 304 306 312 318 325
## 12.62 10.42 13.59 9.44 11.99 9.93
## 333 346 347 367 376 390
58 distinct interest rates, 9.93% is the most common one.
Apply table function to the annual_income variable. Do you think the result is helpful or not? Not really helpful
Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = loan_amount), binwidth = 2500, boundary = 1000)
Create a histogram of annual_income. What is the issue with your graph?
ggplot(data = loans) +
geom_histogram(mapping = aes(x = annual_income))
The graph is squeezed to the left because of outliers to the right that
skewed the graph. This is better:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = annual_income)) +
xlim(0, 500000)
ggplot(loans, aes(x = debt_to_income)) +
geom_histogram(aes(y = after_stat(density)),
boundary = 0, colour = "black", fill = "white") + xlim(0, 100)
ggplot(loans, aes(x = debt_to_income)) +
geom_histogram(aes(y = after_stat(density), binwidth = 2),
boundary = 0, colour = "black", fill = "white") + xlim(0, 100)
ggplot(loans, aes(x = debt_to_income)) +
geom_histogram(aes(y = after_stat(density), binwidth = 2),
boundary = 0, colour = "black", fill = "white") + xlim(0, 100) +
geom_density(adjust = 30/8, linewidth = 1.2)
Question: Can you explain the distribution of
debt_to_income?
debt_to_income is relatively Normal with the center
closer to 0. This means that most people have a relatively manageable to
low debt to income ratio, but there are some people that has a pretty
high debt to income ratio which explains the skewness of the
distribution.
For loans data, create a scatter plot of
interest_rate vs debt_to_income with mapping
color to grade. What can you learn from the
graph?
ggplot(data = loans) + geom_point(mapping = aes(x = interest_rate, y = debt_to_income, color = grade))
From this scatter plot, I can tell that the higher the
grade, the lower the interest_rate. Also most
of the debt_to_income data points are below 100 and the
extremely high debt_to_income are normally the mid-tier
grade with the mid-tier interest_rate.
Finish all Lab Exercises
Create a scatter plot of loan_amount vs
interest_rate with a color grouping using term
variable (please use factor(term) to convert it into a
categorical variable). Save your plot to your local folder.
ggplot(data = loans) + geom_point(mapping = aes(x = loan_amount, y = interest_rate, color = factor(term)))
ggsave("my-plot.pdf")