library(openintro)
library(ggplot2)
library(dplyr)

1. Data visualization and exploration tasks with gpa data set

Meaning of each variable: - gpa: Student’s grade point average (0.0 - 4.0 scale) - studyweek: Hours spent studying per week - sleepnight: Hours of sleep per night - out: Hours spent going out per week - gender: Student’s gender (male/female) —

Visualize the relationship between studyweek and gpa. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x=studyweek, y=gpa))+
  geom_point(position="jitter")+
  geom_smooth()+
  labs(title = "relationship between studyweek and gpa",
       x = "studyweek",
       y = "gpa")

Observation there appears to be a weak relationship between studyweek and gpa, suggesting there is little relation between studytime and gpa.

Visualize the relationship between out and gpa. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x=out, y=gpa))+
  geom_point(position="jitter")+
  geom_smooth()+
  labs(title = "relationship between out and gpa",
       x = "out",
       y = "gpa")

Observation there doesn’t appear to be sufficient data between out and gpa, but those who spend 2 hours outside experience a downward trend compared to spending 1 or 3 hours.

Visualize the relationship between out and sleepnight. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x=out, y=sleepnight))+
  geom_point(position="jitter")+
  geom_smooth()+
  labs(title = "relationship between out and sleepnight",
       x = "out",
       y = "sleepnight")

Observation an upward trend of those who go out more sleeps more.

Visualize the relationship between gender and studyweek. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x=gender, y=studyweek))+
  geom_point(position='jitter')+
  labs(title = "relationship between gender and studyweek",
       x = "gender",
       y = "studyweek")

Observation the density of females outnumber the males, therefore females have more variance in studytime per week compared to males

Visualize the relationship between gender and out. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x=gender, y=out))+
  geom_point(position='jitter')+
  labs(title = "relationship between gender and out",
       x = "gender",
       y = "out")

Observation while there female is more dense, both gender show the most common outing is 2-3 hours

does more sleep equate to higher gpa?

ggplot(data=gpa, mapping=aes(x=sleepnight, y=gpa))+
  geom_point(position='jitter')+
  labs(title = "relationship between sleepnight and gpa",
       x = "sleepnight",
       y = "gpa")

Observation most students sleep 6-8 hours, but it doesn’t seem to reliably predict their gpa.

2. Data visualization tasks with loans_full_schema data set

Create a histogram of a numeric variable that you select and plot a density curve on top of the histogram. Carefully select bin numbers/sizes/boundaries to make the plot informative. What does this graph indicate?

ggplot(loans_full_schema, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "lightblue", color = "black") +
  geom_density(color = "red", linewidth = 1) +
  labs(title = "Distribution of Loan Amounts",
       x = "Loan Amount ($)",
       y = "Density")

Observation the histogram shows a right skew, indicating that majority get a loan of $10,000.

Create a graph to study the effect of a categorical/discrete variable on the distributions of a numeric variable. What does this graph indicate?

ggplot(loans_full_schema, aes(x = grade, y = interest_rate, fill = grade)) +
  geom_boxplot() +
  labs(title = "Interest Rate by Loan Grade",
       x = "Loan Grade",
       y = "Interest Rate (%)") +
  theme(legend.position = "none")

Observation the higher the loan grade, the lower the interest rate, suggesting that those with higher grade have lower risk an greater trust to pay back the loan, most people are in the D-E grade most likely because they start there.

Create a bin heatmap (2d density plot) to study the relationship between two numeric variables that you select. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
  geom_bin2d(bins = 30) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  scale_x_continuous(limits = c(0, 200000)) +
  scale_y_continuous(limits = c(0, 40000)) +
  labs(title = "Relationship between Annual Income and Loan Amount",
       x = "Annual Income ($)",
       y = "Loan Amount ($)",
       fill = "Count")

Observation majority of people have around $20-100k annual income and borrow $10k-$15k loans, where $50k annual income is most dense, representing the majority therefore the focus of loan business.

Use facet_wrap to create an informative plot. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(y = loan_amount)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  facet_wrap(~ loan_purpose, scales = "free_y") +
  labs(title = "Distribution of Loan Amounts by Purpose",
       x = "Loan Amount ($)",
       y = "Count")

Observation the graph shows what most people take out loans for, with the most being debt consolidation, followed by credit card, meaning when someone borrows a loan, it is to solve financial business.

Use facet_grid to create an informative plot. Summarize the findings from the graph.

# Filter to top 5 loan purposes for clarity
top_purposes <- loans_full_schema %>%
  count(loan_purpose, sort = TRUE) %>%
  head(5) %>%
  pull(loan_purpose)

# Filter to grades A-D
loans_filtered <- loans_full_schema %>%
  filter(loan_purpose %in% top_purposes,
         grade %in% c("A", "B", "C", "D"))

ggplot(loans_filtered, aes(x = annual_income, y = loan_amount)) +
  geom_point(alpha = 0.3, size = 0.8) +
  facet_grid(grade ~ loan_purpose) +
  scale_x_continuous(limits = c(0, 150000)) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Loan Amount vs Income by Grade A-D and Top 5 Purposes",
       x = "Annual Income ($)",
       y = "Loan Amount ($)")

Observation those with higher grades are approved loans for more categories, at lower grades, the risk is higher, so there are fewer dots at the higher loans.

Present a question of your own interest related to this data set. Answer your question with analysis or visualization.

ggplot(data = loans_full_schema, aes(x = grade, y = interest_rate, fill = verified_income)) +
  geom_col(position = "dodge")+
  labs(title = "Interest Rate by Grade and Income Verification",
       x = "Loan Grade",
       y = "Interest Rate (%)",
       fill = "Income Verified")

Observation borrowers with verified income get the same amount of interest rates as those whose income aren’t verified, making interest rates completely dependent on grade

3. Data visualization and exploration tasks with ames data set

Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover. Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. The data has 2930 observations and 82 variables, which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers).

Use a plot to analyze how area correlates with price. Summarize your finding from the graph.

ggplot(ames, mapping = aes(x = area, y = price)) +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Relationship Between Living Area and Sale Price",
    x = "Above-grade living area (sq ft)",
    y = "Sale price (USD)"
  )

Observation this scatter plot shows an upward trend where larger area equates to higher cost, but at the 5k square feet, the price drops suddenly, suggesting there are other factors that determine the sale price.

Use a plot to analyze how Bldg.Type correlates with price. Explain the meaning of each label for Bldg.Type and summarize your finding from the graph. -1Fam: Single-family detached dwelling. -2fmCon: Two-family conversion (originally built as a single-family home, later converted). -Duplex: Duplex. -Twnhs: Townhouse end unit. -TwnhsE: Townhouse inside unit.

ggplot(ames, mapping = aes(x=Bldg.Type, y=price, fill=Bldg.Type))+
  geom_boxplot()+
  scale_y_continuous(labels = scales::dollar_format())+
  labs(title="sale price by building type",
       x="building type",
       y="price ($)")+
  theme(legend.position = "none")

Observation the boxplot shows that 1Fam usually has the higher sale prices than duplex and 2fmcon, town house end looks to sell more than insides, 1fam has the widest spectrum in price out of all of them.

Use a plot to analyze how Bldg.Type and area altogether correlates with price. Summarize your finding from the graph.

ggplot(ames, mapping = aes(x=area, y=price))+
  geom_point()+
  geom_smooth()+
  scale_y_continuous(labels = scales::dollar_format())+
  labs(title='building type and area vs price')+
  facet_wrap(~Bldg.Type)

Observation all prices increases with area, the slope and level the prices can reach for each graph differs. This suggests that building type and area directly affect the price, while 1fam has the most observations, other building types show similar interactions, making area and price a reliable measure.

Homework #1

Jerry Chan

2026-02-04

1. Data visualization and exploration tasks with gpa data set

2. Data visualization tasks with loans_full_schema data set

3. Data visualization and exploration tasks with ames data set