DV Lab Homework 1

1. Data visualization and exploration tasks with gpa data set

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

library(ggplot2)
ggplot(gpa, aes(x = studyweek, y = gpa)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  labs(
    title = "2. Relationship Between Weekly Studyweek and GPA",
    x = "Hours per Week",
    y = "GPA"
  )

## `geom_smooth()` using formula = 'y ~ x'

We can see that there is a relationship between GPA and the Study that the students spend, the more time they spend, the higher in GPA but not too much. And there is an outlier that only spend nearly 15 hours a week but still have the highest GPA.

ggplot(gpa, aes(x = out, y = gpa)) +
  geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "3. Relationship Between Out and GPA",
       x = "Hours of going out per week",
       y = "GPA")

## `geom_smooth()` using formula = 'y ~ x'

Based on the scatterplot, student who has GPA of 4 tends to spend aroung 1 - 3 hours going out per week.

ggplot(gpa, aes(x = out, y = sleepnight)) +
    geom_point() + geom_smooth(method = 'lm', se = FALSE) +
    labs(title = "4. Relationship between Going Out and Sleep",
         x = "Hours Going Out per Week", y = "Hours Sleep per Night")

## `geom_smooth()` using formula = 'y ~ x'

These dots are really random and it’s hard to tell the relationship between Going out and Sleep.

ggplot(gpa, aes(x = gender, y = studyweek, fill = gender)) +
     geom_boxplot() +
     labs(title = "5. Study Hours by Gender",
          x = "Gender", y = "Hours Studied per Week")

The female will spend more time studying than male, the female’s median is higher as well.

ggplot(gpa, aes(x = gender, y = out, fill = gender)) +
    geom_boxplot() +
    labs(title = "6. Going Out Hours by Gender",
         x = "Gender", y = "Hours Going Out per Week")

This time, the plot shows that male often spend more time going out more than the female.

ggplot(gpa, aes(x = sleepnight, y = gpa)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "7. Does Sleep Affect GPA?",
         x = "Hours Sleep per Night", y = "GPA")

## `geom_smooth()` using formula = 'y ~ x'

The Trend Line: The blue line is nearly horizontal. It slopes up very, very slightly, but it is almost negligible. This indicates that, on average, getting more sleep (from 5 to 9 hours) does not strongly correspond to a higher GPA in this specific group of students.
The Scatter: Look at the vertical columns of dots at 6, 7, and 8 hours. At 7 hours of sleep, for example, GPAs range from as low as ~3.1 to as high as 4.0 (and one outlier even higher). This wide spread means that sleep duration is not a good predictor of GPA for these students.
The Outlier: Once again, you can see that single student with the >4.6 GPA appearing between 6 and 7 hours of sleep. This outlier pulls the line up slightly, but even with it included, the relationship remains very weak.

2. Data visualization tasks with loans_full_schema data set

ggplot(loans_full_schema, aes(x = interest_rate)) +
  geom_histogram(aes(y = after_stat(density)), 
                 binwidth = 1, 
                 fill = "skyblue", 
                 color = "white") +
  geom_density(color = "darkblue", size = 1) +
  labs(title = "1. Distribution of Interest Rates",
       x = "Interest Rate (%)",
       y = "Density")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph reveals that the interest rate distribution is right-skewed. The peak (mode) of the distribution is around 5% to 14%, indicating that the most common interest rates fall in this lower range. As rates increase beyond 15%, the frequency of loans drops off significantly, though there is a long tail extending toward 30%.

ggplot(loans_full_schema, aes(x = grade, y = interest_rate, fill = grade)) +
  geom_boxplot() +
  labs(title = "2. Interest Rates by Loan Grade (Categorical vs. Numeric Variable)",
       x = "Loan Grade",
       y = "Interest Rate (%)")

This boxplot demonstrates a clear, strong positive relationship between loan grade and interest rate. Grade “A” loans have the lowest median interest rate and the smallest variance. As the grade worsens from A to G, the median interest rate increases substantially, and the spread generally becomes wider, indicating that riskier loans (lower grades) come with higher and more variable interest rates.

ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
  geom_bin2d(bins = 50) +
  scale_x_log10(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_viridis_c() +
  labs(title = "3. Heatmap of Annual Income vs. Loan Amount (two numeric variables)",
       x = "Annual Income (Log Scale)",
       y = "Loan Amount")

## Warning in scale_x_log10(labels = scales::dollar): log-10 transformation
## introduced infinite values.

## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_bin2d()`).

The heatmap reveals a central cluster where the density of data is highest. Most borrowers in this dataset have an annual income between $40,000 and $100,000 and take out loans between $5,000 and $20,000. There is a general positive correlation: as income increases (moving right), the ceiling for loan amounts tends to rise (the heatmap expands upward), suggesting wealthier individuals qualify for or request larger loans.

ggplot(loans_full_schema, aes(x = loan_amount)) +
  geom_histogram(binwidth = 2500, fill = "coral", color = "white") +
  facet_wrap(~ homeownership) +
  labs(title = "4. Loan Amount Distribution by Homeownership",
       x = "Loan Amount",
       y = "Count")

Using facet_wrap allows us to see that the volume of loans is highest among those with “MORTGAGE” and “RENT” statuses.

+Mortgage holders tend to have a flatter distribution with a higher frequency of large loans (visible tails extending to $40k).

+Renters show a distribution that is more heavily skewed toward smaller loan amounts (peaking around $10k).

+The “OWN” category has significantly fewer observations overall.

ggplot(loans_full_schema, aes(x = interest_rate)) +
  geom_density(fill = "coral", alpha = 0.5) +
  facet_grid(term ~ application_type) +
  labs(title = "5. Interest Rate Density by Term and Application Type",
       x = "Interest Rate (%)",
       y = "Density")

Term Effect (Rows): The 36-month loans (top row) generally have peaks at lower interest rates compared to 60-month loans (bottom row). The 60-month loans have a flatter distribution spread across higher rates, indicating long-term loans are often associated with higher interest rates.

Application Type Effect (Columns): Individual and Joint applications show relatively similar shapes within their respective terms, though Individual applications (being the vast majority of the data) have smoother density curves.

ggplot(loans_full_schema, aes(x = reorder(loan_purpose, loan_amount, median), y = loan_amount)) +
  geom_boxplot(fill = "steelblue", outlier.color = "black", outlier.alpha = 0.5) +
  coord_flip() + 
  labs(title = "6. Loan Amount by Loan Purpose",
       x = "Loan Purpose",
       y = "Loan Amount ($)")

I want to see if the Loan purpose affects the amount of money that we can borrow and yes, the purpose impacts the loan amount.

+Smallest Loans: “Vacation” and “moving” loans have the lowest median amounts, typically under $10,000.

+Largest Loans: “Small business,” “debt consolidation,” and “house” loans have the highest medians.

3. Data visualization and exploration tasks with ‘ames’ data set

1. The ames dataset is a record of residential home sales in Ames, Iowa, occurring between 2006 and 2010. It contains 2,930 samples (individual property sales) and 82 features (variables describing each property). The scope of these features is vast, covering physical dimensions (such as lot size and total living area), structural characteristics (including foundation type, roof style, and building type), amenities (like garages, fireplaces, and pools), assessment of condition and quality, as well as the final sale price. This dataset is widely used for predictive modeling and exploratory data analysis in real estate.

ggplot(ames, aes(x = area, y = price)) +
  geom_point(alpha = 0.4, color = "coral") +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "2. Relationship Between Living Area and Price",
       x = "Above Grade Living Area (sq ft)",
       y = "Sale Price")

The scatter plot indicates a strong positive correlation between living area and sale price. Generally, as the square footage of a home increases, its price also increases. However, the variance in price also increases with larger homes. There are also a few notable outliers: properties with massive living areas (over 4,000 sq ft) that sold for disproportionately low prices, which could indicate agricultural zoning, poor condition, or partial sales.

ggplot(ames, aes(x = Bldg.Type, y = price, fill = Bldg.Type)) +
  geom_boxplot(alpha = 0.7) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "3. Sale Price Distribution by Building Type",
       x = "Building Type",
       y = "Sale Price") +
  theme(legend.position = "none")

1Fam: Single-family Detached (a standalone residential structure).

2FmCon: Two-family Conversion (originally a single-family home converted to house two families).

Duplex: Duplex (a single structure divided into two separate residences).

TwnhsE: Townhouse End Unit (a townhouse attached to another on only one side).

Twnhs: Townhouse Inside Unit (a townhouse attached to others on both sides).

The graph reveals that 1Fam homes have the highest median sale price, the widest distribution, and the highest-priced outliers. Townhouse End units (TwnhsE) hold the second-highest median price. Multi-family units like Duplexes and Two-family Conversions (2FmCon) tend to have the lowest median sale prices and a much narrower price range.

ggplot(ames, aes(x = area, y = price, color = Bldg.Type)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ Bldg.Type) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "4. Area vs. Price Separated by Building Type",
       x = "Above Grade Living Area (sq ft)",
       y = "Sale Price") +
  theme(legend.position = "none")

By faceting the plot, we can see that the positive correlation between area and price holds true across almost all building types. However, 1Fam dominates both the high-area and high-price scales. Townhouses (Twnhs and TwnhsE) show a steep positive correlation but hit a strict ceiling in terms of maximum square footage. Duplexes and Conversions show a much weaker relationship, where increased area does not yield a proportionately higher sale price compared to single-family homes.

ggplot(ames, aes(x = area, y = price, color = Year.Built)) +
  geom_point(alpha = 0.7) +
  scale_color_viridis_c(option = "magma") +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "5. Effect of Area and Year Built on Price",
       x = "Living Area (sq ft)",
       y = "Sale Price",
       color = "Year Built")

This plot highlights an interaction effect: newer homes (lighter colors) consistently cluster along the upper edge of the price distribution for any given area size. Older homes (darker colors) generally form the bottom edge. This indicates that while area dictates the baseline price, the age of the house acts as a strong multiplier—a 2,000 sq ft home built in 2005 commands a significantly higher price than a 2,000 sq ft home built in 1950.

ggplot(ames, aes(x = as.factor(Overall.Qual), y = price)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "6. Sale Price by Overall Quality Rating",
       x = "Overall Quality (1 = Poor, 10 = Excellent)",
       y = "Sale Price")

Question: How does the Overall Quality (Overall.Qual) rating of a house impact its sale price, and what is the relationship between them?

The graph shows an exponentially positive relationship. Moving from quality level 3 to 4, or 4 to 5, increases the median price linearly by a small margin. However, moving from level 8 to 9, or 9 to 10, results in massive jumps in the median price and an explosion in the price ceiling. This indicates that premium quality yields disproportionately higher returns in the real estate market.

DV Lab Homework 1

Duc Vinh Hoang

2026-02-05

1. Data visualization and exploration tasks with gpa data set

2. Data visualization tasks with loans_full_schema data set

3. Data visualization and exploration tasks with ‘ames’ data set