library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
ggplot(gpa, aes(x = studyweek, y = gpa)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "2. Relationship Between Weekly Studyweek and GPA",
x = "Hours per Week",
y = "GPA"
)
## `geom_smooth()` using formula = 'y ~ x'
We can see that there is a relationship between GPA and the Study
that the students spend, the more time they spend, the higher in GPA but
not too much. And there is an outlier that only spend nearly 15 hours a
week but still have the highest GPA.
ggplot(gpa, aes(x = out, y = gpa)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "3. Relationship Between Out and GPA",
x = "Hours of going out per week",
y = "GPA")
## `geom_smooth()` using formula = 'y ~ x'
Based on the scatterplot, student who has GPA of 4 tends to spend
aroung 1 - 3 hours going out per week.
ggplot(gpa, aes(x = out, y = sleepnight)) +
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs(title = "4. Relationship between Going Out and Sleep",
x = "Hours Going Out per Week", y = "Hours Sleep per Night")
## `geom_smooth()` using formula = 'y ~ x'
These dots are really random and it’s hard to tell the
relationship between Going out and Sleep.
ggplot(gpa, aes(x = gender, y = studyweek, fill = gender)) +
geom_boxplot() +
labs(title = "5. Study Hours by Gender",
x = "Gender", y = "Hours Studied per Week")
The female will spend more time studying than male, the female’s
median is higher as well.
ggplot(gpa, aes(x = gender, y = out, fill = gender)) +
geom_boxplot() +
labs(title = "6. Going Out Hours by Gender",
x = "Gender", y = "Hours Going Out per Week")
This time, the plot shows that male often spend more time going out
more than the female.
ggplot(gpa, aes(x = sleepnight, y = gpa)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "7. Does Sleep Affect GPA?",
x = "Hours Sleep per Night", y = "GPA")
## `geom_smooth()` using formula = 'y ~ x'
The Trend Line: The blue line is nearly horizontal. It slopes up
very, very slightly, but it is almost negligible. This indicates that,
on average, getting more sleep (from 5 to 9 hours) does not strongly
correspond to a higher GPA in this specific group of students.
The
Scatter: Look at the vertical columns of dots at 6, 7, and 8 hours. At 7
hours of sleep, for example, GPAs range from as low as ~3.1 to as high
as 4.0 (and one outlier even higher). This wide spread means that sleep
duration is not a good predictor of GPA for these students.
The
Outlier: Once again, you can see that single student with the >4.6
GPA appearing between 6 and 7 hours of sleep. This outlier pulls the
line up slightly, but even with it included, the relationship remains
very weak.
ggplot(loans_full_schema, aes(x = interest_rate)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 1,
fill = "skyblue",
color = "white") +
geom_density(color = "darkblue", size = 1) +
labs(title = "1. Distribution of Interest Rates",
x = "Interest Rate (%)",
y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The graph reveals that the interest rate distribution is
right-skewed. The peak (mode) of the distribution is around 5% to 14%,
indicating that the most common interest rates fall in this lower range.
As rates increase beyond 15%, the frequency of loans drops off
significantly, though there is a long tail extending toward 30%.
ggplot(loans_full_schema, aes(x = grade, y = interest_rate, fill = grade)) +
geom_boxplot() +
labs(title = "2. Interest Rates by Loan Grade (Categorical vs. Numeric Variable)",
x = "Loan Grade",
y = "Interest Rate (%)")
This boxplot demonstrates a clear, strong positive relationship
between loan grade and interest rate. Grade “A” loans have the lowest
median interest rate and the smallest variance. As the grade worsens
from A to G, the median interest rate increases substantially, and the
spread generally becomes wider, indicating that riskier loans (lower
grades) come with higher and more variable interest rates.
ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
geom_bin2d(bins = 50) +
scale_x_log10(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_viridis_c() +
labs(title = "3. Heatmap of Annual Income vs. Loan Amount (two numeric variables)",
x = "Annual Income (Log Scale)",
y = "Loan Amount")
## Warning in scale_x_log10(labels = scales::dollar): log-10 transformation
## introduced infinite values.
## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_bin2d()`).
The heatmap reveals a central cluster where the density of data is
highest. Most borrowers in this dataset have an annual income between
$40,000 and $100,000 and take out loans between $5,000 and $20,000.
There is a general positive correlation: as income increases (moving
right), the ceiling for loan amounts tends to rise (the heatmap expands
upward), suggesting wealthier individuals qualify for or request larger
loans.
ggplot(loans_full_schema, aes(x = loan_amount)) +
geom_histogram(binwidth = 2500, fill = "coral", color = "white") +
facet_wrap(~ homeownership) +
labs(title = "4. Loan Amount Distribution by Homeownership",
x = "Loan Amount",
y = "Count")
Using facet_wrap allows us to see that the volume of loans is
highest among those with “MORTGAGE” and “RENT” statuses.
+Mortgage holders tend to have a flatter distribution with a higher frequency of large loans (visible tails extending to $40k).
+Renters show a distribution that is more heavily skewed toward smaller loan amounts (peaking around $10k).
+The “OWN” category has significantly fewer observations overall.
ggplot(loans_full_schema, aes(x = interest_rate)) +
geom_density(fill = "coral", alpha = 0.5) +
facet_grid(term ~ application_type) +
labs(title = "5. Interest Rate Density by Term and Application Type",
x = "Interest Rate (%)",
y = "Density")
Term Effect (Rows): The 36-month loans (top row) generally have
peaks at lower interest rates compared to 60-month loans (bottom row).
The 60-month loans have a flatter distribution spread across higher
rates, indicating long-term loans are often associated with higher
interest rates.
Application Type Effect (Columns): Individual and Joint applications show relatively similar shapes within their respective terms, though Individual applications (being the vast majority of the data) have smoother density curves.
ggplot(loans_full_schema, aes(x = reorder(loan_purpose, loan_amount, median), y = loan_amount)) +
geom_boxplot(fill = "steelblue", outlier.color = "black", outlier.alpha = 0.5) +
coord_flip() +
labs(title = "6. Loan Amount by Loan Purpose",
x = "Loan Purpose",
y = "Loan Amount ($)")
I want to see if the Loan purpose affects the amount of money that
we can borrow and yes, the purpose impacts the loan amount.
+Smallest Loans: “Vacation” and “moving” loans have the lowest median amounts, typically under $10,000.
+Largest Loans: “Small business,” “debt consolidation,” and “house” loans have the highest medians.
1. The ames dataset is a record of residential home sales in
Ames, Iowa, occurring between 2006 and 2010. It contains 2,930 samples
(individual property sales) and 82 features (variables describing each
property). The scope of these features is vast, covering physical
dimensions (such as lot size and total living area), structural
characteristics (including foundation type, roof style, and building
type), amenities (like garages, fireplaces, and pools), assessment of
condition and quality, as well as the final sale price. This dataset is
widely used for predictive modeling and exploratory data analysis in
real estate.
ggplot(ames, aes(x = area, y = price)) +
geom_point(alpha = 0.4, color = "coral") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "2. Relationship Between Living Area and Price",
x = "Above Grade Living Area (sq ft)",
y = "Sale Price")
The scatter plot indicates a strong positive correlation between
living area and sale price. Generally, as the square footage of a home
increases, its price also increases. However, the variance in price also
increases with larger homes. There are also a few notable outliers:
properties with massive living areas (over 4,000 sq ft) that sold for
disproportionately low prices, which could indicate agricultural zoning,
poor condition, or partial sales.
ggplot(ames, aes(x = Bldg.Type, y = price, fill = Bldg.Type)) +
geom_boxplot(alpha = 0.7) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "3. Sale Price Distribution by Building Type",
x = "Building Type",
y = "Sale Price") +
theme(legend.position = "none")
1Fam: Single-family Detached (a standalone residential
structure).
2FmCon: Two-family Conversion (originally a single-family home converted to house two families).
Duplex: Duplex (a single structure divided into two separate residences).
TwnhsE: Townhouse End Unit (a townhouse attached to another on only one side).
Twnhs: Townhouse Inside Unit (a townhouse attached to others on both sides).
The graph reveals that 1Fam homes have the highest median sale price, the widest distribution, and the highest-priced outliers. Townhouse End units (TwnhsE) hold the second-highest median price. Multi-family units like Duplexes and Two-family Conversions (2FmCon) tend to have the lowest median sale prices and a much narrower price range.
ggplot(ames, aes(x = area, y = price, color = Bldg.Type)) +
geom_point(alpha = 0.5) +
facet_wrap(~ Bldg.Type) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "4. Area vs. Price Separated by Building Type",
x = "Above Grade Living Area (sq ft)",
y = "Sale Price") +
theme(legend.position = "none")
By faceting the plot, we can see that the positive correlation
between area and price holds true across almost all building types.
However, 1Fam dominates both the high-area and high-price scales.
Townhouses (Twnhs and TwnhsE) show a steep positive correlation but hit
a strict ceiling in terms of maximum square footage. Duplexes and
Conversions show a much weaker relationship, where increased area does
not yield a proportionately higher sale price compared to single-family
homes.
ggplot(ames, aes(x = area, y = price, color = Year.Built)) +
geom_point(alpha = 0.7) +
scale_color_viridis_c(option = "magma") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "5. Effect of Area and Year Built on Price",
x = "Living Area (sq ft)",
y = "Sale Price",
color = "Year Built")
This plot highlights an interaction effect: newer homes (lighter
colors) consistently cluster along the upper edge of the price
distribution for any given area size. Older homes (darker colors)
generally form the bottom edge. This indicates that while area dictates
the baseline price, the age of the house acts as a strong multiplier—a
2,000 sq ft home built in 2005 commands a significantly higher price
than a 2,000 sq ft home built in 1950.
ggplot(ames, aes(x = as.factor(Overall.Qual), y = price)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "6. Sale Price by Overall Quality Rating",
x = "Overall Quality (1 = Poor, 10 = Excellent)",
y = "Sale Price")
Question: How does the Overall Quality (Overall.Qual) rating of a
house impact its sale price, and what is the relationship between
them?
The graph shows an exponentially positive relationship. Moving from quality level 3 to 4, or 4 to 5, increases the median price linearly by a small margin. However, moving from level 8 to 9, or 9 to 10, results in massive jumps in the median price and an explosion in the price ceiling. This indicates that premium quality yields disproportionately higher returns in the real estate market.