Loading Libraries:
knitr::opts_chunk$set(fig.align = "center", out.width = "80%", message = FALSE,
warning = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)
library(ggridges)
library(nycflights13)
Global Theme Set Up:
my_theme <- theme(
plot.title = element_text(size = rel(1.8), hjust = 0.5),
axis.title = element_text(size = rel(1.5), hjust = 0.5,
margin = margin(15,15,15,15)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
Answer: GPA
(Grade Point Average)
represents a numerical summary of a student’s academic performance.
Studyweek
indicates the number of hours a student studies
at night per week, while sleepnight
refers to the number of
hours they sleep each night. Out
represents the number of
nights a student goes out per week. Lastly, ‘gender’ specifies whether
the student is male or female.
Code:
ggplot(data = gpa, mapping = aes(x = studyweek, y = gpa)) +
geom_point(size = rel(1.2)) +
geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
labs(title = "The Relation Between GPA and Study Hours",
x = "Study hours per week",
y = "Gpa") +
ylim(2.5, 5) +
theme(plot.title = element_text(color = "blue4"),
axis.title = element_text(color = "purple4")) +
my_theme
Answer: The data points appear widely scattered,
suggesting no strong correlation between study hours and GPA. While some
students with fewer study hours achieve high GPAs, others who study more
still have varied GPA outcomes. The distribution indicates that other
factors beyond study hours may significantly influence GPA.
Code:
ggplot(data = gpa, mapping = aes(x = out, y = gpa)) +
geom_point(size = rel(1.2)) +
geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
labs(title = "The Relation Between GPA and Happy Nights",
x = "Going-out nights (per week)",
y = "Gpa") +
theme(plot.title = element_text(color = "blue4"),
axis.title = element_text(color = "purple4")) +
my_theme
Answer: The data points are widely dispersed in the
plot, indicating no strong correlation between going-out frequency and
GPA. The plot shows that students with high GPAs are found across all
levels of social activity, suggesting that moderate socializing does not
necessarily harm academic performance. However, individual variations
imply that other factors may contribute to GPA outcomes.
Code:
ggplot(data = gpa, mapping = aes(x = out, y = sleepnight)) +
geom_point(size = rel(1.2)) +
geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
labs(title = "The Happy Nights and Sleep Hour",
x = "Going-out nights (per week)",
y = "Hours of sleep (per day)") +
theme(plot.title = element_text(color = "blue4"),
axis.title = element_text(color = "purple4")) +
my_theme
Answer: The data suggests a positive correlation
between these two variables, indicating that students who get more sleep
at night tend to go out more. Further statistical analysis is
recommended to validate this trend.
Code:
ggplot(data = gpa, mapping = aes(x = gender, y = studyweek)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Which Gender Studies More?",
x = "Gender",
y = "Study hours (per week)") +
theme(plot.title = element_text(color = "blue4"),
axis.title = element_text(color = "purple4")) +
my_theme
Answer: The plot shows that the median weekly study
hours for male and female students are approximately the same. However,
female students have a wider range of study hours, ranging from around 4
to 50 hours per week, whereas male students range from 2 to 30 hours.
Additionally, there is an outlier among male students who studied over
40 hours per week.
Code:
ggplot(data = gpa, mapping = aes(x = gender, y = out)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Which Gender Goes Out More?",
x = "Gender",
y = "Going-out nights (per week)") +
theme(plot.title = element_text(color = "blue4"),
axis.title = element_text(color = "purple4")) +
my_theme
Answer: The box plot compares the number of going-out
nights per week between males and females, showing similar
distributions. Females have a median of around 2 nights, with an
interquartile range (IQR) from 1 to 3 and a total range of 0 to 4. Males
have a slightly higher median of about 2.5 nights, with an IQR between 2
and 3 and a range from 1 to 4. While males tend to go out slightly more
on average, the overall variability and distribution are similar between
the two groups.
Code:
ggplot(data = loans_full_schema, mapping = aes(x = total_credit_limit)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, boundary = 0,
na.rm = T, fill ="orange2", color = "orange3") +
geom_density(linewidth = 1.2, adjust = 30/20, na.rm = T, color = "green4") +
scale_x_continuous(labels = scales::dollar, limits = c(0, 1000000)) +
labs(title = "Total Avaliable Credit of Lending Club Loaners",
x = "Total available credit (USD)",
y = "Density") +
theme(plot.title = element_text(color = "green4"),
axis.title = element_text(color = "orange4")) +
my_theme
Answer: The graph presents a right-skewed distribution
of total available credit from credit cards, excluding mortgages. A few
individuals have exceptionally high credit limits, ranging between $2
million and $3 million. To enhance clarity and detail, the upper limit
of the plotting range is set at $1 million.
Code:
ggplot(data = loans_full_schema, mapping = aes(x = grade, y = paid_total)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 10000), labels = scales::dollar) +
labs(title = "The Relation Between Grade and Total Paid",
x = "Grade level for loan",
y = "Total loan paid (USD)") +
theme(plot.title = element_text(color = "green4"),
axis.title = element_text(color = "orange4")) +
my_theme
Answer: The box plot depicts the relationship between
loan grade levels (A to G) and the total amount paid on loans in USD.
The data indicates that lower-grade loans (F and G) generally have
higher median total payments compared to higher-grade loans (A to E).
Additionally, the plot reveals greater variability and a higher number
of outliers in higher-grade loans. To my knowledge, this suggests that
riskier loans (lower grades) tend to have higher repayment amounts,
possibly due to higher interest rates or longer repayment terms.
Code:
ggplot(data = loans_full_schema, mapping = aes(x = balance, y = paid_total)) +
geom_bin_2d() +
geom_smooth(color = "gold2", linewidth = 1.5) +
labs(title = "The Relation Between Balance and Total Paid",
x = "Loan balance (USD)",
y = "Total loan paid (USD)") +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
theme(plot.title = element_text(color = "green4"),
axis.title = element_text(color = "orange4")) +
my_theme
Answer: The graph illustrates a weak positive
correlation between loan balance and total loan paid. It suggests that
individuals with lower loan balances exhibit a wider range of total
payments and generally pay less. In contrast, those with higher loan
balances tend to have higher total loan payments.
Code:
ggplot(data = loans_full_schema) +
geom_point(mapping = aes(x = installment, y = annual_income)) +
facet_wrap(~ homeownership) +
scale_y_continuous(limits = c(0, 1500000), labels = scales::dollar) +
labs(title = "The Relation of Installment, Annual Income, and Homeownership",
x = "Monthly payment for loan",
y = "Annual income (USD)") +
theme(plot.title = element_text(color = "green4"),
axis.title = element_text(color = "orange4")) +
my_theme
Answer: From the plot, there is no significant
difference between the three categories: mortgage, own, and rent. While
some outliers exist, the majority of individuals, regardless of
homeownership type, pay similar installment amounts within the $0 t
$1,500 range. Moreover, higher income does not necessarily correspond to
higher installment payments.
Code:
ggplot(data = loans_full_schema) +
geom_point(mapping = aes(x = total_debit_limit, y = num_open_cc_accounts)) +
facet_grid(grade ~ homeownership) +
scale_x_continuous(limits = c(0, 3e+5), labels = scales::dollar) +
labs(title = "Total Limits & Card Accounts vs Grade & Homeownership",
x = "Total limits of ll bank cards",
y = "Numbers of Credit card accounts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), color = "green4",
margin = margin(5,10,5,5)),
axis.title = element_text(hjust = 0.5, size = rel(1.4), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,5,10,5)),
axis.text = element_text(size = rel(0.9)),
panel.spacing = unit(1, "lines"),
plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
axis.text.x = element_text(size = rel(0.9)))
Answer: The graph shows that the higher credit grades
(A-C) are associated with higher credit limits and more credit card
accounts, regardless of homeownership type, while lower credit grades
(D-G) tend to have lower limits and fewer accounts. Interestingly,
outliers exist in all categories, indicating that some individuals have
exceptionally high credit limits or numerous accounts, even in the lower
credit grades.
Answer: The dataset includes property assessment
data from the Ames Assessor’s Office, detailing the assessed values of
residential properties sold in Ames, IA, between 2006 and 2010. It
comprises 2,930 rows and 82 variables. It includes details on property
features, building characteristics, size, utilities, rooms, exterior,
location, and sales info — each covering multiple attributes.
Code:
ggplot(data = ames, mapping = aes(x = area, y = price)) +
geom_point(color = "purple4", fill = "pink3", shape = 21, size = rel(2)) +
geom_smooth(color = "purple3", linetype = "solid", linewidth = 1.5) +
labs(title = "Area vs Price of Houses",
x = "Area (sqf)",
y = "Price (USD)") +
scale_x_continuous(limits = c(0, 4000)) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 5e+05)) +
theme(plot.title = element_text(color = "brown4"),
axis.title = element_text(color = "pink4")) +
my_theme
Answer: The scatter plot illustrates the relationship
between house area (in square feet) and price (in USD) is a clear
positive correlation. Indicating that as the house area increases, the
price tends to rise. The fitted regression line further supports this
trend. However, some outliers are present, where certain houses deviate
significantly from the general pattern.
Code:
ggplot(data = ames, mapping = aes(x = Bldg.Type, y = price)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
labs(title = "Building Type vs Price of Houses",
x = "Type of building",
y = "Price (USD)") +
theme(plot.title = element_text(color = "brown4"),
axis.title = element_text(color = "pink4")) +
my_theme
Answer: The graph shows that single-family homes (1Fam)
and end-unit townhouses (TwnhsE) tend to have higher median prices and
more variation in prices, including several high-value outliers. In
contrast, two-family conversions (2fmCon), duplexes, and regular
townhouses (Twnhs) generally have lower median prices and less price
variation. The numerous outliers suggest that while most homes fall
within a typical price range, some are priced significantly
higher.
Code:
ggplot(data = ames, mapping = aes(x = area, y = price, color = Bldg.Type)) +
geom_point() +
geom_smooth(color = "red3") +
labs(title = "Building Type, Price, and Area of Houses",
x = "Area (sqf)",
y = "Prices (USD)") +
scale_x_continuous(limits = c(0, 4000)) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
theme(plot.title = element_text(color = "brown4"),
axis.title = element_text(color = "pink4")) +
my_theme
Answer: The plot illustrates a positive correlation
between area and price, indicating that larger homes generally cost
more. Single-family homes (1Fam) are the most common in the dataset and
tend to have higher prices. Other building types, such as two-family
conversions (2fmCon), duplexes, and townhouses (Twnhs), are less
prevalent and generally fall within the lower price range. Townhouses
with end units (TwnhsE) are also less common but are priced slightly
higher than two-family conversions, duplexes, and standard townhouses.
The spread of data points suggests that even within the same building
type, prices can vary significantly based on size and other
factors.
Code:
ggplot(data = ames, aes(x = area, y = price, color = Year.Built)) +
geom_point() +
geom_smooth(color = "red3") +
scale_x_continuous(limits = c(0, 4000)) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
labs(title = "Area, Price, and the Year Build of Houses",
x = "Area of the house (sqf)",
y = "Price (USD)") +
theme(plot.title = element_text(color = "brown4"),
axis.title = element_text(color = "pink4")) +
my_theme
Answer: The scatter plot shows a positive correlation
between house area and price, with larger houses tending to have higher
prices. However, some houses deviate significantly from this trend,
suggesting the influence of other factors. Additionally, the year built
appears to play a role, with older houses (darker points) more widely
dispersed across mid-to-low price levels, while newer houses (lighter
blue) tend to be priced higher, even for the same area size.