library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
The gpa data set is available through openintro package in R. Answer the following questions with an appropriate graph. Summarize your finding in plain text for each graph to answer the question.
gpa: The student's grade point average, measured on a standard scale.
studyweek: The number of hours per week student studies outside of class.
sleepnight: The average number of hours the student sleeps per night.
out: The number of nights per week the student goes out.
gender: The student's gender, a factor with level female and male.
ggplot(data=gpa, aes(x = studyweek, y = gpa)) +
geom_point(alpha = 0.7) +
geom_smooth() +
labs(
title = "Relationship between Weekly Study Hours and GPA",
x = "Hours Studies per Week",
y = "GPA"
) +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Conclusion: The graph indicates a weak positive relationship between weekly study hours and GPA. As the number of hours studied per week increases, GPA tends to increase slightly, but the trend is not strong.
ggplot(gpa, aes(x = out, y = gpa)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm") +
labs(
title = "Relationship between Nights Out per Week and GPA",
x = "Nights Out per Week",
y = "GPA"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Conclustion: The graph indicates a very weak positive relationship between the number of nights out per week and GPA. Although GPA appears to increase slightly as the number of nights out increases, the wide spread of the data points suggests that this relationship is not strong.
ggplot(gpa, aes(x = as.factor(out), y = sleepnight)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Hours of Sleep per Night by Number of Nights Out per Week",
x = "Nights Out per Week",
y = "Hours of Sleep per Night"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclustion: The plot shows that there is no clear relationship between how many nights students go out per week and how many hours they sleep per night. Students with different numbers of nights out have similar sleep hours, so going out more does not clearly mean sleeping less or more.
ggplot(gpa, aes(x = gender, y = studyweek)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Weekly Study Hours by Gender",
x = "Gender",
y = "Hours Studied per Week"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The boxplot shows that male and female students have similar weekly study hours. Although there are small differences in the medians, the distributions overlap a lot. This suggests that gender is not strongly related to how much students study each week.
ggplot(gpa, aes(x = gender, y = out)) +
geom_boxplot(fill = "yellow") +
labs(
title = "Nights Out per Week by Gender",
x = "Gender",
y = "Nights Out per Week"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The boxplot shows that male students tend to go out slightly more nights per week than female students. However, the difference is small and the distributions overlap, so gender does not strongly affect how often students go out.
Finish the following data visualization tasks using the full loans_full_schema data set (55 columns) in openintro library. For each task, you need to summarize what you learn from the graph accurately and concisely.
ggplot(data = loans_full_schema, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)),
boundary = 0, bins = 30, color = "black", fill = "white") +
geom_density(linewidth = 1.3) +
labs(
title = "Histogram and Density of Loan Amount",
x = "Loan Amount",
y = "Density"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The histogram with the density curve shows that loan amounts are right-skewed. Most loans are for smaller amounts, while a small number of loans have very large amounts. This indicates that large loans are less common.
ggplot(loans_full_schema, aes(x = homeownership, y = loan_amount)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Loan Amount by Homeownership Status",
x = "Homeownership Status",
y = "Loan Amount"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The boxplot shows that loan amounts differ across homeownership groups. Borrowers who own a home tend to have slightly higher loan amounts, while renters generally have lower loan amounts. However, the distributions overlap, so homeownership alone does not fully explain loan size.
ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
geom_bin2d(bins = 30) +
labs(
title = "2D Density Plot of Annual Income and Loan Amount",
x = "Annual Income",
y = "Loan Amount",
fill = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The heatmap shows that most loans are concentrated among borrowers with lower to moderate incomes and smaller loan amounts. As income increases, loan amounts also tend to increase, but large loans are much less common.
ggplot(loans_full_schema, aes(x = interest_rate)) +
geom_histogram(aes(y = after_stat(density)),
bins = 30, boundary = 0,
color = "black", fill = "white") +
geom_density(linewidth = 1) +
facet_wrap(~ loan_purpose) +
labs(
title = "Interest Rate Distribution by Loan Purpose",
x = "Interest Rate",
y = "Density"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The plot shows that interest rates vary by loan purpose. Some purposes have distributions centered at higher interest rates, while others are centered lower. Overall, the shapes are similar but the typical interest rate level differs across purposes.
ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
facet_grid(grade ~ homeownership) +
labs(
title = "Loan Amount vs Annual Income by Grade and Homeownership",
x = "Annual Income",
y = "Loan Amount"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Conclusion: The plot suggests that the relationship between income and loan amount can differ across grade and homeownership groups. In many panels, higher income is associated with higher loan amounts, but the strength of the trend varies.
The ames data set is available through openintro package in R.
help(ames)
## starting httpd help server ... done
Conclusion: The Ames data set contains information about houses sold in Ames, Iowa from 2006 to 2010. It includes data on 2,930 houses and many variables that describe house size, location, quality, and sale price. The data set is used to study housing prices and how different house features are related to price.
ggplot(ames, aes(x = area, y = price)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Relationship between Living Area and House Price",
x = "Living Area (square feet)",
y = "Sale Price (USD)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Conclusion: The graph shows a positive relationship between living area and house price. Houses with larger living areas tend to have higher prices, although there is some variability. This suggests that house size is an important factor related to price.
ggplot(ames, aes(x = Bldg.Type, y = price)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "House Price by Building Type",
x = "Building Type",
y = "Sale Price (USD)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Conclusion: The boxplot shows that house prices vary by building type. Single-family homes (1Fam) generally have higher prices and a wider range of values, including many high-priced homes. Duplexes and two-family homes tend to have lower prices, while townhouses fall in between. Overall, building type is related to house price, but prices overlap across groups.
ggplot(ames, aes(x = area, y = price, color = Bldg.Type)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(
title = "House Price vs Living Area by Building Type",
x = "Living Area (square feet)",
y = "Sale Price (USD)",
color = "Building Type"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Conclusion: The plot shows a clear positive relationship between living area and house price for all building types. As living area increases, house price generally increases. Single-family homes tend to have higher prices than other building types at similar sizes, while duplexes and townhouses usually sell for less. This suggests that both living area and building type are important factors related to house price.
ggplot(ames, aes(x = area, y = price, color = Year.Built)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "House Price vs Living Area Colored by Year Built",
x = "Living Area (square feet)",
y = "Sale Price (USD)",
color = "Year Built"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Conclusion: The graph shows that house price increases as living area increases. In addition, newer houses tend to have higher prices than older houses at similar sizes. This suggests that both living area and year built are positively related to house price, with newer and larger homes generally selling for more.