Homework 1

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

Problem 1

The gpa data set is available through openintro package in R. Answer the following questions with an appropriate graph. Summarize your finding in plain text for each graph to answer the question.

1). By doing your own research, give the precise meaning of each variable.

gpa: The student's grade point average, measured on a standard scale.
studyweek: The number of hours per week student studies outside of class.
sleepnight: The average number of hours the student sleeps per night.
out: The number of nights per week the student goes out.
gender: The student's gender, a factor with level female and male.

2). Visualize the relationship between studyweek and gpa. What does your graph indicate?

ggplot(data=gpa, aes(x = studyweek, y = gpa)) +
  geom_point(alpha = 0.7) + 
  geom_smooth() +
  labs(
    title = "Relationship between Weekly Study Hours and GPA",
    x = "Hours Studies per Week",
    y = "GPA"
  ) +
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Conclusion: The graph indicates a weak positive relationship between weekly study hours and GPA. As the number of hours studied per week increases, GPA tends to increase slightly, but the trend is not strong.

3). Visualize the relationship between out and gpa. What does your graph indicate?

ggplot(gpa, aes(x = out, y = gpa)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(
    title = "Relationship between Nights Out per Week and GPA",
    x = "Nights Out per Week",
    y = "GPA"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) 
## `geom_smooth()` using formula = 'y ~ x'

Conclustion: The graph indicates a very weak positive relationship between the number of nights out per week and GPA. Although GPA appears to increase slightly as the number of nights out increases, the wide spread of the data points suggests that this relationship is not strong.

4). Visualize the relationship between out and sleepnight. What does your graph indicate?

ggplot(gpa, aes(x = as.factor(out), y = sleepnight)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Hours of Sleep per Night by Number of Nights Out per Week",
    x = "Nights Out per Week",
    y = "Hours of Sleep per Night"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclustion: The plot shows that there is no clear relationship between how many nights students go out per week and how many hours they sleep per night. Students with different numbers of nights out have similar sleep hours, so going out more does not clearly mean sleeping less or more.

5). Visualize the relationship between gender and studyweek. What does your graph indicate?

ggplot(gpa, aes(x = gender, y = studyweek)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Weekly Study Hours by Gender",
    x = "Gender",
    y = "Hours Studied per Week"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The boxplot shows that male and female students have similar weekly study hours. Although there are small differences in the medians, the distributions overlap a lot. This suggests that gender is not strongly related to how much students study each week.

6).Visualize the relationship between gender and out. What does your graph indicate?

ggplot(gpa, aes(x = gender, y = out)) +
  geom_boxplot(fill = "yellow") +
  labs(
    title = "Nights Out per Week by Gender",
    x = "Gender",
    y = "Nights Out per Week"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The boxplot shows that male students tend to go out slightly more nights per week than female students. However, the difference is small and the distributions overlap, so gender does not strongly affect how often students go out.

Problem 2

Finish the following data visualization tasks using the full loans_full_schema data set (55 columns) in openintro library. For each task, you need to summarize what you learn from the graph accurately and concisely.

1). Create a histogram of a numeric variable that you select and plot a density curve on top of the histogram. Carefully select bin numbers/sizes/boundaries to make the plot informative. What does this graph indicate?

ggplot(data = loans_full_schema, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)),
                 boundary = 0, bins = 30, color = "black", fill = "white") +
  geom_density(linewidth = 1.3) + 
  labs(
    title = "Histogram and Density of Loan Amount",
    x = "Loan Amount",
    y = "Density"
  ) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The histogram with the density curve shows that loan amounts are right-skewed. Most loans are for smaller amounts, while a small number of loans have very large amounts. This indicates that large loans are less common.

2). Create a graph to study the effect of a categorical/discrete variable on the distributions of a numeric variable. What does this graph indicate?

ggplot(loans_full_schema, aes(x = homeownership, y = loan_amount)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Loan Amount by Homeownership Status",
    x = "Homeownership Status",
    y = "Loan Amount"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The boxplot shows that loan amounts differ across homeownership groups. Borrowers who own a home tend to have slightly higher loan amounts, while renters generally have lower loan amounts. However, the distributions overlap, so homeownership alone does not fully explain loan size.

3). Create a bin heatmap (2d density plot) to study the relationship between two numeric variables that you select. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
  geom_bin2d(bins = 30) +
  labs(
    title = "2D Density Plot of Annual Income and Loan Amount",
    x = "Annual Income",
    y = "Loan Amount",
    fill = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The heatmap shows that most loans are concentrated among borrowers with lower to moderate incomes and smaller loan amounts. As income increases, loan amounts also tend to increase, but large loans are much less common.

4). Use facet_wrap to create an informative plot. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(x = interest_rate)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, boundary = 0,
                 color = "black", fill = "white") +
  geom_density(linewidth = 1) +
  facet_wrap(~ loan_purpose) +
  labs(
    title = "Interest Rate Distribution by Loan Purpose",
    x = "Interest Rate",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The plot shows that interest rates vary by loan purpose. Some purposes have distributions centered at higher interest rates, while others are centered lower. Overall, the shapes are similar but the typical interest rate level differs across purposes.

5). Use facet_grid to create an informative plot. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(x = annual_income, y = loan_amount)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_grid(grade ~ homeownership) +
  labs(
    title = "Loan Amount vs Annual Income by Grade and Homeownership",
    x = "Annual Income",
    y = "Loan Amount"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

Conclusion: The plot suggests that the relationship between income and loan amount can differ across grade and homeownership groups. In many panels, higher income is associated with higher loan amounts, but the strength of the trend varies.

Problem 3

The ames data set is available through openintro package in R.

1). Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover.

help(ames)
## starting httpd help server ... done
Conclusion: The Ames data set contains information about houses sold in Ames, Iowa from 2006 to 2010. It includes data on 2,930 houses and many variables that describe house size, location, quality, and sale price. The data set is used to study housing prices and how different house features are related to price.

2). Use a plot to analyze how area correlates with price. Summarize your finding from the graph.

ggplot(ames, aes(x = area, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Relationship between Living Area and House Price",
    x = "Living Area (square feet)",
    y = "Sale Price (USD)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

Conclusion: The graph shows a positive relationship between living area and house price. Houses with larger living areas tend to have higher prices, although there is some variability. This suggests that house size is an important factor related to price.

3). Use a plot to analyze how Bldg.Type correlates with price. Explain the meaning of each label for Bldg.Type and summarize your finding from the graph.

ggplot(ames, aes(x = Bldg.Type, y = price)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "House Price by Building Type",
    x = "Building Type",
    y = "Sale Price (USD)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Conclusion: The boxplot shows that house prices vary by building type. Single-family homes (1Fam) generally have higher prices and a wider range of values, including many high-priced homes. Duplexes and two-family homes tend to have lower prices, while townhouses fall in between. Overall, building type is related to house price, but prices overlap across groups.

4). Use a plot to analyze how Bldg.Type and area altogether correlates with price. Summarize your finding from the graph.

ggplot(ames, aes(x = area, y = price, color = Bldg.Type)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(
    title = "House Price vs Living Area by Building Type",
    x = "Living Area (square feet)",
    y = "Sale Price (USD)",
    color = "Building Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

Conclusion: The plot shows a clear positive relationship between living area and house price for all building types. As living area increases, house price generally increases. Single-family homes tend to have higher prices than other building types at similar sizes, while duplexes and townhouses usually sell for less. This suggests that both living area and building type are important factors related to house price.

5). (Bonus - 5 Points) You may need to self-study to fulfill this task: use a plot to study how area and Year.Built together correlates with price. Summarize your finding from the graph.

ggplot(ames, aes(x = area, y = price, color = Year.Built)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "House Price vs Living Area Colored by Year Built",
    x = "Living Area (square feet)",
    y = "Sale Price (USD)",
    color = "Year Built"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

Conclusion: The graph shows that house price increases as living area increases. In addition, newer houses tend to have higher prices than older houses at similar sizes. This suggests that both living area and year built are positively related to house price, with newer and larger homes generally selling for more.