Loading Libraries:

knitr::opts_chunk$set(fig.align = "center", out.width = "80%", message = FALSE, 
                      warning = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)
library(ggridges)
library(nycflights13)

Global Theme Set Up:

my_theme <- theme(
  plot.title = element_text(size = rel(1.8), hjust = 0.5),
  axis.title = element_text(size = rel(1.5), hjust = 0.5,
                            margin = margin(15,15,15,15)),
  axis.title.x = element_text(margin = margin(10,5,5,5)),
  axis.title.y = element_text(margin = margin(5,10,5,5)),
  axis.text = element_text(size = rel(1.1)))

Part I: Data Visualization and Exploration Tasks with GPA Data Set


1. By doing your own research, give the precise meaning of each variable.


Answer: GPA (Grade Point Average) represents a numerical summary of a student’s academic performance. Studyweek indicates the number of hours a student studies at night per week, while sleepnight refers to the number of hours they sleep each night. Out represents the number of nights a student goes out per week. Lastly, ‘gender’ specifies whether the student is male or female.

2. Visualize the relationship between studyweek and gpa. What does your graph indicate?


Code:

ggplot(data = gpa, mapping = aes(x = studyweek, y = gpa)) +
  geom_point(size = rel(1.2)) +
  geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
  labs(title = "The Relation Between GPA and Study Hours",
       x = "Study hours per week",
       y = "Gpa") +
  ylim(2.5, 5) +
  theme(plot.title = element_text(color = "blue4"),
        axis.title = element_text(color = "purple4")) +
  my_theme

Answer: The data points appear widely scattered, suggesting no strong correlation between study hours and GPA. While some students with fewer study hours achieve high GPAs, others who study more still have varied GPA outcomes. The distribution indicates that other factors beyond study hours may significantly influence GPA.

3. Visualize the relationship between out and gpa. What does your graph indicate?


Code:

ggplot(data = gpa, mapping = aes(x = out, y = gpa)) +
  geom_point(size = rel(1.2)) +
  geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
  labs(title = "The Relation Between GPA and Happy Nights",
       x = "Going-out nights (per week)",
       y = "Gpa") +
  theme(plot.title = element_text(color = "blue4"),
        axis.title = element_text(color = "purple4")) +
  my_theme

Answer: The data points are widely dispersed in the plot, indicating no strong correlation between going-out frequency and GPA. The plot shows that students with high GPAs are found across all levels of social activity, suggesting that moderate socializing does not necessarily harm academic performance. However, individual variations imply that other factors may contribute to GPA outcomes.

4. Visualize the relationship between out and sleepnight. What does your graph indicate?


Code:

ggplot(data = gpa, mapping = aes(x = out, y = sleepnight)) +
  geom_point(size = rel(1.2)) +
  geom_smooth(linewidth = 1.2, color = "skyblue3", linetype = "dashed") +
  labs(title = "The Happy Nights and Sleep Hour",
       x = "Going-out nights (per week)",
       y = "Hours of sleep (per day)") +
  theme(plot.title = element_text(color = "blue4"),
        axis.title = element_text(color = "purple4")) +
  my_theme

Answer: The data suggests a positive correlation between these two variables, indicating that students who get more sleep at night tend to go out more. Further statistical analysis is recommended to validate this trend.

5. Visualize the relationship between gender and studyweek. What does your graph indicate?


Code:

ggplot(data = gpa, mapping = aes(x = gender, y = studyweek)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Which Gender Studies More?",
       x = "Gender",
       y = "Study hours (per week)") +
  theme(plot.title = element_text(color = "blue4"),
        axis.title = element_text(color = "purple4")) +
  my_theme

Answer: The plot shows that the median weekly study hours for male and female students are approximately the same. However, female students have a wider range of study hours, ranging from around 4 to 50 hours per week, whereas male students range from 2 to 30 hours. Additionally, there is an outlier among male students who studied over 40 hours per week.

6. Visualize the relationship between gender and out. What does your graph indicate?


Code:

ggplot(data = gpa, mapping = aes(x = gender, y = out)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Which Gender Goes Out More?",
       x = "Gender",
       y = "Going-out nights (per week)") +
  theme(plot.title = element_text(color = "blue4"),
        axis.title = element_text(color = "purple4")) +
  my_theme

Answer: The box plot compares the number of going-out nights per week between males and females, showing similar distributions. Females have a median of around 2 nights, with an interquartile range (IQR) from 1 to 3 and a total range of 0 to 4. Males have a slightly higher median of about 2.5 nights, with an IQR between 2 and 3 and a range from 1 to 4. While males tend to go out slightly more on average, the overall variability and distribution are similar between the two groups.

Part II: Data Visualization Tasks with Loans_full_schema Data Set


1. Create a histogram of a numeric variable that you select and plot a density curve on top of the histogram. Carefully select bin numbers, sizes, and boundaries to make the plot informative. What does this graph indicate?


Code:

ggplot(data = loans_full_schema, mapping = aes(x = total_credit_limit)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20, boundary = 0,
                 na.rm = T, fill ="orange2", color = "orange3") +
  geom_density(linewidth = 1.2, adjust = 30/20, na.rm = T, color = "green4") +
  scale_x_continuous(labels = scales::dollar, limits = c(0, 1000000)) +
  labs(title = "Total Avaliable Credit of Lending Club Loaners",
       x = "Total available credit (USD)",
       y = "Density") +
  theme(plot.title = element_text(color = "green4"),
        axis.title = element_text(color = "orange4")) +
  my_theme

Answer: The graph presents a right-skewed distribution of total available credit from credit cards, excluding mortgages. A few individuals have exceptionally high credit limits, ranging between $2 million and $3 million. To enhance clarity and detail, the upper limit of the plotting range is set at $1 million.

2. Create a graph to study the effect of a categorical/discrete variable on the distributions of a numeric variable. What does this graph indicate?


Code:

ggplot(data = loans_full_schema, mapping = aes(x = grade, y = paid_total)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 10000), labels = scales::dollar) +
  labs(title = "The Relation Between Grade and Total Paid",
       x = "Grade level for loan",
       y = "Total loan paid (USD)") +
  theme(plot.title = element_text(color = "green4"),
        axis.title = element_text(color = "orange4")) +
  my_theme

Answer: The box plot depicts the relationship between loan grade levels (A to G) and the total amount paid on loans in USD. The data indicates that lower-grade loans (F and G) generally have higher median total payments compared to higher-grade loans (A to E). Additionally, the plot reveals greater variability and a higher number of outliers in higher-grade loans. To my knowledge, this suggests that riskier loans (lower grades) tend to have higher repayment amounts, possibly due to higher interest rates or longer repayment terms.

3. Create a bin heatmap (2d density plot) to study the relationship between two numeric variables that you select. Summarize the findings from the graph.


Code:

ggplot(data = loans_full_schema, mapping = aes(x = balance, y = paid_total)) +
  geom_bin_2d() +
  geom_smooth(color = "gold2", linewidth = 1.5) +
  labs(title = "The Relation Between Balance and Total Paid",
       x = "Loan balance (USD)",
       y = "Total loan paid (USD)") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme(plot.title = element_text(color = "green4"),
        axis.title = element_text(color = "orange4")) +
  my_theme

Answer: The graph illustrates a weak positive correlation between loan balance and total loan paid. It suggests that individuals with lower loan balances exhibit a wider range of total payments and generally pay less. In contrast, those with higher loan balances tend to have higher total loan payments.

4. Use facet_wrap to create an informative plot. Summarize the findings from the graph.


Code:

ggplot(data = loans_full_schema) +
  geom_point(mapping = aes(x = installment, y = annual_income)) +
  facet_wrap(~ homeownership) +
   scale_y_continuous(limits = c(0, 1500000), labels = scales::dollar) +
  labs(title = "The Relation of Installment, Annual Income, and Homeownership",
       x = "Monthly payment for loan",
       y = "Annual income (USD)") +
  theme(plot.title = element_text(color = "green4"),
        axis.title = element_text(color = "orange4")) +
  my_theme

Answer: From the plot, there is no significant difference between the three categories: mortgage, own, and rent. While some outliers exist, the majority of individuals, regardless of homeownership type, pay similar installment amounts within the $0 t $1,500 range. Moreover, higher income does not necessarily correspond to higher installment payments.

5. Use facet_grid to create an informative plot. Summarize the findings from the graph.


Code:

ggplot(data = loans_full_schema) +
  geom_point(mapping = aes(x = total_debit_limit, y = num_open_cc_accounts)) +
  facet_grid(grade ~ homeownership) +
  scale_x_continuous(limits = c(0, 3e+5), labels = scales::dollar) +
  labs(title = "Total Limits & Card Accounts vs Grade & Homeownership",
       x = "Total limits of ll bank cards",
       y = "Numbers of Credit card accounts") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), color = "green4",
                                  margin = margin(5,10,5,5)),
        axis.title = element_text(hjust = 0.5, size = rel(1.4), color = "orange4"),
        axis.title.x = element_text(margin = margin(10,5,5,5)),
        axis.title.y = element_text(margin = margin(5,5,10,5)),
        axis.text = element_text(size = rel(0.9)),
        panel.spacing = unit(1, "lines"),
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
        axis.text.x = element_text(size = rel(0.9)))

Answer: The graph shows that the higher credit grades (A-C) are associated with higher credit limits and more credit card accounts, regardless of homeownership type, while lower credit grades (D-G) tend to have lower limits and fewer accounts. Interestingly, outliers exist in all categories, indicating that some individuals have exceptionally high credit limits or numerous accounts, even in the lower credit grades.

Part III: Data Visualization and Exploration Tasks with AMES Data Set



1. Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover.


Answer: The dataset includes property assessment data from the Ames Assessor’s Office, detailing the assessed values of residential properties sold in Ames, IA, between 2006 and 2010. It comprises 2,930 rows and 82 variables. It includes details on property features, building characteristics, size, utilities, rooms, exterior, location, and sales info — each covering multiple attributes.


2. Use a plot to analyze how area correlates with price. Summarize your finding from the graph.


Code:

ggplot(data = ames, mapping = aes(x = area, y = price)) +
  geom_point(color = "purple4", fill = "pink3", shape = 21, size = rel(2)) +
  geom_smooth(color = "purple3", linetype = "solid", linewidth = 1.5) +
  labs(title = "Area vs Price of Houses",
       x = "Area (sqf)",
       y = "Price (USD)") +
  scale_x_continuous(limits = c(0, 4000)) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 5e+05)) +
  theme(plot.title = element_text(color = "brown4"),
        axis.title = element_text(color = "pink4")) +
  my_theme

Answer: The scatter plot illustrates the relationship between house area (in square feet) and price (in USD) is a clear positive correlation. Indicating that as the house area increases, the price tends to rise. The fitted regression line further supports this trend. However, some outliers are present, where certain houses deviate significantly from the general pattern.

3. Use a plot to analyze how Bldg.Type correlates with price. Explain the meaning of each label for Bldg.Type and summarize your finding from the graph.


Code:

ggplot(data = ames, mapping = aes(x = Bldg.Type, y = price)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
  labs(title = "Building Type vs Price of Houses",
       x = "Type of building",
       y = "Price (USD)") +
  theme(plot.title = element_text(color = "brown4"),
        axis.title = element_text(color = "pink4")) + 
  my_theme 

Answer: The graph shows that single-family homes (1Fam) and end-unit townhouses (TwnhsE) tend to have higher median prices and more variation in prices, including several high-value outliers. In contrast, two-family conversions (2fmCon), duplexes, and regular townhouses (Twnhs) generally have lower median prices and less price variation. The numerous outliers suggest that while most homes fall within a typical price range, some are priced significantly higher.

4. Use a plot to analyze how Bldg.Type and area altogether correlates with price. Summarize your finding from the graph.


Code:

ggplot(data = ames, mapping = aes(x = area, y = price, color = Bldg.Type)) +
  geom_point() +
  geom_smooth(color = "red3") +
  labs(title = "Building Type, Price, and Area of Houses",
       x = "Area (sqf)",
       y = "Prices (USD)") +
  scale_x_continuous(limits = c(0, 4000)) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
  theme(plot.title = element_text(color = "brown4"),
        axis.title = element_text(color = "pink4")) +
  my_theme

Answer: The plot illustrates a positive correlation between area and price, indicating that larger homes generally cost more. Single-family homes (1Fam) are the most common in the dataset and tend to have higher prices. Other building types, such as two-family conversions (2fmCon), duplexes, and townhouses (Twnhs), are less prevalent and generally fall within the lower price range. Townhouses with end units (TwnhsE) are also less common but are priced slightly higher than two-family conversions, duplexes, and standard townhouses. The spread of data points suggests that even within the same building type, prices can vary significantly based on size and other factors.

5. You may need to self-study to fulfill this task: use a plot to study how area and Year.Built together correlates with price. Summarize your finding from the graph. You are allowed to ask AI to give you hints about the plot types that you can use. But you are not allowed to ask AI to generate codes or give function names directly.


Code:

ggplot(data = ames, aes(x = area, y = price, color = Year.Built)) +
  geom_point() +
  geom_smooth(color = "red3") +
  scale_x_continuous(limits = c(0, 4000)) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 6e+05)) +
  labs(title = "Area, Price, and the Year Build of Houses",
       x = "Area of the house (sqf)",
       y = "Price (USD)") +
  theme(plot.title = element_text(color = "brown4"),
        axis.title = element_text(color = "pink4")) +
  my_theme

Answer: The scatter plot shows a positive correlation between house area and price, with larger houses tending to have higher prices. However, some houses deviate significantly from this trend, suggesting the influence of other factors. Additionally, the year built appears to play a role, with older houses (darker points) more widely dispersed across mid-to-low price levels, while newer houses (lighter blue) tend to be priced higher, even for the same area size.