library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

QUESTION 1

Objective

To determine the number of variables, observations, and identify missing values in the College dataset.

# Load the dataset

d = read.csv("college.csv")
# 1. Variables and Observations
variables <- ncol(d)
observations <- nrow(d)
missing_values <- sum(is.na(d))

cat("1. Variables:", variables, "\n")
## 1. Variables: 17
cat("   Observations:", observations, "\n")
##    Observations: 1269
cat("   Missing values:", missing_values, "\n")
##    Missing values: 2

Interpretation

The College dataset contains 17 variables and 1269 observations. There are 2 missing values present in the dataset.


QUESTION 2

Objective

To visualize the average admission rate for each state using a barplot.

# Calculate average admission rate for each state

d2 <- d %>%
  group_by(state)%>%
  mutate(admission_rate_mean = mean(admission_rate))
  ggplot(d2) +
  geom_bar(aes(x = state, fill = admission_rate_mean))

Interpretation

Using geom_bar, it’s challenging to discern which states have high or low admission rates as each state’s colleges are represented individually. To address this, we can rearrange the plot to display states with higher admission rates towards one end of the x-axis and states with lower admission rates towards the other end. Alternatively, we can use a different type of graph where admission rate is plotted on the y-axis to provide a clearer visualization of the variation in admission rates across states.


QUESTION 3

Objective

To visualize the relationship between median debt and loan default rate using a scatter plot with a nonparametric trend.

# Scatter plot with nonparametric trend

scatterplot <- ggplot(d, aes(x = median_debt, y = loan_default_rate)) +
  geom_point() +
  geom_smooth(method = "gam") +
  labs(title = "Relationship between Median Debt and Loan Default Rate",
       x = "Median Debt", y = "Loan Default Rate")

print(scatterplot)
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Interpretation

Overall, the trend seems reasonable, indicating that as debt increases, the likelihood of default also tends to rise. The only aspect that appears puzzling is the slight dip observed in the range between $25,000 and $26,000 of median debt.


QUESTION 4

Objective

To visualize the relationship between median debt and loan default rate using a bubble plot with a weighted nonparametric trend.

# Bubble plot with weighted nonparametric trend

bubbleplot <- ggplot(d, aes(x = median_debt, y = loan_default_rate, size = undergrads)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "gam", aes(weight = undergrads), se = FALSE) +
  labs(title = "Relationship between Median Debt, Loan Default Rate, and Undergrads",
       x = "Median Debt", y = "Loan Default Rate", size = "Undergrads")

print(bubbleplot)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: size
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Interpretation

The weighted trend in the bubble plot makes sense as it considers the number of undergrads as weights for the trend line. By assigning weights proportional to the number of undergrads, the trend line is influenced more by observations with larger bubble sizes, indicating higher numbers of undergrads. This approach allows the trend line to better capture the overall relationship between median debt and loan default rate, as it emphasizes data points with greater representation in the dataset. Therefore, the weighted trend provides a more accurate depiction of the underlying trend by accounting for the varying magnitudes of observations.


QUESTION 5

Objective

To compare the trends observed in questions 3 and 4 and determine which trend is better.

Interpretation

Both trends provide valuable insights into the relationship between median debt and loan default rate. However, the trend from question 4 (bubble plot with weighted trend line) may be considered better for the following reasons:

Incorporation of Weights: The trend in question 4 considers the number of undergraduates as weights, giving more importance to schools with larger undergraduate populations. This makes the trend more representative of the overall population.

Enhanced Visualization: The bubble plot visually represents the number of undergraduates using bubble size, providing additional information about the distribution of institutions based on undergraduate population.

Based on these factors, the trend from question 4 is considered better as it provides a more comprehensive and representative analysis of the relationship between median debt and loan default rate.


QUESTION 6

Objective

To compare the distributions of undergraduates for different types of schools offering different highest degrees using overlapping density plots.

# Overlapping density plot
density_plot <- ggplot(d, aes(x = log10(undergrads), fill = as.factor(highest_degree))) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Undergrads by Highest Degree Type",
       x = "Log10(Undergrads)", y = "Density") +
  scale_fill_discrete(name = "Highest Degree Type")

print(density_plot)

Interpretation

The log transformation enhances readability by spreading out the data and reducing skewness, making it easier to distinguish density distributions across different school types based on undergrad counts.


QUESTION 7

Objective

To show the composition of the distribution of undergraduates for different types of schools offering different highest degrees using stacked density plots.

# Stacked density plot

stacked_density <- ggplot(d, aes(x = log10(undergrads), fill = as.factor(highest_degree))) +
  geom_density(alpha = 0.5, position = "stack") +
  labs(title = "Stacked Density of Undergrads by Highest Degree Type",
       x = "Log10(Undergrads)", y = "Density") +
  scale_fill_discrete(name = "Highest Degree Type")

print(stacked_density)

Interpretation

Schools with smaller student populations typically offer Associate degrees as their highest degree, while those with larger student populations tend to offer graduate programs.


QUESTION 8

Objective

To show the number of schools for each highest degree type using a barplot.

# Number of schools by highest degree type
degree_count <- d %>%
  group_by(highest_degree) %>%
  summarise(count = n())

# Barplot
barplot_degree_count <- ggplot(degree_count, aes(x = highest_degree, y = count)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Number of Schools by Highest Degree Type", x = "Highest Degree Type", y = "Count")

print(barplot_degree_count)

Interpretation

By analyzing the data in this manner, it becomes evident that a greater number of schools offer graduate degrees as their highest degree, which may not have been readily apparent from the graph alone.


QUESTION 9

Objective

To improve upon the stacked density plot from question 7 by making the area of each density proportional to the number of schools for each highest degree type.

# Calculate proportions of schools for each highest degree type
degree_proportions <- degree_count$count / sum(degree_count$count)

# Create a data frame containing the weighted counts for each highest degree type
weighted_data <- dplyr::mutate(d, weight = degree_proportions[match(highest_degree, unique(d$highest_degree))])

# Stacked density plot with proportional area
improved_stacked_density <- ggplot(weighted_data, aes(x = log10(undergrads), fill = as.factor(highest_degree))) +
  geom_density(aes(weight = weight), alpha = 0.5, position = "stack") +
  labs(title = "Proportional Stacked Density of Undergrads by Highest Degree Type",
       x = "Log10(Undergrads)", y = "Density") +
  scale_fill_discrete(name = "Highest Degree Type")

print(improved_stacked_density)

Interpretation

The improved stacked density plot adjusts the area of each density curve to be proportional to the number of schools for each highest degree type. This provides a more accurate representation of the distribution of undergraduates across different types of schools.


QUESTION 10

Objective

To visualize the association between a pair of variables conditional on a few other variables, ensuring meaningful patterns in the visualization.

I will be visualizing the association between SAT scores and admission rates, conditional on the type of school (public vs. private) and region.

# Load necessary libraries
library(ggplot2)


# Create a scatter plot with conditional coloring and faceting
conditional_plot <- ggplot(d, aes(x = sat_avg, y = admission_rate, color = region, shape = region)) +
  geom_point(alpha = 0.7) +
  facet_grid(. ~ region) +  # Facet by region
  labs(title = "Association between SAT Scores and Admission Rates",
       x = "Average SAT Score", y = "Admission Rate", color = "Region", shape = "Region") +
  theme_minimal()

# Print the conditional plot
print(conditional_plot)

Interpretation

A general trend of higher median debt being associated with lower earnings 10 years after entry.

Public schools tend to have lower median debt and earnings compared to private schools. Regional variations are evident, with some regions showing higher median debt and earnings compared to others.

Overall, this visualization provides insights into the relationship between median debt and earnings, highlighting variations based on school type and region.


QUESTION 11

Objective

To propose two questions about the dataset and answer them using visualization.

Question 1: How does the faculty salary vary across different regions and highest degree offered by schools?

# Create a grouped bar plot
salary_region_degree_barplot <- ggplot(d, aes(x = region, y = faculty_salary_avg, fill = highest_degree)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Faculty Salary by Region and Highest Degree Offered",
       x = "Region", y = "Average Faculty Salary", fill = "Highest Degree Offered") +
  theme_minimal()

# Print the plot
print(salary_region_degree_barplot)

### Interpretation

Based on the grouped bar plot illustrating the average faculty salary by region and highest degree offered, it appears that in the West region, schools offering graduate-level degrees have the highest average faculty salary compared to other regions. Additionally, across all regions, schools offering graduate-level degrees generally exhibit the highest average faculty salary. This observation suggests that graduate-level programs tend to attract higher-paid faculty members compared to schools offering other degrees. The West region stands out as particularly noteworthy for having the highest average faculty salary among schools offering graduate-level degrees, indicating a potential trend of higher compensation for faculty in this region, especially at the graduate level.


Question 2: What is the relationship between undergraduate enrollment and faculty salary, considering the type of control (public vs. private) and the highest degree offered by schools?

# Create a scatter plot
enrollment_salary_scatter <- ggplot(d, aes(x = undergrads, y = faculty_salary_avg, color = control, shape = highest_degree)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Relationship between Undergraduate Enrollment and Faculty Salary",
       x = "Undergraduate Enrollment", y = "Faculty Salary", color = "Control", shape = "Highest Degree Offered") +
  theme_minimal()

# Print the plot
print(enrollment_salary_scatter)

Interpretation

The scatter plot visualizes the relationship between undergraduate enrollment and faculty salary, considering the control type (public vs. private) and the highest degree offered by schools. Each point represents a college, with undergraduate enrollment on the x-axis and faculty salary on the y-axis. The color of the points indicates the control type (public vs. private), and the shape represents the highest degree offered. By examining the plot, we can observe how undergraduate enrollment relates to faculty salary across different types of schools, providing insights into potential patterns and differences in faculty compensation based on control type and highest degree offered.