Data Exploration Project

Author

Chloe Lee

#import clean data
library(tidyverse)
library(lubridate)
library(broom)
library(ggplot2)

final_cleaned_data <- readRDS("final_cleaned_data.rds")

#Create Variables 

final_cleaned_data <- final_cleaned_data %>%
  mutate(
    post_scorecard = if_else(month_date >= as.Date("2015-09-01"), 1, 0),
    median_earnings = as.numeric(`md_earn_wne_p10-REPORTED-EARNINGS`),
    high_earnings = if_else(median_earnings >= median(median_earnings, na.rm = TRUE), 1, 0)
  )

final_cleaned_data <- final_cleaned_data %>%
  mutate(post_scorecard_binary = if_else(month_date >= as.Date("2015-09-01"), 1, 0))

final_cleaned_data <- final_cleaned_data %>%
  arrange(month_date) %>%
  mutate(time_index = as.numeric(as.factor(month_date)))


model_data <- final_cleaned_data %>%
  filter(!is.na(avg_std_index), !is.na(high_earnings), !is.na(time_index))

model_quadratic <- lm(avg_std_index ~ post_scorecard_binary * high_earnings +
                        time_index + I(time_index^2),
                      data = model_data)

model_data <- model_data %>%
  mutate(fitted_values = fitted(model_quadratic))

fitted_trends <- model_data %>%
  group_by(month_date, high_earnings) %>%
  summarize(predicted_index = mean(fitted_values, na.rm = TRUE), .groups = "drop")

ggplot(fitted_trends, aes(x = month_date, y = predicted_index, color = factor(high_earnings))) +
  geom_line(size = 1.2) +
  geom_vline(xintercept = as.Date("2015-09-01"), linetype = "dashed", color = "blue") +
  labs(
    title = "Model-Predicted Search Interest by Earnings Group (Quadratic Time Trend)",
    x = "Month",
    y = "Predicted Search Index",
    color = "High Earnings\n(1 = Yes)"
  ) +
  theme_minimal()

Graph 1: This graph shows the model-predicted search interest for high- and low-earnings colleges over time, using a regression that includes both linear and quadratic time trends. These fitted lines reflect the expected search patterns after accounting for overall trends in the data.

From 2013 through 2015, search interest declines steadily for both groups. Around the time of the College Scorecard release in September 2015 (shown by the dashed line), there is a brief increase in interest, followed by another decline. Importantly, the high-earnings colleges (blue line) do not show a larger increase in predicted search interest after the Scorecard release. If anything, they appear to decline slightly more than low-earnings colleges. Rather than shifting attention toward high-earning schools, the Scorecard release was followed by a general drop in search interest — with high-earning schools dropping even more.

final_cleaned_data <- final_cleaned_data %>%
  mutate(post_scorecard_binary = if_else(month_date >= as.Date("2015-09-01"), 1, 0))

final_cleaned_data <- final_cleaned_data %>%
  mutate(period_label = if_else(post_scorecard_binary == 1, "After", "Before"))

avg_by_group <- final_cleaned_data %>%
  filter(!is.na(high_earnings), !is.na(avg_std_index)) %>%
  group_by(period_label, high_earnings) %>%
  summarize(mean_hits = mean(avg_std_index, na.rm = TRUE), .groups = "drop")

ggplot(avg_by_group, aes(x = period_label, y = mean_hits, fill = factor(high_earnings))) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Search Interest Before vs. After Scorecard Release",
    x = "Period",
    y = "Average Search Index",
    fill = "High Earnings (1 = Yes)"
  ) +
  theme_minimal()

Graph 2: This graph shows the average (mean) standardized search interest for high-earning and low-earning colleges before and after the release of the college scorecard. Before the scorecard, high-earnings schools received slightly more attention than low-earnings ones. After the release, search interest declined for both groups, but the drop was more significant for high-earnings colleges.

This pattern matches what we saw in Graph 1, where the time series plot showed both groups trending downward after September 2015, with no clear post-release advantage for high-earnings institutions. Taken together, these two graphs suggest that the Scorecard did not lead to a shift in student attention toward high-earnings colleges, but instead there was a general decline in search interest.

model <- lm(avg_std_index ~ post_scorecard_binary * high_earnings, data = final_cleaned_data)
summary(model)


Call:
lm(formula = avg_std_index ~ post_scorecard_binary * high_earnings, 
    data = final_cleaned_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6590 -0.3893 -0.0254  0.3524  6.7898 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          0.045456   0.004020  11.307  < 2e-16 ***
post_scorecard_binary               -0.215191   0.009202 -23.384  < 2e-16 ***
high_earnings                        0.018719   0.005657   3.309 0.000936 ***
post_scorecard_binary:high_earnings -0.069907   0.012957  -5.395 6.87e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5778 on 51561 degrees of freedom
  (5900 observations deleted due to missingness)
Multiple R-squared:  0.02873,   Adjusted R-squared:  0.02867 
F-statistic: 508.3 on 3 and 51561 DF,  p-value: < 2.2e-16

Regression Model Interpretation:

I used a difference-in-differences regression that compares search interest over time between high-earnings and low-earnings colleges, before and after the Scorecard release. Google search interest, standardized at the school-month level, serves as a proxy for student attention. The regression results confirm this pattern. The interaction term in the model, representing the differential effect of the Scorecard release on high-earnings schools, is negative and statistically significant. Rather than driving attention toward higher-paying colleges, the Scorecard was followed by a general decline in interest, with a larger decline among high-earnings schools. This contradicts the hypothesis that students would respond to the earnings data by redirecting their attention to colleges with better economic outcomes.

The regression results confirm this pattern. The interaction term in the model, representing the differential effect of the Scorecard release on high-earnings schools, is negative and statistically significant. Rather than driving attention toward higher-paying colleges, the Scorecard was followed by a general decline in interest, with a larger decline among high-earnings schools. This contradicts the hypothesis that students would respond to the earnings data by redirecting their attention to colleges with better economic outcomes.

Conclusion:

This analysis set out to test whether the release of the College Scorecard in 2015 caused students to shift more of their attention toward colleges with higher post-graduate earnings. To measure student interest, I used standardized Google search data and compared trends before and after the Scorecard across high- and low-earning institutions.

I defined “high-earning” colleges as those with above-median graduate earnings 10 years after entry, using data from the Scorecard. This median split approach allowed for a clear and balanced comparison group. I aggregated the data to the school-month level to reduce noise and make trends more interpretable over time.

To answer the question, I used a difference-in-differences regression that included a treatment indicator (post-Scorecard), a group indicator (high vs. low earnings), and an interaction term that captures the policy’s effect. Because the raw trend graph (Graph 1) showed curved patterns over time, I included linear and quadratic time trends to control for nonlinearity.

The results show that the introduction of the College Scorecard decreased search activity on Google Trends for colleges with high-earning graduates by 0.070 standard deviations, relative to what it did for colleges with low-earning graduates, with a standard error of 0.013. Both the time-series plot (Graph 1) and the before/after comparison (Graph 2) reinforce this finding: search interest dropped across the board after the Scorecard release, and it dropped even more for high-earnings colleges.

These results suggest that the College Scorecard did not shift student attention toward high-earning colleges as the policy intended. While the goal was to help students make more informed choices based on future earnings, search interest declined after the Scorecard’s release, particularly for high-earning institutions. This may reflect low awareness of the tool, difficulty interpreting the data, or the possibility that students prioritize other factors when researching colleges. In real-world terms, simply making earnings data public was not enough to influence behavior.