Data_Exploration

I just loaded in all the libraries I think I’ll need for this assignment. Below, I’m loading in my cleaned data. When I cleaned the data there were a number of choices that I made which I believed would help me with my analysis later on. The biggest choice I made was to split my cleaned data into a high earning colleges file and a low earning colleges file. I chose high earning as 75,000 dollar or higher since the mean household income in the United States is roughly 69,000. I figure that an individual making more than the mean household income is would be considered to be a high earner. For the low earning schools, it becomes very relative. The figure I chose for low earning schools is 50,000 or under. My thought process was that a 25,000 dollar, 20% drop in earnings would constitute low earnings relative to high earning schools.

Another thing I did to aid in my analysis was to create a binary variable called afterSC (after Scorecard) which if true, means the row is after the scorecards has been released. If false, its before September 2015. This variable was helpful in my regression of the data as we are looking at change over time.

Finally, I deleted most columns that I felt were unnecessary to help with the file size and code run time. I was left with around 15,500 observations and 20 variables in my high earnings data frame, and 616,000 observations and 20 variables in my low earnings data frame.

head(h_earn)

## # A tibble: 6 x 19
##   UNITID OPEID.x CITY   STABBR INSTURL  earnings gt_25k_p6 schname schid keyword
##    <dbl>   <dbl> <chr>  <chr>  <chr>       <dbl>     <dbl> <chr>   <dbl> <chr>  
## 1 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    30 csum.e…
## 2 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    31 csum   
## 3 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    29 califo…
## 4 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    31 csum   
## 5 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    29 califo…
## 6 111188  113400 Valle… CA     www.csu…    81100     0.766 califo…    31 csum   
## # … with 9 more variables: keynum <dbl>, monthorweek <chr>, index <dbl>,
## #   OPEID.y <dbl>, N <dbl>, index_std <dbl>, date <date>, beforeSC <lgl>,
## #   afterSC <lgl>

head(l_earn)

## # A tibble: 6 x 19
##   UNITID OPEID.x CITY   STABBR INSTURL  earnings gt_25k_p6 schname schid keyword
##    <dbl>   <dbl> <chr>  <chr>  <chr>       <dbl>     <dbl> <chr>   <dbl> <chr>  
## 1 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    70 aamu   
## 2 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    67 alabam…
## 3 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    67 alabam…
## 4 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    70 aamu   
## 5 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    67 alabam…
## 6 100654  100200 Normal AL     www.aam…    31400     0.462 alabam…    70 aamu   
## # … with 9 more variables: keynum <dbl>, monthorweek <chr>, index <dbl>,
## #   OPEID.y <dbl>, N <dbl>, index_std <dbl>, date <date>, beforeSC <lgl>,
## #   afterSC <lgl>

Below I’m scaling the index values so they are comparible between different colleges. This should make my regression far easier and allow me to have a fairly simple model which will be easier to interpret.

h_earn$index_scaled <-scale(h_earn$index)
l_earn$index_scaled <-scale(l_earn$index)

Because I have two data frames, I had to run two separate regressions and then compare them as model 1 and model 2. I’m regressing afterSC on index_scaled. In other words, I’m trying to find the relationship between the change before or after September 2015 on the scaled index values of college searches on Google.

In model 1 which is the scorecard data for low earning colleges, we see a decline of searches of .1146 from a baseline of .0216.We see an almost identical relationship with the high earning schools with a drop of .1159 from a .0218 base. This indicates that there is a 1.12% difference between the two after the scorecards have been released.

h_scorecard <- lm(index_scaled ~ afterSC, data = h_earn)
l_scorecard <- lm(index_scaled ~ afterSC, data = l_earn)
export_summs(l_scorecard, h_scorecard, digits = 4)

	Model 1	Model 2
(Intercept)	0.0216 ***	0.0218 *
	(0.0014)	(0.0089)
afterSCTRUE	-0.1146 ***	-0.1159 ***
	(0.0033)	(0.0206)
N	613293	15461
R2	0.0020	0.0021
* p < 0.001; p < 0.01; * p < 0.05.

After the first regression, I thought about any variables which might need to be controlled for that I had access to and I decided that controlling for the state in which the school is located in would have a sizable impact. I found that the intercept saw a massive change, however the difference between the before and after Scorecards shrunk to an almost unnoticeable size with a difference of .087%. Interestingly, this time the high earning schools showed a lower drop off, but since it’s by such a small margin, I don’t believe that its significant.

h_scorecardC <- lm(index_scaled ~ afterSC + STABBR, data = h_earn)
l_scorecardC <- lm(index_scaled ~ afterSC + STABBR, data = l_earn)
export_summs(l_scorecardC, h_scorecardC, digits = 4)

	Model 1	Model 2
(Intercept)	0.5993 ***	-0.1864 ***
	(0.0229)	(0.0226)
afterSCTRUE	-0.1147 ***	-0.1146 ***
	(0.0032)	(0.0190)
STABBRAL	-0.6974 ***
	(0.0246)
STABBRAR	-0.7180 ***
	(0.0255)
STABBRAZ	-0.6023 ***
	(0.0270)
STABBRCA	-0.4014 ***
	(0.0236)
STABBRCO	-0.3462 ***
	(0.0250)
STABBRCT	-0.4939 ***
	(0.0261)
STABBRDC	-0.4521 ***	0.8514 ***
	(0.0350)	(0.0433)
STABBRDE	-0.6034 ***
	(0.0312)
STABBRFL	-0.6012 ***
	(0.0239)
STABBRGA	-0.5464 ***
	(0.0240)
STABBRHI	-0.3176 ***
	(0.0281)
STABBRIA	-0.6900 ***
	(0.0245)
STABBRID	-0.5454 ***
	(0.0271)
STABBRIL	-0.4867 ***
	(0.0238)
STABBRIN	-0.4934 ***	0.0252
	(0.0240)	(0.0400)
STABBRKS	-0.5727 ***
	(0.0249)
STABBRKY	-0.9271 ***
	(0.0247)
STABBRLA	-0.7216 ***
	(0.0252)
STABBRMA	-0.5623 ***	-0.0661 *
	(0.0240)	(0.0280)
STABBRMD	-0.6054 ***	0.4934 ***
	(0.0252)	(0.0358)
STABBRME	-0.2750 ***
	(0.0281)
STABBRMI	-0.6071 ***
	(0.0242)
STABBRMN	-0.4484 ***
	(0.0242)
STABBRMO	-0.6446 ***
	(0.0242)
STABBRMS	-0.7503 ***
	(0.0265)
STABBRMT	-0.5537 ***
	(0.0264)
STABBRNC	-0.6982 ***	-0.7677 ***
	(0.0240)	(0.0433)
STABBRND	-0.6389 ***
	(0.0272)
STABBRNE	-0.3346 ***
	(0.0254)
STABBRNH	-0.5437 ***
	(0.0269)
STABBRNJ	-0.5312 ***	0.8433 ***
	(0.0252)	(0.0344)
STABBRNM	-0.3352 ***
	(0.0296)
STABBRNV	-0.4916 ***
	(0.0331)
STABBRNY	-0.5567 ***	-0.0223
	(0.0235)	(0.0309)
STABBROH	-0.5481 ***
	(0.0237)
STABBROK	-0.7711 ***
	(0.0244)
STABBROR	-0.5035 ***
	(0.0249)
STABBRPA	-0.5126 ***	0.5487 ***
	(0.0236)	(0.0290)
STABBRRI	-0.7851 ***
	(0.0283)
STABBRSC	-0.8261 ***
	(0.0247)
STABBRSD	-0.7042 ***
	(0.0264)
STABBRTN	-0.7502 ***
	(0.0241)
STABBRTX	-0.6234 ***	0.2990 ***
	(0.0235)	(0.0344)
STABBRUT	-0.5876 ***
	(0.0277)
STABBRVA	-0.6423 ***	0.0551
	(0.0242)	(0.0400)
STABBRVI	-0.3453 ***
	(0.0422)
STABBRVT	-0.4590 ***
	(0.0266)
STABBRWA	-0.4584 ***
	(0.0252)
STABBRWI	-0.5455 ***
	(0.0240)
STABBRWV	-0.7069 ***
	(0.0257)
STABBRWY	-0.4886 ***
	(0.0422)
N	613293	15461
R2	0.0176	0.1435
* p < 0.001; p < 0.01; * p < 0.05.

Here I double checked my first regression to help me understand the relationship between before and after the scorecard. We can see that just as in the first regression, both the intercept and the change after the scorecard was introduced was negligible.

t.test(index_scaled ~ afterSC, data = h_earn, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  index_scaled by afterSC
## t = 5.6368, df = 15459, p-value = 1.763e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07557074 0.15614817
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          0.02180655         -0.09405291

t.test(index_scaled ~ afterSC, data = l_earn, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  index_scaled by afterSC
## t = 35.134, df = 613291, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.108205 0.120991
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          0.02159860         -0.09299942

h_earn$date = substr(h_earn$date,1,nchar(h_earn$date)-1)
l_earn$date = substr(l_earn$date,1,nchar(l_earn$date)-1)

h_earn$date <- as.integer(h_earn$date)
l_earn$date <- as.integer(l_earn$date)

h_grouped <- h_earn %>%
  group_by(date)%>%
  summarise_at(vars(index_scaled),funs(mean(.,na.rm=TRUE)))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

l_grouped <- l_earn %>%
  group_by(date)%>%
  summarise_at(vars(index_scaled),funs(mean(.,na.rm=TRUE)))

Above, I prepped my data to graph it.

Below, I’m plotting the the scaled index against grouped years. We can clearly see a negative correlation of searches over the time the data is collected and at 2015, we can see a sharper drop off of the high earning colleges than low earning colleges, indicative of the first regression with the 1.12% difference between the two.

graph <- ggplot(NULL, aes(date, index_scaled)) +
  geom_line(data = h_grouped, col = "red") +
  geom_line(data = l_grouped, col = "blue") + 
  geom_vline(xintercept = 2015, size = 1)
graph

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).

Explain what we should conclude, in real world terms, based on your results

I believe the analysis on this page addresses the research question because it describes the relationship between the interest in low earning and high earning colleges before and after the scorecard was released. It also is only looking at colleges which predominantly are four year institutes or in other words, colleges which predominantly give out bachelors degrees.

From this analysis, I found that there seemed to be little to no correlation between the release of the scorecard and the the interest in high earning colleges versus low earning colleges. As I mentioned above, when controls were implemented, I only saw a .087% advantage for high earning colleges.

Interestingly, I noticed that there has been a negative trend of interest in college interest as proxied by Google trends data, but when the scorecard came out, the decline seems to have flattened a bit as you can see in the graph above.

Data_Exploration

Gabriel Marken

2/14/2022