⇒ /Users/sambamamba/Desktop/Stat_133/Stat_133 Stuff/Economist_Assignment_5.Rmd

The Economist is a well-regarded weekly news magazine. The graphic in the assignment page accompanied their article about the release of the “College Scorecard” data in Sept. 2015.

Goal: To reproduce this graph from the College Scorecard data, and perhaps enhance it.

Downloaded data from website:

download.file("http://tiny.cc/dcf/ScorecardSmall.Rda",
destfile = "ScorecardSmall.Rda")
load("ScorecardSmall.Rda")

There are 54 variables, 7804 institutions in the original 2013 Scorecard file. The ScorecardSmall data set denotes the average net cost per year at US colleges, by income quintile.

What’s the case? The case in the Scorecard data is an institution. In the Economist graphic, however, the case is a level of family income (as in NPT4) at an institution. That is, from the perspective of the graphic, the Scorecard data is in wide form. You’ll have to convert it to narrow form to make the graph.

  1. Select just the variables you need from the Scorecard data.

Based on the graphic in the assignment prompt, we need the following variables:

  1. Use gather() to convert from wide to narrow format.

First, we want tot eliminate unnecessary variables and labels. We want to have ADM_RATE to be less than 100%, exclude schools that have fewer than 1000 students with a CCSIZNET filter, and eliminate non-public and non-public schools, so we exclude CONTROL = 3.

Econ1 <- ScorecardSmall %>%
  filter(CONTROL != 3, !(CCSIZSET %in% c(1,6,7,8)), ADM_RATE<1) %>%
  select(ADM_RATE, NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV, CCSIZSET, CONTROL) %>%
  mutate(CONTROL = ifelse(CONTROL==1, "Public", "Private")) %>% #Labels the public and private variables on the plot
  gather(key = long_name, value = Cost, NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV) %>% #generates a narrow data frame from a wide data frame by collecting all the quintiles of students from both private and public schools under one column variable, namely, long_name, and their corresponding values under the new column variable Cost
  mutate(Cost = Cost/1000, ADM_RATE = ADM_RATE*100)  #rescales the values such that we have the tick-marks to be in the 10s place, not in the thousands
  1. After (2) you will have a variable with levels like NPT43_PUB, NPT45_PRIV, etc. You will want to translate these to Q3, Q5, etc. For your convenience, the file http://tiny.cc/dcf/NPT4-names.csv contains a table with the appropriate translations. You can use a join of the narrow-format Scorecard data with this table to perform the translations.
Nickname <- read.csv("http://tiny.cc/dcf/NPT4-names.csv") #loads data frame containing new labels translating

Econ2 <- Econ1 %>%
  left_join(Nickname, by=c("long_name"="long_name"))
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

We only want to find the 1st, 3rd, and 5th poorest quintiles.

Econ3 <- Econ2 %>%
  filter(short_name == "Q1"|short_name ==  "Q3"| short_name == "Q5") 
nrow(Econ3)
## [1] 8964

Now all we need to do is plot the College Score Card data, using visualization arguments in the ggplot2 package to generate an understandable plot.

posn <- position_jitter(width = 0.5)

ggplot(Econ3, aes(x = ADM_RATE, y = Cost)) + 
  geom_point(col = "blue", alpha = 0.6, na.rm = TRUE, position = posn) + 
  geom_smooth(col = "red") + 
  facet_grid(CONTROL ~ short_name, drop = TRUE) + 
  labs(title = "Score Card", x = "Admission rate, %", y = "Net Cost Per Year, $,000") + 
  ylim(0,50) + 
  theme( plot.background = element_rect(fill = "grey90"), panel.background = element_rect(fill = "white"), panel.grid.major = element_line(colour = "grey"), axis.title.x = element_text(size = 20), axis.title.y = element_text(size = 20), plot.title = element_text(size = 30)) 
## Warning: Removed 4734 rows containing non-finite values (stat_smooth).