⇒ /Users/sambamamba/Desktop/Stat_133/Stat_133 Stuff/Economist_Assignment_5.Rmd
The Economist is a well-regarded weekly news magazine. The graphic in the assignment page accompanied their article about the release of the “College Scorecard” data in Sept. 2015.
Goal: To reproduce this graph from the College Scorecard data, and perhaps enhance it.
Downloaded data from website:
download.file("http://tiny.cc/dcf/ScorecardSmall.Rda",
destfile = "ScorecardSmall.Rda")
load("ScorecardSmall.Rda")
There are 54 variables, 7804 institutions in the original 2013 Scorecard file. The ScorecardSmall data set denotes the average net cost per year at US colleges, by income quintile.
What’s the case? The case in the Scorecard data is an institution. In the Economist graphic, however, the case is a level of family income (as in NPT4) at an institution. That is, from the perspective of the graphic, the Scorecard data is in wide form. You’ll have to convert it to narrow form to make the graph.
Based on the graphic in the assignment prompt, we need the following variables:
ADM_RATE: admissions rate in percent
NPT4_PUB: average net cost for students in public institutions
NPT4_PRIV: average net cost for students in private institutions
CCSIZSET: Carnegie size classification for the institutions. Values 1, 6, 7, and 8 correspond to schools with fewer than 1,000 students.
CONTROL: Dichotomous variable denoting whether an institution is public, or (1), or private, or (2). We will discard CONTROL = 3 later.
NPT41_PUB,NPT43_PUB, NPT45_PUB: Average net cost for students at public institutions whose families are in the lowest, third lowest, and richest of the five economic groups respectively.
NPT41_PRIV,NPT43_PRIV, NPT45_PRIV: Average net cost for students at private institutions whose families are in the lowest, third lowest, and richest of the five economic groups respectively.
gather() to convert from wide to narrow format.First, we want tot eliminate unnecessary variables and labels. We want to have ADM_RATE to be less than 100%, exclude schools that have fewer than 1000 students with a CCSIZNET filter, and eliminate non-public and non-public schools, so we exclude CONTROL = 3.
Econ1 <- ScorecardSmall %>%
filter(CONTROL != 3, !(CCSIZSET %in% c(1,6,7,8)), ADM_RATE<1) %>%
select(ADM_RATE, NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV, CCSIZSET, CONTROL) %>%
mutate(CONTROL = ifelse(CONTROL==1, "Public", "Private")) %>% #Labels the public and private variables on the plot
gather(key = long_name, value = Cost, NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV) %>% #generates a narrow data frame from a wide data frame by collecting all the quintiles of students from both private and public schools under one column variable, namely, long_name, and their corresponding values under the new column variable Cost
mutate(Cost = Cost/1000, ADM_RATE = ADM_RATE*100) #rescales the values such that we have the tick-marks to be in the 10s place, not in the thousands
NPT43_PUB, NPT45_PRIV, etc. You will want to translate these to Q3, Q5, etc. For your convenience, the file http://tiny.cc/dcf/NPT4-names.csv contains a table with the appropriate translations. You can use a join of the narrow-format Scorecard data with this table to perform the translations.Nickname <- read.csv("http://tiny.cc/dcf/NPT4-names.csv") #loads data frame containing new labels translating
Econ2 <- Econ1 %>%
left_join(Nickname, by=c("long_name"="long_name"))
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
We only want to find the 1st, 3rd, and 5th poorest quintiles.
Econ3 <- Econ2 %>%
filter(short_name == "Q1"|short_name == "Q3"| short_name == "Q5")
nrow(Econ3)
## [1] 8964
Now all we need to do is plot the College Score Card data, using visualization arguments in the ggplot2 package to generate an understandable plot.
posn <- position_jitter(width = 0.5)
ggplot(Econ3, aes(x = ADM_RATE, y = Cost)) +
geom_point(col = "blue", alpha = 0.6, na.rm = TRUE, position = posn) +
geom_smooth(col = "red") +
facet_grid(CONTROL ~ short_name, drop = TRUE) +
labs(title = "Score Card", x = "Admission rate, %", y = "Net Cost Per Year, $,000") +
ylim(0,50) +
theme( plot.background = element_rect(fill = "grey90"), panel.background = element_rect(fill = "white"), panel.grid.major = element_line(colour = "grey"), axis.title.x = element_text(size = 20), axis.title.y = element_text(size = 20), plot.title = element_text(size = 30))
## Warning: Removed 4734 rows containing non-finite values (stat_smooth).