INTRODUCTION

1. BACKGROUND

The American debate over whether a college education is worth it began when the colonists arrived from Europe and founded “New College” (later renamed Harvard University) in 1636. With 20.2 million US college students in 2015 and average student debt at over $28,950, the debate continues today[1]. People who argue that college is worth it contend that college graduates have higher employment rates, bigger salaries, and more work benefits than high school graduates. People who argue that college is not worth it contend that the debt from college loans is too high and delays graduates from saving for retirement, buying a house, or getting married.

2. Methodology

In this project, we mainly look at sereral aspects that might affect people’s decision. First, what is your investment (time, tuition)? Second, what is your return (employment, salary)? Third, is there a major or place where you can get the best ROI? To answer these questions, mutiple data sets were obtained from open sources. College tuition by state provides us information on cost in different state, this helps people to decide if they want to go to another state to pursue “better” education or they want to go with in-state tuition. Employment and salary are also important factors. They are studied across differnt states and majors. Useful data will be extracted from the original data sets and put on maps to visually show the locations with various levels of cost and return.

3. APPROACH

Considering the consumer of the report would be mostly students planning for their future or parents of students. This report is going to contain a lot of visualization of the data. Maps will be used to show numbers at different locations. Time series graphs will be used to understand the trend of cost. Also, boxplots will be used to compare mean values and identify outliers.

4. HOW IT HELPS

The purpose of this project is not to give an exact answer to the question “If I should go to college or not”, while it serves as a tool that people can reference to, based on their circumstances. They are many factors to consider when making a decision. This report gives you straighforward visualization based on data obtained through survey or census by authorities.

PACKAGES REQUIRED

The following packages are required to complete the project.

library(tidyr)          # provides a set of functions that help with getting to tidy data.
library(XML)            # contains readHtmlTABLE function
library(plotly)         # it will be used to create a map in the final project
library(gdata)          # reading xls data from a website with read.xls
                        # supports "http://", "https://", and "ftp://" URLS
library(RCurl)          # provides functions to allow one to compose general HTTP requests
library(stringr)        # in order to use str_trim to trim leading white spaces
library(dplyr)          # It provides simple "verbs", functions that correspond to the most common data                             manipulation tasks.
library(ggplot2)        # for better visulization, ggplot()function
library(purrr)          # to use map functions
library(maps)           # use state maps data

DATA PREPARATION

DATA SETS

Several data sets were obtained from open sources to achieve the purpose of this project. The first data set was accessible on Trends in Higher Education’s website. The data set describes the tuition and fees at Public Two-Year and Four-Year Instituations by State from 2004-05 to 2017-18. The data were compiled from different sources, including College Board, Annual Survey of Colleges; NCES, IPEDS Fall 2015 Enrollment data. This data set is mainly used to provide information on the cost of college education in different states across the US.

The second data set was obtained from Wall Street Journal’s website. This data set is the result of a year-long survey of 1.2 million people with only a Bachelor’s degree by PayScale Inc. The data sets provide salary information of different majors. The variables include starting median salary, mid-career median salary, percentage change from starting to mid-career slary, etc., which is very useful in terms of providing guidance on major selection.

The third data set provides mean earnings of workers 18 Years and over. Some other factors that might affect earnings are also included in this data set. In this project, educational attainment is the main factor to be studied. However, hypotheses testing will be conducted to look at other factors as well. This data set is availabe on the website United States Census Bureu.

The purpose of using the fourth data set is to study if there is a relationship between the educational attainment and college tuition of a certain state. It is also availbe on United State Census’ webiste

Getting Data into R

##############################
#       Obtaining Data       #
##############################

############# First dataset--college tuition by state ############
library(gdata)
url1 <- "https://trends.collegeboard.org/sites/default/files/cp-2017-table-5.xls"
url1 <- gsub("https","http",url1)
TuibyState <- read.xls(url1,perl = "C:/Perl64/bin/perl.exe", skip = 1, stringsAsFactors = F) #getting data

########## Second dataset--salary of different majors ############
library(XML)
library(RCurl)
url2 <- "http://online.wsj.com/public/resources/documents/info-Degrees_that_Pay_you_Back-sort.html"
salary <- readHTMLTable(url2, header = T, skip = 6, stringsAsFactors = F)

##########Third dataset--Mean Earnings by educational attainment, 2016
url3 <- "https://www2.census.gov/programs-surveys/demo/tables/educational-attainment/2016/cps-detailed-tables/histtab-A-03.xlsx"
url3 <- gsub("https","http", url3)
MeanEarnings <- read.xls(url3, perl = "C:/Perl64/bin/perl.exe", stringsAsFactors = F)

Data Cleaning and Preparation

First DataSet– College Tuition by State

The datasets are reletively clean, the work to get the data sets ready for further analysis mainly involved getting rid of empty columns and rows, renaming column names, removing meaningless strings, etc. Since not all the columns and rows in the data sets are needed, therefore a lot of subsetting was conducted. For example, in the first data set, it originally contains tuitions and fees in both “current dollars” and “2017 dollars”. In this project, we used the data “in 2017 dollars”, therefore the data are already adjusted with inflation in consideration.

######################     
#   Cleaning Data    #
######################

## First dataset

Public2year <- TuibyState[2:53, 1:17]
Public4year <- TuibyState[2:53, c(1,19:34)]
col.names1 <- c("State","2004-05","2005-06","2006-07","2007-08","2008-09","2009-10",
                "2010-11","2011-12","2012-13","2013-14","2014-15","2015-16","2016-17",
                "2017-18","1-Year % Change", "5-Year % Change")
colnames(Public2year) <- col.names1     #change column names
colnames(Public4year) <- col.names1     
        ### We will be mainly focused on the Public4year dataset

Second Dataset–Salary by Major

Although many parents indoctrinate their young children to study hard and go to college so they will have financial security in life, the lesson would be more accurate if nuanced a little more.

Although the average recent college graduates earned $33,000 annually, salaries varied from $27,000 to $50,000 depending on the major. In the second dataset which was obtained from Wall Street Journal’s website, the startig median salary and midcareer median salary for 49 different majors were listed. It provides us with a general idea of how much each majors pays after graduation.

## Second dataset
salary <- salary$mySortableTable
salary <- lapply(salary, function(col_) {gsub( "Â","",col_)} ) 
        # removing the weird character at the end of strings
#Some Transformations of the data
salary <- as.data.frame(salary,stringsAsFactors = FALSE)
salary[,2:ncol(salary)] <- lapply(salary[,2:ncol(salary)], function(col_) {as.numeric(gsub('[$,]', '', col_))})
    #change original datatype (character) to numberic

Third Dataset–Mean Earnings by Educational Attainment and Beyond

The dataset contains the mean earnings based on educational attainment from 1975 to 2015. Also, it seperated the dataset into different races and genders.

Note: for Non-Hispanic White, the data are from 1998 to 2015; for Black Female, the data are from 1979; for Asian, the data are from 2002-2017

## Third dataset
MeanEarnings <- MeanEarnings[3:796, 1:22]
#### Total Mean Earnings without considering gender or race 
MeanEarningsT <- MeanEarnings[5:45, c(3,5:22)]
    # This is a subset of MeanEarnings, Race = Total, Gender = Both
col.names3 <- c("Year", "MeanT", "NumberwithEarningsT", "StdErrT",
                "Mean1", "NumberwithEarnings1", "StdErr1",
                "Mean2", "NumberwithEarnings2", "StdErr2",
                "Mean3", "NumberwithEarnings3", "StdErr3",
                "Mean4", "NumberwithEarnings4", "StdErr4",
                "Mean5", "NumberwithEarnings5", "StdErr5")
     # T: Total 
     # 1: Not a High School Graduate
     # 2: High School Graduate
     # 3: Some College/Associate's Degree
     # 4: Bachelor's Degree
     # 5: Advanced Degree
colnames(MeanEarningsT) <- col.names3

Preview of Data Sets

College Tuition By State

Public4year

Salary By Major

salary

Mean Earnings By Educational Attainment

MeanEarningsT

EXPLORATORY DATA ANALYSIS

What You Pay

College Tuition by State

In this section, we mainly look at the tuition cost of public 4-year universities. A variety of factors go into determining how much college will cost a state resident, including state wealth, state grant programs, state taxes and revenue, and demographic trends that are contributing to robust enrollment growth in some parts of the country. Some states are far more generous than others in supporting higher education. In Figure 1, public 4-year universities tuitions are mapped onto the states, which helps us to get a better understanding of where the reletively more expensive or cheaper colleges are. The original dataset has the tuitions data from 2004-05 to 2017-2018. Insteading of plotting the tuition for each year, the average tuition from year 2004-05 to 2017-18 was first calculated and then mapped across each state.

### plot the public4year dataset on the state map
Public4year[,2:15] <-lapply(Public4year[,2:15], function(col_) {as.numeric(gsub('\\$|,', '', col_))}) 
            # Change the data type from character to numeric to calculate mean values
Public4year[,18]<- rowMeans(Public4year[,2:15])   # add the mean value across different years to the data set
colnames(Public4year)[18] <- "MeanTuiOverYears"

x <- data.frame(region <- tolower(Public4year$State), 
                MTuition <- Public4year$MeanTuiOverYears,
                stringsAsFactors = F)
StatesMap <- map_data("state")
#in ggplot2 package
library(ggplot2)

ggplot() +
    geom_map(data=StatesMap, map=StatesMap,
             aes(x=long,y=lat,map_id=region),
             fill="#ffffff", color="#ffffff", size=0.15)+
    geom_map(data=Public4year, map=StatesMap,
             aes(fill=MeanTuiOverYears,map_id=region),
             color = "#ffffff", size=0.15)+
    ggtitle("Figure 1:Higher Education Tuition by State")+
    labs(x=NULL, y=NULL)+
    coord_map("albers", lat0=39,lat1=45)+
    expand_limits(x = StatesMap$long, y = StatesMap$lat)+
    theme(panel.border = element_blank(),
          panel.background = element_blank(),
          axis.ticks = element_blank(),
          axis.text = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = 15, margin = margin(b = 10)))
## Warning: Ignoring unknown aesthetics: x, y

To be more specific, the ten highest and lowest states are given in the folowing table.

Table 1 The 10 Higher-Tuition and Lower-Tuition States

ExpCheapTable

As we all know, college enrollment has been increasing since 1980. Growing college enrollment contributes to increased costs in two ways. It necessitates the hiring of more administrative staff, which can be costly. It also normally means that the expenditures, from state and federal government, per student is lower, again placing the burden of tuition and fees on students and families. How has college tuition been increasing in each state? In Figure 2 and 3, 1-year % change and 5-year % change for each state are plotted, respectively.

## 1-Year % Change in Tuition for Each State
Public4year[,16:17] <- lapply(Public4year[,16:17], function(col_) {as.numeric(sub("%","",col_))/100})
            ## Change the two % columns to numeric
ggplotly(
   ggplot() + 
  geom_bar(data = Public4year, aes(x= State, y=`1-Year % Change`), stat = "identity") +
  scale_fill_brewer(type = "seq")+
    ggtitle("Figure 2: 1-Year % Change for all States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(size = 15, margin = margin(b = 10))),
width=900, height=500)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

As we can see in Figure 1,college is increasing in every state, except for Florida, Hawaii, and Pureto Rico.

## 5-Year % Change in Tuition for Each State
ggplotly(
   ggplot() + 
  geom_bar(data = Public4year, aes(x=State, y=`5-Year % Change`), stat = "identity") +
  scale_fill_brewer(type = "seq")+
    ggtitle("Figure 3: 5-Year % Change for all States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(size = 15, margin = margin(b = 10))),
width=900, height=500)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

As shown in Figure 3, most of the states have a 5-year % change approximately 20%. Louisana is almost as high as 50%.

What You Earn

Major Matters

There are many things that must be considered in choosing a college major including interests, skills, desired lifestyle, and future earnings. Yet despite this broad array of factors to consider, recently a survey found that, “Being very well off financially” was a very important or essential personal objective of 73.6% of college freshmen suggesting that for many students a top-priority is in fact earnings. Before you spend the time and money to acquire a bachelor’s degree, it’s best to take a look at salaries. Students who strategically choose their major have a good chance of landing a good-paying job. Starting median salaies of 49 different majors are displayed in Figure 4. The figure clearly pointed to the advantages of studying a STEM field in terms of salary.

##Plot of starting median salary
ggplotly(  
salary %>%
  ggplot(aes(reorder(Undergraduate.Major, Starting.Median.Salary),Starting.Median.Salary)) +
  labs(x = 'Undergraduate Major') +
  geom_boxplot(color="red") +
  scale_y_continuous() +
  ggtitle("Figure 4: Starting Median Salary for Different Majors") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 15, margin = margin(b = 10)))+
      coord_flip(),
width = 800, height=900)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

In Figure 5, the midcareer salaries of different majors are plotted.

##Plot of midcareer median salary
ggplotly(  
salary %>%
  ggplot(aes(reorder(Undergraduate.Major, Mid.Career.Median.Salary),Mid.Career.Median.Salary)) +
  labs(x = 'Undergraduate Major', y = 'Midcareer Median Salary') +
  geom_boxplot(color="red") +
  scale_y_continuous() +
  ggtitle("Figure 5: Midcareer Median Salary for Different Majors") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 15, margin = margin(b = 10)))+
      coord_flip(),
width = 800, height=900)

Meaning earnings based on educational attainment are plotted versus time from 1975 to 2015. The gap between earnings seem to get bigger and bigger along the years.

## Mean Earnings by Educational Attainment
MeanEarningsT[,c(2, 5, 8, 11, 14, 17)] <- lapply(MeanEarningsT[,c(2, 5, 8, 11, 14, 17)], function(col_) {as.numeric(gsub('[,]', '', col_))})
ggplotly(
ggplot(MeanEarningsT, aes(Year)) +
labs(x="Year", y="Mean Earnings per Year")+
  geom_line(aes(y = MeanT, colour = "Mean_Total"), size=1) + 
  geom_line(aes(y = Mean1, colour = "Mean_Not a High School Graduate"), size=1)+
  geom_line(aes(y = Mean2, colour = "Mean_High School Graduate"), size=1)+
  geom_line(aes(y = Mean3, colour = "Some High School/Associate's Degree"), size=1)+
  geom_line(aes(y = Mean4, colour = "Mean_Bachelor's Degree"), size=1)+
  geom_line(aes(y = Mean5, colour = "Mean_Advanced Degree"), size=1)+
ggtitle("Figure 6: The Change of Mean Earnings by Education Level from 1975 to 2015")+
      theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 15, margin = margin(b = 5))),
 width = 1000, height=800)

Is It Fair?

Earnings for Different Genders

Besides educational attainment, what are other factors that play a role in a person’s earnings? First, we would like to look at what the mean earnings are for different genders, as shown in Figure 7.

#Plotting#
ggplotly(
compareG %>%
    ggplot(aes(x = reorder(EdAttain, EarningsValues),y=EarningsValues, fill=Gender))+
    geom_bar(stat="identity",position = position_dodge())+
    labs(x = "Educational Attainment", y="Annual Salary Average over 1975 to 2015")+
    ggtitle("Figure 7: The Change of Mean Earnings by Education Level for Different Genders from 1975 to 2015")+
      theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 12, margin = margin(b = 10))),
 width = 700, height=600
)

The data suggests that a glass ceiling is still firmly in place in America’s workforce. Across all industries, on average, women have to attain a Advanced degree to earn more than men who have only attained a bachelor’s degree. Similarly, women with bachelor’s degrees earn a lot less tham men with associate’s degrees.

Earnings Across Ethinicity

As Figure 8 shows, there is a lot of variation of the relative earnings of people of diff erent races/ethnicities relative to Whites. Among Black, for example, mean earnings are a lot than Whites. Hispanic have a similar pattern, but the earnings gap is generally in a lower range.

ggplotly(
compareR %>%
    ggplot(aes(x = reorder(EdAttain, EarningsValues),y=EarningsValues, fill=Race))+
    geom_bar(stat="identity",position = position_dodge())+
    labs(x = "Educational Attainment", y="Annual Salary Average over 1975 to 2015")+
    ggtitle("Figure 8: The Change of Mean Earnings by Education Level for Different Races from 1975 to 2015")+
      theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 12, margin = margin(b = 10))),
 width = 900, height=800
)

##### Does Gender Difference Exist in Each Race?**

#Plotting#
ggplotly(
compare %>%
    ggplot(aes(x = reorder(EdAttain, EarningsValues),y=EarningsValues, fill=Gender))+
    geom_bar(stat="identity",position = position_dodge())+
    facet_grid(Race~.)+
    labs(x = "Educational Attainment", y="Annual Salary Average over 1975 to 2015")+
    ggtitle("Figure 9: Mean Earnings by Educational Attainment")+
      theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 12, margin = margin(b = 10))),
 width = 700, height=1000
)

Lifetime Earnings

Lifetime earnings have important implications for retirement outcomes, including the level of Social Security benefits.Although a large literature examines earnings differences by educational attainment, research on lifetime earnings is limited.Using survey data matched to administrative records that track the earnings of individuals for over three decades allows us to generate a more robust estimate of lifetime earnings by educational attainment. Lifetime earnings are total accumulated earnings over 50 years from age 20 to age 69. The mean earnings for different educational attainment are multipliled by a certain factor (described in detail below), considering that people with lower educational attainment may start working at an earlier age. On the contrary, pursuing higer education delays the career. Applying a 4 percent annual real discount rate to a Bachelor’s degree and a 16 percent annual real discount to an advanced degree, the adjusted lifetime estimate is calulated as follows.

compareG1 <- compareG%>%
    filter(EdAttain=="Bachelor's Degree")%>%
    mutate(Adjusted=EarningsValues*0.96*50)
compareG2 <- compareG%>%
    filter(EdAttain=="Advanced Degree")%>%
    mutate(Adjusted=EarningsValues*0.84*50)
compareG3 <- compareG%>%
    filter(EdAttain!="Advanced Degree"|EdAttain!="Bachelor's Degree")%>%
    mutate(Adjusted=EarningsValues*50)
compareGA <- rbind(compareG1,compareG2,compareG3)

##ploting
ggplotly(
compareGA %>%
    ggplot(aes(x = reorder(EdAttain, Adjusted),y=Adjusted, fill=Gender))+
    geom_bar(stat="identity",position = position_dodge())+
    labs(x = "Educational Attainment", y="Annual Salary Average over 1975 to 2015")+
    ggtitle("Figure 10: The Change of Mean Earnings by Education Level for Different Genders from 1975 to 2015")+
      theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 12, margin = margin(b = 10))),
 width = 900, height=800
)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

As we can see from Figure 10, for males, having some postsecondary education, even without earning a degree, adds nearly 0.4 million dollars to lifetime earnings. Getting an Associate’s degree adds another bump of nearly 0.1million in lifetime earnings. These numbers demonstrate conclusively the advantage of non-baccalaureate postsecondary education.

Getting a Bachelor’s degree adds another large increase in lifetime earnings, $2.3 million over a lifetime, Bachelor’s degree holders earn 35 percent more than workers with an Associate’s degree and 78 percent more than those with just a high school diploma. Further, obtaining a Bachelora’s is also the gateway to entering and completing graduate education. About one-third of Bachelor’s degree holders obtain a graduate degree.

All graduate degree holders can expect lifetime earnings at least double that of those with only a high school diploma.

Moreover, earnings rise substantially for those with Advanced degrees: Master’s Degree or Doctoral degree holders have lifetime earnings of $2.9 million. This is a 61 percent increase (1.4 million) over Bachelora’s degree holders.

SUMMARY

Most Expensive States to Go to College

By looking at the tuition by state for Public 4-Year Universities, we found the 10 states that have the highest tuition cost, which are as follows:

The college tuitions have been increasing in almost every state, Louisiana has had 5-Year % Change of nearly 50%. From 2005-05 to 2017-18, tuition has increased in real dollars at public 4-Year Universities, after taking inflation into account.

What Major to Choose

Almost all of the top 10 highest-paying majors are in STEM fields. While, majors such as Spanish, Religion, Education are the lowest-paying majors. Also, there is a higer potential for salary increase for majors in STEM fileds as well. The majors that rank the lowest in starting salary are still at the bottom of the midcareer ranking.

After all, earning potential is an important factor to consider, there are dozens of factors to consider when choosing a college major. Does it align with your strengths? Is it something you’re passionate about?

Educational Attainment

The data are clear: More education leads to better prospects for earning. Earnings increase as educational attainment rises. The difference of earnings has been getting bigger and bigger along the years. Also, we are interested in comparing how earnings get affected by gender and ethnicity. As a women, it is sad to see that there is a big gap between women’s and men’s earnings, given the same educational attainment level. The higher the educational attainment level, the bigger the difference is. At the advanced degree level, men’s earnings are 74% higher than women’s; at the advanced degree level, men’s earnings are 70% higher than women’s.

Also, there is a lot of variation of the relative earnings of people of diff erent races/ethnicities relative to Whites. Among Black, for example, mean earnings are a lot than Whites. Hispanic have a similar pattern, but the earnings gap is generally in a lower range.

Lifelong Estimate

for males, having some postsecondary education, even without earning a degree, adds nearly 0.4 million dollars to lifetime earnings. Getting an Associate’s degree adds another bump of nearly 0.1million in lifetime earnings. These numbers demonstrate conclusively the advantage of non-baccalaureate postsecondary education.

Getting a Bachelor’s degree adds another large increase in lifetime earnings, $2.3 million over a lifetime, Bachelor’s degree holders earn 35 percent more than workers with an Associate’s degree and 78 percent more than those with just a high school diploma. Further, obtaining a Bachelora’s is also the gateway to entering and completing graduate education. About one-third of Bachelor’s degree holders obtain a graduate degree.

All graduate degree holders can expect lifetime earnings at least double that of those with only a high school diploma.

Moreover, earnings rise substantially for those with Advanced degrees: Master’s Degree or Doctoral degree holders have lifetime earnings of $2.9 million. This is a 61 percent increase (1.4 million) over Bachelora’s degree holders.