According to debt.org, the average borrower for the college class of 2017 will carry over $38,000 in student debt. They go on to state that since 1980, the tuition costs for public institutions have increased by 344% and 241% for private universities. However, a college education is an entry requirement for many of the highest-earning professions and graduates can expect to make more than 1.3 million more in their lifetime than high school graduates (debt.org). As of July 2018, student debt is now the second highest consumer debt category with the total student loan debt is over 1.5 trillion and there are 44.2 million American borrowers with a student loan debt, with a whopping 10.7% either delinquency or in default (Friedman, 2018).
The U.S. Bureau of Labor Statistics (BLS) that, “the more you learn, the more you earn” (Torpey, 2018). This statement highlights that their data shows that those with the higher levels of educational attainment had more than triple the median weekly earnings than those with less than a high school diploma in 2017. Unemployment rates also drop as the level of education increases (Torpey, 2018).
Georgetown University’s Center on Education and the Workforce, as of 2015, reported the economic value of a college major plays an immense factor with top-paying college majors earning $3.4 million more over a lifetime than the lowest-paying majors. STEM, health, and business majors are the highest paying with STEM and business also being the most popular and accounting for 46 percent of college graduates (Georgetown.Edu).
As the only MBA student in my Computer Science Data Analysis elective, I thought it would be interesting to compare the two for my research project.
IPUMS USA provided microdata from the American Community Surveys and was created by the UNIVERSITY OF MINNESOTA (IPUMS-USA, University of Minnesota, www.ipums.org). When you use this source to create a data set, it also creates a user document of the extract for all items that were included, which was extremely useful.
iplums_ACS2017 <- read_csv("C:/Users/Dan Ward/Desktop/CSC530/ResearchProject/ResearchProject/DATA/usa_00001.csv", col_types = cols(GQ = col_skip(), HHWT = col_skip()))
We’ll get started by glimpsing the data we’ve just imported.
It took a few minutes to import this data set with its 3,190,040 observations and 18 variables.
## Observations: 3,190,040
## Variables: 18
## $ YEAR <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ...
## $ DATANUM <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ SERIAL <dbl> 1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, ...
## $ CBSERIAL <dbl> 2.017e+12, 2.017e+12, 2.017e+12, 2.017e+12, 2.01...
## $ US2017A_ST <chr> "01", "01", "01", "01", "01", "01", "01", "01", ...
## $ US2017A_HINCP <chr> "0010000", "0038500", "0090700", "0090700", "009...
## $ PERNUM <dbl> 1, 1, 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, ...
## $ PERWT <dbl> 206, 45, 136, 121, 111, 18, 13, 23, 21, 27, 21, ...
## $ US2017A_AGEP <chr> "73", "31", "41", "48", "16", "37", "18", "17", ...
## $ US2017A_COW <chr> "B", "1", "1", "B", "B", "4", "B", "B", "B", "B"...
## $ US2017A_MIL <chr> "4", "4", "4", "4", "B", "4", "4", "4", "B", "B"...
## $ US2017A_SCH <chr> "1", "1", "1", "1", "2", "1", "2", "2", "2", "1"...
## $ US2017A_SCHL <chr> "10", "21", "17", "16", "13", "16", "14", "13", ...
## $ US2017A_ESR <chr> "6", "1", "1", "6", "6", "1", "6", "6", "B", "B"...
## $ US2017A_FOD1P <chr> "BBBB", "6100", "BBBB", "BBBB", "BBBB", "BBBB", ...
## $ US2017A_FOD2P <chr> "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", ...
## $ US2017A_PERNP <chr> "0000000", "0038500", "0072000", "0000000", "000...
## $ US2017A_SOCP <chr> "BBBBBB", "119111", "472061", "BBBBBB", "BBBBBB"...
To make the data easier to use and read:
* Removed unnecessary variables
* Renamed variables to user-friendly names
* Remove potential N/A values
* Modified some variables to show as numeric instead of characters
* Filter ages to between 18 and 60 years old
I’m also going to a precaution by putting this data into a new dataframe. This might useful later, avoiding to import this large dataset if a mistake is made throughout our analysis.
iplums_ACS2017 %>%
select(US2017A_ST, US2017A_AGEP, US2017A_COW, US2017A_SCHL, US2017A_ESR, US2017A_FOD1P, US2017A_FOD2P, US2017A_PERNP, US2017A_SOCP, US2017A_SCH) %>%
# Make the columns names user-friendly
rename(State_code = US2017A_ST, # State Code
Age = US2017A_AGEP, # Age
Class = US2017A_COW, # Class of worker
School_Enrollment = US2017A_SCH, # School Enrollment
Education_Att = US2017A_SCHL, # Educational Attainment
Employment_Status = US2017A_ESR, # Employment Status
Degree1 = US2017A_FOD1P, # Recorded field of 1st degree
Degree2 = US2017A_FOD2P, # Recorded field of 2nd degree
Earnings = US2017A_PERNP, # Total Person's Earnings
SOC = US2017A_SOCP) %>% # SOC Code for 2012 and later
# Remove unreported stats before converting to numeric
filter(Class != 'B', # Remove N/A (less than 16 years old)
School_Enrollment != 'B', # Remove N/A (less than 3 years old)
Education_Att != 'B', # Remove N/A (less than 3 years old)
Employment_Status != 'B', # Remove N/A (less than 16 years old)
Earnings != 'BBBBBBB') %>% # Remove unreported earnings
# Convert Age & Earnings to numeric
mutate(Age = as.numeric(Age),
Education_Att = as.numeric(Education_Att),
Earnings = as.numeric(Earnings),
Class = as.numeric(Class),
School_Enrollment = as.numeric(School_Enrollment),
Employment_Status = as.numeric(Employment_Status)) %>%
# Set Ages to between 18 and 60
filter(Age >=18 & Age <= 60)-> cleaned_ACS2017
# View summary
summary(cleaned_ACS2017)
## State_code Age Class Education_Att
## Length:1512703 Min. :18.00 Min. :1.000 Min. : 1.00
## Class :character 1st Qu.:29.00 1st Qu.:1.000 1st Qu.:16.00
## Mode :character Median :40.00 Median :1.000 Median :19.00
## Mean :39.62 Mean :2.042 Mean :18.54
## 3rd Qu.:51.00 3rd Qu.:2.000 3rd Qu.:21.00
## Max. :60.00 Max. :9.000 Max. :24.00
## Employment_Status Degree1 Degree2 Earnings
## Min. :1.000 Length:1512703 Length:1512703 Min. : -9000
## 1st Qu.:1.000 Class :character Class :character 1st Qu.: 12000
## Median :1.000 Mode :character Mode :character Median : 31500
## Mean :1.703 Mean : 46695
## 3rd Qu.:1.000 3rd Qu.: 60000
## Max. :6.000 Max. :1130000
## SOC School_Enrollment
## Length:1512703 Min. :1.000
## Class :character 1st Qu.:1.000
## Mode :character Median :1.000
## Mean :1.161
## 3rd Qu.:1.000
## Max. :3.000
We’re at 1,512,703 observations. We should also filter Educational Attainment to a Bachelor’s degree (21) or higher and see what the effects this has.
# Filter Educational Attainment
cleaned_ACS2017 %>%
filter(Education_Att >= 21) -> cleaned_ACS2017
glimpse(cleaned_ACS2017)
## Observations: 515,009
## Variables: 10
## $ State_code <chr> "01", "01", "01", "01", "01", "01", "01", "0...
## $ Age <dbl> 31, 32, 41, 58, 30, 26, 54, 45, 44, 50, 43, ...
## $ Class <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 4, 1, 3, 1, 3, 1,...
## $ Education_Att <dbl> 21, 22, 22, 21, 21, 21, 21, 22, 22, 24, 23, ...
## $ Employment_Status <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1,...
## $ Degree1 <chr> "6100", "3401", "5200", "2102", "6203", "400...
## $ Degree2 <chr> "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBB...
## $ Earnings <dbl> 38500, 43500, 25700, 344000, 120000, 50000, ...
## $ SOC <chr> "119111", "254021", "533030", "411012", "411...
## $ School_Enrollment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
The observations decreased from 1,512,703 to 515,009 but the average earnings increased from 46,695 to $75,909.
I was shocked to see that about 34% of the Americans surveyed between the ages of 19-60 had at least a Bachelor’s degree and that increased the average earnings by $29,214.
It might be easier for us to refer to the states by name, rather than using the state code provided in the survey. We can use the stateTable data set to add State Names by state code.
# Load State Table
load("C:/Users/Dan Ward/Desktop/CSC530/Assignments/Source Data/stateTable.RDA")
stateTable %>%
select(FIPS, state_name) %>%
rename(State_code = FIPS, State = state_name) -> stateTable
# Merge stateTable and ACS2017 dataset
cleaned_ACS2017 <- merge(cleaned_ACS2017, stateTable, by = "State_code")
# Glimpse updated dataset
glimpse(cleaned_ACS2017)
## Observations: 515,009
## Variables: 11
## $ State_code <chr> "01", "01", "01", "01", "01", "01", "01", "0...
## $ Age <dbl> 31, 32, 41, 58, 30, 26, 54, 45, 44, 50, 43, ...
## $ Class <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 4, 1, 3, 1, 3, 1,...
## $ Education_Att <dbl> 21, 22, 22, 21, 21, 21, 21, 22, 22, 24, 23, ...
## $ Employment_Status <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1,...
## $ Degree1 <chr> "6100", "3401", "5200", "2102", "6203", "400...
## $ Degree2 <chr> "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBB...
## $ Earnings <dbl> 38500, 43500, 25700, 344000, 120000, 50000, ...
## $ SOC <chr> "119111", "254021", "533030", "411012", "411...
## $ School_Enrollment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
I also thought it would be useful to include the Job Descriptions for each of the SOC Codes, which I was able to download from: https://www.census.gov/topics/employment/industry-occupation/guidance/code-lists.html
Before we can use the SOC Job Descriptions, we’ll have to remove the ‘-’ in the SOC Codes used in the CSV file.
#Clean SOC_codes data by renaming columns and filter N/A values
SOC_codes %>%
rename(Description = `Occupation 2010 Description`,
SOC = `2010 SOC Code`) %>%
filter(!is.na(Description)) -> SOC_codes
# Remove '-' characters from SOC_codes
SOC_codes$SOC <- gsub('-', '', SOC_codes$SOC)
Now, we can add the Job Descriptions by the SOC Codes.
## Observations: 451,230
## Variables: 12
## $ SOC <chr> "111021", "111021", "111021", "111021", "111...
## $ State_code <chr> "21", "36", "47", "34", "48", "50", "38", "0...
## $ Age <dbl> 37, 49, 37, 49, 59, 49, 57, 42, 45, 35, 27, ...
## $ Class <dbl> 1, 2, 1, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Education_Att <dbl> 21, 22, 22, 21, 21, 21, 21, 21, 21, 21, 22, ...
## $ Employment_Status <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1,...
## $ Degree1 <chr> "3301", "5200", "5200", "5506", "6200", "400...
## $ Degree2 <chr> "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "360...
## $ Earnings <dbl> 38000, 55000, 55000, 595000, 92000, 60000, 1...
## $ School_Enrollment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ State <chr> "Kentucky", "New York", "Tennessee", "New Je...
## $ Description <chr> "General and operations managers", "General ...
You may have noticed that we’ve gone down from We’re down to 515,009 observations to 451,230 observations. Looks like we lost some due to bad SOC codes, I guess it happens. But let’s move on to the next step.
To begin, we’ll narrow down the type of degrees. There are hundreds of different degrees available in the American Community Survey. Let’s start with Business Management and Administration (6203) and Computer Science (2102) degrees.
Then we’ll compare it to other potential combinations
Business
* 6200 GENERAL BUSINESS
* 6201 ACCOUNTING
* 6203 BUSINESS MANAGEMENT AND ADMINISTRATION
* 6206 MARKETING AND MARKETING RESEARCH
* 6207 FINANCE
* 6209 HUMAN RESOURCES AND PERSONNEL MANAGEMENT
* 6210 INTERNATIONAL BUSINESS
* 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
Computer Science
* 2100 COMPUTER AND INFORMATION SYSTEMS
* 2101 COMPUTER PROGRAMMING AND DATA PROCESSING
* 2102 COMPUTER SCIENCE
* 2105 INFORMATION SCIENCES
* 2106 COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY
* 2107 COMPUTER NETWORKING AND TELECOMMUNICATIONS
* 2407 COMPUTER ENGINEERING
* 4005 MATHEMATICS AND COMPUTER SCIENCE
I wouldn’t want to type that more than once, so let’s work smarter and not harder by putting this into a vector.
Business_Degrees <- c("6200", "6201", "6203", "6206", "6207", "6209", "6210", "6212")
Computer_Degrees <- c("2100", "2101", "2102", "2105", "2106", "2107", "2407", "4005")
We’ll start by determining individuals with Business (6203) or Computer Science (2102) Degrees. This dataframe produces 41,428 observations, with a mean earnings of 79,857.
# Start with new dataframe for either Business or Computer Science degrees
cleaned_ACS2017 %>%
filter((Degree1 == "6203" | Degree1 == "2102") |
(Degree2 == "6203" | Degree2 == "2102")) -> Combined_Gen
summarise(Combined_Gen, AvgEarnings = mean(Earnings), Count = n())
Let’s use a bar chart to visualize the comparison of individuals with Business and Computer Science reported as their field of study.
Significantly more Business majors and very few with both a Business and Computer Science degree.
However, this boxplot tells a different story. Individuals with a Computer Science or Both a Business and Computer Science degree earned more on average.
Then create a second dataframe for any of the 16 degrees identified above. However, when we added the additional degrees, you’ll notice we have 114,909 observations and mean earnings increased slightly to 81,607.
cleaned_ACS2017 %>%
filter((Degree1 %in% Business_Degrees | Degree2 %in% Business_Degrees) |
(Degree1 %in% Computer_Degrees | Degree2 %in% Computer_Degrees)) -> Combined_Add
summarise(Combined_Add, AvgEarnings = mean(Earnings), Count = n())
This bar chart looks similar to the previous chart for only general Business and Computer Science degrees.
Here, it seems that those with both degrees seemed to drop slightly. While Business and Computer Science had many outliers, an individual with Both stuck out well above the rest.
Next, we’ll take a side-by-side comparison of the General and Additional degree datasets.
You probably noticed the scales changed, as Business Majors earned more while Both Majors and Computer Science Majors made slightly less.
I also thought it would be interesting to see how the different degrees we identified earlier might affect Median Earnings. Without wanting to look at a list of the many variations, let’s visualize this with a Cleveland Plot.
Looks like 4005 MATHEMATICS AND COMPUTER SCIENCE degree graduates earn much more than any other degree.
It looked like the minimums and maximums were very close to the same. We’ll use the second dataframe going forward because the additional observations could prove beneficial. Do you agree with the additional degrees?
Before moving on, I don’t want to try remembering all the Degree Field Codes or have to keep looking them up, so let’s add them to our dataframe. This will also help limit the amount of retyping code, like in the previous graphics.
While we’re here, let’s also define a new variable called Major so we’ll quickly know if it’s a Business, Computer Science, or Both. We can also update Educational Attainment.
Combined_Add %>%
# Use the type of degree to define a new variable 'Major' as Business, Computer, or Both
mutate(Major = case_when(Degree1 %in% Business_Degrees & Degree2 %in% Computer_Degrees ~ "Both",
Degree1 %in% Computer_Degrees & Degree2 %in% Business_Degrees ~ "Both",
Degree1 %in% Business_Degrees ~ "Business",
Degree2 %in% Business_Degrees ~ "Business",
Degree1 %in% Computer_Degrees ~ "Computer",
Degree2 %in% Computer_Degrees ~ "Computer")) %>%
# Replace the Education Attainment code with the level of education
mutate(Education_Att = case_when(Education_Att == 21 ~ "Bachelor",
Education_Att == 22 ~ "Master",
Education_Att == 23 ~ "Professional",
Education_Att == 24 ~ "Doctorate",
TRUE ~ "")) -> Combined_Add
We can also look at the other variables to determine if any other modifications are needed.
As one would image, it appears Employment Status has an impact on earnings. Overall, unemployment looks low for those for this segment of individuals. Out of the 114,909, only 2,754 reported being unemployed, which is roughly 2.4%. Those that reported employed and at work reported average earnings of $88,040.
To help visualize this, let show a bar chart of the counts and show average earnings.
NOTE: to display the values as currency, I used the dollar_format from the scales package.
I think we might want to remove 3-6 but first let’s look at these in Washington State to see if there are any potential issues we should know about.
Combined_Add %>%
mutate(Employment_Status = case_when(Employment_Status == "1" ~ "Employed",
Employment_Status == "2" ~ "Employed, not at work",
Employment_Status == "3" ~ "Unemployed",
Employment_Status == "4" ~ "Armed forces",
Employment_Status == "5" ~ "Armed forces, not at work",
Employment_Status == "6" ~ "Not in labor force",
TRUE ~ "")) %>%
filter(State == "Washington") %>%
group_by(Employment_Status) %>%
summarise(AvgEarnings = mean(Earnings), Count = n())
I don’t see anything unusual so I’m going to get rid of 3-6.
Combined_Add %>%
filter(Employment_Status <= 2) -> Combined_Add
I didn’t see any Class 9, which is good because we just removed unemployed from our data. I not working without pay in a family business or farm, so we’ll also drop 8.
Combined_Add %>%
filter(Employment_Status <= 7) -> Combined_Add
Individuals that reported as currently attending school earned less than those that have not attended school in the last three months. I was a little surprised to see the average earnings from those attending public or private school. We’ll assume their attending college based on the filtering we’ve done so far. Overall this is interesting, but I don’t see any issues so we’ll leave it and move on.
Let’s see what this looks like now. We dropped from 114,909 observations to 104,951.
## Observations: 104,951
## Variables: 13
## $ SOC <chr> "111021", "111021", "111021", "111021", "111...
## $ State_code <chr> "48", "38", "08", "37", "06", "26", "36", "1...
## $ Age <dbl> 59, 57, 47, 34, 39, 51, 47, 44, 43, 51, 39, ...
## $ Class <dbl> 7, 7, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,...
## $ Education_Att <chr> "Bachelor", "Bachelor", "Master", "Bachelor"...
## $ Employment_Status <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Degree1 <chr> "6200", "6203", "6200", "6200", "6200", "620...
## $ Degree2 <chr> "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBB...
## $ Earnings <dbl> 92000, 120000, 52000, 63000, 275000, 150000,...
## $ School_Enrollment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ State <chr> "Texas", "North Dakota", "Colorado", "North ...
## $ Description <chr> "General and operations managers", "General ...
## $ Major <chr> "Business", "Business", "Business", "Busines...
After removing those additional items in the previous steps, let’s see what the counts and average earnings are for Business, Computer Science, and Both Degrees.
We could have combined these into one statement…
Combined_Add %>%
group_by(Major) %>%
summarise(AvgEarnings = mean(Earnings), Counts = n())
I think that looks a lot cleaner.
# Filter for WA State
Combined_Add %>%
filter(State == "Washington") -> Combined_WA
# Group By Major and show Average Earnings & Counts
Combined_WA %>%
group_by(Major) %>%
summarise(AvgEarnings = mean(Earnings), Counts = n())
While individuals with either a Business or Computer Science degree earned more in Washington than the national average, individuals with Both earned less.
A density plot should provide an effective way to view the distribution of the Earnings variable. Let’s do a side-by-side comparison of the national versus Washington State.
We had 10 jobs overlap from our two TOP 20 Jobs lists.
Nothing surprising here.
Let’s look at another Cleveland dotplot to show the average Earnings by State. Plus, it’s Dr. Nelson’s favorite.
Seemed like the State make a pretty large difference for Earnings.
Next, let’s create a boxplot to look at how Education Attainment affects income in Washington.
Seems like the statement, “the more you learn, the more you earn” is an accurate statement.
Now, let’s see if we can find a better to visualize how the level of educational attainment affects earnings.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
It was interesting to see how age affects earnings, though makes sense when you take experience into consideration. There was a pretty significant bump for individuals with a Professional degree, just before 50 years old.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can see these bumps better in this graphic. Those with Both degrees have a larger variation. I thought it was interesting to see those with Both degrees and those with Computer Science degrees seemed to start dropping off in their mid-50s. I wonder if it has something to do with early retirement or the pace of change in technology.
I also thought it might be beneficial to see earnings on a map of the states.
First, we’ll use TidyCensus to get the Median Income.
## Get ACS data for Median Income for both Bachelor and Master/Professional
WA_State <- get_acs(geography = "state",
variables = "B20004_001",
geometry = TRUE,
shift_geo = TRUE)
Next, we’ll create a new dataframe to merge our datasets. After merging the datasets, we’ll create a new variable to take the Average Earnings for Business, Computer Science, & Both groups by State, and then subtract it from each States Median Income.
Now we’re ready to map the results.
Friedman, Z. (2018, June 13). Student Loan Debt Statistics In 2018: A $1.5 Trillion Crisis. Retrieved from https://www.forbes.com/sites/zackfriedman/2018/06/13/student-loan-debt-statistics-2018/#124d2cea7310
NACE Staff (2017, January 25). COMPUTER SCIENCE PROJECTED AS TOP-PAID MASTER’S GRADS. Retrieved from http://www.naceweb.org/job-market/compensation/computer-science-projected-as-top-paid-masters-grads/
Occupation Definitions: https://www2.census.gov/programs-surveys/demo/guidance/industry-occupation/soc_2018_definitions.pdf
The Economic Value of College Majors. (2018, June 12). Retrieved from https://cew.georgetown.edu/cew-reports/valueofcollegemajors/
Torpey, E. (2018, April). Measuring the value of education: Career Outlook. Retrieved from https://www.bls.gov/careeroutlook/2018/data-on-display/education-pays.htm
The Labor Market for Recent College Graduates. (2018, January 25). Retrieved from https://www.newyorkfed.org/research/college-labor-market/index.html
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. IPUMS USA: Version 8.0 [dataset]. Minneapolis, MN: IPUMS, 2018. https://doi.org/10.18128/D010.V8.0
Student Loan Resources: Financial Aid & Loan Debt Management. (n.d.). Retrieved October 20, 2018, from https://www.debt.org/students/