Final Project: Equality and Tertiary Education

“Until we get equality in education, we won’t have an equal society.”
Topic Background
Tidywork: Making Sense of our Data
Question 1: How do racial and ethnic demographics vary between private and public institutions in our dataset?
Question 2: How do racial and ethnic demographics vary between not-for-profit institutions and for-profit colleges in our dataset?
Question 3: What is the relationship between tuition rate and salary potential?
Question 4: Are For-Profit Tuition Costs Higher than Non-Profit Tuition Costs?
Question 5: How do diversity, tuition rates and salary rates at the top 20 schools compare to other schools?
Data Analysis Summaries
Future Research
References

“Until we get equality in education, we won’t have an equal society.”

~ Supreme Court Justice Sonia Sotomayor

Topic Background

Initial Presentation: Racial Demographics at For-Profit and Not-For-Profit Educational Institutions

Each of our team members originally selected a few datasets that we compared as a group. Our group decided to explore a dataset Ana found related to the topic of racial and ethnic diversity at two-year and four-year colleges and universities in the United States. We initially thought to compare diversity at two-year community colleges versus 4-year schools and also were curious about the variations in diversity between the two that could be found comparatively across the country.

As we were exploring the dataset we identified a number of named colleges and universities that are known as “for-profit” institutions. Some of these institutions have been the subject of recent lawsuits regarding predatory (Halperin, 2020; Legal Services Center, 2020; Redman, 2020). The high number of current lawsuits regarding for-profit colleges’ predatory behaviors are a result of the actions of current Education Secretary Betsy DeVos. In 2019, DeVos “repealed an Obama-era regulation that sought to crack down on for-profit colleges and universities that produced graduates with no meaningful job prospects and mountains of student debt they could not hope to repay” (Green, 2019. para. 1). Another action that DeVos took was to deny debt forgiveness to students who had been prey to for-profit predatory educational institutions (Lobosco, 2019; Turner, 2019). We were aware of past and recent news articles describing the marketing strategies of these colleges that were aimed at students of color, low-income students, immigrant communities and students who are first in their family to go to college (Bonadies et al. 2018; Conti, 2019; Voorhees, 2019). Previous studies have found that loan debt is higher for students who have attended For-Profit institutions, which disproportionately affects students of color.

library(git2r)
library(usethis)

Race and For-Profits

(Body, 2019)

Loan Default Rates at For-Profits

(Body, 2019)

Despite their high cost, For-profit institutions have a lower graduation rate and employment rate than non-profit institutions.

For-Profit Graduation & Employment Rates

(Lopez, 2015)

We have examined our dataset and explored possible consistencies and/or inconsistencies with these previous reports.

Advancements in Analysis Since the Initial Presentation

1. Incorporation of Census Demographic Data

After the initial presentation of our analysis, questions from our colleagues and professor resulted in discussions regarding the comparative proportions of racial and ethnic groups at schools compared to the overall proportion of racial and ethnic backgrounds in the United States. We therefore gathered demographic data from the U.S. Census Bureau and incorporated that data into our analysis.

2. Incorporation of Tuition Cost and Salary Potential Data

Additionally, we received questions regarding economic inequalities such as differences in tuition, particularly at highly-ranked private, non-profit institutions such as Harvard. We therefore merged two additional datasets for tuition and salary potential into our previously compiled dataset regarding diversity at for-profit and non-profit schools. This dataset is titled “monsterfinal”. From this dataset we were able too begin a visual exploration of the data regarding tuition costs and salary potential at early and mid-career markers. This analysis was further developed by exploring the differences in in-state tuition versus out-of-state tuition and salary potential at public non-profit schools.

3. Analysis of the Data for the Top Ranked U.S. Tertiary Education Institutions

Lastly, we initiated a visual comparative analysis of cost of tuition of the top ranked of schools. We completed a second webscraping to compile a list of the top ranked tertiary education programs in the U.S. We utilized this list to create visualizations on the cost, salary potential and demographics at these schools.

Initial Questions

How do racial and ethnic demographics vary between private and public institutions in our dataset?
How do racial and ethnic demographics vary between not-for-profit institutions and for-profit colleges in our dataset?

New Additional Questions

What is the relationship between tuition rate and salary potential?
Are For-Profit Tuition Costs Higher than Non-Profit Tuition Costs?
How do diversity, tuition rates and salary rates at the top 20 schools compare to other schools?

Definitions:

For-Profit Institutions: For-profit institutions are defined by the way that “revenue earned by the school is invested”. For-profit colleges have investors who want to make a profit. Their operations management is determined in part on maximizing the return profit for investors. “Money earned by the school may be used to pay out investors and award bonuses to executives, as well as sustain the operation’s profitability through aggressive marketing and recruitment strategies” (TBS Staff, 2019, para. 8)

Not-for Profit Institutions: Non-profit colleges can be either public or private. Regardless of whether the school is public or private, non-profit colleges must “reinvest the money earned through enrollment into the educational mission” (TBS Staff, 2019, para. 8).

Tidywork: Making Sense of our Data

Webscraping and Randomly Selecting Sample of 35 For-Profit Colleges

Our dataset did not include information on whether colleges were for-profit.

In order to determine which of the schools on our list were for-profit institutions, we decided to find a website with a list of for-profit institutions, scrape the data from that website, to create a second dataset with a list of for profit schools.

Following Web-Scraping Tutorial to Scrape Wikipedia Website

Loading Libraries

#Loading the rvest package
library('rvest')

## Loading required package: xml2

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x purrr::is_empty()       masks git2r::is_empty()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()
## x dplyr::pull()           masks git2r::pull()
## x purrr::when()           masks git2r::when()

library(dplyr)

Creating a data set with all of our variables

diversity <- read_csv("diversity2.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   total_enrollment = col_double(),
##   state = col_character(),
##   category = col_character(),
##   enrollment = col_double()
## )

salary <- read_csv("salary_potential.csv")

## Parsed with column specification:
## cols(
##   rank = col_double(),
##   name = col_character(),
##   state_name = col_character(),
##   early_career_pay = col_double(),
##   mid_career_pay = col_double(),
##   make_world_better_percent = col_double(),
##   stem_percent = col_double()
## )

salary_added <- left_join(diversity,salary,by="name")
salary_added_final <- na.omit(salary_added)

tuition <- read_csv("tuition_cost.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   state = col_character(),
##   state_code = col_character(),
##   type = col_character(),
##   degree_length = col_character(),
##   room_and_board = col_double(),
##   in_state_tuition = col_double(),
##   in_state_total = col_double(),
##   out_of_state_tuition = col_double(),
##   out_of_state_total = col_double()
## )

tuition_added <- left_join(tuition,salary_added_final,by="name")
tuition_added_final <- na.omit(tuition_added)

monsterfinal <- tuition_added_final

Cleaning the Data

monsterfinal <- monsterfinal %>% filter(category != "Women" & category != "Two Or More Races" & category != "Non-Resident Foreign" & category != "Unknown" & category != "Total Minority")

monsterfinal <- monsterfinal %>%  mutate(enrollment_percentage = enrollment/ total_enrollment*100)

monsterfinal %>% 
  mutate_if(is.numeric, round, digits = 2)

## # A tibble: 3,588 x 21
##    name  state.x state_code type  degree_length room_and_board in_state_tuition
##    <chr> <chr>   <chr>      <chr> <chr>                  <dbl>            <dbl>
##  1 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  2 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  3 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  4 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  5 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  6 Adam~ Colora~ CO         Publ~ 4 Year                  8782             9440
##  7 Agne~ Georgia GA         Priv~ 4 Year                 12330            41160
##  8 Agne~ Georgia GA         Priv~ 4 Year                 12330            41160
##  9 Agne~ Georgia GA         Priv~ 4 Year                 12330            41160
## 10 Agne~ Georgia GA         Priv~ 4 Year                 12330            41160
## # ... with 3,578 more rows, and 14 more variables: in_state_total <dbl>,
## #   out_of_state_tuition <dbl>, out_of_state_total <dbl>,
## #   total_enrollment <dbl>, state.y <chr>, category <chr>, enrollment <dbl>,
## #   rank <dbl>, state_name <chr>, early_career_pay <dbl>, mid_career_pay <dbl>,
## #   make_world_better_percent <dbl>, stem_percent <dbl>,
## #   enrollment_percentage <dbl>

Question 1: How do racial and ethnic demographics vary between private and public institutions in our dataset?

Plot of Enrollment Percentages based on Race & Ethnicity: Combined For-profit and Non-profit Data, Without Census Data

summary(monsterfinal)

##      name             state.x           state_code            type          
##  Length:3588        Length:3588        Length:3588        Length:3588       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  degree_length      room_and_board  in_state_tuition in_state_total 
##  Length:3588        Min.   : 4275   Min.   : 4220    Min.   :11816  
##  Class :character   1st Qu.: 9040   1st Qu.:10370    1st Qu.:20264  
##  Mode  :character   Median :10584   Median :28312    Median :37830  
##                     Mean   :11030   Mean   :27561    Mean   :38590  
##                     3rd Qu.:13020   3rd Qu.:41642    3rd Qu.:54042  
##                     Max.   :18156   Max.   :58230    Max.   :75003  
##  out_of_state_tuition out_of_state_total total_enrollment   state.y         
##  Min.   : 6570        Min.   :13674      Min.   :  457    Length:3588       
##  1st Qu.:21683        1st Qu.:31578      1st Qu.: 1960    Class :character  
##  Median :30096        Median :40478      Median : 3788    Mode  :character  
##  Mean   :31863        Mean   :42893      Mean   : 7775                      
##  3rd Qu.:41760        3rd Qu.:54450      3rd Qu.:10646                      
##  Max.   :58230        Max.   :75003      Max.   :60767                      
##    category           enrollment           rank        state_name       
##  Length:3588        Min.   :    0.0   Min.   : 1.00   Length:3588       
##  Class :character   1st Qu.:   15.0   1st Qu.: 5.00   Class :character  
##  Mode  :character   Median :  102.0   Median :11.00   Mode  :character  
##                     Mean   : 1096.7   Mean   :11.33                     
##                     3rd Qu.:  796.2   3rd Qu.:17.00                     
##                     Max.   :33293.0   Max.   :25.00                     
##  early_career_pay mid_career_pay   make_world_better_percent  stem_percent  
##  Min.   :32500    Min.   : 61900   Min.   :34.00             Min.   : 0.00  
##  1st Qu.:45500    1st Qu.: 81200   1st Qu.:48.00             1st Qu.: 8.00  
##  Median :49400    Median : 89100   Median :52.00             Median :14.00  
##  Mean   :51055    Mean   : 92510   Mean   :53.06             Mean   :17.36  
##  3rd Qu.:55000    3rd Qu.:100500   3rd Qu.:58.00             3rd Qu.:22.00  
##  Max.   :88800    Max.   :158200   Max.   :86.00             Max.   :97.00  
##  enrollment_percentage
##  Min.   : 0.0000      
##  1st Qu.: 0.3656      
##  Median : 2.9349      
##  Mean   :14.3040      
##  3rd Qu.:10.5827      
##  Max.   :98.8543

ggplot(data=monsterfinal, aes(x=category, y=enrollment_percentage, fill=type)) +
geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Public & Private Racial & Ethnic Percent Enrollment Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme_minimal()+
   scale_fill_manual(values=c("#9999CC", "#66CC99"))

This plot suggests that there are higher proportions of Black and Native Hawaiian/Pacific Islanders at private schools than public schools and a higher proportion of White students at non-profit schools than for-profit schools. Further computational analysis is required to determine significance. One critique raised of this plot is that population proportion information was not provided, which we incorporate below.

library(dplyr)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:git2r':
## 
##     config

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

monsterfinal %>%
  mutate_if(is.numeric, round, digits = 2)  %>%
  ggplot(aes(x=category, y=enrollment_percentage, fill=type, text = paste("Percent:", enrollment_percentage, '</br>', '</br>School Name:', name))) +
geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("For-Profit/Non-Profit Racial & Ethnic Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme_minimal()

ggplotly(tooltip = "text")

This plot presents the same information as above, however instead of summary information, the interactivity of this plot allows readers to examine the enrollment percentages of each race/ethnicity by school name.

library(dplyr)
sum(tuition_added$type == "Public")

## [1] 3994

sum(tuition_added$type == "Private")

## [1] 5101

sum(tuition_added$type == "For Profit")

## [1] 107

library(RColorBrewer)

ggplot(monsterfinal, aes(reorder(type, enrollment_percentage), enrollment_percentage, fill = type)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~ category)+
  ggtitle("For-Profit/Non-Profit Racial & Ethnic Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity")+
  theme_bw()+
  scale_fill_brewer()

This facet wrap plot is consist with the barplot shown above, however the scale is not quite correct. The scale problem was present for a number of our facet-wrap plots with enrollment percentage and will be discussed further below.

Examination of 2-year and 4-year Program Racial and Ethnic Demographics

tuition4plot <- tuition_added %>% 
filter(category != "Women" & category != "Unknown" & category != "Two Or More Races" & category != "Total Minority" & category != "Non-Resident Foreign")

ggplot(data=tuition4plot, aes(x=category, y=enrollment, fill=degree_length)) +
geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("2 Year & 4 Year Racial & Ethnic Enrollment Comparisons") +
  labs(y = "Enrollment Totals", x = "Race/Ethnicity")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme_minimal()+
   scale_fill_manual(values=c("#CC6666", "#9999CC", "#66CC99"))

This graph provides total enrollment numbers and suggests that White students and Latinx students are represented in much greater numbers than other racial and ethnic groups at 4 year schools.

Question 2: How do racial and ethnic demographics vary between not-for-profit institutions and for-profit colleges in our dataset?

Comparing For-Profit and Non-Profit Data with Census Data

Creating a Dataset with US Census information

According to the US census population statistics website the following is the breakdown of the racial demographics in the United States.

White: 72%
Black or African American alone - 12.7%
American Indian and Alaska Native alone - 0.9%
Asian alone - 5.6%
Native Hawaiian and Other Pacific Islander alone - 0.2%
Hispanic - 18%
Some other race alone - 5.0%
Two or more races - 3.4%

census <- data.frame("category" = c("White", "Black", "American Indian/Alaska Native", "Asian", "Native Hawaiian / Pacific Islander", "Hispanic"), "Percent_Pop" = c(.72, .127, .09, .056, .002, .18))

census

##                             category Percent_Pop
## 1                              White       0.720
## 2                              Black       0.127
## 3      American Indian/Alaska Native       0.090
## 4                              Asian       0.056
## 5 Native Hawaiian / Pacific Islander       0.002
## 6                           Hispanic       0.180

pcensus <- ggplot(census, aes(reorder(x = category, Percent_Pop), y = Percent_Pop, fill = category))+
  geom_bar(stat = "identity")+ 
  geom_text(aes(label = scales::percent(Percent_Pop), 
                  y = Percent_Pop, 
                  group = category, vjust = -.2))+
  ggtitle("Percentage of U.S. Population by Race/Ethnicity") +
  labs(y = "Percent of Population", x = "Race/Ethnicity")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme_minimal()+
   scale_fill_brewer()

pcensus

Creating Sample Datasets of For-Profit and Non-Profit Institutions to Merge with the Census Information

Viewing data

head(diversity)

## # A tibble: 6 x 5
##   name                total_enrollment state  category                enrollment
##   <chr>                          <dbl> <chr>  <chr>                        <dbl>
## 1 University of Phoe~           195059 Arizo~ Women                       134722
## 2 University of Phoe~           195059 Arizo~ American Indian / Alas~        876
## 3 University of Phoe~           195059 Arizo~ Asian                         1959
## 4 University of Phoe~           195059 Arizo~ Black                        31455
## 5 University of Phoe~           195059 Arizo~ Hispanic                     13984
## 6 University of Phoe~           195059 Arizo~ Native Hawaiian / Paci~       1019

diversity <- read_csv("diversity2.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   total_enrollment = col_double(),
##   state = col_character(),
##   category = col_character(),
##   enrollment = col_double()
## )

Cleaning the Data

There were numerous “tidying” exercises that we undertook to clean our data and make the webscraping dataset and the diversity dataset comparable. Our tidying exercises included, but were not limited to, creating the following: 1. a new column calculating the percent of students in attendance based on race and ethnicity 2. a new binary column identifying the selected for-profit and not-for-profit schools represented by “1” and “0” respectively.
3. two new datasets: one for our randomly selected for-profit schools and one for our randomly selected not-for-profit schools.

Creating new column for percent enrollment

diversity_new <- diversity %>%  mutate(enrollment_percentage = enrollment/ total_enrollment*100)

head(diversity_new)

## # A tibble: 6 x 6
##   name        total_enrollment state  category      enrollment enrollment_perce~
##   <chr>                  <dbl> <chr>  <chr>              <dbl>             <dbl>
## 1 University~           195059 Arizo~ Women             134722            69.1  
## 2 University~           195059 Arizo~ American Ind~        876             0.449
## 3 University~           195059 Arizo~ Asian               1959             1.00 
## 4 University~           195059 Arizo~ Black              31455            16.1  
## 5 University~           195059 Arizo~ Hispanic           13984             7.17 
## 6 University~           195059 Arizo~ Native Hawai~       1019             0.522

Removing Categories: Women, Non-Resident Foreign and Two or More Races

We removed these categories to eliminate categories where there could be duplicitous counts.

diversity_new5 <- diversity_new %>% filter(category != "Women" & category != "Two Or More Races" & category != "Non-Resident Foreign" & category != "Unknown" & category != "Total Minority")

This function was used to assess how many separate schools were included in the dataset. There were over 4000 schools listed.

# unique(diversity_new5$name)

These functions were used to attempt to find individual schools. It was successful.

We checked the structure of our new dataset.

str(diversity_new5)

## tibble [27,630 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                 : chr [1:27630] "University of Phoenix-Arizona" "University of Phoenix-Arizona" "University of Phoenix-Arizona" "University of Phoenix-Arizona" ...
##  $ total_enrollment     : num [1:27630] 195059 195059 195059 195059 195059 ...
##  $ state                : chr [1:27630] "Arizona" "Arizona" "Arizona" "Arizona" ...
##  $ category             : chr [1:27630] "American Indian / Alaska Native" "Asian" "Black" "Hispanic" ...
##  $ enrollment           : num [1:27630] 876 1959 31455 13984 1019 ...
##  $ enrollment_percentage: num [1:27630] 0.449 1.004 16.126 7.169 0.522 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   total_enrollment = col_double(),
##   ..   state = col_character(),
##   ..   category = col_character(),
##   ..   enrollment = col_double()
##   .. )

Changing Names of Schools with Multiple Locations into One Name (separate rows remain)

This initial approach went one-by-one. For schools with multiple locations, some with 20 or more, we needed a faster and more efficient option.

diversity_new5$name[diversity_new5$name == "Spencerian College at Louisville (Ky.)"] <- "Spencerian College"
diversity_new5$name[diversity_new5$name == "Spencerian College at Lexington (Ky.)"] <- "Spencerian College"
diversity_new5$name[diversity_new5$name == "NewSchool of Architecture and Design"] <- "NewSchool Arch. Design"
diversity_new5$name[diversity_new5$name == "Western International University"] <- "Western Intl Univ."
diversity_new5$name[diversity_new5$name == "Schiller International University"] <- "Schiller Intl Univ."
diversity_new5$name[diversity_new5$name == "National Paralegal College"] <- "Natl Paralegal College"
diversity_new5$name[diversity_new5$name == "WestCoast University"] <- "West Coast Univ."

#Failed Attempt
# ME <- diversity_new %>% 
# select(contains('Mildred Elley'))
# ME

Name change for the entire group containing a portion of the name

We discovered the “grepl” function which allowed us to change entire groups of rows that contained a similar word or set of words. All of the different locations could be changed in one set of code instead of multiple sets.

diversity_new5$name[grepl('Mildred Elley', diversity_new5$name)] <- 'Mildred Elley'

Showing the Multiple Location Indicies to Verify Name Change Success After

This code identified all of the entries that included “West Coast University” in the name.

which(grepl("West Coast University", diversity_new5$name))

##  [1] 12781 12782 12783 12784 12785 12786 13459 13460 13461 13462 13463 13464
## [13] 15265 15266 15267 15268 15269 15270 16747 16748 16749 16750 16751 16752
## [25] 20113 20114 20115 20116 20117 20118 27145 27146 27147 27148 27149 27150

These indices were used to show the current names in the dataset.

diversity_new5$name[14911]

## [1] "Salus University"

diversity_new5$name[15707]

## [1] "Concorde Career College at Garden Grove (Calif.)"

diversity_new5$name[17813]

## [1] "Argosy University Inland Empire"

diversity_new5$name[19541]

## [1] "Fortis College at Cincinnati"

diversity_new5$name[23466]

## [1] "Beacon College"

diversity_new5$name[31675]

## [1] NA

Single Line of Code to Change All Location Names to One Name

diversity_new5$name[grepl('West Coast University', diversity_new5$name)] <- 'West Coast Univ.'

Verifying that the group name change worked for each group

We used the same indices to ensure that the names of all groups changed to the single name.

diversity_new5$name[14911]

## [1] "Salus University"

diversity_new5$name[15707]

## [1] "Concorde Career College at Garden Grove (Calif.)"

diversity_new5$name[17813]

## [1] "Argosy University Inland Empire"

diversity_new5$name[19541]

## [1] "Fortis College at Cincinnati"

diversity_new5$name[23466]

## [1] "Beacon College"

diversity_new5$name[31675]

## [1] NA

One More Name Change Test

which(grepl("Brookline College", diversity_new5$name))

##  [1] 13495 13496 13497 13498 13499 13500 19135 19136 19137 19138 19139 19140
## [13] 20281 20282 20283 20284 20285 20286 21247 21248 21249 21250 21251 21252

diversity_new5$name[15746]

## [1] "Concorde Career College at San Diego"

diversity_new5$name[24791]

## [1] "Gupton Jones College of Funeral Service"

diversity_new5$name[grepl("Brookline College", diversity_new5$name)] <- "Brookline College"

diversity_new5$name[15746]

## [1] "Concorde Career College at San Diego"

diversity_new5$name[24791]

## [1] "Gupton Jones College of Funeral Service"

Simplifying Names for all Schools in our Samples

We changed all of the names of our schools to simplified versions which was particularly important for the schools with multiple locations.

Jennifer - For-Profit Name Changes

diversity_new5$name[grepl('Clover Park Technical College', diversity_new5$name)] <- 'Clover Park Tech College'
diversity_new5$name[grepl('San Diego City College', diversity_new5$name)] <- 'San Diego City College'
diversity_new5$name[grepl('Aspen University', diversity_new5$name)] <- 'Aspen Univ.'
diversity_new5$name[grepl('Grand Canyon University', diversity_new5$name)] <- 'Grand Canyon Univ.'
diversity_new5$name[grepl('American Public University', diversity_new5$name)] <- 'American Public Univ.'

Ana - For-Profit Name Changes

diversity_new5$name[grepl('Blue Cliff College', diversity_new5$name)] <- 'Blue Cliff College' 
diversity_new5$name[grepl('University of Phoenix', diversity_new5$name)] <- 'University of Phoenix' 
diversity_new5$name[grepl('Stevens-Henager College', diversity_new5$name)] <- 'Stevens-Henager College' 
diversity_new5$name[grepl('DeVry University', diversity_new5$name)] <- 'DeVry Univ.' 
diversity_new5$name[grepl('Pioneer Pacific College', diversity_new5$name)] <- 'Pioneer Pacific College' 
diversity_new5$name[grepl('National College at', diversity_new5$name)] <- 'National College'

Tiffany - For-Profit Name Changes

diversity_new5$name[grepl('Strayer University',diversity_new5$name)]<- 'Strayer Univ.'
diversity_new5$name[grepl('Lincoln Tech',diversity_new5$name)]<- 'Lincoln Tech' 
diversity_new5$name[grepl('Fashion Institute of Design and Merchandising', diversity_new5$name)]<- 'Fashion Institute'
diversity_new5$name[grepl('Centura College',diversity_new5$name)]<- 'Centura College'
diversity_new5$name[grepl('Rasmussen College',diversity_new5$name)]<- 'Rasmussen College' 
diversity_new5$name[grepl('Fortis College',diversity_new5$name)]<- 'Fortis College'

One Name that Refused to Change

For some reason Southeastern Community College (Iowa) would not change names. It also would not appear if we searched it using the which function. We were only able to filter this school by the total_enrollment column. Perhaps the difficulties with the name are a result of the parentheses. Eventually we found the gsub() function to remove the parentheses from the name.

library(stringr)
#str_replace(diversity_new5$name, "\\(.*\\)", "")

#diversity_new5$name <- as.character(diversity_new5$name)
#unlist(strsplit(diversity_new5$name, " \\(.*\\)"))

#library(plyr)
#diversity_new5$name <- as.character(diversity_new5$name)
#c<-strsplit(diversity_new5$name, "\\(")
#ldply(c)

# str_replace_all(diversity_new5$name, "[^[:alnum:]]", " ")

# str_replace_all(diversity_new5$name, "[[:punct:]]", " ")

diversity_new5$name <- gsub("Southeastern Community College (Iowa)", "Southeastern Community College Iowa", diversity_new5$name, fixed=TRUE)

which(grepl("Southeastern Community College Iowa", diversity_new5$name))

## [1] 9217 9218 9219 9220 9221 9222

which(grepl("Southeastern Community College (Iowa)", diversity_new5$name))

## [1] 9217 9218 9219 9220 9221 9222

diversity_new5$name[grepl("Southeastern Community College (Iowa)", diversity_new5$name)] <- "Southeastern Community College Iowa"

which(grepl("2987", diversity_new5$total_enrollment))

## [1] 9217 9218 9219 9220 9221 9222

diversity_new5$name[9217:9222]

## [1] "Southeastern Community College Iowa" "Southeastern Community College Iowa"
## [3] "Southeastern Community College Iowa" "Southeastern Community College Iowa"
## [5] "Southeastern Community College Iowa" "Southeastern Community College Iowa"

Merging and Summarizing the Data for Schools with Multiple Locations

All of the schools with multiple locations have a single name now, however, they still are separated by rows. We would like to combine all of the statistic for each “category” for all of the locations into one summarized row. For example in the following image you can see four locations for the University of Houston.

We were able to combine each of the four into one set of 7 lines with the averages calculated for each category’s enrollment and enrollment percentage.

The resolution appears in the creation of dataset “c3a” below.

35 Randomly selected not-for-profit educational institutions

For consistency of methodology we decided to select 35 schools from the not-for-profit educational institutions to compare to the 35 randomly selected for-profit schools. The final list of randomly selected not-for-profit schools was also geographically diverse.

sample_n(diversity_new,35)

## # A tibble: 35 x 6
##    name         total_enrollment state   category    enrollment enrollment_perc~
##    <chr>                   <dbl> <chr>   <chr>            <dbl>            <dbl>
##  1 Carteret Co~             1659 North ~ Asian               12            0.723
##  2 University ~             2575 Texas   Hispanic           950           36.9  
##  3 California ~             6631 Califo~ Hispanic          2639           39.8  
##  4 St. Thomas ~             1942 New Yo~ Unknown            133            6.85 
##  5 <NA>                      217 <NA>    Unknown             35           16.1  
##  6 ITT Technic~              336 Michig~ Two Or Mor~         12            3.57 
##  7 Southwest V~             2546 Virgin~ Unknown              4            0.157
##  8 Haskell Ind~              808 Kansas  Two Or Mor~          0            0    
##  9 Everest Uni~              634 Florida Non-Reside~          2            0.315
## 10 <NA>                      131 <NA>    Native Haw~          0            0    
## # ... with 25 more rows

Jennifer

University of Idaho
Southeastern Community College (Iowa)
Clover Park Technical College
Clark College
City University of New York Hunter College
University of Houston
- University of Houston
- University of Houston-Downtown
- University of Houston-Clear Lake
- University of Houston-Victoria
University of Colorado
- University of Colorado at Colorado Springs
- University of Colorado at Boulder
- University of Colorado at Denver
University of Massachusetts
- University of Massachusetts at Dartmouth
- University of Massachusetts at Worcester
- University of Massachusetts at Amherst
- University of Massachusetts at Lowell
- University of Massachusetts at Boston
Arizona College
- Arizona College at Glendale
- Arizona College at Mesa
San Diego City College

Ana

State University of New York College at Purchase
Eastern Shore Community College
Southern Oregon University
Santa Monica College
Adirondack Community College
East Georgia State College
Smith College
East Tennessee State University
Austin Community College
Tompkins Cortland Community College

Tiffany

University of Hawaii Hawaii Community College
Burlington College
Pennsylvania State University - Harrisburg
University of Minnesota
University of Central Oklahoma
Southern Virginia University
Pennsylvania Highlands Community College
Arizona State University
ITT Technical Institute at Knoxville
Colgate University
University of Pittsburg

Jennifer- Non-Profit Name Changes

diversity_new5$name[grepl('University of Houston', diversity_new5$name)] <- 'Univ. of Houston'
diversity_new5$name[grepl('University of Colorado', diversity_new5$name)] <- 'Univ. of Colorado'
diversity_new5$name[grepl('University of Massachusetts', diversity_new5$name)] <- 'Univ. of Massachusetts'
diversity_new5$name[grepl('Arizona College', diversity_new5$name)] <- 'Arizona College'
diversity_new5$name[grepl('University of Idaho', diversity_new5$name)] <- 'Univ. of Idaho'
diversity_new5$name[grepl('Clark College', diversity_new5$name)] <- 'Clark College'
diversity_new5$name[grepl('City University of New York Hunter College', diversity_new5$name)] <- 'City Univ. New York Hunter College'

Ana - Non-Profit Name Changes

diversity_new5$name[grepl('State University of New York', diversity_new5$name)] <- 'State Univ. of New York'

Tiffany - Non-Profit Name Changes

diversity_new5$name[grepl('Pennsylvania State University',diversity_new5$name)]<- 'Penn State Univ.'
diversity_new5$name[grepl('University of Minnesota',diversity_new5$name)]<- 'Univ. of Minn'
diversity_new5$name[grepl('Arizona State University',diversity_new5$name)]<- 'Arizona State'
diversity_new5$name[grepl('ITT Technical Institute',diversity_new5$name)]<- 'ITT Tech'

Create a new column to identify the randomly selected for-profit colleges sample.

We were initially going to use the column to create our plots and complete out analysis, however we realized it would be faster to create separate datasets with the randomly selected schools. The below coding successfully create a new column with “1” for the identified for-profit colleges and “0” for the rest.

Produces 1 for For-Profit Random Selections and 0 for the rest

diversity_new3 <- diversity_new %>% mutate(diversity_new,forProfit = ifelse(name=="Spencerian_College", 1,
     ifelse(name=="Aspen University", 1, 
        ifelse(name=="American Public University system", 1, 0))))

diversity_new3$forProfit[47834] # Check on Spencerian_College entry

## [1] 0

diversity_new3$forProfit[24533] # Check on Aspen University

## [1] 1

diversity_new3$forProfit[123] # Check on American Public University system

## [1] 1

Creating a shorter Combined Dataset

Since the final combined dataset had nearly 4000 rows we decided to make a new dataset with simplified data. We combined all of the school entries with multiple locations into one entry with the rows populated by the averages for each variable

Creating a two new datasets to compare of for-profit and not-for-profit institutions

for_profit2 <- diversity_new5 %>% 
  filter(name == "Spencerian College" |
           name == "Mildred Elley" | 
           name == "Brookline College" | 
           name == "Grand Canyon Univ." | 
           name == "Aspen Univ." | 
           name == "American Public Univ." | 
           name == "Western Intl Univ." | 
           name == "NewSchool Arch. Design" | 
           name == "Schiller Intl Univ." | 
           name == "Natl Paralegal College" | 
           name == "West Coast Univ." | 
           name == "Blue Cliff College" | 
           name == "Walden University" |
           name == "Neumont University" |
           name == "University of Phoenix" | 
           name == "Stevens-Henager College" | 
           name == "DeVry Univ." | 
           name == "Pioneer Pacific College" | 
           name == "Stratford University" | 
           name == "Capella University" | 
           name == "Grantham University" | 
           name == "Redstone College" | 
           name == "National College" | 
           name == "Strayer Univ." | 
           name == "Lincoln Tech" | 
           name == "Fashion Institute" | 
           name == "Centura College" |  
           name == "Rasmussen College" | 
           name == "Fortis College" | 
           name == "Full Sail University" | 
           name == "Rocky Mountain College of Art & Design" | 
           name == "Minneapolis Business College" | 
           name == "Paier College of Art" | 
           name == "Vista College" | 
           name == "Bay State College" )

str(for_profit2)

## tibble [1,014 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                 : chr [1:1014] "University of Phoenix" "University of Phoenix" "University of Phoenix" "University of Phoenix" ...
##  $ total_enrollment     : num [1:1014] 195059 195059 195059 195059 195059 ...
##  $ state                : chr [1:1014] "Arizona" "Arizona" "Arizona" "Arizona" ...
##  $ category             : chr [1:1014] "American Indian / Alaska Native" "Asian" "Black" "Hispanic" ...
##  $ enrollment           : num [1:1014] 876 1959 31455 13984 1019 ...
##  $ enrollment_percentage: num [1:1014] 0.449 1.004 16.126 7.169 0.522 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   total_enrollment = col_double(),
##   ..   state = col_character(),
##   ..   category = col_character(),
##   ..   enrollment = col_double()
##   .. )

head(for_profit2)

## # A tibble: 6 x 6
##   name       total_enrollment state  category       enrollment enrollment_perce~
##   <chr>                 <dbl> <chr>  <chr>               <dbl>             <dbl>
## 1 Universit~           195059 Arizo~ American Indi~        876             0.449
## 2 Universit~           195059 Arizo~ Asian                1959             1.00 
## 3 Universit~           195059 Arizo~ Black               31455            16.1  
## 4 Universit~           195059 Arizo~ Hispanic            13984             7.17 
## 5 Universit~           195059 Arizo~ Native Hawaii~       1019             0.522
## 6 Universit~           195059 Arizo~ White               58209            29.8

not_profit <- diversity_new5 %>% 
  filter(name == "Univ. of Idaho" |
           name == "Southeastern Community College Iowa"|
           name == "Clover Park Technical College"|
           name == "Clark College"|
           name == "City Univ. New York Hunter College"|
           name == "Univ. of Houston" | 
           name == "Univ. of Colorado"|
           name == "Univ. of Massachusetts" |
           name == "Arizona College" |
           name == "San Diego City College" | 
           name == "State Univ. of New York" | 
           name == "Eastern Shore Community College" | 
           name == "Southern Oregon University" | 
           name == "Santa Monica College" | 
           name == "Adirondack Community College" | 
           name == "East Georgia State College" | 
           name == "Smith College" | 
           name == "East Tennessee State University" | 
           name == "Austin Community College" | 
           name == "Tompkins Cortland Community College" | 
           name == "Penn State Univ." | 
           name == "Univ. of Minn" | 
           name == "Arizona State" | 
           name == "ITT Tech" | 
           name == "University of Hawaii Hawaii Community College"|
           name == "Burlington College" | 
           name == "University of Central Oklahoma" | 
           name == "Southern Virginia University" | 
           name == "Pennsylvania Highlands Community College" | 
           name == "Colgate University"|
           name == "University of Pittsburg")

For some reason the name filter for “Southeastern Community College (Iowa)” would not select properly. The same happened for the attempted name change above.

**New columns of for-profit and non-profit prior to joining the datasets

Adding for-profit column

for_profit3 <- cbind(for_profit2, profit_status = "for-profit")

Adding non-profit column

not_profit2 <- cbind(not_profit, profit_status = "non-profit")

Shortening the Enrollment percentage decimal points

Combine Random Sample For-Profit and Non-Profit Data Sets

combo_set <- rbind(for_profit3, not_profit2)

combo_set2 <- combo_set %>% 
  mutate_if(is.numeric, round, digits = 2)

# Removing the "state" category so we can make a new dataset with means for each column without an error due to categorical differences that cannot be merged with a mean calculation.
c3 <- combo_set2[,-3]

#creating a new dataset with all locations in one entry for each school.
c4 <- c3 %>%
    group_by(name, category, profit_status) %>% 
    summarise_each(funs(mean))

## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

This list was based off of unmerged school locations so we merged all locations into one prior to the final dataset merge.

# For-profit short combined list
c3a <- c3 %>% 
  filter(profit_status == "for-profit") %>% 
  group_by(name, category, profit_status) %>% 
    summarise_each(funs(mean))
#str(c3a)

# not-profit short combined list
c3b <- c3 %>% 
  filter(profit_status == "non-profit") %>% 
  group_by(name, category, profit_status) %>% 
    summarise_each(funs(mean))
#str(c3b)

Creating a category of census to bind with for profit and not-for-profit groups

# creating census short combined list
c3c <- c3b
c3c$profit_status[c3c$profit_status == "non-profit"] <- "census"
#str(c3c)

Changing the percentages in the census list to

# census percentages

c3c$enrollment_percentage[c3c$category == 'White'] <- as.numeric(72)

c3c$enrollment_percentage[c3c$category == 'Black'] <- as.numeric(12.7)

c3c$enrollment_percentage[c3c$category == 'American Indian / Alaska Native'] <- as.numeric(0.9)

c3c$enrollment_percentage[c3c$category == 'Asian'] <- as.numeric(5.6)

c3c$enrollment_percentage[c3c$category == 'Native Hawaiian / Pacific Islander'] <- as.numeric(0.2)

c3c$enrollment_percentage[c3c$category == 'Hispanic'] <- as.numeric(18)

# population totals

#census4bind$enrollment[census4bind$category == 'White'] <- '235560556'

#census4bind$enrollment[census4bind$category == 'Black'] <- '41550265'

#census4bind$enrollment[census4bind$category == 'American Indian / Alaska Native'] <- '2944507'

#census4bind$enrollment[census4bind$category == 'Asian'] <- '18321377'

#census4bind$enrollment[census4bind$category == 'Native Hawaiian / Pacific Islander'] <- '654334.9'

#census4bind$enrollment[census4bind$category == 'Hispanic'] <- '58890139'

c3c$enrollment_percentage <- as.numeric(as.character(c3c$enrollment_percentage))

#str(c3c)

Merge of shortened datasets

c3bind <-rbind(c3c, c3b, c3a )
#str(c3bind)

Data Visualizations with US Census Data

plot1 <- for_profit2 %>%
  ggplot() +
  geom_bar(aes(x=name, y= enrollment_percentage , 
              fill = category),
              position = "fill", 
              stat = "identity" ) +
  coord_flip() +
  theme_minimal() +
  ggtitle("Diversity % of Enrollment in For Profit US Schools", ) +
  theme (plot.title = element_text(hjust = .01, size=15)) +
  labs(fill = "Race") +
  theme(legend.justification = -20, 
        legend.position="bottom", 
        legend.text = element_text(size=6) ,
        )  +
  xlab("School Name") +
  ylab ("Percent of Enrollment") 
plot1

Racial & Ethnic Composition at Non-Profit Schools

plot2 <- not_profit %>%
  ggplot() +
  geom_bar(aes(x=name, y= enrollment_percentage , 
              fill = category),
              position = "fill", 
              stat = "identity" ) +
  coord_flip() +
  theme_minimal() +
  ggtitle("Diversity % of Enrollment in Not for Profit US Schools", ) +
  theme (plot.title = element_text(hjust = .01, size=12)) +
  labs(fill = "Race") +
  theme( legend.position="bottom", 
         legend.justification = "left",
        legend.text = element_text(size=6) ,
        )  +
  xlab("School Name") +
  ylab ("Percent of Enrollment") 
plot2

Comparing this chart to the chart above suggests that there is a higher proportion of white students at non-profit schools than for-profit schools and that there is a higher percentage of students of color at for-profit schools than non-profit schools. Computational analysis must be completed.

Visualization of Percent Enrollment with Census, For-Profit and Non-Profit data

ggplot(data=c3bind, aes(x=category, y=enrollment_percentage, fill=profit_status)) +
geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("For-Profit/Non-Profit Racial & Ethnic Percent Enrollment Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme_minimal()+
   scale_fill_manual(values=c("#CC6666","#9999CC", "#66CC99"))

This graph incorporates new census data to compare the population proportions to the enrollment percentage proportions. The most striking information in this graphic is the high percentage of for-profit enrollment of most students of color and the higher percentage of white students at non-profit institutions.

c3bind <- c3bind %>% 
  mutate(enrollment_percentage = enrollment_percentage/100) %>% 
  mutate_if(is.numeric, round, digits = 2)

## `mutate_if()` ignored the following grouping variables:
## Columns `name`, `category`

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

ggplot(c3bind, aes(reorder(profit_status, enrollment_percentage), enrollment_percentage, fill = profit_status)) +
  geom_bar(stat = 'identity') +
  labs(x = "Isolated Race/Ethnic Categories", y = "Enrollment Percentage", fill = "Profit Status")+
  #scale_y_continuous(labels = function(x) paste0((x*5), "%")) + # Multiply by 5 & add %  
  facet_wrap(~ category)+
  theme_bw()+
  scale_fill_brewer()

The above facet-wrap graph is consistent with the barplot above. However, one issue that this graph presents is the percentage scale. This is an issue that impacted a few of the other percentage-based visualizations we completed, particularly with the facet-wrap feature and will be discussed further.

Comparisons of People of Color and White People at For-Profit and Non-Profit Institutions

With these individual breakdowns we thought it would be interesting to see what the overall breakdown would be between aggregated people of color as a category and white people as a category, particularly because of the vastly higher number of white people in the U.S. population compared with each individual group of people of color.

# creating a set with only categories for people of color 
c5a <- c4 %>% 
  filter(category != "White")

# removing individual categories
c5b <- c5a[,-2]

# creating a new dataset with percentages summed for overall people of color
c5c <- c5b %>%
    group_by(name, profit_status) %>% 
    summarise_each(funs(mean))

# adding a column to distinguish POC
c5c <- cbind(c5c, category = "People of Color")

c5cmean <- mean(c5c$enrollment_percentage)
c5cmean

## [1] 7.334774

# creating a set of all white category
c5d <- c4 %>% 
  filter(category == "White")

c5dmean <- mean(c5d$enrollment_percentage)
c5dmean

## [1] 48.29188

# joining the overall people of color dataset with the white people dataset for comparison
c6 <- rbind(c5c, c5d)

# white census category
c3c_white <- c3c %>% 
  filter(category == "White")

mean(c3c_white$enrollment_percentage)

## [1] 72

# people of color census category
c3c_poc <- c3c %>% 
  filter(category != "White")

# adding a column to distinguish POC
c3c_poc <- c3c_poc[,-2]

mean(c3c_poc$enrollment_percentage)

## [1] 7.48

c3c_poc <- cbind(c3c_poc, category = "People of Color")

Binding Census data with Dataset for People of Color and White People

c6census <- rbind(c6, c3c_white, c3c_poc)

p1 <- c6 %>%  mutate(enrollment_percentage = enrollment_percentage/100) 

ggplot(c6, aes(reorder(category, enrollment_percentage), enrollment_percentage, fill = category)) +
  geom_bar(stat = 'identity') +
   ggtitle("For-Profit/Non-Profit Racial & Ethnic Percent Enrollment Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity") +
  scale_fill_manual(values=c("#9999CC", "#66CC99")) +
  theme_light()+
  facet_wrap(~ profit_status)

This plot makes clear that there are more people of color enrolled in for-profit colleges than non-profit colleges and more White people enrolled in non-profit colleges than for-profit colleges. Two problems, we do not have census proportions incorporated and our scale is incorrect. Let us now incorporate the census information and then address the scale issue.

ggplot(c6census, aes(reorder(category, enrollment_percentage), enrollment_percentage, fill = category)) +
  geom_bar(stat = 'identity') +
   ggtitle("For-Profit/Non-Profit Racial & Ethnic Percent Enrollment Comparisons") +
  labs(y = "Enrollment Percentage", x = "Race/Ethnicity") +
  scale_fill_manual(values=c("#9999CC", "#66CC99")) +
  scale_y_continuous(labels = function(x) paste(((x/40) + 20), "%")) + # Multiply by 5 & add % 
  theme_light()+
  facet_wrap(~ profit_status)

This graph incorporates the census information and supports the alternative hypothesis that there are proportionally unequal representations of people of color and white people at for-profit and non-profit colleges. It also makes clear that there are proportionally more white people enrolling in institutions of higher education than people of color.

Regarding the scale issue, we were able to manually manipulate the y-axis scale based off of our knowledge that white people account for 72% of the population. However, would this be ethical? We do not think so. Although we are basing the scale off of accurate information, it opens a slippery slope of visual adjustment of data. It is a good lesson for us to see how easy it is to manipulate the data presentation. We want to learn the correct way by investigating why our scale is inaccurate and determine how to fix the problem in the formulated plot without aesthetic manipulation.

Question 3: What is the relationship between tuition rate and salary potential?

In-State Tuition and Early-Career Salary

ptc1 <- monsterfinal %>%
  ggplot()+
  geom_point(aes(x=in_state_total, y=early_career_pay, fill = type, text = paste("Name:", name, '</br>', '</br>In-State Tuition:', in_state_total, '</br>Early Career Pay:', early_career_pay)))+
  stat_smooth(aes(x=in_state_total, y=early_career_pay), method = 'lm', se = FALSE, lwd = 0.5, col = "black")+
  ggtitle("In-State Tuition and Early-Career Salary")+
  labs(x="In-State Tuition (per year)", y="Early-Career Salary")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Ignoring unknown aesthetics: text

ggplotly(ptc1, tooltip = "text")

## `geom_smooth()` using formula 'y ~ x'

This plot suggests a few interpretations. First, private schools cost much more than in-state public schools. Second, in general, there appears to be a positively correlated relationship between tuition cost and early-career salary, with some exceptions. Third, there is a wider range of tuition costs in private schools than the public school tuition range. This graph is interactive to allow the reader to explore costs and salary potential at individual schools. There was not much aesthetic manipulation of this graph, further graphs explore utilizing different shapes and colors. Now let us examine if the visual is supported by the math.

fitInStateEarlyCareer<- lm(monsterfinal$early_career_pay ~ monsterfinal$in_state_total) 

summary(fitInStateEarlyCareer)

## 
## Call:
## lm(formula = monsterfinal$early_career_pay ~ monsterfinal$in_state_total)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16883.4  -4543.3   -858.2   3383.9  28405.5 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.116e+04  2.623e+02  156.91   <2e-16 ***
## monsterfinal$in_state_total 2.565e-01  6.138e-03   41.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6747 on 3586 degrees of freedom
## Multiple R-squared:  0.3275, Adjusted R-squared:  0.3273 
## F-statistic:  1746 on 1 and 3586 DF,  p-value: < 2.2e-16

The In-State tuition rates do have a low p-value suggesting that we should reject the null hypothesis that there is no relationship between in-state tuition and early career pay. However the adjusted R-squared value is only 32.73%, as is demonstrated by the wide spread of datapoints on our plot, telling us that in-state tuition alone does not predict early career pay, rather there is a strong chance that other factors contribute to the observed outcome.

In-State Tuition and Mid-Career Salary

ptc2 <- monsterfinal %>%
  ggplot()+
  geom_point(aes(x=in_state_total, y=mid_career_pay, fill = type, text = paste("Name:", name, '</br>', '</br>In-State Tuition:', in_state_total, '</br>Mid Career Pay:', mid_career_pay)))+
  stat_smooth(aes(x=in_state_total, y=mid_career_pay), method = 'lm', se = FALSE, lwd = 0.5, col = "black")+
  ggtitle("In-State Tuition and Mid-Career Salary")+
  labs(x="In-State Tuition (per year)", y="Mid-Career Salary")+
  theme_minimal()+
  scale_fill_viridis_d()+
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Ignoring unknown aesthetics: text

ggplotly(ptc2, tooltip = "text")

## `geom_smooth()` using formula 'y ~ x'

The results of this plot are very similar to the previous plot of early-career salary. A few notable points are the expanded range of salary at mid-career versus early-career. There are divergences both upward and downward further down the career track. However the general positive correlation still appears to hold. Aesthetically we varied the color scheme to introduce further contrast. Let us examine the numbers.

fitInStateMidCareer<- lm(monsterfinal$mid_career_pay ~ monsterfinal$in_state_total) 

summary(fitInStateMidCareer)

## 
## Call:
## lm(formula = monsterfinal$mid_career_pay ~ monsterfinal$in_state_total)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32182  -8723  -1893   6109  50414 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.265e+04  4.956e+02  146.59   <2e-16 ***
## monsterfinal$in_state_total 5.147e-01  1.160e-02   44.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12750 on 3586 degrees of freedom
## Multiple R-squared:  0.3545, Adjusted R-squared:  0.3543 
## F-statistic:  1970 on 1 and 3586 DF,  p-value: < 2.2e-16

The In-State tuition rates do have a low p-value suggesting that we should reject the null hypothesis that there is no relationship between in-state tuition and mid-career pay. However the adjusted R-squared value is only 35.43%, as is demonstrated by the wide spread of datapoints on our plot, telling us that in-state tuition alone does not predict mid-career pay, rather there is a strong chance that other factors contribute to the observed outcome.

Out-of-State Tuition and Early-Career Salary

ptc3 <- monsterfinal %>%
  ggplot()+
  geom_point(aes(x=out_of_state_total, y=early_career_pay, fill = type, shape = type, text = paste("Name:", name, '</br>', '</br>Out-of-State Tuition:', out_of_state_total, '</br>Early Career Pay:', early_career_pay)))+
  stat_smooth(aes(x=out_of_state_total, y=early_career_pay), method = 'lm', se = FALSE, lwd = 0.5, col = "black")+
  ggtitle("Out-of-State Tuition and Early-Career Salary")+
  labs(x="Out-of-State Tuition (per year)", y="Early-Career Salary")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Ignoring unknown aesthetics: text

ggplotly(ptc3, tooltip = "text")

## `geom_smooth()` using formula 'y ~ x'

This plot shows us that there is much more overlap in tuition costs with out-of-state public school tuition and private school tuition. The positively correlated relationship appears to be stronger in this plot. Aesthetically we changed the shape of the public school datapoints, however we find that a color contrast is more effective for this plot due to the high number of datapoints. If we had fewer datapoints the different shapes could be increased in size and would be more effective. Let us now examine the math.

fitOutStateEarlyCareer<- lm(monsterfinal$early_career_pay ~ monsterfinal$out_of_state_total) 

summary(fitOutStateEarlyCareer)

## 
## Call:
## lm(formula = monsterfinal$early_career_pay ~ monsterfinal$out_of_state_total)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16254.6  -3955.1   -913.2   2777.8  28770.0 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.496e+04  3.053e+02  114.52   <2e-16 ***
## monsterfinal$out_of_state_total 3.753e-01  6.720e-03   55.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6017 on 3586 degrees of freedom
## Multiple R-squared:  0.4651, Adjusted R-squared:  0.465 
## F-statistic:  3119 on 1 and 3586 DF,  p-value: < 2.2e-16

The out-of-state tuition rates do have a low p-value suggesting that we should reject the null hypothesis that there is no relationship between in-state tuition and early career pay. The adjusted R-squared value is 46.5%, which is noticeably greater than in-state tuition. However, it is still not close to 100%, which tells us that out-of-state tuition alone does not predict early career pay, rather there is a strong chance that other factors contribute to the observed outcome.

Out-of-State Tuition and Mid-Career Salary

ptc4 <- monsterfinal %>%
  ggplot()+
  geom_point(aes(x=out_of_state_total, y=mid_career_pay, fill = type, text = paste("Name:", name, '</br>', '</br>Out-of-State Tuition:', out_of_state_total, '</br>Mid-Career Pay:', mid_career_pay)))+
  stat_smooth(aes(x=out_of_state_total, y=mid_career_pay), method = 'lm', se = FALSE, lwd = 0.5, col = "black")+
  ggtitle("Out-of-State Tuition and Mid-Career Salary")+
  labs(x="Out-of-State Tuition (per year)", y="Mid-Career Salary")+
  theme_minimal()+
   scale_fill_viridis_d()+
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Ignoring unknown aesthetics: text

ggplotly(ptc4, tooltip = "text")

## `geom_smooth()` using formula 'y ~ x'

Similarly to the in-state tuition plots, this plot largely mimics the early-career plot, however there is a wider apparent range in salaries. The positive correlation still appears to hold.

fitOutStateMidCareer<- lm(monsterfinal$mid_career_pay ~ monsterfinal$out_of_state_total) 

summary(fitOutStateMidCareer)

## 
## Call:
## lm(formula = monsterfinal$mid_career_pay ~ monsterfinal$out_of_state_total)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30896  -7692  -1775   5496  49950 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.047e+04  5.717e+02  105.76   <2e-16 ***
## monsterfinal$out_of_state_total 7.471e-01  1.259e-02   59.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11270 on 3586 degrees of freedom
## Multiple R-squared:  0.4956, Adjusted R-squared:  0.4954 
## F-statistic:  3523 on 1 and 3586 DF,  p-value: < 2.2e-16

The out-of-state tuition rates do have a low p-value suggesting that we should reject the null hypothesis that there is no relationship between out-of-state tuition and mid-career pay. The adjusted R-squared value is 49.54%, which is noticeably greater than in-state tuition. However, it is still not close to 100%, which tells us that out-of-state tuition alone does not predict mid-career pay, rather there is a strong chance that other factors contribute to the observed outcome.

Question 4: Are For-Profit Tuition Costs Higher than Non-Profit Tuition Costs?

library('rvest')
library(tidyverse)
library(dplyr)

Inserting the CSV file (Tuition Cost)

#read.csv("/Users/tiffanyking/Documents/Data101/tuition_cost.csv")

Setting working directory

#setwd("/Users/tiffanyking/Documents/Data101")
#tuition <- read_csv("tuition_cost.csv")

View data

head(tuition)

## # A tibble: 6 x 10
##   name  state state_code type  degree_length room_and_board in_state_tuition
##   <chr> <chr> <chr>      <chr> <chr>                  <dbl>            <dbl>
## 1 Aani~ Mont~ MT         Publ~ 2 Year                    NA             2380
## 2 Abil~ Texas TX         Priv~ 4 Year                 10350            34850
## 3 Abra~ Geor~ GA         Publ~ 2 Year                  8474             4128
## 4 Acad~ Minn~ MN         For ~ 2 Year                    NA            17661
## 5 Acad~ Cali~ CA         For ~ 4 Year                 16648            27810
## 6 Adam~ Colo~ CO         Publ~ 4 Year                  8782             9440
## # ... with 3 more variables: in_state_total <dbl>, out_of_state_tuition <dbl>,
## #   out_of_state_total <dbl>

Simplifying Names for all Schools in our Samples

We changed all of the names of our schools to simplified versions which was particularly important for the schools with multiple locations.

For-Profit

Jennifer - For-Profit Name Changes

tuition$name[grepl('Clover Park Technical College', tuition$name)] <- 'Clover Park Tech College'
tuition$name[grepl('San Diego City College', tuition$name)] <- 'San Diego City College'
tuition$name[grepl('Aspen University', tuition$name)] <- 'Aspen Univ.'
tuition$name[grepl('Grand Canyon University', tuition$name)] <- 'Grand Canyon Univ.'
tuition$name[grepl('American Public University', tuition$name)] <- 'American Public Univ.'
tuition$name[grepl("Southeastern Community College (Iowa)", tuition$name)] <- "Southeastern Community College Iowa"

Ana - For-Profit Name Changes

tuition$name[grepl('Blue Cliff College', tuition$name)] <- 'Blue Cliff College' 
tuition$name[grepl('University of Phoenix', tuition$name)] <- 'University of Phoenix' 
tuition$name[grepl('Stevens-Henager College', tuition$name)] <- 'Stevens-Henager College' 
tuition$name[grepl('DeVry University', tuition$name)] <- 'DeVry Univ.' 
tuition$name[grepl('Pioneer Pacific College', tuition$name)] <- 'Pioneer Pacific College' 
tuition$name[grepl('National College at', tuition$name)] <- 'National College'

Tiffany - For-Profit Name Changes

tuition$name[grepl('Strayer University', tuition$name)]<- 'Strayer Univ.'
tuition$name[grepl('Lincoln Tech',tuition$name)]<- 'Lincoln Tech' 
tuition$name[grepl('Fashion Institute of Design and Merchandising', tuition$name)]<- 'Fashion Institute'
tuition$name[grepl('Centura College', tuition$name)]<- 'Centura College'
tuition$name[grepl('Rasmussen College', tuition$name)]<- 'Rasmussen College' 
tuition$name[grepl('Fortis College',tuition$name)]<- 'Fortis College'

Non-Profit

Jennifer- Non-Profit Name Changes

tuition$name[grepl('University of Houston', tuition$name)] <- 'Univ. of Houston'
tuition$name[grepl('University of Colorado', tuition$name)] <- 'Univ. of Colorado'
tuition$name[grepl('University of Massachusetts', tuition$name)] <- 'Univ. of Massachusetts'
tuition$name[grepl('Arizona College', tuition$name)] <- 'Arizona College'
tuition$name[grepl('University of Idaho', tuition$name)] <- 'Univ. of Idaho'
tuition$name[grepl('Clark College', tuition$name)] <- 'Clark College'
tuition$name[grepl('City University of New York Hunter College', tuition$name)] <- 'City Univ. New York Hunter College'

Ana - Non-Profit Name Changes

tuition$name[grepl('State University of New York', tuition$name)] <- 'State Univ. of New York'

Tiffany - Non-Profit Name Changes

tuition$name[grepl('Pennsylvania State University',tuition$name)]<- 'Penn State Univ.'

tuition$name[grepl('University of Minnesota',tuition$name)]<- 'Univ. of Minn'

tuition$name[grepl('Arizona State University',tuition$name)]<- 'Arizona State'

tuition$name[grepl('ITT Technical Institute',tuition$name)]<- 'ITT Tech'

New Columns

Creating a two new datasets to compare of for-profit and not-for-profit institutions

For-Profit

for_profit2 <- tuition %>% 
  filter(name == "Spencerian College" |
           name == "Mildred Elley" | 
           name == "Brookline College" | 
           name == "Grand Canyon Univ." | 
           name == "Aspen Univ." | 
           name == "American Public Univ." | 
           name == "Western Intl Univ." | 
           name == "NewSchool Arch. Design" | 
           name == "Schiller Intl Univ." | 
           name == "Natl Paralegal College" | 
           name == "West Coast Univ." | 
           name == "Blue Cliff College" | 
           name == "Walden University" |
           name == "Neumont University" |
           name == "University of Phoenix" | 
           name == "Stevens-Henager College" | 
           name == "DeVry Univ." | 
           name == "Pioneer Pacific College" | 
           name == "Stratford University" | 
           name == "Capella University" | 
           name == "Grantham University" | 
           name == "Redstone College" | 
           name == "National College" | 
           name == "Strayer Univ." | 
           name == "Lincoln Tech" | 
           name == "Fashion Institute" | 
           name == "Centura College" |  
           name == "Rasmussen College" | 
           name == "Fortis College" | 
           name == "Full Sail University" | 
           name == "Rocky Mountain College of Art & Design" | 
           name == "Minneapolis Business College" | 
           name == "Paier College of Art" | 
           name == "Vista College" | 
           name == "Bay State College" )

str(for_profit2)

## tibble [6 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                : chr [1:6] "Bay State College" "Capella University" "Grand Canyon Univ." "Grantham University" ...
##  $ state               : chr [1:6] "Massachusetts" "Minnesota" "Arizona" "Kansas" ...
##  $ state_code          : chr [1:6] "MA" "MN" "AZ" "KS" ...
##  $ type                : chr [1:6] "For Profit" "For Profit" "For Profit" "For Profit" ...
##  $ degree_length       : chr [1:6] "2 Year" "4 Year" "4 Year" "4 Year" ...
##  $ room_and_board      : num [1:6] 13300 NA 10100 NA NA NA
##  $ in_state_tuition    : num [1:6] 27750 13788 27090 6540 18040 ...
##  $ in_state_total      : num [1:6] 41050 13788 37190 6540 18040 ...
##  $ out_of_state_tuition: num [1:6] 27750 13788 27090 6540 18040 ...
##  $ out_of_state_total  : num [1:6] 41050 13788 37190 6540 18040 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   state_code = col_character(),
##   ..   type = col_character(),
##   ..   degree_length = col_character(),
##   ..   room_and_board = col_double(),
##   ..   in_state_tuition = col_double(),
##   ..   in_state_total = col_double(),
##   ..   out_of_state_tuition = col_double(),
##   ..   out_of_state_total = col_double()
##   .. )

head(for_profit2)

## # A tibble: 6 x 10
##   name  state state_code type  degree_length room_and_board in_state_tuition
##   <chr> <chr> <chr>      <chr> <chr>                  <dbl>            <dbl>
## 1 Bay ~ Mass~ MA         For ~ 2 Year                 13300            27750
## 2 Cape~ Minn~ MN         For ~ 4 Year                    NA            13788
## 3 Gran~ Ariz~ AZ         For ~ 4 Year                 10100            27090
## 4 Gran~ Kans~ KS         For ~ 4 Year                    NA             6540
## 5 Paie~ Conn~ CT         For ~ 4 Year                    NA            18040
## 6 Wald~ Minn~ MN         For ~ 4 Year                    NA            15045
## # ... with 3 more variables: in_state_total <dbl>, out_of_state_tuition <dbl>,
## #   out_of_state_total <dbl>

Non-Profit

not_profit <- tuition %>% 
  filter(name == "Univ. of Idaho" |
           name == "Clover Park Technical College"|
           name == "Clark College"|
           name == "City Univ. New York Hunter College"|
           name == "Univ. of Houston" | 
           name == "Univ. of Colorado"|
           name == "Univ. of Massachusetts" |
           name == "Arizona College" |
           name == "San Diego City College" | 
           name == "State Univ. of New York" | 
           name == "Eastern Shore Community College" | 
           name == "Southern Oregon University" | 
           name == "Santa Monica College" | 
           name == "Adirondack Community College" | 
           name == "East Georgia State College" | 
           name == "Smith College" | 
           name == "East Tennessee State University" | 
           name == "Austin Community College" | 
           name == "Tompkins Cortland Community College" | 
           name == "Penn State Univ." | 
           name == "Univ. of Minn" | 
           name == "Arizona State" | 
           name == "ITT Tech" | 
           name == "University of Hawaii Hawaii Community College"|
           name == "Burlington College" | 
           name == "University of Central Oklahoma" | 
           name == "Southern Virginia University" | 
           name == "Pennsylvania Highlands Community College" | 
           name == "Colgate University"|
           name == "University of Pittsburg")

diversity <- read_csv("diversity2.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   total_enrollment = col_double(),
##   state = col_character(),
##   category = col_character(),
##   enrollment = col_double()
## )

salary <- read_csv("salary_potential.csv")

## Parsed with column specification:
## cols(
##   rank = col_double(),
##   name = col_character(),
##   state_name = col_character(),
##   early_career_pay = col_double(),
##   mid_career_pay = col_double(),
##   make_world_better_percent = col_double(),
##   stem_percent = col_double()
## )

salary_added <- left_join(diversity,salary,by="name") 
salary_added_final <- na.omit(salary_added)

tuition <- read_csv("tuition_cost.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   state = col_character(),
##   state_code = col_character(),
##   type = col_character(),
##   degree_length = col_character(),
##   room_and_board = col_double(),
##   in_state_tuition = col_double(),
##   in_state_total = col_double(),
##   out_of_state_tuition = col_double(),
##   out_of_state_total = col_double()
## )

tuition_added <- left_join(tuition,salary_added_final,by="name")
monsterfinal <- na.omit(tuition_added)

library(ggplot2)

Creating Graphs/Visualizations

Find the Average of For- & Not-Profit

mean(not_profit$in_state_tuition)

## [1] 13346.43

mean(for_profit2$in_state_tuition)

## [1] 18042.17

mean(not_profit$out_of_state_tuition)

## [1] 23281.38

mean(for_profit2$out_of_state_tuition)

## [1] 18042.17

Creating a New Data Frame for Means/ Average Tuition

test <- data.frame( "Name" = c("Not Profit In State","Not Profit Out State","For Profit In State","For Profit Out State"),"Mean" = c(13346.43,23281.38,18042.17,18042.17))

str(not_profit)

## tibble [37 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                : chr [1:37] "Adirondack Community College" "Arizona State" "Austin Community College" "Arizona College" ...
##  $ state               : chr [1:37] "New York" "Arizona" "Texas" "Arizona" ...
##  $ state_code          : chr [1:37] "NY" "AZ" "TX" "AZ" ...
##  $ type                : chr [1:37] "Public" "Public" "Public" "Public" ...
##  $ degree_length       : chr [1:37] "2 Year" "4 Year" "2 Year" "2 Year" ...
##  $ room_and_board      : num [1:37] 11660 12648 NA 7700 NA ...
##  $ in_state_tuition    : num [1:37] 5375 10822 2550 2580 4287 ...
##  $ in_state_total      : num [1:37] 17035 23470 2550 10280 4287 ...
##  $ out_of_state_tuition: num [1:37] 9935 28336 13020 11100 9718 ...
##  $ out_of_state_total  : num [1:37] 21595 40984 13020 18800 9718 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   state_code = col_character(),
##   ..   type = col_character(),
##   ..   degree_length = col_character(),
##   ..   room_and_board = col_double(),
##   ..   in_state_tuition = col_double(),
##   ..   in_state_total = col_double(),
##   ..   out_of_state_tuition = col_double(),
##   ..   out_of_state_total = col_double()
##   .. )

Visualization of the Tution Rates at For-Profit and Non-Profit Schools

install.packages("RColorBrewer")

## Warning: package 'RColorBrewer' is in use and will not be installed

library(RColorBrewer)


plot1 <- test %>%
  ggplot(aes(x=Name, y=Mean,fill=Name)) +
  geom_bar(stat="identity")+
  coord_flip() +
  theme_minimal() +
  ggtitle("Average Tuition Rate For & Not Profit US Schools", ) +
  theme (plot.title = element_text(hjust = .01, size=15)) +
  theme(legend.justification = -20, 
        legend.position="bottom", 
        legend.text = element_text(size=6) ,
        )  +
  xlab("School Type") +
  ylab ("Cost") +
  scale_color_brewer()

plot1

This plot, created from a randomly selected dataset, suggests that the out of state tuition costs outweigh the costs at for profit institutions. Non-profit in-state institutions were the least expensive in our dataset. The dataset did include community colleges, which would be one factor to control for in further analysis.

Creating a New Data Set for Not Profits (In State Tuition)

test2 <- data.frame ("Name" = c("Univ. of Idaho", "Clover Park Technical College", "Clark College","City Univ. New York Hunter College", "Univ. of Houston","Univ. of Colorado", "Univ. of Massachusetts", "Arizona College","San Diego City College","State Univ. of New York", "Eastern Shore Community College","Southern Oregon University", "Santa Monica College","Adirondack Community College","East Georgia State College","Smith College","East Tennessee State University","Austin Community College", "Tompkins Cortland Community College","Penn State Univ.","Univ. of Minn","Arizona State","ITT Tech","University of Hawaii Hawaii Community College","Burlington College","University of Central Oklahoma", "Southern Virginia University","Pennsylvania Highlands Community College","Colgate University","University of Pittsburg") , "Tuition"=c(7864, 4271, 4287, 7180, 12506, 11463, 14835, 2640, 1418, 8540, 4800,  9654, 1444,5375,3666, 52404,9277,2550,6046,18454,11921,10822, 45000, 3915, 22500, 7488, 16495, 5820, 55870, 19080))

Graph of the Test 2 Data Set(Not Profits-In State Tuition)

plot1 <- test2%>%
  ggplot(aes(x=Name, y=Tuition ,fill=Tuition)) +
  geom_bar(stat="identity")+
  coord_flip() +
  theme_minimal() +
  ggtitle("In-State Tuition Rate for Not Profit US Schools", ) +
  theme (plot.title = element_text(hjust = .01, size=15)) +
  theme(legend.justification = 1, 
        legend.position="bottom", 
        legend.text = element_text(size=6) ,
        )  +
  xlab("School name") +
  ylab ("Tuition Cost") +
  scale_color_brewer()

plot1

This plot identifies the individual costs of tuition within our non-profit school dataset. The private colleges were much more expensive than the in-state tuition at public schools.

Create data set for For-Profit (In-state)

test3 <- data.frame("Name"=c("Spencerian College", "Mildred Elley", "Brookline College","Grand Canyon Univ.","Aspen Univ.", "American Public Univ.", "Western Intl Univ.", "NewSchool Arch. Design", "Schiller Intl Univ.", "Natl Paralegal College", "West Coast Univ.", "Blue Cliff College", "Walden University", "Neumont University", "University of Phoenix", "Stevens-Henager College", "DeVry Univ.", "Pioneer Pacific College", "Stratford University", "Capella University","Grantham University", "Redstone College",  "National College","Strayer Univ.","Lincoln Tech", "Fashion Institute","Centura College","Rasmussen College","Fortis College", "Full Sail University", "Rocky Mountain College of Art & Design", "Minneapolis Business College", "Paier College of Art", "Vista College","Bay State College"), "Tuition3"=c(16860,14250,13500,27090,14492,7324,11040,28386,17700,7995,17030,15182,15045,24750,9,552,14585,15188,2310,16750,13788,6540,15530,13320,27401,5740,15887,5200,15511,19929,28512,15120,18040,2720,27750))

##Graph the Test 3 Data Set(For-Profit-In-state)

plot1 <- test3%>%
  ggplot(aes(x=Name, y=Tuition3 ,fill=Tuition3)) +
  geom_bar(stat="identity")+
  coord_flip() +
  theme_minimal() +
  ggtitle("In-State Tuition Rate for For Profit US Schools", ) +
  theme (plot.title = element_text(hjust = .01, size=15)) +
  theme(legend.justification = 1, 
        legend.position="bottom", 
        legend.text = element_text(size=6) ,
        )  +
  xlab("School name") +
  ylab ("Tuition Cost") +
  scale_color_brewer()

plot1

Consistent with our previous plots, the in-state tuition rates at for-profit schools were all more expensive than the in-state tuition at non-profit schools.

Question 5: How do diversity, tuition rates and salary rates at the top 20 schools compare to other schools?

Loading Libraries

#install.packages("RColorBrewer")
#Loading the rvest package
library('rvest')
library(tidyverse)
library(dplyr)

Scraping the names of top US schools from wikipedia

#Specifying the url for desired website to be scraped
url <- 'https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States'

#Reading the HTML code from the website
top <- read_html(url)

#Using CSS selectors to scrape the rankings section
top_html <- html_nodes(top,'td a')
#Converting the ranking data to text
topschools <- html_text(top_html)
#topschools
topschools_new <- topschools[11:30]
#df <-data_frame(topschools_new)
#df
topschools_new

##  [1] "Harvard University"                   
##  [2] "Stanford University"                  
##  [3] "Yale University"                      
##  [4] "Massachusetts Institute of Technology"
##  [5] "Princeton University"                 
##  [6] "University of Pennsylvania"           
##  [7] "Brown University"                     
##  [8] "California Institute of Technology"   
##  [9] "Duke University"                      
## [10] "Dartmouth College"                    
## [11] "Cornell University"                   
## [12] "Pomona College"                       
## [13] "University of California, Berkeley"   
## [14] "Columbia University"                  
## [15] "Georgetown University"                
## [16] "University of Chicago"                
## [17] "Northwestern University"              
## [18] "University of Notre Dame"             
## [19] "Williams College"                     
## [20] "University of Michigan"

Changing name of school with multiple campuses, and adding new column of enrollement percentage

diversity$name[grepl('University of Michigan', diversity$name)] <- 'University of Michigan' 
diversity <- diversity %>%  mutate(enrollment_percentage = enrollment/ total_enrollment*100)

Creating new dataframe of top schools

topschools_diversity <- diversity  %>% 
  filter(name == "Harvard University"   |
         name == "Stanford University"  |
         name == "Yale University"  |
         name == "Massachusetts Institute of Technology"  |
         name == "Princeton University"  |                  
         name == "University of Pennsylvania"  |           
         name == "Brown University"  |                      
         name == "California Institute of Technology"  |   
         name == "Duke University"  |                       
         name == "Dartmouth College"  |                    
         name == "Cornell University"  |                    
         name == "Pomona College"  |                       
         name == "University of California at Berkeley"  |    
         name == "Columbia University"  |                  
         #name == "Georgetown University"  |                 
         name == "University of Chicago"  |                
         name == "Northwestern University"  |               
         name == "University of Notre Dame"  |             
         name == "Rensselaer Polytechnic Institute"   |     
         name == "University of Michigan"    
  )

Checking to see the schools in new data frame. For some reason, Georgetown is not included in this dataset.

unique(topschools_diversity$name)

##  [1] "University of Michigan"               
##  [2] "University of California at Berkeley" 
##  [3] "Harvard University"                   
##  [4] "Columbia University"                  
##  [5] "University of Pennsylvania"           
##  [6] "Cornell University"                   
##  [7] "Northwestern University"              
##  [8] "Stanford University"                  
##  [9] "Duke University"                      
## [10] "University of Chicago"                
## [11] "Yale University"                      
## [12] "University of Notre Dame"             
## [13] "Massachusetts Institute of Technology"
## [14] "Brown University"                     
## [15] "Princeton University"                 
## [16] "Rensselaer Polytechnic Institute"     
## [17] "Dartmouth College"                    
## [18] "California Institute of Technology"   
## [19] "Pomona College"

Removing the unwated race categories

topschools_diversity <- topschools_diversity %>% filter(category != "Women" & category != "Two Or More Races" & category != "Non-Resident Foreign" & category != "Unknown" & category != "Total Minority")

Plotting the diversity of the top 19 schools in the US

plot1 <- topschools_diversity %>%
  ggplot() +
  geom_bar(aes(x=name, y= enrollment_percentage , 
              fill = category),
              position = "fill", 
              stat = "identity" ) +
  coord_flip() +
  theme_minimal() +
  ggtitle("Diversity Percent of Enrollment for Top 19 Collegs in US", ) +
  theme (plot.title = element_text(hjust = .01, size=13)) +
  labs(fill = "Race") +
  theme(legend.justification = -5, 
        legend.position="bottom", 
        legend.text = element_text(size=7) ,
        )  +
  xlab("School Name") +
  ylab ("Percent of Enrollment") 
plot1

The graph shows that white students are overwhelmingly present in the top 19 colleges in the US. The 2nd largest present population are Asian students. Hispanic and Black students follow, but are very underwhelmed in numbers. The population of American Indian and Native Hawaiian students are barely visible and present on the graph. This graph fuels the preconception that top US schools (including IVY leagues) are reserved for students with access to resources and privilege (white and asian).

Now creating a master dataset to include tuition & salary pay for top schools

tuition <- read_csv("tuition_cost.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   state = col_character(),
##   state_code = col_character(),
##   type = col_character(),
##   degree_length = col_character(),
##   room_and_board = col_double(),
##   in_state_tuition = col_double(),
##   in_state_total = col_double(),
##   out_of_state_tuition = col_double(),
##   out_of_state_total = col_double()
## )

salary <- read_csv("salary_potential.csv")

## Parsed with column specification:
## cols(
##   rank = col_double(),
##   name = col_character(),
##   state_name = col_character(),
##   early_career_pay = col_double(),
##   mid_career_pay = col_double(),
##   make_world_better_percent = col_double(),
##   stem_percent = col_double()
## )

salary_added <- merge(x = diversity, y = salary, by = "name", all = TRUE)

Where did Harvard go? We can add the missing “NA” data before removing “NAs From the data. We obtained the tuition information from another source”https://www.collegesimply.com/colleges/massachusetts/harvard-university/price/"

all_added$in_state_tuition[all_added$name == "Harvard University"] <- 46340
all_added$in_state_total[all_added$name == "Harvard University"] <- 67580
all_added$out_of_state_tuition[all_added$name == "Harvard University"] <- 46340
all_added$out_of_state_total[all_added$name == "Harvard University"] <- 67580
all_added$out_of_state_tuition[all_added$name == "Harvard University"] <- 46340
all_added$room_and_board[all_added$name == "Harvard University"] <- 17160
all_added$state.y [all_added$name == "Harvard University"] <- "Massachusetts"
all_added$state_code [all_added$name == "Harvard University"] <- "MA"
all_added$type[all_added$name == "Harvard University"] <- "Private"
all_added$degree_length[all_added$name == "Harvard University"] <- "4 Year"

Now that Harvard is in there, we can continue and remove all the NAs from the data

monsterfinal <- na.omit(all_added)

Creating a new subset of the top 20 schools in the US into a new data frame

topschools_monster <- monsterfinal  %>% 
  filter(name == "Harvard University"   |
         name == "Stanford University"  |
         name == "Yale University"  |
         name == "Massachusetts Institute of Technology"  |
         name == "Princeton University"  |                  
         name == "University of Pennsylvania"  |           
         name == "Brown University"  |                      
         name == "California Institute of Technology"  |   
         name == "Duke University"  |                       
         name == "Dartmouth College"  |                    
         name == "Cornell University"  |                    
         name == "Pomona College"  |                       
         name == "University of California at Berkeley"  |    
         name == "Columbia University"  |                  
         name == "Georgetown University"  |                 
         name == "University of Chicago"  |                
         name == "Northwestern University"  |               
         name == "University of Notre Dame"  |             
         name == "Rensselaer Polytechnic Institute"   |     
         name == "University of Michigan"    
  )

Seeing the schools that are in our new dataframe, we see that some school disappeared due to lack of info (Georgetown - not in any dataset), U Michigan - missing salary info, Columbia - missing tutition and salary, Uhiversity of Cali - missing tutition and salary )

unique(topschools_monster$name)

##  [1] "Brown University"                     
##  [2] "California Institute of Technology"   
##  [3] "Cornell University"                   
##  [4] "Dartmouth College"                    
##  [5] "Duke University"                      
##  [6] "Harvard University"                   
##  [7] "Massachusetts Institute of Technology"
##  [8] "Northwestern University"              
##  [9] "Pomona College"                       
## [10] "Princeton University"                 
## [11] "Rensselaer Polytechnic Institute"     
## [12] "Stanford University"                  
## [13] "University of Chicago"                
## [14] "University of Notre Dame"             
## [15] "University of Pennsylvania"           
## [16] "Yale University"

Graphing the intersection between early career pay and in state tuition

library(ggplot2)
library(plotly)
plot_top <- ggplot(topschools_monster, 
      aes(x=early_career_pay, y=in_state_total, color=name)) + 
      geom_point(stat = 'identity') + 
      geom_smooth(method=lm, colour="grey") +
      ggtitle("Early Career Pay and Instate tutition of Top Schools") +
       xlab("Early Career Pay Salary") +
  ylab ("In-state Tuition Total") +
  labs(color = "School Name") 
plot_top

## `geom_smooth()` using formula 'y ~ x'

Adding plotly for interactivity

ggplotly(plot_top)

## `geom_smooth()` using formula 'y ~ x'

The graph shows an intersection between the tuition costs and early career salary outcomes of students who attended top US schools. One would assume that the more expensive a school would be, is correlated to the “prestige” and be reflected in a higher salary coming straight of out undergrad. However, this graph suggest the opposite, that the lesser inexpensive “top schools” are yielding relatively higher early career salary than their more expensive counterparts.

Graphing Diversity at Top Schools

monsterfinaltop <- topschools_monster %>% filter(category != "Women" & category != "Two Or More Races" & category != "Non-Resident Foreign" & category != "Unknown" & category != "Total Minority") 

mft2 <- monsterfinaltop %>%  mutate_if(is.numeric, round, digits = 2)

library(ggplot2)
library(plotly)
plot_top2 <- ggplot(mft2, 
      aes(x=category, y=enrollment_percentage, color=name, text = paste("Percent:", enrollment_percentage, '</br>', '</br>School Name:', name))) +
      geom_point(stat = 'identity') + 
      geom_smooth(method=lm, colour="grey") +
      ggtitle("Racial and Ethnic Demographics at Top Schools") +
       xlab("Race/Ethnicity") +
  ylab ("Enrollment Percentage") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  labs(color = "School Name")
  


ggplotly(plot_top2tooltip = "text")

## `geom_smooth()` using formula 'y ~ x'

This graph makes clear the disproportionate representation of White people at top schools.

Data Analysis Summaries

Diversity

As the graphs above illustrate, our data verifies that indeed there are a higher proportion of students of color at for-profit colleges than at non-profit colleges. White Students have much greater attendance at non-profit colleges. These findings provide support for the argument that there are higher numbers of enrollment of students of color at for-profit colleges. One possible explanation could be targeted marketing campaigns by for-profit colleges to students of color. Further research is required.

Given the well-documented high cost of for-profit education, the exorbitant student debt incurred at these schools, the lower rates of employment upon graduation and the low levels of satisfaction with educational quality from previous students, all potential students should think twice before enrolling in for-profit colleges. This is particularly true for students of color who may be identified by for-profit colleges as susceptible to predatory exploitative practices that will provide additional profits to the college’s shareholders without making good on their promise to provide a marketable, quality education. The predatory actions of for-profit colleges contribute to our nation’s growing economic divide between the haves and the have nots and perpetuates the unequal education system we have today. Fortunately, there are many people and organizations working to end these predatory practices, but in the meantime while they continue to exist, we must inform one another about the importance of obtaining a quality education at not-for-profit institutions, like Montgomery College.

It is also clear that whites are disproportionately represented at the top 20 schools as well, which provides continued advantages for White people in earning potential throughout their lives, as the salary potential data on these schools portrayed.

Costs of Tuition, Salary Potential

In general, our data visualizations suggest the higher the tuition paid, the higher the salary received. However, although positive correlations were found to hold some significance, the R-squared values made it clear that there are many other possible explanatory variables for that outcome, including social capital, particularly at high-cost, high-salary expectation schools.

Tuition and Salary Potential at the Top Ranked Schools

This may be related to the interesting finding we had that at the top 20 schools, cost of tuition and future salary potential were negatively correlated. Perhaps there is a tipping point. At this point though, we just do not know and future computational analysis and research is required to answer these questions and explain these findings comprehensively.

Future Research

The following is a list of more detailed analysis that could be done with this dataset:

Comparing diversity of for-profit and not-for-profit schools in the same geographic regions utilizing local racial demographics.
Comparing diversity in community colleges, four-year-institutions, and for-profit institutions.
Comparing the tuition rates, salary potential and overall profits received from community colleges, four-year institutions and for-profit institutions.
A longitudinal study could be conducted to see if there have been changes in the demographic makeup of for-profit colleges in the time before the Obama-era regulations, the lack of regulation during the DeVos era and the patterns that emerge after the predatory lending lawsuits and any resulting legislation.

References

Body, D. (2019, Mar. 19). Worse Off Than When They Enrolled: The Consequence of For-Profit Colleges for People of Color. The Aspen Institute. https://www.aspeninstitute.org/blog-posts/worse-off-than-when-they-enrolled-the-consequence-of-for-profit-colleges-for-people-of-color/

Bonadies, G.G., Rovenger, J., Connor, E., Shum, B. & Merrill, T. (2018, Jul. 30). For-Profit Schools’ Predatory Practices and Students of Color: A Mission to Enroll Rather than Educate, Harvard Law Review Blog. https://blog.harvardlawreview.org/for-profit-schools-predatory-practices-and-students-of-color-a-mission-to-enroll-rather-than-educate/

Conti, A. (2019, Sep. 10). How For-Profit Colleges Have Targeted and Taken Advantage of Black Students. Vice. https://www.vice.com/en_us/article/bjwj3d/how-for-profit-colleges-have-targeted-and-taken-advantage-of-black-students

Green, E.L. (2019, Jun. 28). DeVos Repeals Obama-Era Rule Cracking down on For-Profit Colleges, New York Times. https://www.nytimes.com/2019/06/28/us/politics/betsy-devos-for-profit-colleges.html

Halperin, D. 22 States Sue DeVos to Overturn Anti-Student Rule. Republic Report. https://www.republicreport.org/2020/22-states-sue-devos-to-overturn-anti-student-rule/

Legal Services Center(2020), Project on Predatory Student Lending: Cases, Harvard Law School. https://predatorystudentlending.org/cases/

Lobosco, K. (2019, Jul. 23). For-profit college students are waiting 958 days for loan relief, CNN. https://www.cnn.com/2019/07/23/politics/betsy-devos-loan-forgiveness-for-profit-college-students/index.html

Lopez, M. (2015, Feb. 12). BEWARE: For-Profit Colleges. The Patriot Post. https://bcchspatriotpost.com/2391/news/beware-for-profit-colleges/

Redman, H. (2020, Jun. 27). AG Sues Department of Education Over For-Profit College Rules. Urban Milwaukee. https://urbanmilwaukee.com/2020/06/27/ag-sues-department-of-education-over-for-profit-college-rules/

TBS Staff (2019, Jul. 29). For-Profit Colleges vs. Non-Profit Colleges - What’s The Difference? The Best Schools Magazine. https://thebestschools.org/magazine/for-profit-vs-non-profit

Turner, C. (2019, Nov. 14). Devos Refuses to Forgive student Debt For Those DeFrauded by For-Profit Colleges, All Things Considered, NPR. https://www.npr.org/2019/11/14/779465130/devos-refuses-to-forgive-student-debt-for-those-defrauded-by-for-profit-colleges

Voorhees, K. (2019, Oct. 17). Civil Rights Groups: For-Profit Colleges Exploit Black and Latino Students. The Leadership Conference Education Fund. https://civilrights.org/edfund/2019/10/17/civil-rights-groups-for-profit-colleges-exploit-black-and-latino-students/

Final Project: Equality and Tertiary Education

An Analysis of U.S. Tertiary Education Institutions: Demographics, Tuition Costs and Salary Potential

Jennifer Johnson, Tiffany King, Ana Ortez-Rivera

8/16/2020

“Until we get equality in education, we won’t have an equal society.”

~ Supreme Court Justice Sonia Sotomayor

Topic Background

Initial Presentation: Racial Demographics at For-Profit and Not-For-Profit Educational Institutions

Advancements in Analysis Since the Initial Presentation

Initial Questions

New Additional Questions

Definitions:

Tidywork: Making Sense of our Data

Webscraping and Randomly Selecting Sample of 35 For-Profit Colleges

Following Web-Scraping Tutorial to Scrape Wikipedia Website

Loading Libraries

Creating a data set with all of our variables

Cleaning the Data

Question 1: How do racial and ethnic demographics vary between private and public institutions in our dataset?

Plot of Enrollment Percentages based on Race & Ethnicity: Combined For-profit and Non-profit Data, Without Census Data

Examination of 2-year and 4-year Program Racial and Ethnic Demographics

Question 2: How do racial and ethnic demographics vary between not-for-profit institutions and for-profit colleges in our dataset?

Comparing For-Profit and Non-Profit Data with Census Data

Creating a Dataset with US Census information

Creating Sample Datasets of For-Profit and Non-Profit Institutions to Merge with the Census Information

Viewing data

Cleaning the Data

Creating new column for percent enrollment

Removing Categories: Women, Non-Resident Foreign and Two or More Races

Changing Names of Schools with Multiple Locations into One Name (separate rows remain)

Name change for the entire group containing a portion of the name

Showing the Multiple Location Indicies to Verify Name Change Success After

Single Line of Code to Change All Location Names to One Name

Verifying that the group name change worked for each group

One More Name Change Test

Simplifying Names for all Schools in our Samples

Jennifer - For-Profit Name Changes

Ana - For-Profit Name Changes

Tiffany - For-Profit Name Changes

One Name that Refused to Change

Merging and Summarizing the Data for Schools with Multiple Locations

35 Randomly selected not-for-profit educational institutions

Jennifer

Ana

Tiffany

Jennifer- Non-Profit Name Changes

Ana - Non-Profit Name Changes

Tiffany - Non-Profit Name Changes

Create a new column to identify the randomly selected for-profit colleges sample.

Produces 1 for For-Profit Random Selections and 0 for the rest

Creating a shorter Combined Dataset

Creating a two new datasets to compare of for-profit and not-for-profit institutions

**New columns of for-profit and non-profit prior to joining the datasets

Adding for-profit column

Adding non-profit column

Shortening the Enrollment percentage decimal points

Combine Random Sample For-Profit and Non-Profit Data Sets

Creating a category of census to bind with for profit and not-for-profit groups

Changing the percentages in the census list to

Merge of shortened datasets

Data Visualizations with US Census Data

Racial & Ethnic Composition at Non-Profit Schools

Visualization of Percent Enrollment with Census, For-Profit and Non-Profit data

Comparisons of People of Color and White People at For-Profit and Non-Profit Institutions

Binding Census data with Dataset for People of Color and White People

Question 3: What is the relationship between tuition rate and salary potential?

In-State Tuition and Early-Career Salary

In-State Tuition and Mid-Career Salary

Out-of-State Tuition and Early-Career Salary

Out-of-State Tuition and Mid-Career Salary

Question 4: Are For-Profit Tuition Costs Higher than Non-Profit Tuition Costs?

Inserting the CSV file (Tuition Cost)

Setting working directory

View data

Simplifying Names for all Schools in our Samples

For-Profit

Jennifer - For-Profit Name Changes

Ana - For-Profit Name Changes

Tiffany - For-Profit Name Changes

Non-Profit