Let’s explore and understand better the variables in the College Scorecard data that are related to student demographics.

Finding the Demographic Variables

Let’s open up the version of the College Scorecard data Jacob made and remove the rows that are all NA.

library(readr)
colleges <- read_csv("data/scorecard_reduced_bachelors.csv", na = c("NULL", "PrivacySuppressed"))
dim(colleges)
## [1] 2560 1730

Now let’s open up the data dictionary.

dictionary <- read_csv("data/CollegeScorecardDataDictionary-09-08-2015.csv")
dim(dictionary)
## [1] 1953    9

The data dictionary will tell us which variables are the demographic variables. Let’s check out what we have in the data dictionary.

names(dictionary)
## [1] "NAME OF DATA ELEMENT"    "dev-category"            "developer-friendly name"
## [4] "API data type"           "VARIABLE NAME"           "VALUE"                  
## [7] "LABEL"                   "SOURCE"                  "NOTES"

It is dev-category that tells us which variables belong to which category. What are the categories?

levels(factor(dictionary$`dev-category`))
##  [1] "academics"  "admissions" "aid"        "completion" "cost"       "earnings"   "repayment" 
##  [8] "root"       "school"     "student"

Which variables belong to student?

library(dplyr)
student_names <- dictionary %>% 
    filter(`dev-category` %in% c("student")) %>% 
    select(`VARIABLE NAME`, `NAME OF DATA ELEMENT`)

student_names
## # A tibble: 94 × 2
##    `VARIABLE NAME`
##              <chr>
## 1             UGDS
## 2               UG
## 3       UGDS_WHITE
## 4       UGDS_BLACK
## 5        UGDS_HISP
## 6       UGDS_ASIAN
## 7        UGDS_AIAN
## 8        UGDS_NHPI
## 9        UGDS_2MOR
## 10        UGDS_NRA
##                                                                                         `NAME OF DATA ELEMENT`
##                                                                                                          <chr>
## 1                                                          Enrollment of undergraduate degree-seeking students
## 2                                                                     Enrollment of all undergraduate students
## 3                             Total share of enrollment of undergraduate degree-seeking students who are white
## 4                             Total share of enrollment of undergraduate degree-seeking students who are black
## 5                          Total share of enrollment of undergraduate degree-seeking students who are Hispanic
## 6                             Total share of enrollment of undergraduate degree-seeking students who are Asian
## 7     Total share of enrollment of undergraduate degree-seeking students who are American Indian/Alaska Native
## 8  Total share of enrollment of undergraduate degree-seeking students who are Native Hawaiian/Pacific Islander
## 9                 Total share of enrollment of undergraduate degree-seeking students who are two or more races
## 10              Total share of enrollment of undergraduate degree-seeking students who are non-resident aliens
## # ... with 84 more rows

Understanding the Student Data

What are these 94 aid variables?

student_names$`NAME OF DATA ELEMENT`
##  [1] "Enrollment of undergraduate degree-seeking students"                                                             
##  [2] "Enrollment of all undergraduate students"                                                                        
##  [3] "Total share of enrollment of undergraduate degree-seeking students who are white"                                
##  [4] "Total share of enrollment of undergraduate degree-seeking students who are black"                                
##  [5] "Total share of enrollment of undergraduate degree-seeking students who are Hispanic"                             
##  [6] "Total share of enrollment of undergraduate degree-seeking students who are Asian"                                
##  [7] "Total share of enrollment of undergraduate degree-seeking students who are American Indian/Alaska Native"        
##  [8] "Total share of enrollment of undergraduate degree-seeking students who are Native Hawaiian/Pacific Islander"     
##  [9] "Total share of enrollment of undergraduate degree-seeking students who are two or more races"                    
## [10] "Total share of enrollment of undergraduate degree-seeking students who are non-resident aliens"                  
## [11] "Total share of enrollment of undergraduate degree-seeking students whose race is unknown"                        
## [12] "Total share of enrollment of undergraduate degree-seeking students who are white non-Hispanic"                   
## [13] "Total share of enrollment of undergraduate degree-seeking students who are black non-Hispanic"                   
## [14] "Total share of enrollment of undergraduate degree-seeking students who are Asian/Pacific Islander"               
## [15] "Total share of enrollment of undergraduate degree-seeking students who are American Indian/Alaska Native"        
## [16] "Total share of enrollment of undergraduate degree-seeking students who are Hispanic"                             
## [17] "Total share of enrollment of undergraduate students who are non-resident aliens"                                 
## [18] "Total share of enrollment of undergraduate students whose race is unknown"                                       
## [19] "Total share of enrollment of undergraduate students who are white non-Hispanic"                                  
## [20] "Total share of enrollment of undergraduate students who are black non-Hispanic"                                  
## [21] "Total share of enrollment of undergraduate students who are Asian/Pacific Islander"                              
## [22] "Total share of enrollment of undergraduate students who are American Indian/Alaska Native"                       
## [23] "Total share of enrollment of undergraduate students who are Hispanic"                                            
## [24] "Share of undergraduate, degree-/certificate-seeking students who are part-time"                                  
## [25] "Share of undergraduate, degree-/certificate-seeking students who are part-time"                                  
## [26] "Share of undergraduate students who are first-time, full-time degree-/certificate-seeking undergraduate students"
## [27] "First-time, full-time student retention rate at four-year institutions"                                          
## [28] "First-time, full-time student retention rate at less-than-four-year institutions"                                
## [29] "First-time, part-time student retention rate at four-year institutions"                                          
## [30] "First-time, part-time student retention rate at less-than-four-year institutions"                                
## [31] "Percentage of undergraduates aged 25 and above"                                                                  
## [32] "Percentage of aided students whose family income is between $0-$30,000"                                          
## [33] "Percentage of students who are financially independent"                                                          
## [34] "Percentage of students who are financially independent and have family incomes between $0-30,000"                
## [35] "Percentage of students who are financially dependent and have family incomes between $0-30,000"                  
## [36] "Percentage first-generation students"                                                                            
## [37] "Aided students with family incomes between $30,001-$48,000 in nominal dollars"                                   
## [38] "Aided students with family incomes between $48,001-$75,000 in nominal dollars"                                   
## [39] "Aided students with family incomes between $75,001-$110,000 in nominal dollars"                                  
## [40] "Aided students with family incomes between $110,001+ in nominal dollars"                                         
## [41] "Dependent students with family incomes between $30,001-$48,000 in nominal dollars"                               
## [42] "Dependent students with family incomes between $48,001-$75,000 in nominal dollars"                               
## [43] "Dependent students with family incomes between $75,001-$110,000 in nominal dollars"                              
## [44] "Dependent students with family incomes between $110,001+ in nominal dollars"                                     
## [45] "Independent students with family incomes between $30,001-$48,000 in nominal dollars"                             
## [46] "Independent students with family incomes between $48,001-$75,000 in nominal dollars"                             
## [47] "Independent students with family incomes between $75,001-$110,000 in nominal dollars"                            
## [48] "Independent students with family incomes between $110,001+ in nominal dollars"                                   
## [49] "Percent of students whose parents' highest educational level is middle school"                                   
## [50] "Percent of students whose parents' highest educational level is high school"                                     
## [51] "Percent of students whose parents' highest educational level was is some form of postsecondary education"        
## [52] "Number of applications is greater than or equal to 2"                                                            
## [53] "Number of applications is greater than or equal to 3"                                                            
## [54] "Number of applications is greater than or equal to 4"                                                            
## [55] "Number of applications is greater than or equal to 5"                                                            
## [56] "Average family income of dependent students in real 2014 dollars."                                               
## [57] "Average family income of independent students in real 2014 dollars."                                             
## [58] "Number of students in the family income cohort"                                                                  
## [59] "Number of students in the family income dependent students cohort"                                               
## [60] "Number of students in the family income independent students cohort"                                             
## [61] "Number of students in the disaggregation with valid dependency status"                                           
## [62] "Number of students in the parents' education level cohort"                                                       
## [63] "Number of students in the FAFSA applications cohort"                                                             
## [64] "Share of students who received a Pell Grant while in school"                                                     
## [65] "Average age of entry, via SSA data"                                                                              
## [66] "Average of the age of entry squared"                                                                             
## [67] "Percent of students over 23 at entry"                                                                            
## [68] "Share of female students, via SSA data"                                                                          
## [69] "Share of married students"                                                                                       
## [70] "Share of dependent students"                                                                                     
## [71] "Share of veteran students"                                                                                       
## [72] "Share of first-generation students"                                                                              
## [73] "Average family income"                                                                                           
## [74] "Median family income"                                                                                            
## [75] "Average family income for independent students"                                                                  
## [76] "Average of the log of family income"                                                                             
## [77] "Average of the log of family income for independent students"                                                    
## [78] "Percent of the population from students' zip codes that is White, via Census data"                               
## [79] "Percent of the population from students' zip codes that is Black, via Census data"                               
## [80] "Percent of the population from students' zip codes that is Asian, via Census data"                               
## [81] "Percent of the population from students' zip codes that is Hispanic, via Census data"                            
## [82] "Percent of the population from students' zip codes with a bachelor's degree over the age 25, via Census data"    
## [83] "Percent of the population from students' zip codes over 25 with a professional degree, via Census data"          
## [84] "Percent of the population from students' zip codes that was born in the US, via Census data"                     
## [85] "Median household income"                                                                                         
## [86] "Poverty rate, via Census data"                                                                                   
## [87] "Unemployment rate, via Census data"                                                                              
## [88] "Log of the median household income"                                                                              
## [89] "Number of students who sent their FAFSA reports to at least one college"                                         
## [90] "Share of students who submitted FAFSAs to only one college"                                                      
## [91] "Share of students who submitted FAFSAs to two colleges"                                                          
## [92] "Share of students who submitted FAFSAs to three colleges"                                                        
## [93] "Share of students who submitted FAFSAs to four colleges"                                                         
## [94] "Share of students who submitted FAFSAs to at least five colleges"

Variables for Diversity

Let’s look at the total share of enrollment of undergraduate degree-seeking students who identify with certain racial/ethnic groups. How many schools are missing data?

colleges %>% 
    summarize(`NA white` = mean(is.na(UGDS_WHITE)),
              `NA black` = mean(is.na(UGDS_BLACK)),
              `NA Hispanic` = mean(is.na(UGDS_HISP)),
              `NA Asian` = mean(is.na(UGDS_ASIAN)))
## # A tibble: 1 × 4
##   `NA white` `NA black` `NA Hispanic` `NA Asian`
##        <dbl>      <dbl>         <dbl>      <dbl>
## 1  0.1453125  0.1453125     0.1453125  0.1453125

These are obviously the same schools since these proportions match exactly. What kind of schools are these?

colleges %>% 
        filter(sch_deg == 3 & is.na(UGDS_WHITE)) %>% 
        select(INSTNM)
## # A tibble: 372 × 1
##                                              INSTNM
##                                               <chr>
## 1    Academy of Chinese Culture and Health Sciences
## 2             American Baptist Seminary of the West
## 3              American Film Institute Conservatory
## 4                       Phillips Graduate Institute
## 5  University of California-Hastings College of Law
## 6            University of California-San Francisco
## 7                  California Western School of Law
## 8             Church Divinity School of the Pacific
## 9                     Claremont Graduate University
## 10            Western University of Health Sciences
## # ... with 362 more rows

More Student Demographics

Let’s look at student characteristics like the percentage of undergraduates aged 25 and above and the percentage first-generation students.

colleges %>% 
    summarize(`NA over 25` = mean(is.na(UG25abv)),
              `NA 1st gen` = mean(is.na(PAR_ED_PCT_1STGEN)))
## # A tibble: 1 × 2
##   `NA over 25` `NA 1st gen`
##          <dbl>        <dbl>
## 1    0.1441406    0.1140625

Those variables for students over 25 and 1st generation students look useable.

Summary