Part 1 - Introduction


Historically women are amoung the many underrepresenting populations in higher education. Recently this trend has been changing and headlines indicating that female college graduates surpassing their male counterparts have appeared. In this projroject the trends involving gender amoung PhD recipients are explored.


Research Question
Is the gender gap descreasing in PhD programs that have a historically underrepresented female population?


According to UNESCO, 17 women have won a Nobel Prize in physics, chemistry or medicine since Marie Curie in 1903 compared to 572 men, and only 28% of all of the world’s researchers are women. These global numbers are a direct result of discrimination, bias, and societal norms influencing the quality of education that women recieve and the career choices they make. The goal for this project is to analyze data from the National Science Foundation on PhD recipients in the United States and confirm that these same disparities in gender are presentin the US, despite efforts to decrease the gender gap. According to the 2030 Agenda for Sustainable Development adopted by the United Nations, innovations in science, engineering and technology is essential to address key global issues such as climate change, food security, healthcare, natural resources and conserving biodiversity for Earth’s ecosystems. Gender equality is essential to developing sustainable solutions to global issues because these threats to humanity require the collective intellegence of humanity, including the untapped potential of underrepresented groups.

Part 2 - The Data


Data was taken from tables from the Survey of Earned Doctorites - National Science Foundation and uploaded into this GitHub Repository. The Survey of Earned Doctorites “is an annual census conducted since 1957 of all individuals receiving a research doctorate from an accredited U.S. institution in a given academic year.” Upon initial analysis a subset of data was selected from categories where females were initially under represented. The categories idenitfied were: Physical Science, Engineering and Math/CS Each case represents the number of male and female PhD recipients for a specific state in a specific year (2008, 2015 & 2018). There is one case for every state between these years EXCEPT for the category of Math/CS - there is one case for male and female PhD recipients between 2015 and 2018.

Cases

Each case represents the number of male and female PhD recipients for a specific state in a specific year between 2008 and 2018 (excluding states that were omitted from the data). There is one case for every state between these years EXCEPT for the category of Math/CS - there is one case for male and female PhD recipients between 2015 and 2018.

Variables of Interest

The response variable is the proportional difference between number of males and females in each category (Math/CS, Engineering and Physical Science) and is numerical (between 0 and 1).

The explanitory Variables:

  • Discipline - Category of Discipline that granted PhD (categorical/nominal variable)
  • Year - Year when PhD was granted (categorical/ ordinal variable)
  • State - State where PhD was granted (categorical/nominal variable)
  • Sex - Male or Female
  • Proportion - Two proportions were calculated and analyzed as a functin of year: proportion of males or females to the total number of PhDs granted in the country, and proportion of males and females to total number of PhDs granted in a specific state
  • ’Difference in Proportion` - the difference between male and female proportions were calculated and analyzed as a function of year


Assumptions and Caveats

In the orginial data sets, certain categories were hidden due to confidentiality responses. For example:
2009 States Surpressed for each category of interest:
  • Physical Science (7) - Wyoming, Maine, South Dakota, North Dakota, Vermont
  • Engineering (10) - South Dakota, Nevada, Vermon, New Hamshire, Alaska, Hawaii, Idaho, Maine, Mississippi, Montana
  • Math and CS (13) were not included as a category until 2015. For 2015 - Alaska, Arkansas, District of Columbia, Hawaii, Idaho, Maine, Montana, Nevada, North Dakota, Puerto Rico, Vermont, West Virgina, Wyoming
2015 States Surpressed for each category of interest:
  • Physical Science (9) - West Virginia, Wyoming, Puerto Rico, South Dakota, Vermont, Rhode Island, Mississippi, Arkansas, Alaska
  • Math/CS (12) - Arkansas, Alaska, District of Columbia, Hawaii, Idaho, Montana, Maine, Nevada, North Dakota, Oklahoma, Puerto Rico, Vermont, West Virgina, Wyoming
  • Engineering (9) - Alaska, Hawaii, Idaho, Maine, Montana, Nevada, South Dakota, Vermont, Wyoming
2018 States Surpressed for each category of interest:
  • Physical Science (6) - Idaho, Wyoming, Maine, Puerto Rico, South Dakota, Vermont
  • Math/CS (14) - Arkansas, Delaware, Hawaii, Idaho, Montana, Nevada, New Hampshire, New Mexico, North Dakota, Oklahoma, Puerto Rico, Wyoming, South Dakota, Vermont
  • Engineering (6) - Alaska, Hawaii, Maine, Montana, Puerto Rico, Vermont

In this respect, the sample of data included in this analysis excludes some states for a given category and year.

Type of Study

This is an observational study that attempts to use linear regression to model trends between male and female PhD recipients in the USA.

Scope of Inference:

The population of interest are all PhD recipients in the USA. The findings in this study can be generalized this population because the number of cases exceed 30, and the sample is assumed to be a simple random sample. A potential source of bias here is that some states were omitted from the dataset due to confidentiallity reasons that are not specified on the NSF website. This data cannot be used to establish a causal link between variables of interest because there are a plethora of socio-economic factors that are involved in gender inequality in higher education. See this source for more information.

Part 3 - Exploratory Data Analysis


Import Data for EDA

RCurl Library was used to read in the csv files into dataframes and dplyr was used to transform and tidy data into dataframes where each row is an observation (state) and each column corresponded to the number of males or females who earned a PhD in each category. Column names were also changes to reflect the variables of interest (Major and gender)

2009

#Load url from Github
url<- "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2009Doc.csv"

raw <- getURL(url)

#Rename Columns, omit unessesary rows, rename columns to include sex, field pivot into long format
dataFrame09 <- read.csv(text = raw) %>% 
  data.frame() %>%
  select(TABLE.6...Doctorates.awarded.by.state.location..by.broad.field.of.study.and.sex.of.doctorate.recipients..2009, X, X.1, X.6, X.7, X.12, X.13) %>% rename(State= TABLE.6...Doctorates.awarded.by.state.location..by.broad.field.of.study.and.sex.of.doctorate.recipients..2009, TotalMale = X, TotalFemale = X.1, PhysciMale = X.6, PhysciFemale = X.7, EngMale = X.12, EngFemale = X.13) %>%
  slice(3:62) %>%
  pivot_longer(-State, "Cat") %>% 
  filter(value != "D") 

#Change all values in the value column - delete "," and cast values as numeric types 
dataFrame09$value <- 
  as.numeric(unlist(str_remove_all(dataFrame09$value, ',')))

#Drop NA Values (extra rows) and place data back in wide format
dataFrame09 <- dataFrame09 %>% 
  drop_na() %>% 
  spread(Cat,value)

#Display 
kable(dataFrame09)%>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
State EngFemale EngMale PhysciFemale PhysciMale TotalFemale TotalMale
Alabama 20 74 31 60 299 331
Alaska NA NA 6 6 16 21
Arizona 34 130 51 117 505 543
Arkansas 10 19 6 11 97 111
California 257 825 318 871 2708 3258
Colorado 24 89 54 127 390 431
Connecticut 16 31 34 70 320 331
Delaware 11 37 21 40 135 138
District of Columbia 10 41 20 36 310 277
Florida 69 263 97 226 955 1071
Georgia 71 261 64 153 670 759
Hawaii NA NA 9 14 92 92
Idaho NA NA 5 15 50 72
Illinois 85 281 131 298 1093 1229
Indiana 45 208 64 171 574 757
Iowa 29 95 39 61 294 377
Kansas 7 42 16 48 205 228
Kentucky 11 42 14 27 221 215
Louisiana 16 48 22 62 212 294
Maine NA NA NA NA 34 24
Maryland 48 161 52 164 588 611
Massachusetts 88 365 153 354 1146 1472
Michigan 54 313 92 200 785 988
Minnesota 12 66 30 66 510 449
Mississippi NA NA 11 32 212 208
Missouri 23 68 46 71 415 453
Montana NA NA NA NA 59 49
Nebraska 5 22 12 22 154 150
Nevada NA NA NA NA 109 106
New Hampshire NA NA 13 26 68 75
New Jersey 42 130 55 163 465 607
New Mexico 10 38 13 55 116 164
New York 105 318 171 462 1883 1967
North Carolina 53 153 88 160 714 708
North Dakota NA 10 NA NA 54 67
Ohio 56 269 87 184 874 986
Oklahoma 13 40 20 41 186 233
Oregon 8 41 26 65 204 249
Pennsylvania 104 331 116 334 1184 1385
Puerto Rico NA NA 10 13 124 76
Rhode Island 6 26 30 50 117 144
South Carolina 13 49 22 66 223 244
South Dakota NA NA NA NA 54 47
Tennessee 30 93 30 78 398 439
Texas 106 437 153 371 1562 1801
United Statese 1623 6006 2450 5868 23190 26338
Utah 10 84 29 59 201 282
Vermont NA NA NA 6 26 22
Virginia 44 185 67 116 598 694
Washington 25 77 56 77 433 421
West Virginia NA 22 7 13 80 83
Wisconsin 27 112 38 135 439 561
Wyoming NA NA NA NA 29 38


2015

# Load data from Github

#Load url
url<- "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2015Doc.csv"
raw <- getURL(url)

#Rename Columns, omit unessesary rows, rename columns to include sex, field pivot into long format
dataFrame15 <- read.csv(text = raw) %>% 
  data.frame() %>%
  select(State= Table.6..Doctorates.awarded..by.state.or.location..broad.field.of.study..and.sex.of.doctorate.recipients..2015, TotalMale = X, TotalFemale = X.1, PhysciMale = X.4, PhysciFemale = X.5, MathMale = X.6,MathFemale = X.7, EngMale = X.10, EngFemale = X.11) %>%
  slice(5:57) %>%
  pivot_longer(-State, "Cat") %>% 
  filter(value != "D")

#Change all values in the value column - delete "," and cast values as numeric types 
dataFrame15$value <- 
  as.numeric(unlist(str_remove_all(dataFrame15$value, ',')))

#Drop NA Values (extra rows) and place data back in wide format
dataFrame15 <- dataFrame15 %>% 
  drop_na() %>% 
  spread(Cat,value)

#Display 
kable(dataFrame15)%>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
State EngFemale EngMale MathFemale MathMale PhysciFemale PhysciMale TotalFemale TotalMale
Alabama 28 105 9 37 15 23 309 386
Alaska NA NA NA NA NA NA 14 27
Arizona 29 120 12 51 45 93 449 541
Arkansas 12 46 NA NA NA NA 95 130
California 265 893 83 384 245 510 2686 3386
Colorado 61 164 19 45 63 104 439 567
Connecticut 28 63 13 36 30 67 387 392
Delaware 16 43 9 17 14 24 100 127
District of Columbia 18 49 NA NA 7 23 311 265
Florida 83 303 33 132 84 189 1127 1237
Georgia 90 269 37 89 43 69 696 785
Hawaii NA NA NA NA 6 20 118 122
Idaho NA NA NA NA 6 12 47 68
Illinois 104 360 36 134 82 170 1066 1413
Indiana 80 264 44 94 65 106 711 870
Iowa 17 96 12 34 33 48 293 393
Kansas 19 51 7 23 23 32 275 301
Kentucky 17 62 7 26 6 21 199 304
Louisiana 22 63 6 35 24 39 299 333
Maine NA NA NA NA 6 7 39 33
Maryland 59 179 32 87 39 98 660 745
Massachusetts 142 367 42 158 133 238 1266 1570
Michigan 82 346 32 100 92 134 889 1103
Minnesota 30 103 12 44 17 51 714 594
Mississippi 11 33 10 13 NA NA 232 214
Missouri 37 118 15 47 33 65 464 522
Montana NA NA NA NA 7 21 58 70
Nebraska 7 39 7 10 11 18 176 196
Nevada NA NA NA NA 13 19 102 109
New Hampshire 6 18 7 7 9 17 80 89
New Jersey 54 158 17 91 30 114 468 660
New Mexico 20 51 5 18 17 41 169 176
New York 138 437 84 204 138 289 1996 2090
North Carolina 83 212 37 110 58 89 846 857
North Dakota 5 29 NA NA 6 13 88 87
Ohio 93 323 26 84 68 126 924 1069
Oklahoma 21 71 6 22 13 39 219 296
Oregon 15 48 9 21 25 50 237 252
Pennsylvania 154 464 72 143 81 171 1186 1442
Puerto Rico 5 8 NA NA NA NA 129 65
Rhode Island 12 23 9 27 NA NA 160 160
South Carolina 27 117 15 29 13 38 266 320
South Dakota NA NA 0 7 NA NA 46 64
Tennessee 34 118 9 26 23 46 436 464
Texas 189 636 77 192 136 281 1842 2224
United Statesd 2301 7596 943 2880 1988 3935 25403 29596
Utah 26 113 7 34 16 49 211 365
Vermont NA NA NA NA NA NA 34 42
Virginia 52 253 23 92 54 91 709 826
Washington 41 112 23 61 37 69 456 502
West Virginia 5 32 NA NA NA NA 101 116
Wisconsin 43 135 17 50 47 88 546 575
Wyoming NA NA NA NA NA NA 33 52


2018

# Load data from Github
url<- "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2018Doc.csv"
raw <- getURL(url)

#Rename Columns, omit unessesary rows, rename columns to include sex, field pivot into long format
dataFrame18 <- read.csv(text = raw) %>% 
  data.frame() %>%
  slice(5:57) %>%
  select(State= Table.6, TotalMale = X, TotalFemale = X.1, PhysciMale = X.4, PhysciFemale = X.5, MathMale = X.6, MathFemale = X.7, EngMale = X.10, EngFemale = X.11)%>%
  pivot_longer(-State, "Cat") %>% 
  filter(value != "D")

#Change all values in the value column - delete "," and cast values as numeric types 
dataFrame18$value <- 
  as.numeric(unlist(str_remove_all(dataFrame18$value, ',')))

#Place data back in wide format
dataFrame18 <- spread(dataFrame18, Cat,value)

#Display 
kable(dataFrame18)%>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
State EngFemale EngMale MathFemale MathMale PhysciFemale PhysciMale TotalFemale TotalMale
Alabama 25 103 12 23 11 31 329 338
Alaska NA NA 0 0 8 7 27 29
Arizona 43 100 10 42 30 65 364 399
Arkansas 10 44 NA NA 9 14 104 162
California 300 889 108 377 286 597 2647 3427
Colorado 60 160 17 45 59 122 478 574
Connecticut 33 79 10 40 37 79 363 422
Delaware 24 59 NA NA 21 27 98 140
District of Columbia 14 52 5 28 20 18 296 292
Florida 87 327 59 142 96 180 1094 1252
Georgia 68 268 23 97 51 91 684 827
Hawaii NA NA NA NA 12 15 104 96
Idaho 6 19 NA NA NA NA 41 56
Illinois 97 350 70 156 98 199 1107 1408
Indiana 78 255 34 108 64 116 708 923
Iowa 31 121 16 44 21 54 334 409
Kansas 11 57 11 21 28 34 250 284
Kentucky 10 45 10 16 11 27 223 271
Louisiana 21 66 13 27 25 34 281 295
Maine NA NA 0 0 NA NA 28 22
Maryland 64 182 31 99 54 78 665 699
Massachusetts 178 428 38 161 139 258 1330 1616
Michigan 97 325 34 115 76 147 863 1090
Minnesota 41 114 10 32 27 51 795 642
Mississippi 6 37 7 13 14 29 226 245
Missouri 43 159 15 40 21 70 430 546
Montana NA NA NA NA 6 10 56 56
Nebraska 7 27 10 18 8 25 164 177
Nevada 9 25 NA NA 17 24 125 115
New Hampshire 15 25 NA NA 12 21 73 92
New Jersey 57 153 19 82 38 89 529 595
New Mexico 15 42 NA NA 15 40 161 161
New York 153 447 72 266 150 301 2051 2207
North Carolina 74 235 44 113 67 111 843 890
North Dakota 8 18 NA NA 7 14 104 89
Ohio 92 299 27 98 84 178 957 1094
Oklahoma 15 67 NA NA 11 36 235 269
Oregon 20 70 8 32 28 62 237 300
Pennsylvania 144 456 53 172 70 165 1165 1457
Puerto Rico NA NA NA NA NA NA 88 59
Rhode Island 9 30 9 26 20 33 140 186
South Carolina 36 103 11 25 16 40 262 306
South Dakota 8 16 NA NA NA NA 51 63
Tennessee 58 134 11 39 17 57 467 488
Texas 193 702 76 213 132 325 1771 2297
United Statesd 2453 7726 983 3043 2118 4214 25368 29798
Utah 18 92 15 27 23 38 200 311
Vermont NA NA NA NA NA NA 32 31
Virginia 79 236 26 92 48 103 687 826
Washington 31 104 21 51 50 68 463 501
West Virginia 6 32 6 5 6 14 96 123
Wisconsin 39 115 13 57 51 74 506 575
Wyoming 6 16 NA NA NA NA 36 66


USA Totals

First total number of PhD recipients were analyzed per year (2009, 2015 & 2018). Note: PhD recipients from all diciplines are included in USA totals
Findings:
  • The total amount of people who earned PhD degrees increased from 2009 - 2018
  • The number of males who earned PhDs increased by 13.14% from 2009 - 2018
  • The number of females who earned PhDs increased by 9.39% from 2009 - 2018
  • Less females earned PhDs than males in 2009, 2015 and 2018
  • The difference between male and female PhD Recipients increased from 6.4% in 2009 to 7.6% in 2015 and finally to 8% in 2018


#Select only total and gather sex and graduates into long format
Total09 <- select(dataFrame09, State, Male = TotalMale, Female = TotalFemale) %>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = 2009)

#Isolate totals for country (for bar chart 1)
TotalUS09<- filter(Total09, State == "United Statese") %>%
  mutate(gradSum = sum(graduates))

#Select only total and gather sex and graduates into long format
Total15 <- select(dataFrame15, State, Male = TotalMale, Female = TotalFemale) %>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = 2015)

#Isolate totals for country (for bar chart 1)
TotalUS15<- filter(Total15, State == "United Statesd") %>%
  mutate(gradSum = sum(graduates))

#Select only total and gather sex and graduates into long format
Total18 <- select(dataFrame18, State, Male = TotalMale, Female = TotalFemale)%>% 
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = 2018)

#Isolate totals for country (for bar chart 1)
TotalUS18<- filter(Total18, State == "United Statesd") %>%
  mutate(gradSum = sum(graduates))

# Stack them
TotalUS <- Stack(TotalUS09, TotalUS15)
TotalUS <- Stack(TotalUS, TotalUS18)
TotalUS$year <- as.factor(TotalUS$year)

#Calculate proportion
TotalUS <- TotalUS %>%
  mutate(propGrad = graduates/gradSum) 

#Bar Plot
ggplot(TotalUS) +
  geom_col(aes(x = year, y = propGrad, fill = sex), position = "dodge") +
  scale_fill_manual(values=c("yellow","lightgreen"))+
  labs(y = "Proportion of Graduates", x ="Year/Sex", title = "Proportion of PhDs Granted in the USA") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

#Calculate differences between male and female proportions
TotalUS <- TotalUS %>%
  select(year, propGrad, sex) %>%
  spread(sex, propGrad) %>% 
  mutate(difference = Male - Female)

#Plot proportional differences
ggplot(TotalUS) +
  geom_col(aes(x = year, y = difference, fill = year), position = "dodge") +
  scale_fill_manual(values=c("darkblue","blue", "lightblue")) +
  labs(y = "Difference (Male - Female)", x ="Year", title = "Proportion Difference Between Male & Female PhD Grads") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") +
  geom_text(aes(x = year, y = difference + 0.01, label = round(difference, 3)))

By State

The total number of PhD recipients were analyzed per year (2009, 2015 & 2018) by state (only states with included data). Note: PhD recipients from all diciplines are included in State totals.

Findings:
  • The distribution of PhDs granted in each state has a slight right skew in 2009 and 2015. The skew becomes more apparent in 2018 for female PhD recipients due to increase in spread
  • The median for males is greater than the median for females in all three years
  • Median PhD recipients increased from 2009 - 2015 but decreased from 2015 - 2018 for both genders.
#Isolate totals for other states and drop all NA values (for boxplot 1)
TotalStates09 <- filter(Total09, State != "United Statese") %>% drop_na()

TotalStates15 <- filter(Total15, State != "United Statesd") %>% drop_na()

#Isolate totals for other states and drop all NA values (for boxplot 1)
TotalStates18 <- filter(Total18, State != "United Statesd") %>% drop_na()

TotalStates <- Stack(TotalStates09, TotalStates15)
TotalStates <- Stack(TotalStates, TotalStates18)
TotalStates$year <- as.factor(TotalStates$year)
TotalStates$State <- as.factor(TotalStates$State)

#Box Plot of distribution of PhD's in each state
ggplot(TotalStates, aes(x = year, y = graduates, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values= c("yellow", "lightgreen")) +
  labs(y = "Graduates", x ="Sex", title = "Distribution for PhDs Granted in States") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") 

#Print Medians for each year
#2009
TotalStates09 <- TotalStates09 %>%
  spread(sex, graduates)

male09 <- median(TotalStates09$Male)
female09 <- median(TotalStates09$Female)

#2015
TotalStates15 <- TotalStates15 %>%
  spread(sex, graduates)

male15 <- median(TotalStates15$Male)
female15 <- median(TotalStates15$Female)

#2018
TotalStates18 <- TotalStates18 %>%
  spread(sex, graduates)
male18 <- median(TotalStates18$Male)
female18 <- median(TotalStates18$Female)

#Create dataframe with medians for plotting
mediansDf <- data.frame(Year = as.factor(c(2009, 2015, 2018)), 
                        Male = c(male09, male15, male18), 
                        Female = c(female09, female15, female18)) %>%
  gather(sex, median, Male:Female)

#Plot medians
ggplot(mediansDf) +
  geom_col(aes(x = Year, y = median, fill = sex), position = "dodge") +
  scale_fill_manual(values=c("yellow", "lightgreen")) +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") +
  labs(y = "Median", x ="Year", title = "Median PhDs Granted in States") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") 


Totals for USA by Discipline

Preliminary data exploration uncovered that the number of PhD’s granted to females are greater than the PhD’s granted for males in the following fields: Social Sciences (including Psychology), Education, “Other” (non STEM), Humanities and Life Sciences.

In the following fields the number of male PhD recipients were greater than females: Engineering, Physical Science, Math/CS, and Total for all categories.

Thus I decided to focus on the feilds where women were underrepresented, because it is these fields that are tipping the scale in favor for males gaining more PhD’s than females as a total in the United States.

First total number of PhD recipients were analyzed per discipline for each year.

Engineering Degrees

Findings:
  • There were 3.672 times the number of males than females who earned PhDs in Engineering in 2009, 3.301 times the number of males than females in 2015, and 3.15 times the number of males than females in 2018
  • The difference in the proportion of males and females increased from 2009 to 2015 and slightly decreased from 2015 - 2018
  • All distributions have a right skew, moreso for females than males.
  • We can also see that the variation is also drastically different between males and females. The spread for males is alot larger than females for all three years.


#Subset Engineering Degrees data '09
engData09 <- select(dataFrame09, State, Male = EngMale, Female = EngFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2009))

#Isolate totals for country
TotalEng09<- filter(engData09, State == "United Statese") %>%
  mutate(prop = graduates/TotalUS09$gradSum[1])

#Isolate totals for other states and drop all NA values (for boxplot 1)
TotalEngStates09 <- filter(engData09, State != "United Statese") %>% drop_na()

#Subset Engineering Degrees data '15
engData15 <- select(dataFrame15, State, Male = EngMale, Female = EngFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2015))

#Isolate totals for country
TotalEng15<- filter(engData15, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS15$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalEngStates15 <- filter(engData15, State != "United Statesd") %>% drop_na()

#Subset Engineering Degrees data '15
engData18 <- select(dataFrame18, State, Male = EngMale, Female = EngFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2018))

#Isolate totals for country
TotalEng18<- filter(engData18, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS18$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalEngStates18 <- filter(engData18, State != "United Statesd") %>% drop_na()

#Totals 
TotalEng <- Stack(TotalEng09, TotalEng15)
TotalEng <- Stack(TotalEng, TotalEng18)

#Bar Plot of proportions
ggplot(TotalEng) +
  geom_col(aes(x = year, y = prop, fill = sex), position = "dodge") +
  scale_fill_manual(values=c("yellow", "lightgreen"))+
  labs(y = "Proportion of Total Graduates", x ="Year", title = "Proportion of Engineering PhDs in USA") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

#Calculate Difference
TotalEng <- TotalEng %>%
  select(sex, year, prop) %>%
  spread(sex, prop) %>%
  mutate(propDiff = Male - Female)

#Plot proportional differences
ggplot(TotalEng) +
  geom_col(aes(x = year, y = propDiff, fill = year), position = "dodge") +
  scale_fill_manual(values=c("darkblue","blue", "lightblue")) +
  labs(y = "Difference (Male - Female)", x ="Year", title = "Proportion Difference Between Male & Female Engineering PhD Grads") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") +
  geom_text(aes(x = year, y = propDiff + 0.01, label = round(propDiff, 4)))

#Consolidate state dataframes
TotalEngStates <- Stack(TotalEngStates09, TotalEngStates15)
TotalEngStates <- Stack(TotalEngStates, TotalEngStates18)

#Box Plot of distribution of PhD's in each state
ggplot(TotalEngStates, aes(x = year, y = graduates, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values= c("yellow", "lightgreen")) +
  labs(y = "Graduates", x ="Sex", title = "Distribution for PhDs in Engineering in States") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") 


Physical Science

Findings:
  • There were 2.384 times the number of males than females who earned PhDs in a Physical Science in 2009, 1.979 times the number of males than females in 2015, and 1.99 times the number of males than females in 2018
  • The difference in the proportion of males and females decreased from 2009 to 2015 but increased from 2015 - 2018
  • Similary to the Engineering data, all distributions have a right skew, moreso for females than males, and the variation is significantly different between males and females. The spread for males is alot larger than females for all three years.
#Subset Physical Science Data
Physci09 <- select(dataFrame09, State, Male = PhysciMale, Female = PhysciFemale) %>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2009))

#Isolate totals for country (for bar chart 1)
TotalPhysci09<- filter(Physci09, State == "United Statese")%>%
  mutate(prop = graduates/TotalUS09$gradSum[1])

#Isolate totals for other states and drop all NA values (for boxplot 1)
TotalPhysciStates09 <- filter(Physci09, State != "United Statese") %>% drop_na()

#Subset Engineering Degrees data '15
Physci15 <- select(dataFrame15, State, Male = PhysciMale, Female = PhysciFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2015))

#Isolate totals for country
TotalPhysci15<- filter(Physci15, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS15$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalPhysciStates15 <- filter(Physci15, State != "United Statesd") %>% drop_na()

#Subset Engineering Degrees data '18
Physci18 <- select(dataFrame18, State, Male = PhysciMale, Female = PhysciFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2018))

#Isolate totals for country
TotalPhysci18<- filter(Physci18, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS18$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalPhysciStates18 <- filter(Physci18, State != "United Statesd") %>% drop_na()

#Totals 
TotalPhysci <- Stack(TotalPhysci09, TotalPhysci15)
TotalPhysci <- Stack(TotalPhysci, TotalPhysci18)

#Bar Plot of proportions
ggplot(TotalPhysci) +
  geom_col(aes(x = year, y = prop, fill = sex), position = "dodge") +
  scale_fill_manual(values=c("yellow", "lightgreen"))+
  labs(y = "Proportion of Total Graduates", x ="Year", title = "Proportion of Physical Science PhDs in USA") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

#Calculate Difference
TotalPhysci <- TotalPhysci%>%
  select(sex, year, prop) %>%
  spread(sex, prop) %>%
  mutate(propDiff = Male - Female)

#Plot proportional differences
ggplot(TotalPhysci) +
  geom_col(aes(x = year, y = propDiff, fill = year), position = "dodge") +
  scale_fill_manual(values=c("darkblue","blue", "lightblue")) +
  labs(y = "Difference (Male - Female)", x ="Year", title = "Proportion Difference Between Male & Female Physical Science PhD Grads") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") +
  geom_text(aes(x = year, y = propDiff + 0.01, label = round(propDiff, 4)))

#Consolidate state dataframes
TotalPhysciStates <- Stack(TotalPhysciStates09, TotalPhysciStates15)
TotalPhysciStates <- Stack(TotalPhysciStates, TotalPhysciStates18)

#Box Plot of distribution of PhD's in each state
ggplot(TotalPhysciStates, aes(x = year, y = graduates, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values= c("yellow", "lightgreen")) +
  labs(y = "Graduates", x ="Sex", title = "Distribution for PhDs in Physical Science in States") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")



Math and Computer Science

Note: Math and CS was not categorized as a dicipline in this data set before 2015.

Findings:
  • There were 3.05 times the number of males than females who earned PhDs in a Math & Computer Science in 2015, and 3.096 times the number of males than females in 2018.
  • The difference in the proportion of males and females increased from 2015 - 2018
  • Similar to the Engineering and Physical Science data, all distributions have a right skew, moreso for females than males, and the variation is significantly different between males and females. The spread for males is alot larger than females for all three years.
  • Out of all the disciplines reviewed in this project, Math and Computer Science seems to contain the largest gender gap.
#Subset Engineering Degrees data '15
MathCS15 <- select(dataFrame15, State, Male = MathMale, Female = MathFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2015))

#Isolate totals for country
TotalMathCS15<- filter(MathCS15, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS15$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalMathCSStates15 <- filter(MathCS15, State != "United Statesd") %>% drop_na()

#Subset Engineering Degrees data '18
MathCS18 <- select(dataFrame18, State,  Male = MathMale, Female = MathFemale)%>%
  gather(sex, graduates, Male:Female) %>% 
  mutate(year = as.factor(2018))

#Isolate totals for country
TotalMathCS18<- filter(MathCS18, State == "United Statesd") %>%
  mutate(prop = graduates/TotalUS18$gradSum[1])

#Isolate totals for other states and drop all NA values
TotalMathCSStates18 <- filter(MathCS18, State != "United Statesd") %>% drop_na()

#Totals 
TotalMathCS <- Stack(TotalMathCS15, TotalMathCS18)

#Bar Plot of proportions
ggplot(TotalMathCS) +
  geom_col(aes(x = year, y = prop, fill = sex), position = "dodge") +
  scale_fill_manual(values=c("yellow", "lightgreen"))+
  labs(y = "Proportion of Total Graduates", x ="Year", title = "Proportion of Math/CS PhDs in USA") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

#Calculate Difference
TotalMathCS <- TotalMathCS %>%
  select(sex, year, prop) %>%
  spread(sex, prop) %>%
  mutate(propDiff = Male - Female)

#Plot proportional differences
ggplot(TotalMathCS) +
  geom_col(aes(x = year, y = propDiff, fill = year), position = "dodge") +
  scale_fill_manual(values=c("darkblue","blue", "lightblue")) +
  labs(y = "Difference (Male - Female)", x ="Year", title = "Proportion Difference Between Male & Female Math/CS PhD Grads") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top") +
  geom_text(aes(x = year, y = propDiff + 0.003, label = round(propDiff, 4)))

#Consolidate state dataframes
TotalMathCSStates <- Stack(TotalMathCSStates15, TotalMathCSStates18)


#Box Plot of distribution of PhD's in each state
ggplot(TotalMathCSStates, aes(x = year, y = graduates, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values= c("yellow", "lightgreen")) +
  labs(y = "Graduates", x ="Sex", title = "Distribution for PhDs in Math/CS in States") +
   theme(plot.title = element_text(hjust = 0.5), legend.position = "top")


Part 4 - Inference


Import Data

Data for all years from 2008 - 2018 were loaded into one large dataframe. RCurl Library was used to read in the csv files from a Github repo into dataframes and dplyr was used to transform and tidy data into dataframes where each row is an observation (state) and each column corresponded to the number of males or females who earned a PhD in each category. Column names were also changed to reflect the variables of interest (Major and gender)

This was done in two sections because the introduction of Math and CS as a dicipline affected the automated importing process. CSV files were uploaded on Github, accessed through url and saved in vectors. A for loop iterated through the vector; data was cleaned and stacked into one data frame.

2009 - 2014

#Load first dataframe

#Load url from Github
data <- c("https://raw.githubusercontent.com/MsQCompSci/606Project/master/2009Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2010Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2011Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2012Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2013Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2014Doc.csv")

#Get data 
raw<- getURL(data[1])

#Rename Columns, omit unessesary rows and columns, rename columns to include sex & field pivot into long format and remove "D" values
GIANTdf<- read.csv(text = raw) %>% 
  data.frame() %>%
  select(State = starts_with("TABLE"), TotalMale = X, TotalFemale = X.1, PhysciMale = X.6, PhysciFemale = X.7, EngMale = X.12, EngFemale = X.13) %>%
  slice(3:62) %>%
  pivot_longer(-State, "Cat") %>% 
  filter(value != "D") 

#Data as numeric removing commas
GIANTdf$value <- as.numeric(unlist(str_remove_all(GIANTdf$value, ',')))

#Drop NA Values (extra rows) and place data back in wide format
GIANTdf <- GIANTdf %>% 
  drop_na() %>% 
  spread(Cat,value) %>%
  mutate(year= as.factor(2009))
  
#same as above for each csv url
for(x in 2:length(data)){
  
  raw<- getURL(data[x])
   
  dataFrame <- read.csv(text = raw) %>% 
     data.frame() %>%
     select(State = starts_with("TABLE"), TotalMale = X, TotalFemale = X.1, PhysciMale = X.6, PhysciFemale = X.7, EngMale = X.12, EngFemale = X.13) %>%
     slice(3:62) %>%
     pivot_longer(-State, "Cat") %>% 
     filter(value != "D") 
  
  dataFrame$value <- as.numeric(unlist(str_remove_all(dataFrame$value, ',')))
  
  calYear <- 2009+x-1
  dataFrame <-dataFrame %>% 
    drop_na() %>% 
    spread(Cat,value)%>%
    mutate(year= as.factor(calYear))
  
  GIANTdf <- Stack(GIANTdf, dataFrame)
}

kable(GIANTdf)%>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
State EngFemale EngMale PhysciFemale PhysciMale TotalFemale TotalMale year
Alabama 20 74 31 60 299 331 2009
Alaska NA NA 6 6 16 21 2009
Arizona 34 130 51 117 505 543 2009
Arkansas 10 19 6 11 97 111 2009
California 257 825 318 871 2708 3258 2009
Colorado 24 89 54 127 390 431 2009
Connecticut 16 31 34 70 320 331 2009
Delaware 11 37 21 40 135 138 2009
District of Columbia 10 41 20 36 310 277 2009
Florida 69 263 97 226 955 1071 2009
Georgia 71 261 64 153 670 759 2009
Hawaii NA NA 9 14 92 92 2009
Idaho NA NA 5 15 50 72 2009
Illinois 85 281 131 298 1093 1229 2009
Indiana 45 208 64 171 574 757 2009
Iowa 29 95 39 61 294 377 2009
Kansas 7 42 16 48 205 228 2009
Kentucky 11 42 14 27 221 215 2009
Louisiana 16 48 22 62 212 294 2009
Maine NA NA NA NA 34 24 2009
Maryland 48 161 52 164 588 611 2009
Massachusetts 88 365 153 354 1146 1472 2009
Michigan 54 313 92 200 785 988 2009
Minnesota 12 66 30 66 510 449 2009
Mississippi NA NA 11 32 212 208 2009
Missouri 23 68 46 71 415 453 2009
Montana NA NA NA NA 59 49 2009
Nebraska 5 22 12 22 154 150 2009
Nevada NA NA NA NA 109 106 2009
New Hampshire NA NA 13 26 68 75 2009
New Jersey 42 130 55 163 465 607 2009
New Mexico 10 38 13 55 116 164 2009
New York 105 318 171 462 1883 1967 2009
North Carolina 53 153 88 160 714 708 2009
North Dakota NA 10 NA NA 54 67 2009
Ohio 56 269 87 184 874 986 2009
Oklahoma 13 40 20 41 186 233 2009
Oregon 8 41 26 65 204 249 2009
Pennsylvania 104 331 116 334 1184 1385 2009
Puerto Rico NA NA 10 13 124 76 2009
Rhode Island 6 26 30 50 117 144 2009
South Carolina 13 49 22 66 223 244 2009
South Dakota NA NA NA NA 54 47 2009
Tennessee 30 93 30 78 398 439 2009
Texas 106 437 153 371 1562 1801 2009
United Statese 1623 6006 2450 5868 23190 26338 2009
Utah 10 84 29 59 201 282 2009
Vermont NA NA NA 6 26 22 2009
Virginia 44 185 67 116 598 694 2009
Washington 25 77 56 77 433 421 2009
West Virginia NA 22 7 13 80 83 2009
Wisconsin 27 112 38 135 439 561 2009
Wyoming NA NA NA NA 29 38 2009
Alabama 18 62 26 57 265 308 2010
Alaska NA NA 5 8 20 25 2010
Arizona 33 105 52 111 430 467 2010
Arkansas 5 21 5 22 71 108 2010
California 243 755 312 829 2619 3153 2010
Colorado 30 97 45 115 372 428 2010
Connecticut 16 36 33 94 315 346 2010
Delaware NA NA 19 27 92 113 2010
District of Columbia 5 32 23 26 312 236 2010
Florida 85 296 120 233 1011 1126 2010
Georgia 61 196 64 139 612 636 2010
Hawaii NA NA 14 19 98 86 2010
Idaho NA NA NA NA 46 55 2010
Illinois 79 305 98 294 1023 1267 2010
Indiana 55 189 70 175 555 745 2010
Iowa 26 85 36 87 310 365 2010
Kansas 12 28 15 55 204 222 2010
Kentucky 5 31 16 37 210 252 2010
Louisiana 14 59 25 78 277 318 2010
Maine NA NA 5 8 24 29 2010
Maryland 52 126 65 169 610 651 2010
Massachusetts 129 325 135 343 1134 1366 2010
Michigan 68 259 90 204 754 937 2010
Minnesota 21 86 25 89 560 492 2010
Mississippi 12 36 19 36 246 200 2010
Missouri 24 77 48 85 369 378 2010
Montana NA NA 10 18 42 58 2010
Nebraska 8 26 15 27 160 159 2010
Nevada NA NA 8 26 80 111 2010
New Hampshire NA NA 15 28 66 88 2010
New Jersey 63 111 63 130 448 516 2010
New Mexico 10 32 17 48 119 158 2010
New York 101 353 189 438 1905 1955 2010
North Carolina 40 169 91 186 717 775 2010
North Dakota 5 7 6 14 76 56 2010
Ohio 70 271 94 210 856 997 2010
Oklahoma 12 53 18 41 238 235 2010
Oregon 8 37 16 69 190 222 2010
Pennsylvania 121 346 108 293 1063 1281 2010
Puerto Rico 0 0 0 0 47 25 2010
Rhode Island 7 23 23 50 137 158 2010
South Carolina 21 50 31 59 212 231 2010
South Dakota NA NA 5 5 23 30 2010
Tennessee 24 90 30 68 391 408 2010
Texas 110 449 159 396 1468 1780 2010
United Statesf 1743 5806 2458 5860 22505 25548 2010
Utah 11 64 28 63 172 268 2010
Vermont 0 5 NA NA 36 26 2010
Virginia 55 194 51 109 569 643 2010
Washington 26 82 45 93 399 425 2010
West Virginia 6 21 8 16 93 107 2010
Wisconsin 20 99 51 111 436 498 2010
Wyoming NA NA NA NA 23 29 2010
Alabama 17 84 30 53 271 304 2011
Alaska NA NA NA NA 24 22 2011
Arizona 30 107 48 110 403 450 2011
Arkansas NA NA 9 19 89 88 2011
California 247 792 322 875 2662 3174 2011
Colorado 27 113 46 133 338 430 2011
Connecticut 18 53 23 89 300 347 2011
Delaware 18 41 18 36 101 119 2011
District of Columbia 11 37 20 30 312 247 2011
Florida 69 332 110 289 1033 1187 2011
Georgia 56 224 71 142 585 697 2011
Hawaii NA NA NA NA 92 120 2011
Idaho 0 8 6 15 43 51 2011
Illinois 73 308 107 303 1031 1276 2011
Indiana 50 223 79 175 563 761 2011
Iowa 29 90 26 102 333 403 2011
Kansas 12 39 23 41 235 239 2011
Kentucky 9 30 16 36 223 219 2011
Louisiana 15 43 30 56 225 271 2011
Maine NA NA 6 14 20 33 2011
Maryland 53 146 56 161 561 621 2011
Massachusetts 111 340 126 363 1074 1454 2011
Michigan 74 251 91 192 816 885 2011
Minnesota 19 109 24 109 527 566 2011
Mississippi 11 46 20 27 208 216 2011
Missouri 14 97 31 87 346 434 2011
Montana NA NA 7 17 44 54 2011
Nebraska 8 27 19 28 153 162 2011
Nevada 6 18 15 23 100 105 2011
New Hampshire NA NA 10 33 64 87 2011
New Jersey 54 127 66 156 478 578 2011
New Mexico 6 34 11 51 121 152 2011
New York 106 377 182 459 1917 2091 2011
North Carolina 64 160 83 184 700 741 2011
North Dakota NA NA 9 9 86 53 2011
Ohio 67 251 78 230 861 991 2011
Oklahoma 8 38 12 46 186 230 2011
Oregon 13 35 26 58 189 233 2011
Pennsylvania 135 358 137 322 1220 1304 2011
Puerto Rico NA NA 20 29 153 100 2011
Rhode Island 10 31 24 51 130 173 2011
South Carolina 23 75 20 53 217 241 2011
South Dakota NA NA NA NA 26 35 2011
Tennessee 27 76 36 78 387 416 2011
Texas 128 479 147 372 1445 1746 2011
United Statesf 1778 6218 2479 6193 22751 26237 2011
Utah 17 79 23 73 193 302 2011
Vermont NA NA 0 6 20 24 2011
Virginia 66 207 74 176 628 721 2011
Washington 28 83 62 109 444 422 2011
West Virginia NA NA NA NA 101 111 2011
Wisconsin 24 121 60 111 448 519 2011
Wyoming NA NA 6 10 25 32 2011
Alabama 23 72 27 69 320 328 2012
Alaska NA NA 5 10 21 29 2012
Arizona 25 133 48 112 397 491 2012
Arkansas NA NA 10 22 84 110 2012
California 272 906 317 882 2612 3423 2012
Colorado 38 107 63 116 382 427 2012
Connecticut 23 56 27 83 333 368 2012
Delaware 18 48 16 27 97 116 2012
District of Columbia 12 42 19 40 294 273 2012
Florida 72 283 110 301 972 1180 2012
Georgia 64 238 62 141 659 715 2012
Hawaii NA NA NA NA 103 91 2012
Idaho 0 9 6 13 36 63 2012
Illinois 87 287 133 349 1087 1309 2012
Indiana 57 202 83 174 572 772 2012
Iowa 29 102 25 108 320 441 2012
Kansas 14 43 18 54 235 235 2012
Kentucky 12 47 21 43 246 273 2012
Louisiana 13 49 41 81 319 343 2012
Maine NA NA NA NA 23 36 2012
Maryland 44 168 45 160 658 624 2012
Massachusetts 125 356 147 369 1195 1463 2012
Michigan 83 314 107 203 808 989 2012
Minnesota 31 89 30 74 610 527 2012
Mississippi 6 33 12 38 231 225 2012
Missouri 30 91 34 99 394 450 2012
Montana NA NA NA NA 40 52 2012
Nebraska 5 28 12 28 133 147 2012
Nevada 6 13 9 31 110 94 2012
New Hampshire 7 13 14 20 63 78 2012
New Jersey 54 122 51 150 453 550 2012
New Mexico 6 41 15 51 149 153 2012
New York 113 358 195 481 1993 2024 2012
North Carolina 52 173 97 195 773 782 2012
North Dakota NA NA 9 17 70 67 2012
Ohio 62 270 93 192 830 965 2012
Oklahoma 11 58 16 54 202 267 2012
Oregon 8 39 22 80 206 266 2012
Pennsylvania 108 352 125 318 1144 1369 2012
Puerto Rico NA NA 20 18 140 81 2012
Rhode Island 8 29 22 51 151 174 2012
South Carolina 27 82 37 53 230 259 2012
South Dakota NA NA NA NA 30 46 2012
Tennessee 30 96 28 85 395 415 2012
Texas 149 551 183 442 1576 2062 2012
United Statesf 1883 6527 2551 6393 23562 27390 2012
Utah 19 89 20 70 178 314 2012
Vermont NA NA NA NA 24 34 2012
Virginia 63 223 60 173 659 753 2012
Washington 28 92 50 98 369 428 2012
West Virginia 7 19 5 14 100 104 2012
Wisconsin 28 111 43 121 520 559 2012
Wyoming 0 13 NA NA 16 46 2012
Alabama 28 82 31 70 299 345 2013
Alaska NA NA NA NA 28 24 2013
Arizona 42 118 49 117 437 467 2013
Arkansas 8 30 NA NA 81 141 2013
California 294 904 358 976 2800 3488 2013
Colorado 38 116 70 143 423 484 2013
Connecticut 23 57 51 95 337 384 2013
Delaware 7 37 19 32 87 99 2013
District of Columbia 13 44 13 24 316 269 2013
Florida 67 300 130 298 985 1182 2013
Georgia 60 247 60 154 632 726 2013
Hawaii NA NA 12 14 123 107 2013
Idaho NA 17 11 19 60 80 2013
Illinois 89 303 110 332 1142 1398 2013
Indiana 58 225 76 207 602 826 2013
Iowa 27 94 37 104 357 423 2013
Kansas 10 42 27 56 224 287 2013
Kentucky 14 37 12 32 204 263 2013
Louisiana 17 78 33 84 264 388 2013
Maine NA NA NA NA 27 22 2013
Maryland 64 162 57 162 714 682 2013
Massachusetts 124 394 143 410 1171 1586 2013
Michigan 76 283 106 199 839 998 2013
Minnesota 32 102 38 101 646 578 2013
Mississippi 8 29 26 34 248 209 2013
Missouri 28 104 41 83 397 464 2013
Montana NA 7 7 13 54 46 2013
Nebraska 15 36 14 31 177 186 2013
Nevada NA NA 7 30 89 122 2013
New Hampshire 10 20 11 29 86 78 2013
New Jersey 51 115 59 156 465 570 2013
New Mexico 9 45 15 50 148 177 2013
New York 127 397 186 496 2092 2113 2013
North Carolina 70 209 92 190 816 865 2013
North Dakota NA 21 NA NA 62 79 2013
Ohio 79 281 78 220 863 976 2013
Oklahoma 16 77 16 44 212 277 2013
Oregon 18 54 30 58 221 238 2013
Pennsylvania 113 373 123 313 1135 1390 2013
Puerto Rico NA NA 14 12 102 47 2013
Rhode Island 8 24 NA NA 137 171 2013
South Carolina 23 83 22 46 235 256 2013
South Dakota NA NA 5 9 32 44 2013
Tennessee 35 95 37 73 416 413 2013
Texas 159 605 183 438 1615 2010 2013
United Statesf 2051 6910 2698 6589 24396 28353 2013
Utah 18 93 42 87 194 326 2013
Vermont NA NA NA NA 41 32 2013
Virginia 67 249 74 177 709 855 2013
Washington 45 118 49 99 450 480 2013
West Virginia NA NA 13 11 94 103 2013
Wisconsin 27 105 65 130 486 535 2013
Wyoming NA 7 NA NA 22 44 2013
Alabama 26 88 28 68 328 342 2014
Alaska NA NA 7 9 32 17 2014
Arizona 41 113 43 110 411 478 2014
Arkansas 5 14 16 27 94 114 2014
California 253 859 369 967 2717 3454 2014
Colorado 39 144 64 170 409 529 2014
Connecticut 28 62 37 94 339 393 2014
Delaware 25 38 18 22 95 100 2014
District of Columbia 7 42 24 33 320 275 2014
Florida 98 311 136 306 1056 1218 2014
Georgia 74 289 73 181 643 807 2014
Hawaii NA NA 11 17 108 88 2014
Idaho NA NA 9 18 47 86 2014
Illinois 98 313 127 343 1061 1331 2014
Indiana 61 287 66 223 581 868 2014
Iowa 33 96 49 90 329 400 2014
Kansas 21 49 22 44 228 256 2014
Kentucky 13 69 19 42 244 279 2014
Louisiana 18 62 39 84 280 327 2014
Maine NA NA NA NA 40 36 2014
Maryland 55 186 63 163 632 650 2014
Massachusetts 143 425 153 381 1253 1563 2014
Michigan 85 355 99 231 885 1071 2014
Minnesota 21 108 27 100 710 631 2014
Mississippi 9 44 NA NA 202 213 2014
Missouri 26 95 41 107 413 475 2014
Montana NA NA 8 13 49 58 2014
Nebraska 10 29 15 34 199 166 2014
Nevada NA NA 11 17 91 107 2014
New Hampshire 7 14 15 36 84 95 2014
New Jersey 50 132 62 191 516 637 2014
New Mexico 16 49 20 46 160 177 2014
New York 135 432 216 517 2094 2218 2014
North Carolina 69 230 104 213 839 852 2014
North Dakota NA NA NA NA 66 93 2014
Ohio 73 294 114 243 872 1054 2014
Oklahoma 19 64 29 59 238 278 2014
Oregon 10 54 25 66 219 220 2014
Pennsylvania 125 324 119 370 1184 1410 2014
Puerto Rico NA 5 11 17 127 61 2014
Rhode Island NA NA 37 57 184 149 2014
South Carolina 35 109 30 56 235 305 2014
South Dakota 5 14 5 17 41 59 2014
Tennessee 28 146 24 79 395 496 2014
Texas 188 652 198 495 1778 2180 2014
United Statesf 2179 7349 2820 7004 24857 29049 2014
Utah 18 76 20 85 187 311 2014
Vermont NA NA NA NA 38 35 2014
Virginia 79 260 70 164 697 860 2014
Washington 47 116 59 139 428 502 2014
West Virginia NA NA 5 25 95 97 2014
Wisconsin 39 129 57 128 544 566 2014
Wyoming 5 24 NA NA 40 62 2014


2015 - 2018

# Load data from Github

#Load url
data<- c("https://raw.githubusercontent.com/MsQCompSci/606Project/master/2015Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2016Doc.csv","https://raw.githubusercontent.com/MsQCompSci/606Project/master/2017Doc.csv", "https://raw.githubusercontent.com/MsQCompSci/606Project/master/2018Doc.csv")

#same as above for each csv url
for(x in 1:length(data)){
  
  raw<- getURL(data[x])
  
  dataFrame <- read.csv(text = raw) %>% 
     data.frame() %>%
    select(State = contains("Table"), TotalMale = X, TotalFemale = X.1, PhysciMale = X.4, PhysciFemale = X.5, MathMale = X.6,MathFemale = X.7, EngMale = X.10, EngFemale = X.11) %>%
    slice(5:57) %>%
    pivot_longer(-State, "Cat") %>% 
    filter(value != "D")
  
  dataFrame$value <- as.numeric(unlist(str_remove_all(dataFrame$value, ',')))
  
  calYear <- 2015+x-1
  
  dataFrame <-dataFrame %>% 
    drop_na() %>% 
    spread(Cat,value)%>%
    mutate(year= as.factor(calYear))
  
  GIANTdf <- Stack(GIANTdf, dataFrame)
}

#Display Giant Table
kable(GIANTdf)%>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
State EngFemale EngMale PhysciFemale PhysciMale TotalFemale TotalMale year MathFemale MathMale
Alabama 20 74 31 60 299 331 2009 NA NA
Alaska NA NA 6 6 16 21 2009 NA NA
Arizona 34 130 51 117 505 543 2009 NA NA
Arkansas 10 19 6 11 97 111 2009 NA NA
California 257 825 318 871 2708 3258 2009 NA NA
Colorado 24 89 54 127 390 431 2009 NA NA
Connecticut 16 31 34 70 320 331 2009 NA NA
Delaware 11 37 21 40 135 138 2009 NA NA
District of Columbia 10 41 20 36 310 277 2009 NA NA
Florida 69 263 97 226 955 1071 2009 NA NA
Georgia 71 261 64 153 670 759 2009 NA NA
Hawaii NA NA 9 14 92 92 2009 NA NA
Idaho NA NA 5 15 50 72 2009 NA NA
Illinois 85 281 131 298 1093 1229 2009 NA NA
Indiana 45 208 64 171 574 757 2009 NA NA
Iowa 29 95 39 61 294 377 2009 NA NA
Kansas 7 42 16 48 205 228 2009 NA NA
Kentucky 11 42 14 27 221 215 2009 NA NA
Louisiana 16 48 22 62 212 294 2009 NA NA
Maine NA NA NA NA 34 24 2009 NA NA
Maryland 48 161 52 164 588 611 2009 NA NA
Massachusetts 88 365 153 354 1146 1472 2009 NA NA
Michigan 54 313 92 200 785 988 2009 NA NA
Minnesota 12 66 30 66 510 449 2009 NA NA
Mississippi NA NA 11 32 212 208 2009 NA NA
Missouri 23 68 46 71 415 453 2009 NA NA
Montana NA NA NA NA 59 49 2009 NA NA
Nebraska 5 22 12 22 154 150 2009 NA NA
Nevada NA NA NA NA 109 106 2009 NA NA
New Hampshire NA NA 13 26 68 75 2009 NA NA
New Jersey 42 130 55 163 465 607 2009 NA NA
New Mexico 10 38 13 55 116 164 2009 NA NA
New York 105 318 171 462 1883 1967 2009 NA NA
North Carolina 53 153 88 160 714 708 2009 NA NA
North Dakota NA 10 NA NA 54 67 2009 NA NA
Ohio 56 269 87 184 874 986 2009 NA NA
Oklahoma 13 40 20 41 186 233 2009 NA NA
Oregon 8 41 26 65 204 249 2009 NA NA
Pennsylvania 104 331 116 334 1184 1385 2009 NA NA
Puerto Rico NA NA 10 13 124 76 2009 NA NA
Rhode Island 6 26 30 50 117 144 2009 NA NA
South Carolina 13 49 22 66 223 244 2009 NA NA
South Dakota NA NA NA NA 54 47 2009 NA NA
Tennessee 30 93 30 78 398 439 2009 NA NA
Texas 106 437 153 371 1562 1801 2009 NA NA
United Statese 1623 6006 2450 5868 23190 26338 2009 NA NA
Utah 10 84 29 59 201 282 2009 NA NA
Vermont NA NA NA 6 26 22 2009 NA NA
Virginia 44 185 67 116 598 694 2009 NA NA
Washington 25 77 56 77 433 421 2009 NA NA
West Virginia NA 22 7 13 80 83 2009 NA NA
Wisconsin 27 112 38 135 439 561 2009 NA NA
Wyoming NA NA NA NA 29 38 2009 NA NA
Alabama 18 62 26 57 265 308 2010 NA NA
Alaska NA NA 5 8 20 25 2010 NA NA
Arizona 33 105 52 111 430 467 2010 NA NA
Arkansas 5 21 5 22 71 108 2010 NA NA
California 243 755 312 829 2619 3153 2010 NA NA
Colorado 30 97 45 115 372 428 2010 NA NA
Connecticut 16 36 33 94 315 346 2010 NA NA
Delaware NA NA 19 27 92 113 2010 NA NA
District of Columbia 5 32 23 26 312 236 2010 NA NA
Florida 85 296 120 233 1011 1126 2010 NA NA
Georgia 61 196 64 139 612 636 2010 NA NA
Hawaii NA NA 14 19 98 86 2010 NA NA
Idaho NA NA NA NA 46 55 2010 NA NA
Illinois 79 305 98 294 1023 1267 2010 NA NA
Indiana 55 189 70 175 555 745 2010 NA NA
Iowa 26 85 36 87 310 365 2010 NA NA
Kansas 12 28 15 55 204 222 2010 NA NA
Kentucky 5 31 16 37 210 252 2010 NA NA
Louisiana 14 59 25 78 277 318 2010 NA NA
Maine NA NA 5 8 24 29 2010 NA NA
Maryland 52 126 65 169 610 651 2010 NA NA
Massachusetts 129 325 135 343 1134 1366 2010 NA NA
Michigan 68 259 90 204 754 937 2010 NA NA
Minnesota 21 86 25 89 560 492 2010 NA NA
Mississippi 12 36 19 36 246 200 2010 NA NA
Missouri 24 77 48 85 369 378 2010 NA NA
Montana NA NA 10 18 42 58 2010 NA NA
Nebraska 8 26 15 27 160 159 2010 NA NA
Nevada NA NA 8 26 80 111 2010 NA NA
New Hampshire NA NA 15 28 66 88 2010 NA NA
New Jersey 63 111 63 130 448 516 2010 NA NA
New Mexico 10 32 17 48 119 158 2010 NA NA
New York 101 353 189 438 1905 1955 2010 NA NA
North Carolina 40 169 91 186 717 775 2010 NA NA
North Dakota 5 7 6 14 76 56 2010 NA NA
Ohio 70 271 94 210 856 997 2010 NA NA
Oklahoma 12 53 18 41 238 235 2010 NA NA
Oregon 8 37 16 69 190 222 2010 NA NA
Pennsylvania 121 346 108 293 1063 1281 2010 NA NA
Puerto Rico 0 0 0 0 47 25 2010 NA NA
Rhode Island 7 23 23 50 137 158 2010 NA NA
South Carolina 21 50 31 59 212 231 2010 NA NA
South Dakota NA NA 5 5 23 30 2010 NA NA
Tennessee 24 90 30 68 391 408 2010 NA NA
Texas 110 449 159 396 1468 1780 2010 NA NA
United Statesf 1743 5806 2458 5860 22505 25548 2010 NA NA
Utah 11 64 28 63 172 268 2010 NA NA
Vermont 0 5 NA NA 36 26 2010 NA NA
Virginia 55 194 51 109 569 643 2010 NA NA
Washington 26 82 45 93 399 425 2010 NA NA
West Virginia 6 21 8 16 93 107 2010 NA NA
Wisconsin 20 99 51 111 436 498 2010 NA NA
Wyoming NA NA NA NA 23 29 2010 NA NA
Alabama 17 84 30 53 271 304 2011 NA NA
Alaska NA NA NA NA 24 22 2011 NA NA
Arizona 30 107 48 110 403 450 2011 NA NA
Arkansas NA NA 9 19 89 88 2011 NA NA
California 247 792 322 875 2662 3174 2011 NA NA
Colorado 27 113 46 133 338 430 2011 NA NA
Connecticut 18 53 23 89 300 347 2011 NA NA
Delaware 18 41 18 36 101 119 2011 NA NA
District of Columbia 11 37 20 30 312 247 2011 NA NA
Florida 69 332 110 289 1033 1187 2011 NA NA
Georgia 56 224 71 142 585 697 2011 NA NA
Hawaii NA NA NA NA 92 120 2011 NA NA
Idaho 0 8 6 15 43 51 2011 NA NA
Illinois 73 308 107 303 1031 1276 2011 NA NA
Indiana 50 223 79 175 563 761 2011 NA NA
Iowa 29 90 26 102 333 403 2011 NA NA
Kansas 12 39 23 41 235 239 2011 NA NA
Kentucky 9 30 16 36 223 219 2011 NA NA
Louisiana 15 43 30 56 225 271 2011 NA NA
Maine NA NA 6 14 20 33 2011 NA NA
Maryland 53 146 56 161 561 621 2011 NA NA
Massachusetts 111 340 126 363 1074 1454 2011 NA NA
Michigan 74 251 91 192 816 885 2011 NA NA
Minnesota 19 109 24 109 527 566 2011 NA NA
Mississippi 11 46 20 27 208 216 2011 NA NA
Missouri 14 97 31 87 346 434 2011 NA NA
Montana NA NA 7 17 44 54 2011 NA NA
Nebraska 8 27 19 28 153 162 2011 NA NA
Nevada 6 18 15 23 100 105 2011 NA NA
New Hampshire NA NA 10 33 64 87 2011 NA NA
New Jersey 54 127 66 156 478 578 2011 NA NA
New Mexico 6 34 11 51 121 152 2011 NA NA
New York 106 377 182 459 1917 2091 2011 NA NA
North Carolina 64 160 83 184 700 741 2011 NA NA
North Dakota NA NA 9 9 86 53 2011 NA NA
Ohio 67 251 78 230 861 991 2011 NA NA
Oklahoma 8 38 12 46 186 230 2011 NA NA
Oregon 13 35 26 58 189 233 2011 NA NA
Pennsylvania 135 358 137 322 1220 1304 2011 NA NA
Puerto Rico NA NA 20 29 153 100 2011 NA NA
Rhode Island 10 31 24 51 130 173 2011 NA NA
South Carolina 23 75 20 53 217 241 2011 NA NA
South Dakota NA NA NA NA 26 35 2011 NA NA
Tennessee 27 76 36 78 387 416 2011 NA NA
Texas 128 479 147 372 1445 1746 2011 NA NA
United Statesf 1778 6218 2479 6193 22751 26237 2011 NA NA
Utah 17 79 23 73 193 302 2011 NA NA
Vermont NA NA 0 6 20 24 2011 NA NA
Virginia 66 207 74 176 628 721 2011 NA NA
Washington 28 83 62 109 444 422 2011 NA NA
West Virginia NA NA NA NA 101 111 2011 NA NA
Wisconsin 24 121 60 111 448 519 2011 NA NA
Wyoming NA NA 6 10 25 32 2011 NA NA
Alabama 23 72 27 69 320 328 2012 NA NA
Alaska NA NA 5 10 21 29 2012 NA NA
Arizona 25 133 48 112 397 491 2012 NA NA
Arkansas NA NA 10 22 84 110 2012 NA NA
California 272 906 317 882 2612 3423 2012 NA NA
Colorado 38 107 63 116 382 427 2012 NA NA
Connecticut 23 56 27 83 333 368 2012 NA NA
Delaware 18 48 16 27 97 116 2012 NA NA
District of Columbia 12 42 19 40 294 273 2012 NA NA
Florida 72 283 110 301 972 1180 2012 NA NA
Georgia 64 238 62 141 659 715 2012 NA NA
Hawaii NA NA NA NA 103 91 2012 NA NA
Idaho 0 9 6 13 36 63 2012 NA NA
Illinois 87 287 133 349 1087 1309 2012 NA NA
Indiana 57 202 83 174 572 772 2012 NA NA
Iowa 29 102 25 108 320 441 2012 NA NA
Kansas 14 43 18 54 235 235 2012 NA NA
Kentucky 12 47 21 43 246 273 2012 NA NA
Louisiana 13 49 41 81 319 343 2012 NA NA
Maine NA NA NA NA 23 36 2012 NA NA
Maryland 44 168 45 160 658 624 2012 NA NA
Massachusetts 125 356 147 369 1195 1463 2012 NA NA
Michigan 83 314 107 203 808 989 2012 NA NA
Minnesota 31 89 30 74 610 527 2012 NA NA
Mississippi 6 33 12 38 231 225 2012 NA NA
Missouri 30 91 34 99 394 450 2012 NA NA
Montana NA NA NA NA 40 52 2012 NA NA
Nebraska 5 28 12 28 133 147 2012 NA NA
Nevada 6 13 9 31 110 94 2012 NA NA
New Hampshire 7 13 14 20 63 78 2012 NA NA
New Jersey 54 122 51 150 453 550 2012 NA NA
New Mexico 6 41 15 51 149 153 2012 NA NA
New York 113 358 195 481 1993 2024 2012 NA NA
North Carolina 52 173 97 195 773 782 2012 NA NA
North Dakota NA NA 9 17 70 67 2012 NA NA
Ohio 62 270 93 192 830 965 2012 NA NA
Oklahoma 11 58 16 54 202 267 2012 NA NA
Oregon 8 39 22 80 206 266 2012 NA NA
Pennsylvania 108 352 125 318 1144 1369 2012 NA NA
Puerto Rico NA NA 20 18 140 81 2012 NA NA
Rhode Island 8 29 22 51 151 174 2012 NA NA
South Carolina 27 82 37 53 230 259 2012 NA NA
South Dakota NA NA NA NA 30 46 2012 NA NA
Tennessee 30 96 28 85 395 415 2012 NA NA
Texas 149 551 183 442 1576 2062 2012 NA NA
United Statesf 1883 6527 2551 6393 23562 27390 2012 NA NA
Utah 19 89 20 70 178 314 2012 NA NA
Vermont NA NA NA NA 24 34 2012 NA NA
Virginia 63 223 60 173 659 753 2012 NA NA
Washington 28 92 50 98 369 428 2012 NA NA
West Virginia 7 19 5 14 100 104 2012 NA NA
Wisconsin 28 111 43 121 520 559 2012 NA NA
Wyoming 0 13 NA NA 16 46 2012 NA NA
Alabama 28 82 31 70 299 345 2013 NA NA
Alaska NA NA NA NA 28 24 2013 NA NA
Arizona 42 118 49 117 437 467 2013 NA NA
Arkansas 8 30 NA NA 81 141 2013 NA NA
California 294 904 358 976 2800 3488 2013 NA NA
Colorado 38 116 70 143 423 484 2013 NA NA
Connecticut 23 57 51 95 337 384 2013 NA NA
Delaware 7 37 19 32 87 99 2013 NA NA
District of Columbia 13 44 13 24 316 269 2013 NA NA
Florida 67 300 130 298 985 1182 2013 NA NA
Georgia 60 247 60 154 632 726 2013 NA NA
Hawaii NA NA 12 14 123 107 2013 NA NA
Idaho NA 17 11 19 60 80 2013 NA NA
Illinois 89 303 110 332 1142 1398 2013 NA NA
Indiana 58 225 76 207 602 826 2013 NA NA
Iowa 27 94 37 104 357 423 2013 NA NA
Kansas 10 42 27 56 224 287 2013 NA NA
Kentucky 14 37 12 32 204 263 2013 NA NA
Louisiana 17 78 33 84 264 388 2013 NA NA
Maine NA NA NA NA 27 22 2013 NA NA
Maryland 64 162 57 162 714 682 2013 NA NA
Massachusetts 124 394 143 410 1171 1586 2013 NA NA
Michigan 76 283 106 199 839 998 2013 NA NA
Minnesota 32 102 38 101 646 578 2013 NA NA
Mississippi 8 29 26 34 248 209 2013 NA NA
Missouri 28 104 41 83 397 464 2013 NA NA
Montana NA 7 7 13 54 46 2013 NA NA
Nebraska 15 36 14 31 177 186 2013 NA NA
Nevada NA NA 7 30 89 122 2013 NA NA
New Hampshire 10 20 11 29 86 78 2013 NA NA
New Jersey 51 115 59 156 465 570 2013 NA NA
New Mexico 9 45 15 50 148 177 2013 NA NA
New York 127 397 186 496 2092 2113 2013 NA NA
North Carolina 70 209 92 190 816 865 2013 NA NA
North Dakota NA 21 NA NA 62 79 2013 NA NA
Ohio 79 281 78 220 863 976 2013 NA NA
Oklahoma 16 77 16 44 212 277 2013 NA NA
Oregon 18 54 30 58 221 238 2013 NA NA
Pennsylvania 113 373 123 313 1135 1390 2013 NA NA
Puerto Rico NA NA 14 12 102 47 2013 NA NA
Rhode Island 8 24 NA NA 137 171 2013 NA NA
South Carolina 23 83 22 46 235 256 2013 NA NA
South Dakota NA NA 5 9 32 44 2013 NA NA
Tennessee 35 95 37 73 416 413 2013 NA NA
Texas 159 605 183 438 1615 2010 2013 NA NA
United Statesf 2051 6910 2698 6589 24396 28353 2013 NA NA
Utah 18 93 42 87 194 326 2013 NA NA
Vermont NA NA NA NA 41 32 2013 NA NA
Virginia 67 249 74 177 709 855 2013 NA NA
Washington 45 118 49 99 450 480 2013 NA NA
West Virginia NA NA 13 11 94 103 2013 NA NA
Wisconsin 27 105 65 130 486 535 2013 NA NA
Wyoming NA 7 NA NA 22 44 2013 NA NA
Alabama 26 88 28 68 328 342 2014 NA NA
Alaska NA NA 7 9 32 17 2014 NA NA
Arizona 41 113 43 110 411 478 2014 NA NA
Arkansas 5 14 16 27 94 114 2014 NA NA
California 253 859 369 967 2717 3454 2014 NA NA
Colorado 39 144 64 170 409 529 2014 NA NA
Connecticut 28 62 37 94 339 393 2014 NA NA
Delaware 25 38 18 22 95 100 2014 NA NA
District of Columbia 7 42 24 33 320 275 2014 NA NA
Florida 98 311 136 306 1056 1218 2014 NA NA
Georgia 74 289 73 181 643 807 2014 NA NA
Hawaii NA NA 11 17 108 88 2014 NA NA
Idaho NA NA 9 18 47 86 2014 NA NA
Illinois 98 313 127 343 1061 1331 2014 NA NA
Indiana 61 287 66 223 581 868 2014 NA NA
Iowa 33 96 49 90 329 400 2014 NA NA
Kansas 21 49 22 44 228 256 2014 NA NA
Kentucky 13 69 19 42 244 279 2014 NA NA
Louisiana 18 62 39 84 280 327 2014 NA NA
Maine NA NA NA NA 40 36 2014 NA NA
Maryland 55 186 63 163 632 650 2014 NA NA
Massachusetts 143 425 153 381 1253 1563 2014 NA NA
Michigan 85 355 99 231 885 1071 2014 NA NA
Minnesota 21 108 27 100 710 631 2014 NA NA
Mississippi 9 44 NA NA 202 213 2014 NA NA
Missouri 26 95 41 107 413 475 2014 NA NA
Montana NA NA 8 13 49 58 2014 NA NA
Nebraska 10 29 15 34 199 166 2014 NA NA
Nevada NA NA 11 17 91 107 2014 NA NA
New Hampshire 7 14 15 36 84 95 2014 NA NA
New Jersey 50 132 62 191 516 637 2014 NA NA
New Mexico 16 49 20 46 160 177 2014 NA NA
New York 135 432 216 517 2094 2218 2014 NA NA
North Carolina 69 230 104 213 839 852 2014 NA NA
North Dakota NA NA NA NA 66 93 2014 NA NA
Ohio 73 294 114 243 872 1054 2014 NA NA
Oklahoma 19 64 29 59 238 278 2014 NA NA
Oregon 10 54 25 66 219 220 2014 NA NA
Pennsylvania 125 324 119 370 1184 1410 2014 NA NA
Puerto Rico NA 5 11 17 127 61 2014 NA NA
Rhode Island NA NA 37 57 184 149 2014 NA NA
South Carolina 35 109 30 56 235 305 2014 NA NA
South Dakota 5 14 5 17 41 59 2014 NA NA
Tennessee 28 146 24 79 395 496 2014 NA NA
Texas 188 652 198 495 1778 2180 2014 NA NA
United Statesf 2179 7349 2820 7004 24857 29049 2014 NA NA
Utah 18 76 20 85 187 311 2014 NA NA
Vermont NA NA NA NA 38 35 2014 NA NA
Virginia 79 260 70 164 697 860 2014 NA NA
Washington 47 116 59 139 428 502 2014 NA NA
West Virginia NA NA 5 25 95 97 2014 NA NA
Wisconsin 39 129 57 128 544 566 2014 NA NA
Wyoming 5 24 NA NA 40 62 2014 NA NA
Alabama 28 105 15 23 309 386 2015 9 37
Alaska NA NA NA NA 14 27 2015 NA NA
Arizona 29 120 45 93 449 541 2015 12 51
Arkansas 12 46 NA NA 95 130 2015 NA NA
California 265 893 245 510 2686 3386 2015 83 384
Colorado 61 164 63 104 439 567 2015 19 45
Connecticut 28 63 30 67 387 392 2015 13 36
Delaware 16 43 14 24 100 127 2015 9 17
District of Columbia 18 49 7 23 311 265 2015 NA NA
Florida 83 303 84 189 1127 1237 2015 33 132
Georgia 90 269 43 69 696 785 2015 37 89
Hawaii NA NA 6 20 118 122 2015 NA NA
Idaho NA NA 6 12 47 68 2015 NA NA
Illinois 104 360 82 170 1066 1413 2015 36 134
Indiana 80 264 65 106 711 870 2015 44 94
Iowa 17 96 33 48 293 393 2015 12 34
Kansas 19 51 23 32 275 301 2015 7 23
Kentucky 17 62 6 21 199 304 2015 7 26
Louisiana 22 63 24 39 299 333 2015 6 35
Maine NA NA 6 7 39 33 2015 NA NA
Maryland 59 179 39 98 660 745 2015 32 87
Massachusetts 142 367 133 238 1266 1570 2015 42 158
Michigan 82 346 92 134 889 1103 2015 32 100
Minnesota 30 103 17 51 714 594 2015 12 44
Mississippi 11 33 NA NA 232 214 2015 10 13
Missouri 37 118 33 65 464 522 2015 15 47
Montana NA NA 7 21 58 70 2015 NA NA
Nebraska 7 39 11 18 176 196 2015 7 10
Nevada NA NA 13 19 102 109 2015 NA NA
New Hampshire 6 18 9 17 80 89 2015 7 7
New Jersey 54 158 30 114 468 660 2015 17 91
New Mexico 20 51 17 41 169 176 2015 5 18
New York 138 437 138 289 1996 2090 2015 84 204
North Carolina 83 212 58 89 846 857 2015 37 110
North Dakota 5 29 6 13 88 87 2015 NA NA
Ohio 93 323 68 126 924 1069 2015 26 84
Oklahoma 21 71 13 39 219 296 2015 6 22
Oregon 15 48 25 50 237 252 2015 9 21
Pennsylvania 154 464 81 171 1186 1442 2015 72 143
Puerto Rico 5 8 NA NA 129 65 2015 NA NA
Rhode Island 12 23 NA NA 160 160 2015 9 27
South Carolina 27 117 13 38 266 320 2015 15 29
South Dakota NA NA NA NA 46 64 2015 0 7
Tennessee 34 118 23 46 436 464 2015 9 26
Texas 189 636 136 281 1842 2224 2015 77 192
United Statesd 2301 7596 1988 3935 25403 29596 2015 943 2880
Utah 26 113 16 49 211 365 2015 7 34
Vermont NA NA NA NA 34 42 2015 NA NA
Virginia 52 253 54 91 709 826 2015 23 92
Washington 41 112 37 69 456 502 2015 23 61
West Virginia 5 32 NA NA 101 116 2015 NA NA
Wisconsin 43 135 47 88 546 575 2015 17 50
Wyoming NA NA NA NA 33 52 2015 NA NA
Alabama 27 104 17 36 334 387 2016 12 34
Alaska NA NA 5 9 24 24 2016 NA NA
Arizona 38 133 31 89 407 482 2016 11 36
Arkansas NA NA 8 11 124 126 2016 5 10
California 269 877 253 563 2741 3376 2016 103 372
Colorado 50 183 37 130 439 631 2016 17 56
Connecticut 20 64 27 64 325 434 2016 13 37
Delaware 20 64 17 32 114 166 2016 NA NA
District of Columbia 17 55 14 18 317 292 2016 5 26
Florida 89 286 97 200 1072 1215 2016 58 122
Georgia 65 268 45 96 669 793 2016 29 87
Hawaii NA NA 11 16 110 91 2016 NA NA
Idaho NA NA 6 12 44 65 2016 NA NA
Illinois 98 286 83 201 1089 1337 2016 30 127
Indiana 63 265 53 126 641 877 2016 24 104
Iowa 31 96 25 37 315 397 2016 18 33
Kansas NA NA 18 22 252 262 2016 NA NA
Kentucky 18 41 7 15 219 261 2016 7 26
Louisiana 15 71 28 42 293 365 2016 16 28
Maine NA NA NA NA 37 38 2016 NA NA
Maryland 49 177 47 89 597 677 2016 29 103
Massachusetts 149 426 120 287 1257 1640 2016 46 148
Michigan 69 288 76 146 862 1045 2016 33 102
Minnesota 39 106 23 56 859 606 2016 15 40
Mississippi 7 23 15 23 240 208 2016 5 9
Missouri 32 112 29 65 419 498 2016 16 48
Montana NA NA 6 16 52 66 2016 NA NA
Nebraska 12 38 6 19 197 191 2016 10 15
Nevada 14 22 10 22 120 109 2016 0 0
New Hampshire 5 17 11 24 69 94 2016 NA NA
New Jersey 54 105 49 100 480 584 2016 28 81
New Mexico NA NA 13 30 133 172 2016 5 18
New York 127 397 134 312 2030 2165 2016 79 259
North Carolina 79 221 56 110 841 967 2016 48 121
North Dakota 5 22 6 12 92 93 2016 8 15
Ohio 85 322 62 195 895 1148 2016 20 106
Oklahoma 22 78 19 29 239 299 2016 12 10
Oregon 10 45 26 50 208 247 2016 NA NA
Pennsylvania 137 444 91 169 1241 1480 2016 70 175
Puerto Rico 6 5 NA NA 142 74 2016 0 5
Rhode Island 13 24 19 38 157 168 2016 8 17
South Carolina 31 91 23 34 272 272 2016 8 26
South Dakota NA NA 8 12 49 64 2016 NA NA
Tennessee 37 113 30 60 421 503 2016 6 38
Texas 196 623 139 320 1793 2181 2016 73 216
United Statesd 2192 7277 1963 4286 25278 29616 2016 959 2998
Utah 17 102 13 46 201 336 2016 5 27
Vermont 0 8 NA NA 36 39 2016 0 0
Virginia 62 219 51 82 723 802 2016 31 92
Washington 39 114 40 78 461 466 2016 17 40
West Virginia 5 35 9 19 112 131 2016 NA NA
Wisconsin 27 122 37 86 482 623 2016 18 68
Wyoming NA NA NA NA 32 49 2016 NA NA
Alabama 31 100 21 38 342 365 2017 13 35
Alaska NA NA 11 9 33 19 2017 0 0
Arizona 32 109 32 78 381 420 2017 13 38
Arkansas NA NA 11 6 104 98 2017 NA NA
California 334 817 289 551 2817 3283 2017 124 371
Colorado 45 161 41 114 462 543 2017 16 51
Connecticut 28 69 38 66 351 397 2017 16 44
Delaware 23 46 7 17 111 127 2017 NA NA
District of Columbia 17 61 19 18 328 295 2017 7 17
Florida 86 333 90 169 1078 1258 2017 36 109
Georgia 77 283 48 81 697 795 2017 22 77
Hawaii NA NA 14 15 111 78 2017 NA NA
Idaho NA NA NA NA 46 57 2017 NA NA
Illinois 119 314 68 199 1173 1357 2017 54 139
Indiana 55 274 67 128 641 928 2017 30 88
Iowa 37 115 22 57 305 410 2017 22 50
Kansas 18 56 15 29 249 281 2017 5 18
Kentucky 10 38 10 12 241 256 2017 11 24
Louisiana 11 63 25 43 265 342 2017 9 39
Maine NA NA NA NA 27 29 2017 NA NA
Maryland 63 162 46 95 650 645 2017 37 81
Massachusetts 181 416 120 252 1313 1565 2017 51 126
Michigan 98 296 71 135 850 1056 2017 29 89
Minnesota 25 103 25 41 775 599 2017 10 43
Mississippi 12 39 18 30 230 229 2017 8 12
Missouri 44 127 23 84 470 551 2017 14 41
Montana NA NA 7 8 63 56 2017 NA NA
Nebraska 11 33 14 18 177 186 2017 8 19
Nevada 6 28 6 14 90 110 2017 NA NA
New Hampshire NA NA 14 18 80 78 2017 NA NA
New Jersey 53 141 43 102 487 628 2017 20 80
New Mexico 14 60 10 34 120 179 2017 NA NA
New York 158 409 138 273 1972 2092 2017 59 243
North Carolina 87 229 67 127 883 949 2017 52 103
North Dakota 7 23 NA NA 100 81 2017 NA NA
Ohio 85 325 72 188 900 1128 2017 27 86
Oklahoma 18 52 17 29 232 292 2017 NA NA
Oregon 15 56 26 69 250 322 2017 8 44
Pennsylvania 138 407 85 163 1218 1408 2017 58 148
Puerto Rico NA NA NA NA 54 29 2017 NA NA
Rhode Island 8 31 16 31 153 169 2017 NA NA
South Carolina 49 92 27 32 240 266 2017 8 24
South Dakota NA NA NA NA 34 76 2017 NA NA
Tennessee 48 137 28 59 508 524 2017 10 42
Texas 217 660 142 282 1879 2186 2017 87 209
United Statesd 2448 7389 2011 4068 25495 29146 2017 976 2866
Utah 17 82 14 40 193 315 2017 7 42
Vermont NA NA NA NA 32 29 2017 NA NA
Virginia 69 222 49 96 714 799 2017 30 73
Washington 44 127 38 72 444 470 2017 21 47
West Virginia 6 26 NA NA 87 96 2017 NA NA
Wisconsin 29 146 36 85 496 633 2017 13 62
Wyoming NA NA 7 11 39 62 2017 NA NA
Alabama 25 103 11 31 329 338 2018 12 23
Alaska NA NA 8 7 27 29 2018 0 0
Arizona 43 100 30 65 364 399 2018 10 42
Arkansas 10 44 9 14 104 162 2018 NA NA
California 300 889 286 597 2647 3427 2018 108 377
Colorado 60 160 59 122 478 574 2018 17 45
Connecticut 33 79 37 79 363 422 2018 10 40
Delaware 24 59 21 27 98 140 2018 NA NA
District of Columbia 14 52 20 18 296 292 2018 5 28
Florida 87 327 96 180 1094 1252 2018 59 142
Georgia 68 268 51 91 684 827 2018 23 97
Hawaii NA NA 12 15 104 96 2018 NA NA
Idaho 6 19 NA NA 41 56 2018 NA NA
Illinois 97 350 98 199 1107 1408 2018 70 156
Indiana 78 255 64 116 708 923 2018 34 108
Iowa 31 121 21 54 334 409 2018 16 44
Kansas 11 57 28 34 250 284 2018 11 21
Kentucky 10 45 11 27 223 271 2018 10 16
Louisiana 21 66 25 34 281 295 2018 13 27
Maine NA NA NA NA 28 22 2018 0 0
Maryland 64 182 54 78 665 699 2018 31 99
Massachusetts 178 428 139 258 1330 1616 2018 38 161
Michigan 97 325 76 147 863 1090 2018 34 115
Minnesota 41 114 27 51 795 642 2018 10 32
Mississippi 6 37 14 29 226 245 2018 7 13
Missouri 43 159 21 70 430 546 2018 15 40
Montana NA NA 6 10 56 56 2018 NA NA
Nebraska 7 27 8 25 164 177 2018 10 18
Nevada 9 25 17 24 125 115 2018 NA NA
New Hampshire 15 25 12 21 73 92 2018 NA NA
New Jersey 57 153 38 89 529 595 2018 19 82
New Mexico 15 42 15 40 161 161 2018 NA NA
New York 153 447 150 301 2051 2207 2018 72 266
North Carolina 74 235 67 111 843 890 2018 44 113
North Dakota 8 18 7 14 104 89 2018 NA NA
Ohio 92 299 84 178 957 1094 2018 27 98
Oklahoma 15 67 11 36 235 269 2018 NA NA
Oregon 20 70 28 62 237 300 2018 8 32
Pennsylvania 144 456 70 165 1165 1457 2018 53 172
Puerto Rico NA NA NA NA 88 59 2018 NA NA
Rhode Island 9 30 20 33 140 186 2018 9 26
South Carolina 36 103 16 40 262 306 2018 11 25
South Dakota 8 16 NA NA 51 63 2018 NA NA
Tennessee 58 134 17 57 467 488 2018 11 39
Texas 193 702 132 325 1771 2297 2018 76 213
United Statesd 2453 7726 2118 4214 25368 29798 2018 983 3043
Utah 18 92 23 38 200 311 2018 15 27
Vermont NA NA NA NA 32 31 2018 NA NA
Virginia 79 236 48 103 687 826 2018 26 92
Washington 31 104 50 68 463 501 2018 21 51
West Virginia 6 32 6 14 96 123 2018 6 5
Wisconsin 39 115 51 74 506 575 2018 13 57
Wyoming 6 16 NA NA 36 66 2018 NA NA


Methodology - Linear Regression Models

For each discipline, the differences in male and female PhD recipients were initially analyzed with respect to the total number of PhD grads in the country (proportion = Grads in Gender Category/ Total Grads for US that year). Differences and trends were not seen to be statistically signicant (p-value > 0.05) for Engineering and Math & Computer Science. After looking at the data further proportions were recalculated with respect to the total number of PhD Grads in the state (Grads in Gender Category/ Total Grads in State that Year). Linear regression and Pearson Correlation were calculated for each gender proportion and difference between gender proportions for each discipline. Based on these models, inferences were made for each catergory about the state of the “gender gap” in 2030 based on goals set by the United Nations.

Engineering


Female and Male

Findings:
  • There is a positive correlation that is statistically significant between the difference proportions of graduates in each state and year (Females: correlation coefficient = 0.28; p-value = 0.000000004; Males: correlation coefficient = 0.219; p-value = 0.000004). This means that the proportion of male and female grads with respect to the total grads in that state is increasing.
  • We can say with 95% Confidence that the actual correlation coefficent for Males falls between 0.126 and 0.31, and for Females falls between 0.19 and 0.37
  • According to the model if this trend continues to 2030, the proportion of female PhD graduates will increase to about 6.1%, and male graduates will increase to about 17%
#Select Engineering and Total Data
EngStates <- select(GIANTdf, State, EngFemale, EngMale, year, TotalMale, TotalFemale) %>%
  mutate(diff = EngMale - EngFemale, year = as.numeric(year), StateTotal = TotalMale + TotalFemale) %>% #Calculate differences in male and female graduates and state totals 
  mutate(statePropDiff = diff/StateTotal, statePropMale = EngMale/StateTotal, statePropFemale = EngFemale/StateTotal) %>% # Calculate Proportions
  filter(!str_detect(State, "United")) #Filter out USA Totals

#Female
#Test the pearson correlation
cor.test(EngStates$year, EngStates$statePropFemale)
## 
##  Pearson's product-moment correlation
## 
## data:  EngStates$year and EngStates$statePropFemale
## t = 6.0092, df = 414, p-value = 4.091e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1923334 0.3693292
## sample estimates:
##       cor 
## 0.2832413
#Generate Linear Regression
EngFemale <- lm(statePropFemale ~ year, data = EngStates)

#Print
summary(EngFemale)
## 
## Call:
## lm(formula = statePropFemale ~ year, data = EngStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.040848 -0.008001 -0.000170  0.007226  0.090188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0295246  0.0014810  19.935  < 2e-16 ***
## year        0.0014155  0.0002355   6.009 4.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01382 on 414 degrees of freedom
##   (104 observations deleted due to missingness)
## Multiple R-squared:  0.08023,    Adjusted R-squared:  0.078 
## F-statistic: 36.11 on 1 and 414 DF,  p-value: 4.091e-09
#Male
#Test the pearson correlation
cor.test(EngStates$year, EngStates$statePropMale)
## 
##  Pearson's product-moment correlation
## 
## data:  EngStates$year and EngStates$statePropMale
## t = 4.5958, df = 421, p-value = 5.707e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1258448 0.3075058
## sample estimates:
##       cor 
## 0.2185682
#Generate Linear Regression
EngMale <- lm(statePropMale ~ year, data = EngStates)

#Print
summary(EngMale)
## 
## Call:
## lm(formula = statePropMale ~ year, data = EngStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115639 -0.026127  0.000466  0.024087  0.110324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1101551  0.0037358  29.487  < 2e-16 ***
## year        0.0027420  0.0005966   4.596 5.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03522 on 421 degrees of freedom
##   (97 observations deleted due to missingness)
## Multiple R-squared:  0.04777,    Adjusted R-squared:  0.04551 
## F-statistic: 21.12 on 1 and 421 DF,  p-value: 5.707e-06
EngPlot <- select(EngStates, year, State,statePropMale, statePropFemale ) %>%
  gather("sex","prop",-year, -State) %>%
  drop_na()

#Plot points and regression lines
ggplot(EngPlot, aes(x = year, y = prop, color = sex)) +
  geom_jitter() +
  geom_smooth(method='lm')+
  labs(y = "Proportion", x ="Year", title = "Proportion Engineering PhD by State") +
  theme(plot.title = element_text(hjust = 0.5))


Residual Plots
#Residual Plot - Check for normal dist
ggplot(EngMale, aes(y = EngMale$residuals, x = year))+
  geom_jitter(color = "green") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Male Proportion Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals - Check for normal dist
ggplot()+
  geom_histogram(aes(EngMale$residuals), fill = "green", color = "black") +
  labs(y = "Residuals", title = "Male Proportion Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Residual Plot - Check for normal dist
ggplot(EngFemale, aes(y = EngFemale$residuals, x = year))+
  geom_jitter(color = "yellow") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Female Proportion Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals - Check for normal dist
ggplot()+
  geom_histogram(aes(EngFemale$residuals), fill = "yellow", color = "black") +
  labs(y = "Residuals", title = "Female Proportion Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))


Proportional Differences

Findings:
  • There is a positive correlation that is statistically significant between the difference in proportions of graduates in each state (correlation coefficient = 0.131; p-value = 0.0075). This means that the difference between male and female grads with respect to the total grads in that state is increasing.
  • We can say with 95% Confidence that the actual correlation coefficent falls between 0.035 and 0.224
  • According to the model if this trend continues to 2030, the average proportional difference between male and female PhD graduates will increase to about 11%, which suggests that the USA will not be able to meet the goal set by the UN and is in fact demonstrating characteristics of an INCREASING gender gap!
#Pearson Correlation and Confidence Test
cor.test(EngStates$year, EngStates$statePropDiff)
## 
##  Pearson's product-moment correlation
## 
## data:  EngStates$year and EngStates$statePropDiff
## t = 2.6855, df = 414, p-value = 0.007533
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03514697 0.22417573
## sample estimates:
##       cor 
## 0.1308504
#Calculate Model
EngStates <- lm(statePropDiff ~ year, data = EngStates)

#Display
summary(EngStates)
## 
## Call:
## lm(formula = statePropDiff ~ year, data = EngStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.096232 -0.020792 -0.001438  0.018808  0.123351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0810502  0.0030883  26.244  < 2e-16 ***
## year        0.0013191  0.0004912   2.686  0.00753 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02881 on 414 degrees of freedom
##   (104 observations deleted due to missingness)
## Multiple R-squared:  0.01712,    Adjusted R-squared:  0.01475 
## F-statistic: 7.212 on 1 and 414 DF,  p-value: 0.007533
#Difference in Proportion
ggplot(EngStates, aes(x = year, y = statePropDiff)) +
  geom_jitter(color = "blue") +
  geom_smooth(method='lm', color = "red")+
  labs(y = "Difference in Proportion", x ="Year", title = "Proportional Difference Engineering PhD") +
  theme(plot.title = element_text(hjust = 0.5))


Residual Plots
#Residual Plot
ggplot(EngStates, aes(y = EngStates$residuals, x = year))+
  geom_jitter(color = "green") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Proportional Difference Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals
ggplot()+
  geom_histogram(aes(EngStates$residuals), fill = "green", color = "black") +
  labs(y = "Residuals", title = "Proportional Difference Engineering PhDs") +
  theme(plot.title = element_text(hjust = 0.5))


Physical Science


Female and Male

Findings:
  • There is a negative correlation that is statistically significant between the difference proportions of graduates in each state and year (Females: correlation coefficient = -0.24; p-value = 0.00000016; Males: correlation coefficient = -0.47; p-value < 0.05). This means that the proportion of male and female grads with respect to the total grads in that state are decreasing.
  • We can say with 95% Confidence that the actual correlation coefficent for Males falls between -0.54 and -0.40, and for Females falls between -0.33 and -0.15
  • According to the model if this trend continues to 2030, the proportion of female PhD graduates will decrease to about 2%, and male graduates will decrease to about 0.003%


#Select Engineering and Total Data
PhysciStates <- select(GIANTdf, State, PhysciFemale,PhysciMale, year, TotalMale, TotalFemale) %>%
  mutate(diff = PhysciMale - PhysciFemale, year = as.numeric(year), StateTotal = TotalMale + TotalFemale) %>% #Calculate differences in male and female graduates and state totals 
  mutate(statePropDiff = diff/StateTotal, statePropMale = PhysciMale/StateTotal, statePropFemale = PhysciFemale/StateTotal) %>% # Calculate Proportions
  filter(!str_detect(State, "United")) #Filter out USA Totals

#Female
#Test the pearson correlation
cor.test(PhysciStates$year, PhysciStates$statePropFemale)
## 
##  Pearson's product-moment correlation
## 
## data:  PhysciStates$year and PhysciStates$statePropFemale
## t = -5.3253, df = 460, p-value = 1.58e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3250587 -0.1531130
## sample estimates:
##        cor 
## -0.2409759
#Generate Linear Regression
PhysciFemale <- lm(statePropFemale ~ year, data = PhysciStates)

#Print
summary(PhysciFemale)
## 
## Call:
## lm(formula = statePropFemale ~ year, data = PhysciStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055027 -0.011646 -0.003017  0.007216  0.168859 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0585547  0.0020474  28.600  < 2e-16 ***
## year        -0.0017640  0.0003312  -5.325 1.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02047 on 460 degrees of freedom
##   (58 observations deleted due to missingness)
## Multiple R-squared:  0.05807,    Adjusted R-squared:  0.05602 
## F-statistic: 28.36 on 1 and 460 DF,  p-value: 1.58e-07
#Male
#Test the pearson correlation
cor.test(PhysciStates$year, PhysciStates$statePropMale)
## 
##  Pearson's product-moment correlation
## 
## data:  PhysciStates$year and PhysciStates$statePropMale
## t = -11.527, df = 461, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5408220 -0.3990757
## sample estimates:
##        cor 
## -0.4730039
#Generate Linear Regression
PhysciMale <- lm(statePropMale ~ year, data = PhysciStates)

#Print
summary(PhysciMale)
## 
## Call:
## lm(formula = statePropMale ~ year, data = PhysciStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.125640 -0.020717 -0.001753  0.019082  0.144635 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1378888  0.0032804   42.03   <2e-16 ***
## year        -0.0061242  0.0005313  -11.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03291 on 461 degrees of freedom
##   (57 observations deleted due to missingness)
## Multiple R-squared:  0.2237, Adjusted R-squared:  0.222 
## F-statistic: 132.9 on 1 and 461 DF,  p-value: < 2.2e-16
PhysciPlot <- select(PhysciStates, year, State,statePropMale, statePropFemale ) %>%
  gather("sex","prop",-year, -State) %>%
  drop_na()

#Plot points and regression lines
ggplot(PhysciPlot, aes(x = year, y = prop, color = sex)) +
  geom_jitter() +
  geom_smooth(method='lm')+
  labs(y = "Proportion", x ="Year", title = "Proportion Physical Science PhD by State") +
  theme(plot.title = element_text(hjust = 0.5))


Residual Plots
#Residual Plot - Check for normal dist
ggplot(PhysciMale, aes(y = PhysciMale$residuals, x = year))+
  geom_jitter(color = "green") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Male Proportion Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals - Check for normal dist
ggplot()+
  geom_histogram(aes(PhysciMale$residuals), fill = "green", color = "black") +
  labs(y = "Residuals", title = "Male Proportion Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Residual Plot - Check for normal dist
ggplot(PhysciFemale, aes(y = PhysciFemale$residuals, x = year))+
  geom_jitter(color = "yellow") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Female Proportion Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals - Check for normal dist
ggplot()+
  geom_histogram(aes(PhysciFemale$residuals), fill = "yellow", color = "black") +
  labs(y = "Residuals", title = "Female Proportion Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))


Proportional Differences

Findings:
  • There is a negative correlation that is statistically significant between the difference in proportions of graduates in each state (correlation coefficient = -0.43; p-value < 0.5). This means that the difference between male and female grads with respect to the total grads in that state is decreasing.
  • We can say with 95% Confidence that the actual correlation coefficent falls between -0.5 and -0.352
  • According to the model if this trend continues to 2030, the average proportional difference between male and female PhD graduates will decrease to about -1.67%, which suggests that the USA will be able to meet the goal set by the UN and is in fact demonstrating characteristics of a DECREASING gender gap!
#Pearson Correlation and Confidence Test
cor.test(PhysciStates$year, PhysciStates$statePropDiff)
## 
##  Pearson's product-moment correlation
## 
## data:  PhysciStates$year and PhysciStates$statePropDiff
## t = -10.193, df = 460, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5008705 -0.3518039
## sample estimates:
##       cor 
## -0.429256
#Calculate Model
PhysciStates <- lm(statePropDiff ~ year, data = PhysciStates)

#Display
summary(PhysciStates)
## 
## Call:
## lm(formula = statePropDiff ~ year, data = PhysciStates)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.078540 -0.014753 -0.000071  0.015054  0.086030 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0793922  0.0026487   29.97   <2e-16 ***
## year        -0.0043682  0.0004285  -10.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02648 on 460 degrees of freedom
##   (58 observations deleted due to missingness)
## Multiple R-squared:  0.1843, Adjusted R-squared:  0.1825 
## F-statistic: 103.9 on 1 and 460 DF,  p-value: < 2.2e-16
#Difference in Proportion
ggplot(PhysciStates, aes(x = year, y = statePropDiff)) +
  geom_jitter(color = "blue") +
  geom_smooth(method='lm', color = "red")+
  labs(y = "Difference in Proportion", x ="Year", title = "Proportional Difference Physical Science PhD") +
  theme(plot.title = element_text(hjust = 0.5))

<br.

Residual Plots
#Residual Plot
ggplot(PhysciStates, aes(y = PhysciStates$residuals, x = year))+
  geom_jitter(color = "green") +
  geom_hline(yintercept = 0, color = "red") +
  labs(y = "Residuals", x ="Year", title = "Proportional Difference Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))

#Histogram of residuals
ggplot()+
  geom_histogram(aes(PhysciStates$residuals), fill = "green", color = "black") +
  labs(y = "Residuals", title = "Proportional Difference Physical Science PhDs") +
  theme(plot.title = element_text(hjust = 0.5))



Math and Computer Science



Female and Male

Findings:
  • The correlation between proportion of female PhD graduates and year, and male PhD graduates and year is a predicted to be negative but is not statistically significant (p-value > 0.05 for both).
  • It is speculated that this can be due to low sample size for years (Math and CS data was gathered beggining in 2015)


#Select Engineering and Total Data, Calculate differences in male and female graduates and state totals, Calculate Proportions
MathStates <- select(GIANTdf, State, MathFemale,MathMale, year, TotalMale, TotalFemale) %>% 
  drop_na() %>%
  mutate(diff = MathMale - MathFemale, StateTotal = TotalMale + TotalFemale) %>%
  mutate(statePropDiff = diff/StateTotal, statePropMale = MathMale/StateTotal, statePropFemale =MathFemale/StateTotal) %>%
  filter(!str_detect(State, "United"))%>%
  mutate(year = as.numeric(year))

#Female
#Test the pearson correlation
cor.test(MathStates$year, MathStates$statePropFemale)
## 
##  Pearson's product-moment correlation
## 
## data:  MathStates$year and MathStates$statePropFemale
## t = -0.14182, df = 150, p-value = 0.8874
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1704650  0.1478948
## sample estimates:
##         cor 
## -0.01157847
#Male
#Test the pearson correlation
cor.test(MathStates$year, MathStates$statePropMale)
## 
##  Pearson's product-moment correlation
## 
## data:  MathStates$year and MathStates$statePropMale
## t = -0.52297, df = 150, p-value = 0.6018
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2005003  0.1173363
## sample estimates:
##         cor 
## -0.04266132
MathPlot <- select(MathStates, year, State,statePropMale, statePropFemale ) %>%
  gather("sex","prop",-year, -State) %>%
  drop_na()

#Plot points and regression lines
ggplot(MathPlot, aes(x = year, y = prop, color = sex)) +
  geom_jitter() +
  geom_smooth(method='lm')+
  labs(y = "Proportion", x ="Year", title = "Proportion Math & CS PhD by State") +
  theme(plot.title = element_text(hjust = 0.5))


Proportional Differences

Findings:
  • There is a negative correlation that is not statistically significant between the difference in proportions of graduates in each state (p-value > 0.5). This means that the difference between male and female grads with respect to the total grads in that state is decreasing.
  • More data is needed to determine if the gender gap is increasing or decreasing for Math and Computer Science
#Pearson Correlation and Confidence Test
cor.test(MathStates$year, MathStates$statePropDiff)
## 
##  Pearson's product-moment correlation
## 
## data:  MathStates$year and MathStates$statePropDiff
## t = -0.53406, df = 150, p-value = 0.5941
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2013683  0.1164442
## sample estimates:
##         cor 
## -0.04356413
#Difference in Proportion
ggplot(MathStates, aes(x = MathStates$year, y = statePropDiff)) +
  geom_jitter(color = "blue") +
  geom_smooth(method='lm', color = "red")+
  labs(y = "Difference in Proportion", x ="Year", title = "Proportional Difference Physical Science PhD") +
  theme(plot.title = element_text(hjust = 0.5))



Conclusion


As seen in the visualisations above, there still exsists a significant gender gap in dicisplines associated with STEM fields such as Engineering, Physical Science and Math & Computer Science. Trends for PhD recipients in Engineering are increasing for both male and female groups however, they are not increasing at the same rate. Thus despite the increase in popularity of PhD’s in Engineering, the gap between males and females are increasing. This may lead to further underepresentation of females and other marginalized groups in Engineering.On the other hand, we see that the number of people earning PhD’s in Physical Science is decreasing. Since the proportion of males is decreasing faster than the proportion of females, the gender gap should be closed by the goal year set by the United Nations: 2030. Last but not least, although the data for total number of people earning PhDs in Math suggest that the number of reciepnts in both groups is increasing, there is not enough data in this data set to warrant a statistically significant linear regression model.

Reflection on Learning


There were many versions of this project that I worked on, which made it difficult for me to decide the best approach for using the data to make meaningful inferences. First, I had to really get to know the data and how male and female numbers compared for every discipline and every state. Once I was able to confirm that females were underrepresentred in STEM fields I began analyzing the relationship between different proportions. First I thought that it would make sense to somehow normalize the number by state and by year. I initially anlyzed the data with respect to the total number of PhD recipients in that year however, the variation in number for each state made the p-values so large (which I didn’t realize until hours into working with the data). Then I tried to normalize by comparing proportions of people by the total number of people who earned PhDs in that state. This yeilded p-values that were much lower which allowed me to rely on my model for extrapolation. I am really passionate about euity in education so the process of doing this project really helped me feel that I can contribute somehow to helping others understand the story that the data is telling. In the future I hope to read more literature about this topic and somehow explain patterns in the data that I found. For example, What happened in 2015 that Math and CS was suddenly included in the data set? Does this somehow reflect the values outlines by the United Nations in 2015? How can we measure the gender gap in higher education for Math & Computer Science? We see a general increase in both males and females recieving PhD’s, how can we accelerate the growth of females who graduate to reduce the gap? Why are some states omitted from the data and others are not?

Further Research


In the future I would like to explore the question outlined in my relfection above. I would like to also research other factors that may contribute to the number of males and females who earn PhDs in STEM related fields. I would like to see if another model can be developed to predict the number of PhD recipients in each gender (for example, multiple regression?). I would also like to search for more data regarding Math and Computer Science graduate degrees to see if a statistically significant model can be achieved. Lastly, I am curious to see data about the number of females who apply and get accepted into PhD rograms in the United States. How many males and females are being denied admission? How many males and females who apply and get accepted to PhD programs in stem actually complete their degree and earn a PhD? This is truly a facinationg topic! Thank you for reading!