Final Project

Question of Interest

For this project, we were interested to see if there was a Glass Ceiling at UVA– specifically if there was a significant difference between male and female salaries at UVA. While this is a hotly contested question, the key concept to keep in mind is if the faculty members being compared are indeed “doing the same job.” We will play around with the the faculty member’s department and tenure status to determine if the faculty members indeed are “doing the same job” and how this impacts their salary.

In order to explore that, we explored the questions:

Is gender a top determining factor within the top N% of faculty at UVA?

We decided to use a KNN model to try and answer this question. However, as we will explain later, we ran into a barrier that made it so we were unable to answer the question using KNN. Instead, we reworded our question as:

Can we predict an employee’s gender based on their salary? What about if we add in tenure and/or department?

We decided to use a Decision Tree to explore which of these variables would be most important in determining gender.

Can we predict an employee’s tenure based on their gender? What about if we add in department and/or salary?

We decided to use a Decision Tree to explore which of these variables would be most important in determining tenure

Can we predict the tenure based on salary (as well as other variables)?

For the same purposes as question two we decided to use a Decision Tree model to explain what role tenure plays in UVA faculty positions.

Data

The data we used to do our initial background research was the Cavalier Daily’s 2020 Faculty Salaries . The dataset used in our evaluation was a smaller, more specific dataset on UVA’s School of Engineering & Applied Science Faculty. However, the information provided within this dataset only included the faculty’s name, ID, department, and tenure (Tenured vs. On Track vs. AGF).

The Cavalier Daily’s dataset was used to merge the salary into the dataset based on name. However, some names were not matched up because some faculty members use nicknames or had a change in last name likely due to getting married recently. Faculty members who were not found in the Cavalier Daily were manually double checked and their salaries were added if the department, role, and first name or last name matched up.

Multiple online name and gender dictionaries were used in conjunction with a Python program to determine the gender of the faculty member based on their first name since this information was not readily available. If any of the dictionaries labeled the name as indeterminately female/male (aka a gender-neutral name like Alex), the faculty member was given a ? for gender even if another gender dictionary gave it a determinate female or male labeling. If the gender was not found in any of the datasets, it was given a NA for gender.

In our evaluation, we will only be looking at faculty members who have a determinate female or male name.

Reading in the Data

The data has previously been merged and cleaned using Python, but we are going to make sure all of the rows with incomplete data are filtered out. Specifically, we will be filtering out employees with a gender the Python program was unable to find in any existing gender-name dictionaries or was deemed to be a gender-neutral name like Alex.

##   Department            Position   Tenure              Full.Name    ID Dpt
## 1        BME            Lecturer      AGF            Chen, David dc9rk SOM
## 2        BME Assistant Professor      AGF      Barker, Shannon D sb3xk BME
## 3        BME Assistant Professor On Track          Civelek, Mete mc2wq SOM
## 4        BME Assistant Professor On Track           Griffin, Don dg2gf BME
## 5        BME Assistant Professor On Track Highley, Christopher H ch2qm BME
## 6        BME Assistant Professor On Track            Zunder, Eli  ez8v BME
##    First.Name Last.Name Gender       Salary
## 1       David      Chen      M $120,967.00 
## 2     Shannon    Barker      F  $91,000.00 
## 3        Mete   Civelek      M $110,800.00 
## 4         Don   Griffin      M $128,000.00 
## 5 Christopher   Highley      M $127,900.00 
## 6         Eli    Zunder      M $129,000.00

Data Walkthrough

Gender Distribution

To determine the split, we had to determine the ratio of female to total, and based on that, we can deduce the percent of male entries, and the percentage of both. To do this we created the function ad_split, which prints the ratio of female to total entries and ran it on the clean salary dataset. The split between female to male is 22 percent to 78 percent within the School of Engineering, respectively.

Average Salary by Gender Between all UVA schools

Average Salary by Gender Across All SEAS Departments

Average Salary in SEAS by Factor

Graph of Average Salary in SEAS by Factor

Cleaning the Data

To filter out the vars, we create a list of the vars names and saved the dataset over itself without the columns in the vars variable.

## Department   Position     Tenure     Gender     Salary 
##  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"

Approach #1: KNN

Before entering this process, it is important to note that the KNN approach was the first method we wanted to use to attempt a ML model. However, it does not work for this data even if we are trying to predict the salary of the professors becasue there are simply not enough scalar variables to use a KNN model. Everything in this dataset is either a factor or the salary, so a KNN model would not be feasable at all.

Question #1: Predicting Salary

We first ran the function cor(), to discover how each column was related to each other to. We wanted to check if there are any highly correlated variables and remove them so as to not skew our KNN. We used a threshold of 0.7 to determine if something was too highly correlated.

Correlation Between Variables

None of the variable are too highly correlated so we will leave these all in. We will note that Tenure and Salary has a correlation of 0.6 so we will keep an eye out for it.

The good thing we can see here though is that Gender & Salary only have a correlation of 0.15 which is low and indicated that there isn’t a glass ceiling we should be worried about if we are looking only at gender. However, there are more things that determine if there is indeed a glass ceiling– specifically if these employees we are comparing are indeed doing “the same job”.

Generate Training and Testing Sets

Data Sets

Training Data

##     Department Position Tenure Gender Salary
## 15           1        5      3      1 195500
## 112          5        1      2      1 135400
## 186          8        5      3      2 185200
## 161          6        5      3      2 221400
## 60           3        1      2      1 152800
## 111          5        1      1      2  86700

Testing Data

##    Department Position Tenure Gender Salary
## 1           1        4      1      2 120967
## 2           1        1      1      1  91000
## 4           1        1      2      2 128000
## 10          1        2      3      2 150700
## 19          1        5      3      2 166400
## 24          1        5      3      2 157311

Choosing the Best K

Unfortunately, we see that there are no good number of splits if we use KNN so instead we can look at which variable is most important using a decision tree!

Approach #2: Decision Trees

Question #2:Predicting Gender

We’re going to switch over to using a decision tree to see if we can accurately predict an employee’s gender based on a combination of 1-3 factors from their salary, department, and tenure.

From the first approach, we can see that there are certain departments with high gender disparities which has more to do with employee choice of department rather than UVA being unfair and having a glass ceiling.

Cleaning the Data

To filter out the vars, we create a list of the vars names and saved the dataset over itself without the columns in the vars variable.

Gender: Male is 1, Female is 0

Splitting

We will start off our decision tree by splitting the data into a training and a testing set

Building the Model

We build our model using the default setting and by using the rpart function. Gender is used as the “formula” aka our response variable. We utilize our previously split training dataset and set a cp of 0.01 as it is our default.

Decision Tree

Visually

First, we can take a look to see the order of importance of the employee’s other factors that we can use to guess gender with.

Most important is the department that the employee is.

If they are in E+S (APMA) or MAE, then salary is the secondmost important. If their salary is below 77k, they are assigned by the decision tree to be a female If they are over 77k, they are male

If they are in any other department, they are determined to be male as amongst these departments, there are 113 males to 19 females.

Variable Importance

## Department     Salary   Position     Tenure 
##  6.3894759  2.9246765  1.2006388  0.3812613

In order of decreasing importance, department, salary, position, then tenure are the most important factors within the School of Engineering that can be used to accurately determine the faculty member’s gender.

Ideally, the order would have been department, position, tenure, then salary. Department would be at the top of the list because this is for the most part, employee-chosen. Position would be next because this would also be driven by drive by the employee member and their background coming into UVA. Third is tenure which is similar to position, but has less to do with the employee’s resume coming into UVA and moreso how UVA treats the employee throughout their time at UVA. If this factor was ranked higher, it would indicate that there is likely bias in UVA’s own process at choosing professors for tenure. However, there is still a level of employee drive and initiative to even apply for being on track for tenure vs. staying as an AGF. Last is salary because salary is more reactive to the employee’s current responsibilities and role at UVA. Tenure is similarly important, but can take 5-20 years for the university to make a binary choice versus the university reevaluating the professor’s salary each year.

Next, we use the predict function to predict the target variable.

Confusion Matrix

Hit and Detection Rate

The detection rate is found to be 66.7% with a prevalence of 73.8% Our model currently has an accuracy of 69.1%.

The hit rate can be derived from the confusion matrix by taking the number of true positives (28) over total actual positive cases (28+ a missing 3 = 31) for a hit rate of 90.3%.

The Hit Rate/ True Error Rate is 31.0%

The true positive rate, also shown as our detection rate is 66.7%

The true positive rate could be improved upon as the ideal would create a confusion matrix with a value of 0 in the spot (1,2) also meaning a hit rate of 100%. Overall, this is a pretty good rate for our decision tree. The only concern would be a medium level of false positives (10 cases out of 42)

Question #3: Predicting Tenure

We are going to use a similar decision tree to see if we can accurately predict an employee’s tenure based on a combination of 1-3 factors from their salary, department, and tenure.

Splitting

We will start off our decision tree by splitting the data into a training and a testing set as we did for the previous tree

Building the Model

We build our model using the default setting and by using the rpart function. Instead of gender, this time Tenure is used as the “formula” aka our response variable. We utilize our previously split training dataset and set a cp of 0.01 as it is our default.

Decision Tree

Visually

First, we can take a look to see the order of importance of the employee’s other factors that we can use to guess tenure with.

The most influential variable in our decision tree is the position that a given faculty member occupies. If they are an assistant professor, research associate, associate professor, or a professor, then the model immediately classifies them as tenured without consideration of another variable.In the case that a faculty member is not filling one of the aforementioned positions, the next most important classifier is salary. If said faculty member’s salary is greater than 125k, then that suggests that they are on track to tenure, otherwise they are AGF.

Variable Importance

##   Position     Salary Department     Gender 
## 49.3846883 30.6207915 13.0733133  0.8071358

In order of decreasing importance, position, salary, department, and then gender are the most important factors within the School of Engineering that can be used to accurately determine the faculty member’s tenure. This is the ideal order as it pertains to gender because it is not only last in ranking of importance, but it also is so unimportant that the decision tree doesn’t make any of its decisions on it at all.

To be honest, my perfect order

Prediction

Next, we use the predict function to predict the target variable in the testing set.

Confusion Matrix

Hit and Detection Rate

the detection rates, prevalence, and sensitivity for this model are all pretty abysmal for all three categories with the highest of all of these being the sensitivity of tenure at .77, and the overall accuracy for this model is under 50%.

The Hit Rate/ True Error Rate is 52.4%

Conclusion

While this decision tree can guess the tenure status of a professor better than random, however, this is by no means a good model. Regardless of the validity of the model, we can still infer some result to our question for the variable importance and how this model was formed. Gender has very little to do with tenure status. When I changed the seed, multiple times, there was not a single model that used the gender category to predict the tenure status of the faculty. Kudos to UVA for that (otherwise there would be a problem).

Fairness Assessment

Because of how we used gender dictionaries to assign genders to the employees rather than obtaining that data from the employee, there were difficulties determining genders for gender-neutral names or names that did not appear in these dictionaries. While gender-neutral named groups are misrepresented, it is likely that these are well-spread between genders, departments, and tenure.

The more concerning group that is underrepresented are the employees who have names that do not appear in the gender dictionaries we used. These are more likely to be foreign employees since the gender dictionaries we used were written in English. Thus, it is likely that departments that have larger groups of foreign employees like the computer science department are misrepresented in our analysis.

Future Analysis

In the future, we would ideally get more information like tenure and position across more UVA schools. The data on all UVA employees we got was unstandardized in position name and lacked either tenure status or tenure in regards to how long the employee had been at UVA making it difficult to work with.

Furthermore, our analysis would be more accurate if we were to be able to get the employee’s gender rather than using the gender dictionary to try and determine it (which also resulted in the aforementioned misrepresentations).

Overall, we would want to be able to rerun this analysis on a larger dataset with gender being given to us rather than using the Python tool.

Once we had this larger sample, it would also be interesting to compare the effects of race, credits being taught, and tenure in respect to how long the faculty member has been at UVA. This further expands our exploration of whether or not UVA is indeed paying faculty members fairly for “doing the same job”.

Final Project

UVA SEAS Salary and Gender

Megan Lin, James Powell, Eva Mustafic

5/5/2021

Question of Interest

Data

Reading in the Data

Data Walkthrough

Gender Distribution

Average Salary by Gender Between all UVA schools

Average Salary by Gender Across All SEAS Departments

Average Salary in SEAS by Factor

Cleaning the Data

Approach #1: KNN

Question #1: Predicting Salary

Correlation Between Variables

Generate Training and Testing Sets

Data Sets

Training Data

Testing Data

Choosing the Best K

Approach #2: Decision Trees

Question #2:Predicting Gender

Cleaning the Data

Splitting

Building the Model

Decision Tree

Visually

Variable Importance

Confusion Matrix

Hit and Detection Rate

Question #3: Predicting Tenure

Splitting

Building the Model

Decision Tree

Visually

Variable Importance

Prediction

Confusion Matrix

Hit and Detection Rate

Conclusion

Fairness Assessment

Future Analysis