In this exercise you will do quick descriptive and predictive analytics to evaluate if the Salaries data set (with professor salaries) supports the gender pay gap hypothesis.
Download the R Quarto template for this exercise Ex2_Foundations_YourLastName.Qmd and save it to your class project folder. Rename the file to replace YourLastName with your own last name.
Open the file in RStudio and complete the coding exercises and answer the interpretation questions. Run the code to ensure everything is working fine.
I will always display the exercise code output in the instructions on Canvas, so that you can compare your results against the solution. Technical Note: Your quantitative and visual outputs should look identical to the outputs shown in the homework. Now is a good time to compare your results, and ask me with questions if they differ.
When done, knit your R Quarto file into a Word document. On Canvas please submit both the knitted word document AND the .qmd file. If you have troubles knitting a word file, you can knit to a PDF file, or to an HTML file and then save it as a PDF.
Knitting
Preparing analytic reports using the {knitr} package is an important learning objective of this course. Your R Quarto file MUST have the attribute echo=T in the {r global options} so that we can see and grade your R code. The knitted document should adhere to proper business-like formatting and appearance.This means all interpretations should be in markdown sections, NOT code comments.
As such, inadequate or no knitting will carry point deductions up to 30 points max.
1. Descriptive Analytics
1.1 Examine the data
Is there a gender pay gap? Let’s take a look.
Load the library {car}, which contains the Salaries data set. Then, list the first few records with head(Salaries). The display the summmary() for this dataset, which will show frequencies.
Then, load the library {psych} which contains the describe() function and use this function to list the descriptive statistics for the dataset.
Then display the median salary grouped by gender using the aggregate() function (feed grouping variables, dataset and aggregate function, i.e., salary ~ sex, Salaries, mean)
library (car)head (Salaries)
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 AssocProf B 6 6 Male 97000
summary(Salaries)
rank discipline yrs.since.phd yrs.service sex
AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
Prof :266 Median :21.00 Median :16.00
Mean :22.31 Mean :17.61
3rd Qu.:32.00 3rd Qu.:27.00
Max. :56.00 Max. :60.00
salary
Min. : 57800
1st Qu.: 91000
Median :107300
Mean :113706
3rd Qu.:134185
Max. :231545
Load the library GGally and run the ggpairs() function on the salary (notice that the dataset Salary is capitalized, whereas the variable salary is not), sex and yrs.since.phd variables (only) in the Salaries data set to display some basic descriptive and correlation visually. Please label your variables appropriately (see graph below).
Tips: ggpairs() requires a data frame. So you need to use the data.frame() function to bind the necessary column vectors into a data frame (e.g., ggpairs(data.frame("Salary"=Salaries$salary, etc.). Notice the difference in the quality of the graphics and how categorical variables are labeled. Also, add the attribute upper=list(combo='box') at the end to get labels for the boxplot.
Finally, conduct an ANOVA test to evaluate if there is a significant difference between mean salaries for male and female faculty. Feed Salaries$salary ~ Salaries$sex into the aov() function. Embed the aov() function inside the summary() function to see the statistical test results.
require(GGally)attach (Salaries)D_sal =data.frame("Salary"=Salaries$salary, "Gender"=Salaries$sex, "Yrs Since PhD"=Salaries$yrs.since.phd)ggpairs(D_sal, upper=list(combo='box'))
summary(aov(Salaries$salary~Salaries$sex))
Df Sum Sq Mean Sq F value Pr(>F)
Salaries$sex 1 6.980e+09 6.980e+09 7.738 0.00567 **
Residuals 395 3.563e+11 9.021e+08
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.3 Preliminary Interpretation
Based on the output above, does it appear to be a gender pay gap? Why or why not. In your answer, please refer to as much of the data above to support your answer.
Response: The data above does indicate a pay gap, showing higher salaries for males than females. In the upper triangle of the ggplot grid, we observe two boxplots, one is salary vs. gender, which shows higher value for the interquartile range for males as opposed to females, thereby indicating a gender pay gap, although the means are same, the interquartile range varies for the two. Looking at the scatter plot in the lower triangle indicates that salaries increase with years since PhD, and comparing this scatter plot with boxplot between gender and years since PhD, we observe that males have more years since they completed their PhD which shows why males have higher salaries, thus justifying gender pay gap.
2. Basic Predictive Modeling
2.1 Salary Gender Gap: Simple OLS Regression
Suppose that you hypothesize that there is a salary gender pay gap. Fit a linear model function lm() to test this hypothesis by predicting salary using only sex as a predictor. Store the results in an object called lm.fit.1, then inspect the results using the summary() function. Do these results support the salary gender gap hypothesis? Briefly explain why.
Call:
lm(formula = salary ~ sex)
Residuals:
Min 1Q Median 3Q Max
-57290 -23502 -6828 19710 116455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101002 4809 21.001 < 2e-16 ***
sexMale 14088 5065 2.782 0.00567 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
#This model supports the gender pay gap hypothesis because the model construct attributes 0 to female and 1 to male. This indicates that the intercept value is the mean value of female, whereas the sexmale coefficient value is the differential between male and female values. And since the sexmale coefficient value is positive, it shows that on average males earn USD 14088 more than females, hence justifying the gender pay gap hypothesis.
2.2 Multivariate OLS Regression
Now fit a linear model with sex and yrs.since.phd as predictors and save it in an object named lm.fit.2. Then inspect the results using the summary() function. Do these results support the salary gender gap hypothesis? Briefly explain why.
Call:
lm(formula = salary ~ sex + yrs.since.phd)
Residuals:
Min 1Q Median 3Q Max
-84167 -19735 -2551 15427 102033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85181.8 4748.3 17.939 <2e-16 ***
sexMale 7923.6 4684.1 1.692 0.0915 .
yrs.since.phd 958.1 108.3 8.845 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27470 on 394 degrees of freedom
Multiple R-squared: 0.1817, Adjusted R-squared: 0.1775
F-statistic: 43.74 on 2 and 394 DF, p-value: < 2.2e-16
2.3 Comparing Models with ANOVA F-Test
Run an ANOVA test using the anova() function to compare lm.fit.1 to lm.fit.2.
anova(lm.fit.1, lmfit.2)
Analysis of Variance Table
Model 1: salary ~ sex
Model 2: salary ~ sex + yrs.since.phd
Res.Df RSS Df Sum of Sq F Pr(>F)
1 395 3.5632e+11
2 394 2.9729e+11 1 5.9031e+10 78.234 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.4 Interpretation
Provide your brief conclusions (in no more than 3 lines) about whether you think there is a gender pay gap based on this analysis (you will expand this analysis much further in HW2). First, which lm() model is better and why? Then, compare the best predictive model of the two against the descriptive analytics results you obtained in section 1 above. If the null hypothesis is that there is no gender pay gap, is this hypothesis supported? Why or why not?
#Anova between lm.fit.1 and lmfit.2 shows that lm.fit.2 model significantly improves predictability of the model due to large F value and a virtually zero p-value, meaning adding yrs.since.phd adds better predictability than gender alone. The descriptive analysis suggested males earning more than females. With ANOVA for two models, we see that the underlying seniority due to years since Phd, where males are more senior than female reinstates that gender pay gap's underlying reason is because males are more senior than females. The null hypothesis that there is no gender pay gap is therefore not supported.