ITEC 621 Exercise 2 - Foundations

Descriptive and Predictive Analytics

Author

Your Name

Published

Invalid Date

General Instructions

In this exercise you will do quick descriptive and predictive analytics to evaluate if the Salaries data set (with professor salaries) supports the gender pay gap hypothesis.

Download the R Quarto template for this exercise Ex2_Foundations_YourLastName.Qmd and save it to your class project folder. Rename the file to replace YourLastName with your own last name.
Open the file in RStudio and complete the coding exercises and answer the interpretation questions. Run the code to ensure everything is working fine.
I will always display the exercise code output in the instructions on Canvas, so that you can compare your results against the solution. Technical Note: Your quantitative and visual outputs should look identical to the outputs shown in the homework. Now is a good time to compare your results, and ask me with questions if they differ.
When done, knit your R Quarto file into a Word document. On Canvas please submit both the knitted word document AND the .qmd file. If you have troubles knitting a word file, you can knit to a PDF file, or to an HTML file and then save it as a PDF.

Knitting

Preparing analytic reports using the {knitr} package is an important learning objective of this course. Your R Quarto file MUST have the attribute echo=T in the {r global options} so that we can see and grade your R code. The knitted document should adhere to proper business-like formatting and appearance.This means all interpretations should be in markdown sections, NOT code comments.

As such, inadequate or no knitting will carry point deductions up to 30 points max.

1. Descriptive Analytics

1.1 Examine the data

Is there a gender pay gap? Let’s take a look.

Load the library {car}, which contains the Salaries data set. Then, list the first few records with head(Salaries). The display the summmary() for this dataset, which will show frequencies.

Then, load the library {psych} which contains the describe() function and use this function to list the descriptive statistics for the dataset.

Then display the median salary grouped by gender using the aggregate() function (feed grouping variables, dataset and aggregate function, i.e., salary ~ sex, Salaries, mean)

library (car)
head (Salaries)

       rank discipline yrs.since.phd yrs.service  sex salary
1      Prof          B            19          18 Male 139750
2      Prof          B            20          16 Male 173200
3  AsstProf          B             4           3 Male  79750
4      Prof          B            45          39 Male 115000
5      Prof          B            40          41 Male 141500
6 AssocProf          B             6           6 Male  97000

summary(Salaries)

        rank     discipline yrs.since.phd    yrs.service        sex     
 AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
 AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
 Prof     :266              Median :21.00   Median :16.00               
                            Mean   :22.31   Mean   :17.61               
                            3rd Qu.:32.00   3rd Qu.:27.00               
                            Max.   :56.00   Max.   :60.00               
     salary      
 Min.   : 57800  
 1st Qu.: 91000  
 Median :107300  
 Mean   :113706  
 3rd Qu.:134185  
 Max.   :231545

library (psych)
describe (Salaries)

              vars   n      mean       sd median   trimmed      mad   min
rank*            1 397      2.50     0.77      3      2.62     0.00     1
discipline*      2 397      1.54     0.50      2      1.55     0.00     1
yrs.since.phd    3 397     22.31    12.89     21     21.83    14.83     1
yrs.service      4 397     17.61    13.01     16     16.51    14.83     0
sex*             5 397      1.90     0.30      2      2.00     0.00     1
salary           6 397 113706.46 30289.04 107300 111401.61 29355.48 57800
                 max  range  skew kurtosis      se
rank*              3      2 -1.12    -0.38    0.04
discipline*        2      1 -0.18    -1.97    0.03
yrs.since.phd     56     55  0.30    -0.81    0.65
yrs.service       60     60  0.65    -0.34    0.65
sex*               2      1 -2.69     5.25    0.01
salary        231545 173745  0.71     0.18 1520.16

aggregate(salary~sex, Salaries, mean)

     sex   salary
1 Female 101002.4
2   Male 115090.4

1.2 Correlation, Boxplots and ANOVA

Load the library GGally and run the ggpairs() function on the salary (notice that the dataset Salary is capitalized, whereas the variable salary is not), sex and yrs.since.phd variables (only) in the Salaries data set to display some basic descriptive and correlation visually. Please label your variables appropriately (see graph below).

Tips: ggpairs() requires a data frame. So you need to use the data.frame() function to bind the necessary column vectors into a data frame (e.g., ggpairs(data.frame("Salary"=Salaries$salary, etc.). Notice the difference in the quality of the graphics and how categorical variables are labeled. Also, add the attribute upper=list(combo='box') at the end to get labels for the boxplot.

Finally, conduct an ANOVA test to evaluate if there is a significant difference between mean salaries for male and female faculty. Feed Salaries$salary ~ Salaries$sex into the aov() function. Embed the aov() function inside the summary() function to see the statistical test results.

require(GGally)
attach (Salaries)
D_sal = data.frame("Salary"=Salaries$salary, "Gender"=Salaries$sex, "Yrs Since PhD"=Salaries$yrs.since.phd)
ggpairs(D_sal, upper=list(combo='box'))

summary(aov(Salaries$salary~Salaries$sex))

              Df    Sum Sq   Mean Sq F value  Pr(>F)   
Salaries$sex   1 6.980e+09 6.980e+09   7.738 0.00567 **
Residuals    395 3.563e+11 9.021e+08                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.3 Preliminary Interpretation

Based on the output above, does it appear to be a gender pay gap? Why or why not. In your answer, please refer to as much of the data above to support your answer.

Response: The data above does indicate a pay gap, showing higher salaries for males than females. In the upper triangle of the ggplot grid, we observe two boxplots, one is salary vs. gender, which shows higher value for the interquartile range for males as opposed to females, thereby indicating a gender pay gap, although the means are same, the interquartile range varies for the two. Looking at the scatter plot in the lower triangle indicates that salaries increase with years since PhD, and comparing this scatter plot with boxplot between gender and years since PhD, we observe that males have more years since they completed their PhD which shows why males have higher salaries, thus justifying gender pay gap.

2. Basic Predictive Modeling

2.1 Salary Gender Gap: Simple OLS Regression

Suppose that you hypothesize that there is a salary gender pay gap. Fit a linear model function lm() to test this hypothesis by predicting salary using only sex as a predictor. Store the results in an object called lm.fit.1, then inspect the results using the summary() function. Do these results support the salary gender gap hypothesis? Briefly explain why.

lm.fit.1 = lm(salary~sex)
lm.fit.1


Call:
lm(formula = salary ~ sex)

Coefficients:
(Intercept)      sexMale  
     101002        14088

summary (lm.fit.1)


Call:
lm(formula = salary ~ sex)

Residuals:
   Min     1Q Median     3Q    Max 
-57290 -23502  -6828  19710 116455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   101002       4809  21.001  < 2e-16 ***
sexMale        14088       5065   2.782  0.00567 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared:  0.01921,   Adjusted R-squared:  0.01673 
F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667

#This model supports the gender pay gap hypothesis because the model construct attributes 0 to female and 1 to male. This indicates that the intercept value is the mean value of female, whereas the sexmale coefficient value is the differential between male and female values. And since the sexmale coefficient value is positive, it shows that on average males earn USD 14088 more than females, hence justifying the gender pay gap hypothesis.

2.2 Multivariate OLS Regression

Now fit a linear model with sex and yrs.since.phd as predictors and save it in an object named lm.fit.2. Then inspect the results using the summary() function. Do these results support the salary gender gap hypothesis? Briefly explain why.

lmfit.2 = lm(salary~sex+yrs.since.phd)
summary (lmfit.2)


Call:
lm(formula = salary ~ sex + yrs.since.phd)

Residuals:
   Min     1Q Median     3Q    Max 
-84167 -19735  -2551  15427 102033 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    85181.8     4748.3  17.939   <2e-16 ***
sexMale         7923.6     4684.1   1.692   0.0915 .  
yrs.since.phd    958.1      108.3   8.845   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27470 on 394 degrees of freedom
Multiple R-squared:  0.1817,    Adjusted R-squared:  0.1775 
F-statistic: 43.74 on 2 and 394 DF,  p-value: < 2.2e-16

2.3 Comparing Models with ANOVA F-Test

Run an ANOVA test using the anova() function to compare lm.fit.1 to lm.fit.2.

anova(lm.fit.1, lmfit.2)

Analysis of Variance Table

Model 1: salary ~ sex
Model 2: salary ~ sex + yrs.since.phd
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1    395 3.5632e+11                                   
2    394 2.9729e+11  1 5.9031e+10 78.234 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4 Interpretation

Provide your brief conclusions (in no more than 3 lines) about whether you think there is a gender pay gap based on this analysis (you will expand this analysis much further in HW2). First, which lm() model is better and why? Then, compare the best predictive model of the two against the descriptive analytics results you obtained in section 1 above. If the null hypothesis is that there is no gender pay gap, is this hypothesis supported? Why or why not?

#Anova between lm.fit.1 and lmfit.2 shows that lm.fit.2 model significantly improves predictability of the model due to large F value and a virtually zero p-value, meaning adding yrs.since.phd adds better predictability than gender alone. The descriptive analysis suggested males earning more than females. With ANOVA for two models, we see that the underlying seniority due to years since Phd, where males are more senior than female reinstates that gender pay gap's underlying reason is because males are more senior than females. The null hypothesis that there is no gender pay gap is therefore not supported.