Final Exam Task 1

December 14, 2015

An Analysis of Gender Discrimination in University Professors' Salaries

1.) To begin, I have created a new project in R and have saved the data file from Piazza into my project folder.

I also loaded a few packages that I anticipated needing to work with, including dplyr, magritter and ggvis.

Visualization

a.) Using GGIVS, I constructed a boxplot for Academic Salary by Sex:

Visualization

a.) I constructed a boxplot for Academic Salary by Highest Degree:

Visualization

b.) I constructed a scatterplot of points with a smooth line among points showing Academic Salary by Year Since Degree was Earned:

Visualization

b.) I constructed a scatterplot of points with a smooth line among points showing Academic Salary by Number of Years in Current Rank:

Visualization

c.) I constructed a scatterplot of points, with a linear model and 95% confidence interval, for salary by number of years since degree was earned:

Visualization

d.) I constructed a scatterplot of points showing salary by number of years in current rank, and grouped by rank:

Linear Regression

Using Salary as the dependent variable and Sex, Years in Rank, Highest Degree, Years since Degree was Awarded and Rank as independent variables, I tested null hypotheses regarding relationships between the variables:

a.) Null Hypothesis: There is no relationship between Salary and Sex, Years in current rank, degree, years since degree was awarded and current rank.

Alternative Hypothesis: There is a relationship between Salary and Sex, Years in current rank, degree, years since degree was awarded and current rank. For this analysis, Current Rank has been recoded into only two categories: full professors and assistant + associate professors:

For this test, I set alpha to .05.

## rk2
##  1  2 
## 32 20

Testing the Null Hypothesis

Testing the hypothesis:

## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rk2, data = new_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6066.3 -1719.5  -452.5   957.8  9826.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10905.38    1363.05   8.001 2.95e-10 ***
## sx           -547.47    1018.44  -0.538  0.59347    
## yr            356.25     109.64   3.249  0.00216 ** 
## dg           -559.33    1204.37  -0.464  0.64454    
## yd             77.37      76.84   1.007  0.31930    
## rk2          6856.45    1186.70   5.778 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2880 on 46 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
## F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14

Calcuating Confidence Interval

Calculating 95% confidence interval:

##                   2.5 %     97.5 %
## (Intercept)  8161.70176 13649.0490
## sx          -2597.47771  1502.5290
## yr            135.56889   576.9402
## dg          -2983.60125  1864.9356
## yd            -77.31372   232.0466
## rk2          4467.75405  9245.1439

Reporting Results

b.) Having run a multiple squares regression, I read from the findings that:

Multiple R-squared: 0.7863, Adjusted R-squared: 0.763

F-statistic: 33.84 on 5 and 46 DF, p-value: 2.461e-14

Having set alpha equal to .05, the p-value (2.461e-14) is less than alpha and I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.

Reporting Results

Having rejected the null hypothesis that there is no relationship between the dependent and the entire set of independent variables, I examined the relationship between salary and sex.

The p-value for the sx coefficient (0.59347) is greater than alpha so I have failed to reject the null hypothesis.

Population Coefficient:

sx -547.47 1018.44 -0.538 0.59347

c.) Interval Estimate:

            2.5 %       97.5 %

sx -2597.47771 1502.5290

Regression Equation

I computed a regression equation with salary as the dependent variable and sex as the sole independent variable:

## 
## Call:
## lm(formula = new_data$sx ~ new_data$sl)
## 
## Coefficients:
## (Intercept)  new_data$sl  
##   7.246e-01   -1.913e-05

## 
## Call:
## lm(formula = new_data$sx ~ new_data$sl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4166 -0.2806 -0.1986  0.5641  1.0034 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  7.246e-01  2.538e-01   2.855  0.00626 **
## new_data$sl -1.913e-05  1.036e-05  -1.847  0.07060 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4377 on 50 degrees of freedom
## Multiple R-squared:  0.0639, Adjusted R-squared:  0.04518 
## F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

T-test

Here is a t-test for comparison:

## 
##  Two Sample t-test
## 
## data:  sl by sx
## t = 1.8474, df = 50, p-value = 0.0706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -291.257 6970.550
## sample estimates:
## mean in group 0 mean in group 1 
##        24696.79        21357.14

Reporting Results

The p-value (.07) was the same for both the regression anlysis and the t-test. Because the p-value was greater than alpha (.05), I failed to reject the Null Hypothesis in both instances.

Thank you

Now please enjoy Task Two