December 3, 2015

Task 1ai: Boxplot for sl by sx

I note the significant outlier for females at around $38,000.

This will be important for later analysis.

Task 1aii: Boxplot for sl by dg

Task 1bi: Scatterplot for sl by yd

Task 1bii: Scatterplot for sl by yr

Task 1c: Scatterplot for sl by yd

This plot also includes a linear model with a 95% CI.

Task 1d: Scatterplot for sl by yr

This plot is grouped by rk, the current academic rank of faculty.

However, by also including linear models for each group, the plot becomes easier to read and more useful…

Task 1d cont.: Scatterplot for sl by yr

This plot includes a linear model for each group.

Task 2a: Null Hypothesis Test

The Null Hypothesis would state that there is no relationship between the variable sl (Salary) and the entire set of variables sx, yr, dg, yd, and rk (recoded).

In other words, using a simple linear regression analysis, the results of our regression coefficient for this entire set of variables would be equal to 0.

I have set my acceptable level of type 1 error to 0.05. (\(\alpha\) = 0.05)

My alternative hypothesis would state that there is some relationship between the variable sl (Salary) and the entire set of variables sx, yr, dg, yd, and rk (recoded).

Task 2a cont.: Results of test

I read from my calculations that:

Multiple R-squared: 0.7863, Adjusted R-squared: 0.763

F-statistic: 33.84 on 5 and 46 DF, p-value: 2.461e-14

This shows that my regression coefficient (33.84) is not equal to 0. If \(\alpha\) = 0.05, then the p-value, 2.461e-14, is less than \(\alpha\). Therefore, I reject the null hypothesis that there is no relationship between sl and the entire set of independent variables.

This also allows me to test each independent variable as it relates to sl.

Task 2b: Independence of variable sx in relation to sl

The regression coefficient for the independent variable sx is:

Estimate Std., -547.47

Error, 1018.44

t-value, -0.538

Pr(>|t|), 0.59347

Again, if \(\alpha\) = 0.05, our p-value of 0.59347 is greater than \(\alpha\). In this instance I fail to reject the null hypothesis that is no relationship between salary (sl) and sex (sx).

Task 2c: 95% CI for sx regression coefficient

The reported regression coefficient for Sex is -547.47, however, our 95% confidence interval also shows the failure to reject the null hypothesis for this variable. It shows

2.5% CI = -2597.48 (rounded)

97.5% CI = 1502.53 (rounded)

and since these values span 0, it shows the failure to reject the null hypothesis using the current data.

Task 3: Further analysis

Using sx as the sole independent variable to the dependent variable sl shows the following:

Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518

F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706

Again, if \(\alpha\) = 0.05, our p-value of 0.0706 is still greater than \(\alpha\). In this instance I again fail to reject the null hypothesis that is no relationship between salary (sl) and sex (sx).

Task 3 cont.: t-test of difference in mean sl by sx

The results of this test shows similar results to the regression analysis in terms of the 95% confidence interval values of -291.257 and 6970.550 (the regresssion analysis showed these same values except reversed in terms of positive and negative).

It also showed the same p-value found in our regression analysis, 0.0706, which is greater than \(\alpha\). Therefore, the t-test also shows the failure to reject the null hypothesis that that there is no difference in the mean salary for men and women.

However, I also wondered if the outlier I noted previously in the assignment was affecting the analysis.

Task 3 cont.: Outlier effect

I created a histogram of the relevant data for the entire population.

This showed the outlier, but it didn't seem signficant for the entire set of data.

Task 3 cont.: Significant for women

However, a similar histogram for just women did show that it was significant.

Task 3 cont.: No significance for men

A histogram did not show the same issue for men's salary.

Task 3 cont.: Cleaned data

After removing this outlier from the data set, I was given a more representative sample by which to do my analysis.

Task 3 cont.: Rerun t-test

The cleaned data did allow me to reject my null hypothesis that there is no difference in the mean salary for men and women. The difference in means for men and women show:

Men = $24,696.79

Women = $20,073.46

p-value = 0.009018

And since our p-value is less than \(\alpha\) I reject my the null hypothesis and accept my alternative hypothesis that sex is related to a difference in average salary.

Task 3 cont.: Regression analysis

The results of a regression analysis with this data also shows that we can reject the original null hypothesis that there is not relationship between salary and sex.

Multiple R-squared: 0.1311, Adjusted R-squared: 0.1134

F-statistic: 7.396 on 1 and 49 DF, p-value: 0.009018

This shows that my regression coefficient is not equal to 0. If \(\alpha\) = 0.05, then the p-value, 2.461e-14, is less than \(\alpha\). Therefore, I reject the null hypothesis that there is no relationship between sl and sx.

Task 3 cont.: Interpretation

Further, the regression analysis shows that estimate of the coefficient is -4,623.30. This means that there is a negative relationship between sl and sx, females earn less than men. Another way of saying this is that women earn $4,623.30 less per year than men.

The 95% CI shows:

2.5% CI = -8039.66 (rounded)

97.5% CI = -1206.995 (rounded

Our best estimate is that during a year, women earn $4,623.30 less than men. However, we are 95% confident that this difference is between $1,206.99 and $8,039.66.