1.) To begin, I have created a new project in R and have saved the data file from Piazza into my project folder.
I also loaded a few packages that I anticipated needing to work with, including dplyr, magritter and ggvis.
December 14, 2015
1.) To begin, I have created a new project in R and have saved the data file from Piazza into my project folder.
I also loaded a few packages that I anticipated needing to work with, including dplyr, magritter and ggvis.
a.) Using GGIVS, I constructed a boxplot for Academic Salary by Sex:
a.) I constructed a boxplot for Academic Salary by Highest Degree:
b.) I constructed a scatterplot of points with a smooth line among points showing Academic Salary by Year Since Degree was Earned:
b.) I constructed a scatterplot of points with a smooth line among points showing Academic Salary by Number of Years in Current Rank:
c.) I constructed a scatterplot of points, with a linear model and 95% confidence interval, for salary by number of years since degree was earned:
d.) I constructed a scatterplot of points showing salary by number of years in current rank, and grouped by rank:
Using Salary as the dependent variable and Sex, Years in Rank, Highest Degree, Years since Degree was Awarded and Rank as independent variables, I tested null hypotheses regarding relationships between the variables:
a.) Null Hypothesis: There is no relationship between Salary and Sex, Years in current rank, degree, years since degree was awarded and current rank.
Alternative Hypothesis: There is a relationship between Salary and Sex, Years in current rank, degree, years since degree was awarded and current rank. For this analysis, Current Rank has been recoded into only two categories: full professors and assistant + associate professors:
For this test, I set alpha to .05.
## rk2 ## 1 2 ## 32 20
Testing the hypothesis:
## ## Call: ## lm(formula = sl ~ sx + yr + dg + yd + rk2, data = new_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6066.3 -1719.5 -452.5 957.8 9826.7 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10905.38 1363.05 8.001 2.95e-10 *** ## sx -547.47 1018.44 -0.538 0.59347 ## yr 356.25 109.64 3.249 0.00216 ** ## dg -559.33 1204.37 -0.464 0.64454 ## yd 77.37 76.84 1.007 0.31930 ## rk2 6856.45 1186.70 5.778 6.23e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2880 on 46 degrees of freedom ## Multiple R-squared: 0.7863, Adjusted R-squared: 0.763 ## F-statistic: 33.84 on 5 and 46 DF, p-value: 2.461e-14
Calculating 95% confidence interval:
## 2.5 % 97.5 % ## (Intercept) 8161.70176 13649.0490 ## sx -2597.47771 1502.5290 ## yr 135.56889 576.9402 ## dg -2983.60125 1864.9356 ## yd -77.31372 232.0466 ## rk2 4467.75405 9245.1439
b.) Having run a multiple squares regression, I read from the findings that:
Multiple R-squared: 0.7863, Adjusted R-squared: 0.763
F-statistic: 33.84 on 5 and 46 DF, p-value: 2.461e-14
Having set alpha equal to .05, the p-value (2.461e-14) is less than alpha and I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.
Having rejected the null hypothesis that there is no relationship between the dependent and the entire set of independent variables, I examined the relationship between salary and sex.
The p-value for the sx coefficient (0.59347) is greater than alpha so I have failed to reject the null hypothesis.
Population Coefficient:
sx -547.47 1018.44 -0.538 0.59347
c.) Interval Estimate:
2.5 % 97.5 %
sx -2597.47771 1502.5290
I computed a regression equation with salary as the dependent variable and sex as the sole independent variable:
## ## Call: ## lm(formula = new_data$sx ~ new_data$sl) ## ## Coefficients: ## (Intercept) new_data$sl ## 7.246e-01 -1.913e-05
## ## Call: ## lm(formula = new_data$sx ~ new_data$sl) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.4166 -0.2806 -0.1986 0.5641 1.0034 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.246e-01 2.538e-01 2.855 0.00626 ** ## new_data$sl -1.913e-05 1.036e-05 -1.847 0.07060 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4377 on 50 degrees of freedom ## Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518 ## F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
Here is a t-test for comparison:
## ## Two Sample t-test ## ## data: sl by sx ## t = 1.8474, df = 50, p-value = 0.0706 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -291.257 6970.550 ## sample estimates: ## mean in group 0 mean in group 1 ## 24696.79 21357.14
The p-value (.07) was the same for both the regression anlysis and the t-test. Because the p-value was greater than alpha (.05), I failed to reject the Null Hypothesis in both instances.
Now please enjoy Task Two