1.) Using ggvis, construct a.) boxplots, i.)for sl by sx
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to ## character
December 9, 2015
## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## ## Loading required package: ggvis ## Loading required package: magrittr ## Loading required package: knitr ## Loading required package: rmarkdown
## sx rk yr dg yd sl ## 1 0 3 25 1 35 36350 ## 2 0 3 13 1 22 35350 ## 3 0 3 10 1 23 28200 ## 4 1 3 7 1 27 26775 ## 5 0 3 19 0 30 33696 ## 6 0 3 16 1 21 28516 ## 7 1 3 0 0 32 24900 ## 8 0 3 16 1 18 31909 ## 9 0 3 13 0 30 31850 ## 10 0 3 13 0 31 32850 ## 11 0 3 12 1 22 27025 ## 12 0 2 15 1 19 24750 ## 13 0 3 9 1 17 28200 ## 14 0 2 9 0 27 23712 ## 15 0 3 9 1 24 25748 ## 16 0 3 7 1 15 29342 ## 17 0 3 13 1 20 31114 ## 18 0 2 11 0 14 24742 ## 19 0 2 10 0 15 22906 ## 20 0 3 6 0 21 24450 ## 21 0 1 16 0 23 19175 ## 22 0 2 8 0 31 20525 ## 23 0 3 7 1 13 27959 ## 24 1 3 8 1 24 38045 ## 25 0 2 9 1 12 24832 ## 26 0 3 5 1 18 25400 ## 27 0 2 11 1 14 24800 ## 28 1 3 5 1 16 25500 ## 29 0 2 3 0 7 26182 ## 30 0 2 3 0 17 23725 ## 31 1 1 10 0 15 21600 ## 32 0 2 11 0 31 23300 ## 33 0 1 9 0 14 23713 ## 34 1 2 4 0 33 20690 ## 35 1 2 6 0 29 22450 ## 36 0 2 1 1 9 20850 ## 37 1 1 8 1 14 18304 ## 38 0 1 4 1 4 17095 ## 39 0 1 4 1 5 16700 ## 40 0 1 4 1 4 17600 ## 41 0 1 3 1 4 18075 ## 42 0 1 3 0 11 18000 ## 43 0 2 0 1 7 20999 ## 44 1 1 3 1 3 17250 ## 45 0 1 2 1 3 16500 ## 46 0 1 2 1 1 16094 ## 47 1 1 2 1 6 16150 ## 48 1 1 2 1 2 15350 ## 49 0 1 1 1 1 16244 ## 50 1 1 1 1 1 16686 ## 51 1 1 1 1 1 15000 ## 52 1 1 0 1 2 20300
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to ## character
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to ## character
## Guessing formula = yd ~ sl
2.) Compute a simple linear regression with sl as the dependent variable and sx, yr, dg, yd, and a recoded rk variable as independent variables. The recode of the rk variable should result in a new categorical variable that allows a mean salary comparison of full professors with another group composed of both assistant professors and associate professors. From this regression analysis:
Note:
I think you mean a multiple linear regression and not a simple one because you can't have more than two variables in a simple linear regression so I am going to perform a multiple linear regression for this task.
## sx rk yr dg yd sl ## 1 0 1 25 1 35 36350 ## 2 0 1 13 1 22 35350 ## 3 0 1 10 1 23 28200 ## 4 1 1 7 1 27 26775 ## 5 0 1 19 0 30 33696 ## 6 0 1 16 1 21 28516 ## 7 1 1 0 0 32 24900 ## 8 0 1 16 1 18 31909 ## 9 0 1 13 0 30 31850 ## 10 0 1 13 0 31 32850 ## 11 0 1 12 1 22 27025 ## 12 0 0 15 1 19 24750 ## 13 0 1 9 1 17 28200 ## 14 0 0 9 0 27 23712 ## 15 0 1 9 1 24 25748 ## 16 0 1 7 1 15 29342 ## 17 0 1 13 1 20 31114 ## 18 0 0 11 0 14 24742 ## 19 0 0 10 0 15 22906 ## 20 0 1 6 0 21 24450 ## 21 0 0 16 0 23 19175 ## 22 0 0 8 0 31 20525 ## 23 0 1 7 1 13 27959 ## 24 1 1 8 1 24 38045 ## 25 0 0 9 1 12 24832 ## 26 0 1 5 1 18 25400 ## 27 0 0 11 1 14 24800 ## 28 1 1 5 1 16 25500 ## 29 0 0 3 0 7 26182 ## 30 0 0 3 0 17 23725 ## 31 1 0 10 0 15 21600 ## 32 0 0 11 0 31 23300 ## 33 0 0 9 0 14 23713 ## 34 1 0 4 0 33 20690 ## 35 1 0 6 0 29 22450 ## 36 0 0 1 1 9 20850 ## 37 1 0 8 1 14 18304 ## 38 0 0 4 1 4 17095 ## 39 0 0 4 1 5 16700 ## 40 0 0 4 1 4 17600 ## 41 0 0 3 1 4 18075 ## 42 0 0 3 0 11 18000 ## 43 0 0 0 1 7 20999 ## 44 1 0 3 1 3 17250 ## 45 0 0 2 1 3 16500 ## 46 0 0 2 1 1 16094 ## 47 1 0 2 1 6 16150 ## 48 1 0 2 1 2 15350 ## 49 0 0 1 1 1 16244 ## 50 1 0 1 1 1 16686 ## 51 1 0 1 1 1 15000 ## 52 1 0 0 1 2 20300
## ## Call: ## lm(formula = sl ~ sx + yr + dg + yd + rk, data = gndrdscrmntn) ## ## Coefficients: ## (Intercept) sx yr dg yd ## 17761.82 -547.47 356.25 -559.33 77.37 ## rk ## 6856.45
## ## Call: ## lm(formula = sl ~ sx + yr + dg + yd + rk, data = gndrdscrmntn) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6066.3 -1719.5 -452.5 957.8 9826.7 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17761.82 1429.16 12.428 2.62e-16 *** ## sx -547.47 1018.44 -0.538 0.59347 ## yr 356.25 109.64 3.249 0.00216 ** ## dg -559.33 1204.37 -0.464 0.64454 ## yd 77.37 76.84 1.007 0.31930 ## rk 6856.45 1186.70 5.778 6.23e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2880 on 46 degrees of freedom ## Multiple R-squared: 0.7863, Adjusted R-squared: 0.763 ## F-statistic: 33.84 on 5 and 46 DF, p-value: 2.461e-14
## 2.5 % 97.5 % ## (Intercept) 14885.07648 20638.5722 ## sx -2597.47771 1502.5290 ## yr 135.56889 576.9402 ## dg -2983.60125 1864.9356 ## yd -77.31372 232.0466 ## rk 4467.75405 9245.1439
a.) Report a test of the null hypothesis that sl, the dependent variable, is not related to the entire set of independent variables (sx, yr, dg, yd, and the recoded rk variable).
Upon testing the null hypothesis that sl (dependent variable) is not related to the entire set of independent variables (sx, yr, dg, yd, & rk "recoded") if alpha = .05, then the p-value, 2.461e-14 , is less than alpha. Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.
b.) Report a test of the null hypothesis that sl is not related to sx.
This can only be answered after having rejected the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables. Which was done in the previous task (a.).
So now lets examine the regression coefficent for sx in order to answer the null hypothesis that there is no relationship between sl and sx. With that said if the Alpha is = .05, then the p-value for sx being 0.59347 is greater than the Alpha.
Therefore we fail to reject the null hypothesis that is no relationship between sl the dependent variable and sx the independent variable.
c.) Report and interpret a 95% confidence interval around the regression coefficient for sx.
How accurate is the estimate of the relationship between Academic year salary (sl) and Sex/Gender (sx)? If we examine the 95% confidence interval for "sx":
2.5% 97.5%
sx -2597.47771 20638.5722
My best estimate is that for females their salary will be less than their equal male counterparts by $547.47. However, we are 95% confident that this difference in female salary is between $-2597.48 and $1502.53.
3.) Compute and report a new regression equation with sl as the dependent variable and sx as the sole independent variable. Then, compute a t–test of the difference in mean sl by sx. Describe whether and how the results about the relationship between sl and sx from the regression analysis and from the t–test are similar.
## ## Call: ## lm(formula = sl ~ sx, data = gndrdscrmntn) ## ## Coefficients: ## (Intercept) sx ## 24697 -3340
## ## Call: ## lm(formula = sl ~ sx, data = gndrdscrmntn) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8602.8 -4296.6 -100.8 3513.1 16687.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 24697 938 26.330 <2e-16 *** ## sx -3340 1808 -1.847 0.0706 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5782 on 50 degrees of freedom ## Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518 ## F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
## 2.5 % 97.5 % ## (Intercept) 22812.81 26580.773 ## sx -6970.55 291.257
## ## Two Sample t-test ## ## data: sl by sx ## t = 1.8474, df = 50, p-value = 0.0706 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -291.257 6970.550 ## sample estimates: ## mean in group 0 mean in group 1 ## 24696.79 21357.14
In the regression analysis and the t-test there were a number of similarities along with a few differences. First the similarities were that both p values were 0.0706 which therefore causes both of them fail to reject the null hypothesis that there is no relation ship between sl and sx. The rest of the numbers in both the regression analysis and the t-test were the same (such as t, and 95% confidence interval) however they were opposite values of each other. Where the t value was a negative number, -1.8474, for the regression analysis it was the same number, but postive value for the t-test, 1.8474. The same was true for the 95% confidence intervals for both.