December 9, 2015

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: ggvis
## Loading required package: magrittr
## Loading required package: knitr
## Loading required package: rmarkdown
##    sx rk yr dg yd    sl
## 1   0  3 25  1 35 36350
## 2   0  3 13  1 22 35350
## 3   0  3 10  1 23 28200
## 4   1  3  7  1 27 26775
## 5   0  3 19  0 30 33696
## 6   0  3 16  1 21 28516
## 7   1  3  0  0 32 24900
## 8   0  3 16  1 18 31909
## 9   0  3 13  0 30 31850
## 10  0  3 13  0 31 32850
## 11  0  3 12  1 22 27025
## 12  0  2 15  1 19 24750
## 13  0  3  9  1 17 28200
## 14  0  2  9  0 27 23712
## 15  0  3  9  1 24 25748
## 16  0  3  7  1 15 29342
## 17  0  3 13  1 20 31114
## 18  0  2 11  0 14 24742
## 19  0  2 10  0 15 22906
## 20  0  3  6  0 21 24450
## 21  0  1 16  0 23 19175
## 22  0  2  8  0 31 20525
## 23  0  3  7  1 13 27959
## 24  1  3  8  1 24 38045
## 25  0  2  9  1 12 24832
## 26  0  3  5  1 18 25400
## 27  0  2 11  1 14 24800
## 28  1  3  5  1 16 25500
## 29  0  2  3  0  7 26182
## 30  0  2  3  0 17 23725
## 31  1  1 10  0 15 21600
## 32  0  2 11  0 31 23300
## 33  0  1  9  0 14 23713
## 34  1  2  4  0 33 20690
## 35  1  2  6  0 29 22450
## 36  0  2  1  1  9 20850
## 37  1  1  8  1 14 18304
## 38  0  1  4  1  4 17095
## 39  0  1  4  1  5 16700
## 40  0  1  4  1  4 17600
## 41  0  1  3  1  4 18075
## 42  0  1  3  0 11 18000
## 43  0  2  0  1  7 20999
## 44  1  1  3  1  3 17250
## 45  0  1  2  1  3 16500
## 46  0  1  2  1  1 16094
## 47  1  1  2  1  6 16150
## 48  1  1  2  1  2 15350
## 49  0  1  1  1  1 16244
## 50  1  1  1  1  1 16686
## 51  1  1  1  1  1 15000
## 52  1  1  0  1  2 20300

1.) Using ggvis, construct a.) boxplots, i.)for sl by sx
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

ii.) for sl by dg

## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

b.) scatterplots of points, with a smooth line among points, i.) for sl by yd

ii.) for sl by yr

c.) a scatterplot of points, plotted with a linear model, and 95% confidence interval for the model, for sl by yd.

## Guessing formula = yd ~ sl

d.) a scatterplot of points of sl by yr grouped by rk.

Task 2

2.) Compute a simple linear regression with sl as the dependent variable and sx, yr, dg, yd, and a recoded rk variable as independent variables. The recode of the rk variable should result in a new categorical variable that allows a mean salary comparison of full professors with another group composed of both assistant professors and associate professors. From this regression analysis:

Note:

I think you mean a multiple linear regression and not a simple one because you can't have more than two variables in a simple linear regression so I am going to perform a multiple linear regression for this task.

Task 2 Cont'd

##    sx rk yr dg yd    sl
## 1   0  1 25  1 35 36350
## 2   0  1 13  1 22 35350
## 3   0  1 10  1 23 28200
## 4   1  1  7  1 27 26775
## 5   0  1 19  0 30 33696
## 6   0  1 16  1 21 28516
## 7   1  1  0  0 32 24900
## 8   0  1 16  1 18 31909
## 9   0  1 13  0 30 31850
## 10  0  1 13  0 31 32850
## 11  0  1 12  1 22 27025
## 12  0  0 15  1 19 24750
## 13  0  1  9  1 17 28200
## 14  0  0  9  0 27 23712
## 15  0  1  9  1 24 25748
## 16  0  1  7  1 15 29342
## 17  0  1 13  1 20 31114
## 18  0  0 11  0 14 24742
## 19  0  0 10  0 15 22906
## 20  0  1  6  0 21 24450
## 21  0  0 16  0 23 19175
## 22  0  0  8  0 31 20525
## 23  0  1  7  1 13 27959
## 24  1  1  8  1 24 38045
## 25  0  0  9  1 12 24832
## 26  0  1  5  1 18 25400
## 27  0  0 11  1 14 24800
## 28  1  1  5  1 16 25500
## 29  0  0  3  0  7 26182
## 30  0  0  3  0 17 23725
## 31  1  0 10  0 15 21600
## 32  0  0 11  0 31 23300
## 33  0  0  9  0 14 23713
## 34  1  0  4  0 33 20690
## 35  1  0  6  0 29 22450
## 36  0  0  1  1  9 20850
## 37  1  0  8  1 14 18304
## 38  0  0  4  1  4 17095
## 39  0  0  4  1  5 16700
## 40  0  0  4  1  4 17600
## 41  0  0  3  1  4 18075
## 42  0  0  3  0 11 18000
## 43  0  0  0  1  7 20999
## 44  1  0  3  1  3 17250
## 45  0  0  2  1  3 16500
## 46  0  0  2  1  1 16094
## 47  1  0  2  1  6 16150
## 48  1  0  2  1  2 15350
## 49  0  0  1  1  1 16244
## 50  1  0  1  1  1 16686
## 51  1  0  1  1  1 15000
## 52  1  0  0  1  2 20300
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rk, data = gndrdscrmntn)
## 
## Coefficients:
## (Intercept)           sx           yr           dg           yd  
##    17761.82      -547.47       356.25      -559.33        77.37  
##          rk  
##     6856.45
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rk, data = gndrdscrmntn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6066.3 -1719.5  -452.5   957.8  9826.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17761.82    1429.16  12.428 2.62e-16 ***
## sx           -547.47    1018.44  -0.538  0.59347    
## yr            356.25     109.64   3.249  0.00216 ** 
## dg           -559.33    1204.37  -0.464  0.64454    
## yd             77.37      76.84   1.007  0.31930    
## rk           6856.45    1186.70   5.778 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2880 on 46 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
## F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14
##                   2.5 %     97.5 %
## (Intercept) 14885.07648 20638.5722
## sx          -2597.47771  1502.5290
## yr            135.56889   576.9402
## dg          -2983.60125  1864.9356
## yd            -77.31372   232.0466
## rk           4467.75405  9245.1439

Task 2, a.) Null Hypothesis Results

a.) Report a test of the null hypothesis that sl, the dependent variable, is not related to the entire set of independent variables (sx, yr, dg, yd, and the recoded rk variable).

Upon testing the null hypothesis that sl (dependent variable) is not related to the entire set of independent variables (sx, yr, dg, yd, & rk "recoded") if alpha = .05, then the p-value, 2.461e-14 , is less than alpha. Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.

Task 2, b.) Report of Null Hypothesis

b.) Report a test of the null hypothesis that sl is not related to sx.

This can only be answered after having rejected the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables. Which was done in the previous task (a.).

So now lets examine the regression coefficent for sx in order to answer the null hypothesis that there is no relationship between sl and sx. With that said if the Alpha is = .05, then the p-value for sx being 0.59347 is greater than the Alpha.
Therefore we fail to reject the null hypothesis that is no relationship between sl the dependent variable and sx the independent variable.

Task 2, c.) Report and Interpret a 95% C.I.

c.) Report and interpret a 95% confidence interval around the regression coefficient for sx.

How accurate is the estimate of the relationship between Academic year salary (sl) and Sex/Gender (sx)? If we examine the 95% confidence interval for "sx":

     2.5%          97.5%

sx -2597.47771 20638.5722

My best estimate is that for females their salary will be less than their equal male counterparts by $547.47. However, we are 95% confident that this difference in female salary is between $-2597.48 and $1502.53.

Task 3.) Compute and Report a New Regression and T-test

3.) Compute and report a new regression equation with sl as the dependent variable and sx as the sole independent variable. Then, compute a t–test of the difference in mean sl by sx. Describe whether and how the results about the relationship between sl and sx from the regression analysis and from the t–test are similar.

Simple Linear Regression of sl and sx

## 
## Call:
## lm(formula = sl ~ sx, data = gndrdscrmntn)
## 
## Coefficients:
## (Intercept)           sx  
##       24697        -3340
## 
## Call:
## lm(formula = sl ~ sx, data = gndrdscrmntn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8602.8 -4296.6  -100.8  3513.1 16687.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24697        938  26.330   <2e-16 ***
## sx             -3340       1808  -1.847   0.0706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5782 on 50 degrees of freedom
## Multiple R-squared:  0.0639, Adjusted R-squared:  0.04518 
## F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706
##                2.5 %    97.5 %
## (Intercept) 22812.81 26580.773
## sx          -6970.55   291.257

T-test of sl and sx

## 
##  Two Sample t-test
## 
## data:  sl by sx
## t = 1.8474, df = 50, p-value = 0.0706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -291.257 6970.550
## sample estimates:
## mean in group 0 mean in group 1 
##        24696.79        21357.14

Report on Simple Regression and T-test Results

In the regression analysis and the t-test there were a number of similarities along with a few differences. First the similarities were that both p values were 0.0706 which therefore causes both of them fail to reject the null hypothesis that there is no relation ship between sl and sx. The rest of the numbers in both the regression analysis and the t-test were the same (such as t, and 95% confidence interval) however they were opposite values of each other. Where the t value was a negative number, -1.8474, for the regression analysis it was the same number, but postive value for the t-test, 1.8474. The same was true for the 95% confidence intervals for both.