12/01/2015

Require

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr

Load data

data1<-read.csv(url("http://www.personal.psu.edu/dlp/w540/sexdisc.csv"))
data1<-tbl_df(data1)
data1
## Source: local data frame [52 x 6]
## 
##    sx rk yr dg yd    sl
## 1   0  3 25  1 35 36350
## 2   0  3 13  1 22 35350
## 3   0  3 10  1 23 28200
## 4   1  3  7  1 27 26775
## 5   0  3 19  0 30 33696
## 6   0  3 16  1 21 28516
## 7   1  3  0  0 32 24900
## 8   0  3 16  1 18 31909
## 9   0  3 13  0 30 31850
## 10  0  3 13  0 31 32850
## .. .. .. .. .. ..   ...

1ai. Boxplots for sl by sx.

data1 %>% ggvis(~sx, ~sl) %>% layer_boxplots()

1aii. Boxplots for sl by dg.

data1 %>% ggvis(~dg, ~sl) %>% layer_boxplots()

1bi. Scatterplots of points, with a smooth line among points for sl by yd.

data1 %>% ggvis(~yd, ~sl) %>% layer_points() %>% layer_smooths()

1bii. Scatterplots of points, with a smooth line among points for sl by yr.

data1 %>% ggvis(~yr, ~sl) %>% layer_points() %>% layer_smooths()

1c. Scatterplot of points, plotted with a linear model, and 95% confidence interval for the model, for sl by yd.

data1 %>% ggvis(~yd, ~sl) %>% layer_points() %>% layer_model_predictions(model = "lm", se = TRUE)
## Guessing formula = sl ~ yd

1d. Scatterplot of points of sl by yr grouped by rk.

data1 %>% ggvis(~yr, ~sl) %>% layer_points(fill = ~factor(rk))

2a. Test of the null hypothesis that sl is not related to the entire set of independent variables

2a. continue

summary(lm1)
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rk1, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6066.3 -1719.5  -452.5   957.8  9826.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17761.82    1429.16  12.428 2.62e-16 ***
## sx           -547.47    1018.44  -0.538  0.59347    
## yr            356.25     109.64   3.249  0.00216 ** 
## dg           -559.33    1204.37  -0.464  0.64454    
## yd             77.37      76.84   1.007  0.31930    
## rk1          6856.45    1186.70   5.778 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2880 on 46 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
## F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14

2a. continue

confint(lm1)
##                   2.5 %     97.5 %
## (Intercept) 14885.07648 20638.5722
## sx          -2597.47771  1502.5290
## yr            135.56889   576.9402
## dg          -2983.60125  1864.9356
## yd            -77.31372   232.0466
## rk1          4467.75405  9245.1439

According to the result, sx, dg, and yd are not related to sl, but yr and rk1 (full professor or not) are positivley related to sl. Thus, the null hypothesis is rejected.

2b. Test of the null hypothesis that sl is not related to sx.

2b. continue

summary(lm2)
## 
## Call:
## lm(formula = sl ~ sx, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8602.8 -4296.6  -100.8  3513.1 16687.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24697        938  26.330   <2e-16 ***
## sx             -3340       1808  -1.847   0.0706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5782 on 50 degrees of freedom
## Multiple R-squared:  0.0639, Adjusted R-squared:  0.04518 
## F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

According to the results, sx is not related to sl. Thus, the null hypothesis is supported.

2c. Report and interpret a 95% confidence interval around the regression coefficient for sx.

confint(lm2)
##                2.5 %    97.5 %
## (Intercept) 22812.81 26580.773
## sx          -6970.55   291.257

The result show that the regression coefficient for sx lies between -6970.55 and 291.257 at alpha = 5% level of significance. In other words, if we were to collect new data generated from the same distribution then in 95 out of every 100 experiments we'd get the number in this interval.

3. Describe whether and how the results about the relationship between sl and sx from the regression analysis and from the t-test are similar.

lm3<- lm(sl ~ sx, data=data1)

3. continue

summary(lm3)
## 
## Call:
## lm(formula = sl ~ sx, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8602.8 -4296.6  -100.8  3513.1 16687.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24697        938  26.330   <2e-16 ***
## sx             -3340       1808  -1.847   0.0706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5782 on 50 degrees of freedom
## Multiple R-squared:  0.0639, Adjusted R-squared:  0.04518 
## F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

3. continue

t.test(data1$sl~data1$sx)
## 
##  Welch Two Sample t-test
## 
## data:  data1$sl by data1$sx
## t = 1.7744, df = 21.591, p-value = 0.09009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -567.8539 7247.1471
## sample estimates:
## mean in group 0 mean in group 1 
##        24696.79        21357.14

According to the result of the regression analysis, sx has not an effect on sl. The result of t-test also indicates that there is no academic year salary (sl) difference between female and male (sx). Therefore, these two results are similar.