November 28, 2015

First of all, I trimed dataset. Load the packages that I will need, read the csv file in to R, convert the data frame to a table frame through the dplyr function, tbl_df

## Source: local data frame [52 x 6]
## 
##       sx    rk    yr    dg    yd    sl
##    (int) (int) (int) (int) (int) (int)
## 1      0     3    25     1    35 36350
## 2      0     3    13     1    22 35350
## 3      0     3    10     1    23 28200
## 4      1     3     7     1    27 26775
## 5      0     3    19     0    30 33696
## 6      0     3    16     1    21 28516
## 7      1     3     0     0    32 24900
## 8      0     3    16     1    18 31909
## 9      0     3    13     0    30 31850
## 10     0     3    13     0    31 32850
## ..   ...   ...   ...   ...   ...   ...

#1. Using ggvis, construct
##a. boxplots
####i.for sl by sx

####ii. for sl by dg. 

##b. scatterplots of points,with a smooth line among points 
#####i. for sl by yd.

#####ii. for sl by yr. 

##c. a scatterplot of points, plotted with a linear model, and 95% 

##confidence interval for the model, for sl by yd.
## Guessing formula = sl ~ yd

##d. a scatterplot of points of sl by yr grouped by rk.
## Guessing formula = sl ~ yr

##Recode rk
#####(full professor=1, assistant and associate professor=0)
profrank<- group_by(Profsal, rk)
summarize(profrank,count=n(),na.rm=TRUE)
## Source: local data frame [3 x 3]
## 
##      rk count na.rm
##   (int) (int) (lgl)
## 1     1    18  TRUE
## 2     2    14  TRUE
## 3     3    20  TRUE
## 
##  0  1 
## 32 20

#2. Compute a simple linear regression 
Profsaltest <- lm (sl ~ sx + yr + dg + yd + rkrecode, data=Profsal)
Profsaltest
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rkrecode, data = Profsal)
## 
## Coefficients:
## (Intercept)           sx           yr           dg           yd  
##    17761.82      -547.47       356.25      -559.33        77.37  
##    rkrecode  
##     6856.45

summary(Profsaltest)
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rkrecode, data = Profsal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6066.3 -1719.5  -452.5   957.8  9826.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17761.82    1429.16  12.428 2.62e-16 ***
## sx           -547.47    1018.44  -0.538  0.59347    
## yr            356.25     109.64   3.249  0.00216 ** 
## dg           -559.33    1204.37  -0.464  0.64454    
## yd             77.37      76.84   1.007  0.31930    
## rkrecode     6856.45    1186.70   5.778 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2880 on 46 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
## F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14

##a. Report a test of the null hypothesis
confint(Profsaltest)
##                   2.5 %     97.5 %
## (Intercept) 14885.07648 20638.5722
## sx          -2597.47771  1502.5290
## yr            135.56889   576.9402
## dg          -2983.60125  1864.9356
## yd            -77.31372   232.0466
## rkrecode     4467.75405  9245.1439
#####  Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
#####  F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14 
If α = .05, then the p-value, 2.461e-14, is less than α. Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables.

##b. Report a test of the null hypothesis that sl is not related to sx.
###### Estimate    Std.Error   t value       Pr(>|t|)
######-547.47    1018.44       -0.538         0.59347 
if α = .05, then the p-value, 0.59374, is greater than α. Therefore, we cannot reject the null hypothesis that there is no relationship between sl and sx.

##c. Report and interpret a 95% confidence interval around the

##regression coefficient for sx.
The estimate of the population coefficient is -547.47. The means that there is a negative relationship between "Academic year salary" and "sex." On average, women's salary is 547.47 dollars less than men.

##3. Regression & t-test

####Compute and report a new regression equation with sl as the

####dependent variable and sx as the sole independent variable.
Profsal_sxtest <- lm(sl ~ sx, data=Profsal)
summary(Profsal_sxtest)$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 24696.789   937.9776 26.32983 5.761530e-31
## sx          -3339.647  1807.7156 -1.84744 7.060394e-02

If α = .05, then the p-value, 0.0706, is greater than α. Therefore, we cannot reject the null hypothesis that there is no relationship between the dependent variable, sl, and the independent variables, sx.

####compute a t–test of the difference in mean sl by sx
sl_male <- Profsal[Profsal$sx==0,]$sl
sl_female <- Profsal[Profsal$sx==1,]$sl
t.test (sl_male, sl_female, var.eq=TRUE)
## 
##  Two Sample t-test
## 
## data:  sl_male and sl_female
## t = 1.8474, df = 50, p-value = 0.0706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -291.257 6970.550
## sample estimates:
## mean of x mean of y 
##  24696.79  21357.14

####Describe whether and how the results about the relationship

####between sl and sx from the regression analysis and from the

####t–test are similar.

1. The p-value associated with a t-value of 1.8474 is 0.0706

2. This p-value of 0.0706 is greater than alpha, 0.05.

3. Therefore, we could not have rejected the null hypothesis.

All in all, regression anaysis and the result of the t-test are similar.