Task 1

Analyze whether there is gender discrimination in University Professors’ salaries.

1. Preparation

First, I load the packages that I will need: dplyr, ggvis, and magritter.

Next, I read the data file into R and then convert the data fram to a table frame by using the dplyr function, tbl_df.

## Observations: 52
## Variables: 6
## $ sx (int) 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rk (int) 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3,...
## $ yr (int) 25, 13, 10, 7, 19, 16, 0, 16, 13, 13, 12, 15, 9, 9, 9, 7, 1...
## $ dg (int) 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,...
## $ yd (int) 35, 22, 23, 27, 30, 21, 32, 18, 30, 31, 22, 19, 17, 27, 24,...
## $ sl (int) 36350, 35350, 28200, 26775, 33696, 28516, 24900, 31909, 318...
##        sx               rk              yr               dg        
##  Min.   :0.0000   Min.   :1.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.:0.0000  
##  Median :0.0000   Median :2.000   Median : 7.000   Median :1.0000  
##  Mean   :0.2692   Mean   :2.038   Mean   : 7.481   Mean   :0.6538  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:11.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :3.000   Max.   :25.000   Max.   :1.0000  
##        yd              sl       
##  Min.   : 1.00   Min.   :15000  
##  1st Qu.: 6.75   1st Qu.:18247  
##  Median :15.50   Median :23719  
##  Mean   :16.12   Mean   :23798  
##  3rd Qu.:23.25   3rd Qu.:27258  
##  Max.   :35.00   Max.   :38045

1. Using ggvis, construct the following:

a. Boxplots

For sl (academic year salary) by sx (sex, coded 0 for male and 1 for female)

For sl (academic year salary) by dg (Highest degree, coded 0 for masters and 1 for doctoral degree)

b. Scatterplots of points, with a smooth line among points

Scatterplot: sl (academic year salary) by yd (Number of years since highest degree was earned)

Scatterplot: sl (academic year salary) by yr (Number of years in current rank)

c. Scatterplot of points, plotted with a linear model, and 95% confidence interval for the model, for sl (academic year salary) by yd (Number of years since highest degree was earned).

d. Scatterplot of points of sl (academic year salary) by yr (Number of years in current rank) grouped by rk (academic rank, coded 1 for assistant professor, 2 for associate professor, and 3 for full professor).

In this scatterplot, data points are noted as follows:

  • Assistant professors (1) - blue

  • Associate professors (2) - orange

  • Full professors (3) - green

2. Compute a simple linear regression with sl as the dependent variable and sx, yr, dg, yd, and a recoded rk variable as independent variables. The recode of the rk variable should result in a new categorical variable that allows a mean salary comparison of full professors with another group composed of both assistant professors and associate professors.

## 
##  1  2  3 
## 18 14 20
## 
##  0  1 
## 32 20
## 
## Call:
## lm(formula = sl ~ sx + yr + dg + yd + rkrecode, data = sexdiscrimination)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6066.3 -1719.5  -452.5   957.8  9826.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17761.82    1429.16  12.428 2.62e-16 ***
## sx           -547.47    1018.44  -0.538  0.59347    
## yr            356.25     109.64   3.249  0.00216 ** 
## dg           -559.33    1204.37  -0.464  0.64454    
## yd             77.37      76.84   1.007  0.31930    
## rkrecode     6856.45    1186.70   5.778 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2880 on 46 degrees of freedom
## Multiple R-squared:  0.7863, Adjusted R-squared:  0.763 
## F-statistic: 33.84 on 5 and 46 DF,  p-value: 2.461e-14
##                   2.5 %     97.5 %
## (Intercept) 14885.07648 20638.5722
## sx          -2597.47771  1502.5290
## yr            135.56889   576.9402
## dg          -2983.60125  1864.9356
## yd            -77.31372   232.0466
## rkrecode     4467.75405  9245.1439

3. Compute and report a new regression equation with sl as the dependent variable and sx as the sole independent variable. Then compute a t-test of the difference in mean sl by sx. Describe whether and how the results about the relationship between sl and sx from the regression analysis and from the t-test are similar.

Compute and report a new regression equation with sl as the dependent variable and sx as the sole independent variable.

## 
## Call:
## lm(formula = sexdiscrimination$sl ~ sexdiscrimination$sx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8602.8 -4296.6  -100.8  3513.1 16687.9 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             24697        938  26.330   <2e-16 ***
## sexdiscrimination$sx    -3340       1808  -1.847   0.0706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5782 on 50 degrees of freedom
## Multiple R-squared:  0.0639, Adjusted R-squared:  0.04518 
## F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

Is sl related to sx?

I read from the findings that:

## Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518

## F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706

If α = .05, then the p-value, 0.07 (rounded), is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between sl (academic year salary) and sx (sex). In other words, there does not seem to be a relationship between academic year salary and sex in this small sample of University faculty.

Compute a t-test of the difference in mean sl by sx.

For this t-test, I have the following hypotheses:

\(H_{0}: \mu_{Salary} - \mu_{Sex} = 0\)

Alternatively

\(H_{1}: \mu_{Salary} - \mu_{Sex} \neq 0\)

I set \(\alpha\) = 0.05. (Note: Male = 0 and Female = 1.)

## 
##  Two Sample t-test
## 
## data:  sl by sx
## t = 1.8474, df = 50, p-value = 0.0706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -291.257 6970.550
## sample estimates:
## mean in group 0 mean in group 1 
##        24696.79        21357.14

Describe whether and how the results about the relationship between sl and sx from the regression analysis and from the t-test are similar.

The results of the regression equation with sl as the dependent variable and sx as the sole independent variable is the same as the t-test of the difference in mean sl by sx. Both reveal of p-value of 0.07 (rounded). Since my α = .05, then the p-value, 0.07 (rounded), is more than α. Therefore both test fail to reject the null hypothesis that there is no difference in academic year salary between the men and women. In other words, there does not seem to be a relationship between academic year salary and sex in this small sample of University faculty.

Task 2

Analyze whether there are associations among U.S. state-level indicators.

I first read the state-level indicators data file into R and then convert the data fram to a table frame by using the dplyr function, tbl_df.

## Observations: 50
## Variables: 7
## $ stateNames (fctr) Alabama, Alaska, Arizona, Arkansas, California, Co...
## $ Population (int) 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277...
## $ Income     (int) 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 481...
## $ Illiteracy (dbl) 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1...
## $ LifeExp    (dbl) 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70...
## $ Murder     (dbl) 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 1...
## $ HSGrad     (dbl) 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52....
##       stateNames   Population        Income       Illiteracy   
##  Alabama   : 1   Min.   :  365   Min.   :3098   Min.   :0.500  
##  Alaska    : 1   1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625  
##  Arizona   : 1   Median : 2838   Median :4519   Median :0.950  
##  Arkansas  : 1   Mean   : 4246   Mean   :4436   Mean   :1.170  
##  California: 1   3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575  
##  Colorado  : 1   Max.   :21198   Max.   :6315   Max.   :2.800  
##  (Other)   :44                                                 
##     LifeExp          Murder           HSGrad     
##  Min.   :67.96   Min.   : 1.400   Min.   :37.80  
##  1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05  
##  Median :70.67   Median : 6.850   Median :53.25  
##  Mean   :70.88   Mean   : 7.378   Mean   :53.11  
##  3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15  
##  Max.   :73.60   Max.   :15.100   Max.   :67.30  
## 

1. Compute and report correlations among these six variables and plot a correlogram representing these correlations.

Then I require the RStudio packages I’ll need, as well as the cormat functions.

Next, I conduct Estimate Pearson Product-Moment Correlations for the six variables included in the State-Level Indicators dataset we were provided. First I must select all but the stateNames variable.

## Source: local data frame [50 x 6]
## 
##    Population Income Illiteracy LifeExp Murder HSGrad
##         (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)
## 1        3615   3624        2.1   69.05   15.1   41.3
## 2         365   6315        1.5   69.31   11.3   66.7
## 3        2212   4530        1.8   70.55    7.8   58.1
## 4        2110   3378        1.9   70.66   10.1   39.9
## 5       21198   5114        1.1   71.71   10.3   62.6
## 6        2541   4884        0.7   72.06    6.8   63.9
## 7        3100   5348        1.1   72.48    3.1   56.0
## 8         579   4809        0.9   70.06    6.2   54.6
## 9        8277   4815        1.3   70.66   10.7   52.6
## 10       4931   4091        2.0   68.54   13.9   40.6
## ..        ...    ...        ...     ...    ...    ...

## $r
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp          1                                           
## Income        0.34      1                                    
## HSGrad        0.58   0.62      1                             
## Population  -0.068   0.21 -0.098          1                  
## Illiteracy   -0.59  -0.44  -0.66       0.11          1       
## Murder       -0.78  -0.23  -0.49       0.34        0.7      1
## 
## $p
##            LifeExp  Income  HSGrad Population Illiteracy Murder
## LifeExp          0                                             
## Income       0.016       0                                     
## HSGrad     9.2e-06 1.6e-06       0                             
## Population    0.64    0.15     0.5          0                  
## Illiteracy   7e-06  0.0015 2.2e-07       0.46          0       
## Murder     2.3e-11    0.11 0.00032      0.015    1.3e-08      0
## 
## $sym
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp    1                                                 
## Income     .       1                                         
## HSGrad     .       ,      1                                  
## Population                       1                           
## Illiteracy .       .      ,                 1                
## Murder     ,              .      .          ,          1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

2. Using ggvis, construct:

a. Plots that demonstrate the relationship between

**i. HSGrad and Income

> **ii. Illiteracy and Income

**b. A scatterplot of Murder by Illiteracy grouped by HSGrad

## 
## 37.8 38.5 39.9 40.6   41 41.3 41.6 41.8 42.2 46.4 47.4 47.8 48.8 50.2 50.3 
##    1    2    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 51.6 52.3 52.5 52.6 52.7 52.8 52.9 53.2 53.3 54.5 54.6 54.7 55.2   56 57.1 
##    1    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
## 57.6 58.1 58.5   59 59.2 59.3 59.5 59.9   60 61.9 62.6 62.9 63.5 63.9 65.2 
##    2    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 66.7 67.3 
##    1    1

## 
## 30 40 50 60 
##  4 10 28  8