Week 5 Examples

This week we look at a variety of hypothesis tests that we can do. To illustrate various tests, we will use Bank Salaries dataset which contains the following variables:

Employee - just a observation count

Education - level of education with 1-HS, 2-Some College, 3-Bachelors, 4-Some Graduate, 5-Graduate degree

Grade - job grade coded from 1 to 6, 6 being the highest

Years 1 - years of experience with the bank

Years 2 - years of experience prior to joining the bank

Age

Salary

Gender

PC job - yes if the job is primarily computer-related.

NOTE: source is real data from a bank from 1995 (hence the PC job variable :) ).

Let’s just get familiar with data

First, let’s load the data

library(readxl)
Bank_Salaries <- read_excel("C:/Users/atomi/Downloads/Bank Salaries.xlsx")
View(Bank_Salaries)
library(psych)
describe(Bank_Salaries)
##           vars   n     mean       sd  median  trimmed     mad   min   max range
## Employee     1 208   104.50    60.19   104.5   104.50   77.10     1   208   207
## Education    2 208     3.16     1.47     3.0     3.20    1.48     1     5     4
## Grade        3 208     2.76     1.57     3.0     2.62    1.48     1     6     5
## Years1       4 208     9.67     6.99     8.0     8.58    5.93     2    39    37
## Years2       5 208     2.38     3.14     1.0     1.80    1.48     0    18    18
## Age          6 208    40.39    10.32    38.5    39.78   11.12    22    65    43
## Salary       7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000 70300
## Gender*      8 208     1.33     0.47     1.0     1.29    0.00     1     2     1
## PC Job*      9 208     1.09     0.29     1.0     1.00    0.00     1     2     1
##           skew kurtosis     se
## Employee  0.00    -1.22   4.17
## Education 0.00    -1.33   0.10
## Grade     0.53    -0.82   0.11
## Years1    1.65     3.15   0.48
## Years2    1.59     2.80   0.22
## Age       0.46    -0.73   0.72
## Salary    2.43     8.34 780.47
## Gender*   0.73    -1.47   0.03
## PC Job*   2.82     5.96   0.02

So, notice, we have a problem. R is treating Education and Grade as continuous variables. Let’s fix that

Bank_Salaries$Education<-as.factor(Bank_Salaries$Education)
Bank_Salaries$Grade<-as.factor(Bank_Salaries$Grade)
describe(Bank_Salaries)
##            vars   n     mean       sd  median  trimmed     mad   min   max
## Employee      1 208   104.50    60.19   104.5   104.50   77.10     1   208
## Education*    2 208     3.16     1.47     3.0     3.20    1.48     1     5
## Grade*        3 208     2.76     1.57     3.0     2.62    1.48     1     6
## Years1        4 208     9.67     6.99     8.0     8.58    5.93     2    39
## Years2        5 208     2.38     3.14     1.0     1.80    1.48     0    18
## Age           6 208    40.39    10.32    38.5    39.78   11.12    22    65
## Salary        7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000
## Gender*       8 208     1.33     0.47     1.0     1.29    0.00     1     2
## PC Job*       9 208     1.09     0.29     1.0     1.00    0.00     1     2
##            range skew kurtosis     se
## Employee     207 0.00    -1.22   4.17
## Education*     4 0.00    -1.33   0.10
## Grade*         5 0.53    -0.82   0.11
## Years1        37 1.65     3.15   0.48
## Years2        18 1.59     2.80   0.22
## Age           43 0.46    -0.73   0.72
## Salary     70300 2.43     8.34 780.47
## Gender*        1 0.73    -1.47   0.03
## PC Job*        1 2.82     5.96   0.02

Notice, R now puts an “*” next to those two variables as it had for Gender and PC Job. For Education and Grade, there is really not much more that we should do. Now, for Gender and PC Job, it sure looks like it is using 1 and 2, giving means between 1 and 2. It may be more convenient to code options as 0 and 1 so that when we use “describe” command, the mean will give us the proportion of observations in each group.

Bank_Salaries$Female<-ifelse(Bank_Salaries$Gender=="Male", 0, 1)
describe(Bank_Salaries$Female)
##    vars   n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 208 0.67 0.47      1    0.71   0   0   1     1 -0.73    -1.47 0.03
table(Bank_Salaries$Gender)
## 
## Female   Male 
##    140     68
(sum(Bank_Salaries$Gender=="Female")/length(Bank_Salaries$Gender)) - mean(Bank_Salaries$Female)
## [1] 0
#seems to check out
# now for PC Job
Bank_Salaries$PCJobYes<-ifelse(Bank_Salaries$`PC Job`=="Yes",1,0)

(sum(Bank_Salaries$`PC Job`=="Yes")/length(Bank_Salaries$`PC Job`))-mean(Bank_Salaries$PCJobYes)
## [1] 0
#seems to check out

This is not necessary for regression analysis, but it is a nice break.

Test for population mean

OK, now let’s look for some population tests. Salary mean is around $39K. Let’s test for population mean being $40K

t.test((Bank_Salaries$Salary), mu=40000 )
## 
##  One Sample t-test
## 
## data:  (Bank_Salaries$Salary)
## t = -0.10004, df = 207, p-value = 0.9204
## alternative hypothesis: true mean is not equal to 40000
## 95 percent confidence interval:
##  38383.23 41460.62
## sample estimates:
## mean of x 
##  39921.92

Note, the confidence interval tells us that, based on this sample, we can conclude that average salary in banking is between $38,383 and 41,460. Let’s look at 99% confidence interval

t.test((Bank_Salaries$Salary), mu=40000, conf.level=.99)
## 
##  One Sample t-test
## 
## data:  (Bank_Salaries$Salary)
## t = -0.10004, df = 207, p-value = 0.9204
## alternative hypothesis: true mean is not equal to 40000
## 99 percent confidence interval:
##  37892.86 41950.99
## sample estimates:
## mean of x 
##  39921.92

Still fail to reject the null.

Test of variance

Let’s test whether population standard deviation is $13,000. For this, we will have to construct the Chi-squared variable and use critical value approach.

popvar<-13000^2
samvar<-var(Bank_Salaries$Salary)
n<-length(Bank_Salaries$Salary)
chi_squared_stat<-(n-1)*samvar/popvar

p_value<-pchisq(chi_squared_stat, df=n-1, lower.tail=FALSE)
p_value
## [1] 0.9971359
#to make it prettier :)
cat("Chi-squared statistic:", chi_squared_stat)
## Chi-squared statistic: 155.19
cat("p-value", p_value)
## p-value 0.9971359

Test of Proportion

Let’s test the hypothesis that 70% of bank employees are female

prop.test(sum(Bank_Salaries$Gender=="Female"), length(Bank_Salaries$Gender), p=0.7)
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(Bank_Salaries$Gender == "Female") out of length(Bank_Salaries$Gender), null probability 0.7
## X-squared = 0.59547, df = 1, p-value = 0.4403
## alternative hypothesis: true p is not equal to 0.7
## 95 percent confidence interval:
##  0.6042159 0.7354266
## sample estimates:
##         p 
## 0.6730769

We fail to reject the null and see that 95% confidence interval is between roughly 60% and 74% of bank workforce being female.

Test of two populations

Now, let’s test whether average wage is different between the two genders.

gender_avg_salary_test<-t.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"])
gender_avg_salary_test
## 
##  Welch Two Sample t-test
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## t = 4.141, df = 78.898, p-value = 8.604e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4308.082 12282.943
## sample estimates:
## mean of x mean of y 
##  45505.44  37209.93
gender_avg_salary_testEV<-t.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"], var.equal = TRUE)
gender_avg_salary_testEV
## 
##  Two Sample t-test
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## t = 5.3024, df = 206, p-value = 2.935e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5211.041 11379.984
## sample estimates:
## mean of x mean of y 
##  45505.44  37209.93

As we can see, means are not equal, and men are paid more than women. We can model this further with dummy variables and interactions.

let’s test if there is a premium for PC jobs.

PCjob_avg_salary_test<-t.test(Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="Yes"], Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="No"], alternative="greater")
PCjob_avg_salary_test
## 
##  Welch Two Sample t-test
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "Yes"] and Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "No"]
## t = 0.26008, df = 33.648, p-value = 0.3982
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -2321.735       Inf
## sample estimates:
## mean of x mean of y 
##  40305.26  39883.39
#notice, default is unequal variance

PCjob_avg_salary_testEV<-t.test(Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="Yes"], Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="No"],alternative="greater", var.equal = TRUE)
PCjob_avg_salary_testEV
## 
##  Two Sample t-test
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "Yes"] and Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "No"]
## t = 0.15536, df = 206, p-value = 0.4383
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -4064.8     Inf
## sample estimates:
## mean of x mean of y 
##  40305.26  39883.39

Apparently, there was no premium to doing computer job in 1995 :)

Let’s test if younger workers are more likely to hold a computer job

Test of Proportion

So, we see that men are paid more than women. Let’s now see if women are under-represented in highest-grade jobs and over-represented in lowest-grade jobs.

Grade6Gender<-Bank_Salaries$Gender[Bank_Salaries$Grade==6]
num_females_grade6<-sum(Grade6Gender=="Female")
total_grade6<-length(Grade6Gender)

percent_females_grade6 <-num_females_grade6/total_grade6

prop_test_grade6gender<-prop.test(num_females_grade6, total_grade6, p=.5, alternative="less")
prop_test_grade6gender
## 
##  1-sample proportions test with continuity correction
## 
## data:  num_females_grade6 out of total_grade6, null probability 0.5
## X-squared = 8.6429, df = 1, p-value = 0.001642
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.3105581
## sample estimates:
##          p 
## 0.07142857

We see that in this sample there is 7% of people in grade 6 that are female. So, in the population, 95% confidence interval is between 0.3% and 36% (based on the two-tailed test), so we can reject the null that women are equally represented in grade 6 jobs. Above are results of a one-tailed test.

Lets look at grade 1 jobs. Here we are testing for over-representation, i.e. p>50%

Grade1Gender<-Bank_Salaries$Gender[Bank_Salaries$Grade==1]
num_females_grade1<-sum(Grade1Gender=="Female")
total_grade1<-length(Grade1Gender)

percent_females_grade1 <-num_females_grade1/total_grade1

prop_test_grade1gender<-prop.test(num_females_grade1, total_grade1, p=.5, alternative="greater")
prop_test_grade1gender
## 
##  1-sample proportions test with continuity correction
## 
## data:  num_females_grade1 out of total_grade1, null probability 0.5
## X-squared = 20.417, df = 1, p-value = 3.114e-06
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.6937909 1.0000000
## sample estimates:
##   p 
## 0.8

We can reject the null very strongly and can see that proportion in the sample is 0.8. 95% confidence interval (based on two-sided test) is between 67% and 89%

Variance Test

Let’s test variance for salary between men and women.

salary_var_test<-var.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"])
salary_var_test
## 
##  F test to compare two variances
## 
## data:  Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## F = 5.5735, num df = 67, denom df = 139, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  3.736535 8.570519
## sample estimates:
## ratio of variances 
##            5.57352

Maybe we could have run this before conducting t-test for salaries. Based on results of this test, we should use the unequal variance (default) version of the t-test.

ANOVA work

Now, let’s see if there is difference in salaries between grades.

sal_grade_anova<-aov(Bank_Salaries$Salary ~ Bank_Salaries$Grade)
summary(sal_grade_anova)
##                      Df    Sum Sq   Mean Sq F value Pr(>F)    
## Bank_Salaries$Grade   5 1.850e+10 3.699e+09   96.64 <2e-16 ***
## Residuals           202 7.732e+09 3.828e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, there seems to be a difference. Let’s see if it is between all grades or just some.

tukeys_anova<-TukeyHSD(sal_grade_anova)
tukeys_anova
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade)
## 
## $`Bank_Salaries$Grade`
##          diff        lwr       upr     p adj
## 2-1  2329.357 -1252.0561  5910.770 0.4228303
## 3-1  6329.484  2772.6522  9886.317 0.0000105
## 4-1 11816.976  7742.7858 15891.167 0.0000000
## 5-1 17993.405 13479.9212 22506.888 0.0000000
## 6-1 35664.833 30381.2223 40948.444 0.0000000
## 3-2  4000.127   138.1901  7862.065 0.0375396
## 4-2  9487.619  5144.5175 13830.721 0.0000000
## 5-2 15664.048 10906.4182 20421.677 0.0000000
## 6-2 33335.476 27841.8390 38829.113 0.0000000
## 4-3  5487.492  1164.6378  9810.346 0.0044146
## 5-3 11663.920  6924.7672 16403.073 0.0000000
## 6-3 29335.349 23857.7048 34812.993 0.0000000
## 5-4  6176.429  1037.6015 11315.256 0.0085767
## 6-4 23847.857 18020.9750 29674.739 0.0000000
## 6-5 17671.429 11529.3555 23813.502 0.0000000

ANCOVA

Now, let’s see if differences hold when we take into account time at the bank.

sal_grade_exp1_ancova<-aov(Bank_Salaries$Salary~Bank_Salaries$Grade+Bank_Salaries$Years1+Bank_Salaries$Grade*Bank_Salaries$Years1)
summary(sal_grade_exp1_ancova)
##                                           Df    Sum Sq   Mean Sq F value
## Bank_Salaries$Grade                        5 1.850e+10 3.699e+09 122.210
## Bank_Salaries$Years1                       1 8.060e+08 8.060e+08  26.628
## Bank_Salaries$Grade:Bank_Salaries$Years1   5 9.937e+08 1.987e+08   6.566
## Residuals                                196 5.932e+09 3.027e+07        
##                                            Pr(>F)    
## Bank_Salaries$Grade                       < 2e-16 ***
## Bank_Salaries$Years1                     6.02e-07 ***
## Bank_Salaries$Grade:Bank_Salaries$Years1 1.13e-05 ***
## Residuals                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukeys_ancova<-TukeyHSD(sal_grade_exp1_ancova, "Bank_Salaries$Grade")
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Years1
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Grade, Bank_Salaries$Years1
tukeys_ancova
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade + Bank_Salaries$Years1 + Bank_Salaries$Grade * Bank_Salaries$Years1)
## 
## $`Bank_Salaries$Grade`
##          diff        lwr       upr     p adj
## 2-1  2329.357  -856.2950  5515.009 0.2892672
## 3-1  6329.484  3165.6971  9493.272 0.0000005
## 4-1 11816.976  8193.0009 15440.951 0.0000000
## 5-1 17993.405 13978.6801 22008.129 0.0000000
## 6-1 35664.833 30965.0834 40364.583 0.0000000
## 3-2  4000.127   564.9503  7435.304 0.0122233
## 4-2  9487.619  5624.4484 13350.790 0.0000000
## 5-2 15664.048 11432.1562 19895.939 0.0000000
## 6-2 33335.476 28448.9089 38222.043 0.0000000
## 4-3  5487.492  1642.3313  9332.652 0.0008238
## 5-3 11663.920  7448.4635 15879.377 0.0000000
## 6-3 29335.349 24463.0074 34207.690 0.0000000
## 5-4  6176.429  1605.4634 10747.394 0.0018918
## 6-4 23847.857 18664.8699 29030.844 0.0000000
## 6-5 17671.429 12208.0803 23134.777 0.0000000
sal_grade_expinter_ancova<-aov(Bank_Salaries$Salary~ Bank_Salaries$Grade*Bank_Salaries$Years1)
summary(sal_grade_expinter_ancova)
##                                           Df    Sum Sq   Mean Sq F value
## Bank_Salaries$Grade                        5 1.850e+10 3.699e+09 122.210
## Bank_Salaries$Years1                       1 8.060e+08 8.060e+08  26.628
## Bank_Salaries$Grade:Bank_Salaries$Years1   5 9.937e+08 1.987e+08   6.566
## Residuals                                196 5.932e+09 3.027e+07        
##                                            Pr(>F)    
## Bank_Salaries$Grade                       < 2e-16 ***
## Bank_Salaries$Years1                     6.02e-07 ***
## Bank_Salaries$Grade:Bank_Salaries$Years1 1.13e-05 ***
## Residuals                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Grade is very significant, but so are years at the bank and the interaction term. This means that both grade and time with the bank make a difference, but also that effect of grade on salary changes with number of years at the bank.

What about years outside of the bank?

sal_grade_exp2_ancova<-aov(Bank_Salaries$Salary~Bank_Salaries$Grade*Bank_Salaries$Years2)

summary(sal_grade_exp2_ancova)
##                                           Df    Sum Sq   Mean Sq F value Pr(>F)
## Bank_Salaries$Grade                        5 1.850e+10 3.699e+09  97.733 <2e-16
## Bank_Salaries$Years2                       1 3.325e+07 3.325e+07   0.878  0.350
## Bank_Salaries$Grade:Bank_Salaries$Years2   5 2.807e+08 5.613e+07   1.483  0.197
## Residuals                                196 7.418e+09 3.785e+07               
##                                             
## Bank_Salaries$Grade                      ***
## Bank_Salaries$Years2                        
## Bank_Salaries$Grade:Bank_Salaries$Years2    
## Residuals                                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukeys_ancova_years2<-TukeyHSD(sal_grade_exp2_ancova, "Bank_Salaries$Grade")
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Years2
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Grade, Bank_Salaries$Years2
tukeys_ancova_years2
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade * Bank_Salaries$Years2)
## 
## $`Bank_Salaries$Grade`
##          diff        lwr       upr     p adj
## 2-1  2329.357 -1232.9404  5891.655 0.4163407
## 3-1  6329.484  2791.6367  9867.332 0.0000094
## 4-1 11816.976  7764.5317 15869.421 0.0000000
## 5-1 17993.405 13504.0118 22482.798 0.0000000
## 6-1 35664.833 30409.4234 40920.243 0.0000000
## 3-2  4000.127   158.8031  7841.452 0.0358770
## 4-2  9487.619  5167.6987 13807.539 0.0000000
## 5-2 15664.048 10931.8120 20396.283 0.0000000
## 6-2 33335.476 27871.1611 38799.791 0.0000000
## 4-3  5487.492  1187.7109  9787.272 0.0041318
## 5-3 11663.920  6950.0623 16377.778 0.0000000
## 6-3 29335.349 23886.9416 34783.756 0.0000000
## 5-4  6176.429  1065.0299 11287.827 0.0080766
## 6-4 23847.857 18052.0758 29643.638 0.0000000
## 6-5 17671.429 11562.1386 23780.719 0.0000000

Apparently, years outside of the bank do not matter. Grade is still important, but years outside of the bank do not seem to matter, and the effect of grade does not change with years of experience outside of the bank.