This week we look at a variety of hypothesis tests that we can do. To illustrate various tests, we will use Bank Salaries dataset which contains the following variables:
Employee - just a observation count
Education - level of education with 1-HS, 2-Some College, 3-Bachelors, 4-Some Graduate, 5-Graduate degree
Grade - job grade coded from 1 to 6, 6 being the highest
Years 1 - years of experience with the bank
Years 2 - years of experience prior to joining the bank
Age
Salary
Gender
PC job - yes if the job is primarily computer-related.
NOTE: source is real data from a bank from 1995 (hence the PC job variable :) ).
Let’s just get familiar with data
First, let’s load the data
library(readxl)
Bank_Salaries <- read_excel("C:/Users/atomi/Downloads/Bank Salaries.xlsx")
View(Bank_Salaries)
library(psych)
describe(Bank_Salaries)
## vars n mean sd median trimmed mad min max range
## Employee 1 208 104.50 60.19 104.5 104.50 77.10 1 208 207
## Education 2 208 3.16 1.47 3.0 3.20 1.48 1 5 4
## Grade 3 208 2.76 1.57 3.0 2.62 1.48 1 6 5
## Years1 4 208 9.67 6.99 8.0 8.58 5.93 2 39 37
## Years2 5 208 2.38 3.14 1.0 1.80 1.48 0 18 18
## Age 6 208 40.39 10.32 38.5 39.78 11.12 22 65 43
## Salary 7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000 70300
## Gender* 8 208 1.33 0.47 1.0 1.29 0.00 1 2 1
## PC Job* 9 208 1.09 0.29 1.0 1.00 0.00 1 2 1
## skew kurtosis se
## Employee 0.00 -1.22 4.17
## Education 0.00 -1.33 0.10
## Grade 0.53 -0.82 0.11
## Years1 1.65 3.15 0.48
## Years2 1.59 2.80 0.22
## Age 0.46 -0.73 0.72
## Salary 2.43 8.34 780.47
## Gender* 0.73 -1.47 0.03
## PC Job* 2.82 5.96 0.02
So, notice, we have a problem. R is treating Education and Grade as continuous variables. Let’s fix that
Bank_Salaries$Education<-as.factor(Bank_Salaries$Education)
Bank_Salaries$Grade<-as.factor(Bank_Salaries$Grade)
describe(Bank_Salaries)
## vars n mean sd median trimmed mad min max
## Employee 1 208 104.50 60.19 104.5 104.50 77.10 1 208
## Education* 2 208 3.16 1.47 3.0 3.20 1.48 1 5
## Grade* 3 208 2.76 1.57 3.0 2.62 1.48 1 6
## Years1 4 208 9.67 6.99 8.0 8.58 5.93 2 39
## Years2 5 208 2.38 3.14 1.0 1.80 1.48 0 18
## Age 6 208 40.39 10.32 38.5 39.78 11.12 22 65
## Salary 7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000
## Gender* 8 208 1.33 0.47 1.0 1.29 0.00 1 2
## PC Job* 9 208 1.09 0.29 1.0 1.00 0.00 1 2
## range skew kurtosis se
## Employee 207 0.00 -1.22 4.17
## Education* 4 0.00 -1.33 0.10
## Grade* 5 0.53 -0.82 0.11
## Years1 37 1.65 3.15 0.48
## Years2 18 1.59 2.80 0.22
## Age 43 0.46 -0.73 0.72
## Salary 70300 2.43 8.34 780.47
## Gender* 1 0.73 -1.47 0.03
## PC Job* 1 2.82 5.96 0.02
Notice, R now puts an “*” next to those two variables as it had for Gender and PC Job. For Education and Grade, there is really not much more that we should do. Now, for Gender and PC Job, it sure looks like it is using 1 and 2, giving means between 1 and 2. It may be more convenient to code options as 0 and 1 so that when we use “describe” command, the mean will give us the proportion of observations in each group.
Bank_Salaries$Female<-ifelse(Bank_Salaries$Gender=="Male", 0, 1)
describe(Bank_Salaries$Female)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 208 0.67 0.47 1 0.71 0 0 1 1 -0.73 -1.47 0.03
table(Bank_Salaries$Gender)
##
## Female Male
## 140 68
(sum(Bank_Salaries$Gender=="Female")/length(Bank_Salaries$Gender)) - mean(Bank_Salaries$Female)
## [1] 0
#seems to check out
# now for PC Job
Bank_Salaries$PCJobYes<-ifelse(Bank_Salaries$`PC Job`=="Yes",1,0)
(sum(Bank_Salaries$`PC Job`=="Yes")/length(Bank_Salaries$`PC Job`))-mean(Bank_Salaries$PCJobYes)
## [1] 0
#seems to check out
This is not necessary for regression analysis, but it is a nice break.
OK, now let’s look for some population tests. Salary mean is around $39K. Let’s test for population mean being $40K
t.test((Bank_Salaries$Salary), mu=40000 )
##
## One Sample t-test
##
## data: (Bank_Salaries$Salary)
## t = -0.10004, df = 207, p-value = 0.9204
## alternative hypothesis: true mean is not equal to 40000
## 95 percent confidence interval:
## 38383.23 41460.62
## sample estimates:
## mean of x
## 39921.92
Note, the confidence interval tells us that, based on this sample, we can conclude that average salary in banking is between $38,383 and 41,460. Let’s look at 99% confidence interval
t.test((Bank_Salaries$Salary), mu=40000, conf.level=.99)
##
## One Sample t-test
##
## data: (Bank_Salaries$Salary)
## t = -0.10004, df = 207, p-value = 0.9204
## alternative hypothesis: true mean is not equal to 40000
## 99 percent confidence interval:
## 37892.86 41950.99
## sample estimates:
## mean of x
## 39921.92
Still fail to reject the null.
Let’s test whether population standard deviation is $13,000. For this, we will have to construct the Chi-squared variable and use critical value approach.
popvar<-13000^2
samvar<-var(Bank_Salaries$Salary)
n<-length(Bank_Salaries$Salary)
chi_squared_stat<-(n-1)*samvar/popvar
p_value<-pchisq(chi_squared_stat, df=n-1, lower.tail=FALSE)
p_value
## [1] 0.9971359
#to make it prettier :)
cat("Chi-squared statistic:", chi_squared_stat)
## Chi-squared statistic: 155.19
cat("p-value", p_value)
## p-value 0.9971359
Let’s test the hypothesis that 70% of bank employees are female
prop.test(sum(Bank_Salaries$Gender=="Female"), length(Bank_Salaries$Gender), p=0.7)
##
## 1-sample proportions test with continuity correction
##
## data: sum(Bank_Salaries$Gender == "Female") out of length(Bank_Salaries$Gender), null probability 0.7
## X-squared = 0.59547, df = 1, p-value = 0.4403
## alternative hypothesis: true p is not equal to 0.7
## 95 percent confidence interval:
## 0.6042159 0.7354266
## sample estimates:
## p
## 0.6730769
We fail to reject the null and see that 95% confidence interval is between roughly 60% and 74% of bank workforce being female.
Now, let’s test whether average wage is different between the two genders.
gender_avg_salary_test<-t.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"])
gender_avg_salary_test
##
## Welch Two Sample t-test
##
## data: Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## t = 4.141, df = 78.898, p-value = 8.604e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4308.082 12282.943
## sample estimates:
## mean of x mean of y
## 45505.44 37209.93
gender_avg_salary_testEV<-t.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"], var.equal = TRUE)
gender_avg_salary_testEV
##
## Two Sample t-test
##
## data: Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## t = 5.3024, df = 206, p-value = 2.935e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5211.041 11379.984
## sample estimates:
## mean of x mean of y
## 45505.44 37209.93
As we can see, means are not equal, and men are paid more than women. We can model this further with dummy variables and interactions.
let’s test if there is a premium for PC jobs.
PCjob_avg_salary_test<-t.test(Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="Yes"], Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="No"], alternative="greater")
PCjob_avg_salary_test
##
## Welch Two Sample t-test
##
## data: Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "Yes"] and Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "No"]
## t = 0.26008, df = 33.648, p-value = 0.3982
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -2321.735 Inf
## sample estimates:
## mean of x mean of y
## 40305.26 39883.39
#notice, default is unequal variance
PCjob_avg_salary_testEV<-t.test(Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="Yes"], Bank_Salaries$Salary[Bank_Salaries$`PC Job`=="No"],alternative="greater", var.equal = TRUE)
PCjob_avg_salary_testEV
##
## Two Sample t-test
##
## data: Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "Yes"] and Bank_Salaries$Salary[Bank_Salaries$`PC Job` == "No"]
## t = 0.15536, df = 206, p-value = 0.4383
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -4064.8 Inf
## sample estimates:
## mean of x mean of y
## 40305.26 39883.39
Apparently, there was no premium to doing computer job in 1995 :)
Let’s test if younger workers are more likely to hold a computer job
So, we see that men are paid more than women. Let’s now see if women are under-represented in highest-grade jobs and over-represented in lowest-grade jobs.
Grade6Gender<-Bank_Salaries$Gender[Bank_Salaries$Grade==6]
num_females_grade6<-sum(Grade6Gender=="Female")
total_grade6<-length(Grade6Gender)
percent_females_grade6 <-num_females_grade6/total_grade6
prop_test_grade6gender<-prop.test(num_females_grade6, total_grade6, p=.5, alternative="less")
prop_test_grade6gender
##
## 1-sample proportions test with continuity correction
##
## data: num_females_grade6 out of total_grade6, null probability 0.5
## X-squared = 8.6429, df = 1, p-value = 0.001642
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.3105581
## sample estimates:
## p
## 0.07142857
We see that in this sample there is 7% of people in grade 6 that are female. So, in the population, 95% confidence interval is between 0.3% and 36% (based on the two-tailed test), so we can reject the null that women are equally represented in grade 6 jobs. Above are results of a one-tailed test.
Lets look at grade 1 jobs. Here we are testing for over-representation, i.e. p>50%
Grade1Gender<-Bank_Salaries$Gender[Bank_Salaries$Grade==1]
num_females_grade1<-sum(Grade1Gender=="Female")
total_grade1<-length(Grade1Gender)
percent_females_grade1 <-num_females_grade1/total_grade1
prop_test_grade1gender<-prop.test(num_females_grade1, total_grade1, p=.5, alternative="greater")
prop_test_grade1gender
##
## 1-sample proportions test with continuity correction
##
## data: num_females_grade1 out of total_grade1, null probability 0.5
## X-squared = 20.417, df = 1, p-value = 3.114e-06
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.6937909 1.0000000
## sample estimates:
## p
## 0.8
We can reject the null very strongly and can see that proportion in the sample is 0.8. 95% confidence interval (based on two-sided test) is between 67% and 89%
Let’s test variance for salary between men and women.
salary_var_test<-var.test(Bank_Salaries$Salary[Bank_Salaries$Gender=="Male"], Bank_Salaries$Salary[Bank_Salaries$Gender=="Female"])
salary_var_test
##
## F test to compare two variances
##
## data: Bank_Salaries$Salary[Bank_Salaries$Gender == "Male"] and Bank_Salaries$Salary[Bank_Salaries$Gender == "Female"]
## F = 5.5735, num df = 67, denom df = 139, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3.736535 8.570519
## sample estimates:
## ratio of variances
## 5.57352
Maybe we could have run this before conducting t-test for salaries. Based on results of this test, we should use the unequal variance (default) version of the t-test.
Now, let’s see if there is difference in salaries between grades.
sal_grade_anova<-aov(Bank_Salaries$Salary ~ Bank_Salaries$Grade)
summary(sal_grade_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Bank_Salaries$Grade 5 1.850e+10 3.699e+09 96.64 <2e-16 ***
## Residuals 202 7.732e+09 3.828e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, there seems to be a difference. Let’s see if it is between all grades or just some.
tukeys_anova<-TukeyHSD(sal_grade_anova)
tukeys_anova
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade)
##
## $`Bank_Salaries$Grade`
## diff lwr upr p adj
## 2-1 2329.357 -1252.0561 5910.770 0.4228303
## 3-1 6329.484 2772.6522 9886.317 0.0000105
## 4-1 11816.976 7742.7858 15891.167 0.0000000
## 5-1 17993.405 13479.9212 22506.888 0.0000000
## 6-1 35664.833 30381.2223 40948.444 0.0000000
## 3-2 4000.127 138.1901 7862.065 0.0375396
## 4-2 9487.619 5144.5175 13830.721 0.0000000
## 5-2 15664.048 10906.4182 20421.677 0.0000000
## 6-2 33335.476 27841.8390 38829.113 0.0000000
## 4-3 5487.492 1164.6378 9810.346 0.0044146
## 5-3 11663.920 6924.7672 16403.073 0.0000000
## 6-3 29335.349 23857.7048 34812.993 0.0000000
## 5-4 6176.429 1037.6015 11315.256 0.0085767
## 6-4 23847.857 18020.9750 29674.739 0.0000000
## 6-5 17671.429 11529.3555 23813.502 0.0000000
Now, let’s see if differences hold when we take into account time at the bank.
sal_grade_exp1_ancova<-aov(Bank_Salaries$Salary~Bank_Salaries$Grade+Bank_Salaries$Years1+Bank_Salaries$Grade*Bank_Salaries$Years1)
summary(sal_grade_exp1_ancova)
## Df Sum Sq Mean Sq F value
## Bank_Salaries$Grade 5 1.850e+10 3.699e+09 122.210
## Bank_Salaries$Years1 1 8.060e+08 8.060e+08 26.628
## Bank_Salaries$Grade:Bank_Salaries$Years1 5 9.937e+08 1.987e+08 6.566
## Residuals 196 5.932e+09 3.027e+07
## Pr(>F)
## Bank_Salaries$Grade < 2e-16 ***
## Bank_Salaries$Years1 6.02e-07 ***
## Bank_Salaries$Grade:Bank_Salaries$Years1 1.13e-05 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukeys_ancova<-TukeyHSD(sal_grade_exp1_ancova, "Bank_Salaries$Grade")
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Years1
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Grade, Bank_Salaries$Years1
tukeys_ancova
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade + Bank_Salaries$Years1 + Bank_Salaries$Grade * Bank_Salaries$Years1)
##
## $`Bank_Salaries$Grade`
## diff lwr upr p adj
## 2-1 2329.357 -856.2950 5515.009 0.2892672
## 3-1 6329.484 3165.6971 9493.272 0.0000005
## 4-1 11816.976 8193.0009 15440.951 0.0000000
## 5-1 17993.405 13978.6801 22008.129 0.0000000
## 6-1 35664.833 30965.0834 40364.583 0.0000000
## 3-2 4000.127 564.9503 7435.304 0.0122233
## 4-2 9487.619 5624.4484 13350.790 0.0000000
## 5-2 15664.048 11432.1562 19895.939 0.0000000
## 6-2 33335.476 28448.9089 38222.043 0.0000000
## 4-3 5487.492 1642.3313 9332.652 0.0008238
## 5-3 11663.920 7448.4635 15879.377 0.0000000
## 6-3 29335.349 24463.0074 34207.690 0.0000000
## 5-4 6176.429 1605.4634 10747.394 0.0018918
## 6-4 23847.857 18664.8699 29030.844 0.0000000
## 6-5 17671.429 12208.0803 23134.777 0.0000000
sal_grade_expinter_ancova<-aov(Bank_Salaries$Salary~ Bank_Salaries$Grade*Bank_Salaries$Years1)
summary(sal_grade_expinter_ancova)
## Df Sum Sq Mean Sq F value
## Bank_Salaries$Grade 5 1.850e+10 3.699e+09 122.210
## Bank_Salaries$Years1 1 8.060e+08 8.060e+08 26.628
## Bank_Salaries$Grade:Bank_Salaries$Years1 5 9.937e+08 1.987e+08 6.566
## Residuals 196 5.932e+09 3.027e+07
## Pr(>F)
## Bank_Salaries$Grade < 2e-16 ***
## Bank_Salaries$Years1 6.02e-07 ***
## Bank_Salaries$Grade:Bank_Salaries$Years1 1.13e-05 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Grade is very significant, but so are years at the bank and the interaction term. This means that both grade and time with the bank make a difference, but also that effect of grade on salary changes with number of years at the bank.
What about years outside of the bank?
sal_grade_exp2_ancova<-aov(Bank_Salaries$Salary~Bank_Salaries$Grade*Bank_Salaries$Years2)
summary(sal_grade_exp2_ancova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Bank_Salaries$Grade 5 1.850e+10 3.699e+09 97.733 <2e-16
## Bank_Salaries$Years2 1 3.325e+07 3.325e+07 0.878 0.350
## Bank_Salaries$Grade:Bank_Salaries$Years2 5 2.807e+08 5.613e+07 1.483 0.197
## Residuals 196 7.418e+09 3.785e+07
##
## Bank_Salaries$Grade ***
## Bank_Salaries$Years2
## Bank_Salaries$Grade:Bank_Salaries$Years2
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukeys_ancova_years2<-TukeyHSD(sal_grade_exp2_ancova, "Bank_Salaries$Grade")
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Years2
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Bank_Salaries$Grade, Bank_Salaries$Years2
tukeys_ancova_years2
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Bank_Salaries$Salary ~ Bank_Salaries$Grade * Bank_Salaries$Years2)
##
## $`Bank_Salaries$Grade`
## diff lwr upr p adj
## 2-1 2329.357 -1232.9404 5891.655 0.4163407
## 3-1 6329.484 2791.6367 9867.332 0.0000094
## 4-1 11816.976 7764.5317 15869.421 0.0000000
## 5-1 17993.405 13504.0118 22482.798 0.0000000
## 6-1 35664.833 30409.4234 40920.243 0.0000000
## 3-2 4000.127 158.8031 7841.452 0.0358770
## 4-2 9487.619 5167.6987 13807.539 0.0000000
## 5-2 15664.048 10931.8120 20396.283 0.0000000
## 6-2 33335.476 27871.1611 38799.791 0.0000000
## 4-3 5487.492 1187.7109 9787.272 0.0041318
## 5-3 11663.920 6950.0623 16377.778 0.0000000
## 6-3 29335.349 23886.9416 34783.756 0.0000000
## 5-4 6176.429 1065.0299 11287.827 0.0080766
## 6-4 23847.857 18052.0758 29643.638 0.0000000
## 6-5 17671.429 11562.1386 23780.719 0.0000000
Apparently, years outside of the bank do not matter. Grade is still important, but years outside of the bank do not seem to matter, and the effect of grade does not change with years of experience outside of the bank.