One of the basic statistical techniques, when we are analyzing populations, is the Student’s t-test. We can mainly use the Student’s t-test to achieve one of the following two objectives:
Suppose we have information on the entire population. In that case, we don’t need to carry out any test because we only need to compare the value with the reference (when we have one population) or compare the value of the two populations. But, in many cases, we don’t have the complete sample. We only have a sample of the populations. In these cases, the Student’s t-test allows us to evaluate the population based on the samples.
To carry out the Student’s t-test, we need to define the null and the alternative hypotheses. On the one hand, the null hypothesis suggests that nothing is going on and everything is the same. On the other hand, the alternative hypothesis is the opposite of the null hypothesis. Let’s see one example.
There are two assumptions for carrying out the Student’s t-test:
Normality - All samples should follow a normal distribution. To test the normality, we can plot a histogram or a QQ-plot. Another approach for normality tests is the Shapiro-Wilk and the Kolmogorov-Smirnov test. When we have samples greater than 30, it’s unnecessary to analyze the normality due to the central limit theorem.
Homogeneity of the variances - Also called homoscedasticity of the variance, and we need to test it when we have two samples. To test the homoscedasticity, we can plot a boxplot or dot plot. Another approach for the homoscedasticity test is the Levene’s test and the Fisher’s test.
The steps for carrying out a hypothesis testing using the Student’s t are:
We need to install and load several packages to analyze our data set with text mining.
library(tidyverse)
library(gridExtra)
The tidyverse is an opinionated collection of R packages
designed for data science. The gridExtrapackage provides a
number of user-level functions to work with “grid” graphics, notably to
arrange multiple grid-based plots on a page, and draw tables.
We have a data set with information of 550 employees in two consecutive years. The data set contains three variables:
The first step is to load the data set and check that everything is correct. Instead of using a standard R data.frame, we have decided to use a tibble because this makes it much easier to work with large data. You can download the data set from https://github.com/vicencfernandez/WorkforceAnalytics.
employees <- read.csv("t-test-hypotheses.csv", sep = ",") %>% as_tibble()
employees
## # A tibble: 550 × 6
## employeeID gender salary20 salary21 performance20 performance21
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 female 1850. 1867. 58.9 63.0
## 2 2 female 2449. 2481. 73.0 73.8
## 3 3 female 2154. 2179. 91.2 87.9
## 4 4 female 2011. 2032. 55.9 60.7
## 5 5 female 1910. 1929. 69.6 71.2
## 6 6 female 2324. 2353. 72.3 73.3
## 7 7 female 1939. 1958. 79.8 79.1
## 8 8 female 1919. 1939. 67.5 69.6
## 9 9 female 1912. 1931. 96.4 91.8
## 10 10 female 2012. 2034. 68.8 70.6
## # … with 540 more rows
We can see that the variable gender has been defined as
a character, but we prefer to describe it as a factor. So, let’s change
it.
employees$gender <- as.factor(employees$gender)
employees
## # A tibble: 550 × 6
## employeeID gender salary20 salary21 performance20 performance21
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 female 1850. 1867. 58.9 63.0
## 2 2 female 2449. 2481. 73.0 73.8
## 3 3 female 2154. 2179. 91.2 87.9
## 4 4 female 2011. 2032. 55.9 60.7
## 5 5 female 1910. 1929. 69.6 71.2
## 6 6 female 2324. 2353. 72.3 73.3
## 7 7 female 1939. 1958. 79.8 79.1
## 8 8 female 1919. 1939. 67.5 69.6
## 9 9 female 1912. 1931. 96.4 91.8
## 10 10 female 2012. 2034. 68.8 70.6
## # … with 540 more rows
Now, our data set is ready for analysis.
There are several versions of the Student’s t-test depending on the number of samples, the definition of the alternative hypothesis, and the relationship between the samples.
Let’s start with the most simple test. Consider that we want to know if the average salary of female employees in 2021 is equal to or different than 1,930 euros.
female_employees <- employees %>% filter(gender == "female")
female_employees %>% summarize(mean = mean(salary21))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 1955.
We can see that the average salary of female employees in the sample is 1955.849 euros. Still, we want to know if the average salary of female employees in the population is 1,930 euros. To achieve it, we need to carry out the Student’s t-test. So, the first step is to analyze the assumptions that the sample has to satisfy.
The first assumption is about normality. As we have a sample greater than 30, it’s not necessary. However, we are going to analyze the normality to show to do it. The first way is using a histogram plot or the QQ-plot.
tmp1 <- female_employees %>% ggplot(aes(x = salary21)) +
geom_histogram(color = "steelblue", fill = "steelblue", binwidth = 58)
tmp2 <- female_employees %>% ggplot(aes(sample = salary21)) +
geom_qq(color = "steelblue") + geom_qq_line()
grid.arrange(tmp1, tmp2, ncol = 2)
As we can see in both plots, the sample follows a normal distribution. Another approach is a more formal test. In this case, we have decided to carry out a Shapiro-Wilk test.
female_employees$salary21 %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99466, p-value = 0.5296
As the p-value is big, we cannot reject the hypothesis that the sample comes from a normal distribution population. So, we can accept that the population follows a normal distribution.
After checking the assumption, we can start the following step: defining the null and the alternative hypotheses. We describe the hypotheses in the following way:
We can mathematically write these hypotheses.
The third step is to define the maximum probability of error that we accept (the significance level). It’s widespread to use \(\alpha=0.05\) as a threshold. In other words, we don’t get probabilities to make an error greater than 5%.
Now, it’s time to carry out the Student’s t-test. the function is
t.test() carries out the Student’s t-test. We need to
define the following parameters:
female_employess$salary21mu - In this scenario, it’s \(1930\)alternative -
We can face three types of alternative hypothesis: (1) different to
(“two.sided”), (2) greater than (“greater”), and less than (“less”). In
this scenario, we have a ‘different to’ hypothesis.conf.level - Its
value is one minus the significance level. In this scenario, the
confidence level is \(0.95\).female_employees$salary21 %>% t.test(mu = 1930, alternative = "two.sided", conf.level = 0.95)
##
## One Sample t-test
##
## data: .
## t = 1.9721, df = 249, p-value = 0.0497
## alternative hypothesis: true mean is not equal to 1930
## 95 percent confidence interval:
## 1930.034 1980.719
## sample estimates:
## mean of x
## 1955.376
The p-value tells us the probability of making a mistake if we reject the null hypothesis. As the p-value is smaller than 0.05 (our significance level), we can reject the null hypothesis, and so, we can accept the alternative hypothesis. The average salary of female employees in 2021 is different than 1,930 euros.
Let’s see what happens if we define our alternative hypothesis in a different way. Consider these two new hypotheses.
We can mathematically write these hypotheses.
In this new scenario, the alternative hypothesis is ‘less than’, so
we need to define the parameter alternative as “less”.
female_employees$salary21 %>% t.test(mu = 1930, alternative = "less", conf.level = 0.95)
##
## One Sample t-test
##
## data: .
## t = 1.9721, df = 249, p-value = 0.9752
## alternative hypothesis: true mean is less than 1930
## 95 percent confidence interval:
## -Inf 1976.62
## sample estimates:
## mean of x
## 1955.376
As the p-value is greater than 0.05 (our significance level), we can’t reject the null hypothesis, and so, we can’t accept the alternative hypothesis. The average salary of female employees in 2021 is not less than 1,930 euros.
Let’s move on to a more complex scenario. Consider that we want to compare if the average performance of female employees is equal to or different than male employees in 2021.
female_employees <- employees %>% filter(gender == "female")
female_employees %>% summarize(mean = mean(performance21))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 72.4
male_employees <- employees %>% filter(gender == "male")
male_employees %>% summarize(mean = mean(performance21))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 70.3
We can see that the average performance of female employees of the sample is greater than male employees. But we want to know if they are different in the population. So, the first step is to check both assumptions in the two samples.
female_employees$performance21 %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99119, p-value = 0.1378
male_employees$performance21 %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99729, p-value = 0.9012
In both samples, the p-value is high, so we cannot reject the hypothesis that both samples come from normal distribution populations. So, we can accept that both populations follow a normal distribution.
The second assumption is about the homoscedasticity of the variance. The first way to test it is by using a boxplot.
tmp1 <- female_employees %>% ggplot(aes(x = performance21)) +
geom_boxplot(color = "steelblue", fill = "steelblue") +
coord_flip() +
ylab ("female")
tmp2 <- male_employees %>% ggplot(aes(x = performance21)) +
geom_boxplot(color = "tomato", fill = "tomato") +
coord_flip() +
ylab ("male")
grid.arrange(tmp1, tmp2, ncol = 2)
As we can see in the plot, the samples have homoscedasticity because they are a very similar shape. Another approach is a more formal test. In this case, we have decided to carry out a Fisher’s F-test.
var.test(female_employees$performance21, male_employees$performance21, ratio = 1, alternative = "two.sided", conf.level = 0.95)
##
## F test to compare two variances
##
## data: female_employees$performance21 and male_employees$performance21
## F = 1.2491, num df = 249, denom df = 299, p-value = 0.0658
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.985561 1.587353
## sample estimates:
## ratio of variances
## 1.249133
As the p-value is greater than 0.05, we cannot reject the hypothesis that the variance of both populations is equal. So, we can accept that both populations have the same (or very similar) variance.
Following the same approach as in the previous scenarios, we need to define the null and the alternative hypotheses:
We can mathematically write these hypotheses.
The third step is to define the maximum probability of error that we accept (the significance level). Again, we define the threshold as 0.05.
Now, it’s time to carry out the Student’s t-test. the function is
t.test() carries out the Student’s t-test. We need to
define the following parameters:
female_employees$performance21male_employees$performance21alternative -
In this scenario, we have a ‘different to’ hypothesispaired - In this
scenario, there isn’t a direct relationship between two samples, so the
value is FALSE (see the following scenarios to know more)var.equal - In this
scenario, we have tested that both have the same varianceconf.level - In
this scenario, the confidence level is \(0.95\).Now you may be wondering about the fourth parameter. We said that we
need homoscedasticity between both samples to carry out the Student’s
t-test. So, why do we have to indicate this assumption explicitly? The
answer is simple. Internally, the function t,test has two
sub-functions. If the parameter var.equal is TRUE, the
function runs a standard Student’s t-test. Else, the function is running
a Welch (or Satterthwaite) t-test, where the variance of the sample
doesn’t need to be the same.
t.test(female_employees$performance21, male_employees$performance21, alternative = "two.sided", paired = FALSE, var.equal = TRUE, conf.level = 0.95)
##
## Two Sample t-test
##
## data: female_employees$performance21 and male_employees$performance21
## t = 2.332, df = 548, p-value = 0.02006
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3201269 3.7408334
## sample estimates:
## mean of x mean of y
## 72.36632 70.33584
The p-value tells us the probability of making a mistake if we reject the null hypothesis. As the p-value is smaller than 0.05 (our significance level), we can reject the null hypothesis, and so, we can accept the alternative hypothesis. The average performance of female employees is different from than of male employees in 2021.
Let’s see what happens if we define our alternative hypothesis differently. Consider these two new hypotheses.
We can mathematically write these hypotheses.
In this new scenario, the alternative hypothesis is ‘greater than’,
so we need to define the parameter alternative as
“greater”.
t.test(female_employees$performance21, male_employees$performance21, alternative = "greater", paired = FALSE, var.equal = TRUE, conf.level = 0.95)
##
## Two Sample t-test
##
## data: female_employees$performance21 and male_employees$performance21
## t = 2.332, df = 548, p-value = 0.01003
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.5958514 Inf
## sample estimates:
## mean of x mean of y
## 72.36632 70.33584
As the p-value is less than 0.05 (our significance level), we can reject the null hypothesis, and so, we can accept the alternative hypothesis. The average performance of female employees is greater than male employees in 2021.
In the document, we have seen the different ways to use the Student’s t-test. We have seen that the samples need to satisfy two criteria: a normality distribution and the homogeneity of the variances. Just two final recommendations:
t.test(var.equal = FALSE).wilcox.test().