Understanding and Coding in R
A Z-test is a statistical test that concerns with the true mean or proportion of a population, or the difference of true means or proportions between two independent populations1.
It allows us to estimate the population parameter (in this case, true means or proportions) through sample data1.
Yes! That’s correct. Both a z-test and a t-test share the goal of estimating the true mean.
However, there is a difference:
We use a z-test based on the following conditions:
Sample is randomly selected from a normal distribution population
Population variance is known
OR
When the sample size is large by Central Limit Theorem (CLT), rule of thumb is n \(\geq\) 30
(As a reminder) We use a t-test based on the following conditions:
Population variance is unknown
Sample size is small (n < 30)
For large samples (n \(\ge\) 30), the formula2 for the Z-test statistic, which possesses approximately a standard normal distribution, is:
\[ Z = \frac{\hat\theta-\theta}{\sigma_\theta} \]
The table2 below includes the target parameters for hypothesis test, sample sizes, point estimators, and the standard error for the point estimators. Standard error is the standard deviation of the sampling distribution. It depicts how much disparity there is likely to be in a point estimate obtained from a sample relative to the true population mean.
Random Sampling: The sample data should be obtained through a random sampling method to ensure it represents the population accurately.
Normal Distribution: The data is drawn from a normal distributed population, but can be ignored when we have a large sample size (n \(\geq\) 30)
Known Population Variance (or Large Sample Size): The population variance should be known, or in the absence of this knowledge, the sample size should be large (typically greater than 30). When the population variance is unknown and the sample size is small, the t-test is often preferred.
Independent Populations: for two-sample analysis, the two samples are assumed to be independent.
For the example analysis, we will use the Titanic dataset, which is publicly available here for download on GitHub. This dataset contains information about the passengers aboard the Titanic, which famously sank on April 15, 1912 [Citation]. It includes variables such as age, sex, and, survival status. The Titanic dataset is widely used for introductory data analysis and as a tool for various machine learning tasks. [Citation, 5]
For this practice, we will learn how to run a Z-test in R when:
Case A: You want to compare proportions difference between two populations
Case B: You want to compare means difference between two populations
Research Question: Is there a significant difference in the survival rates between female and male passengers aboard the Titanic?
Null Hypothesis (H0): There is no significant difference between the proportion of female and male survivors aboard the Titanic.
Alternative Hypothesis (Ha): There is a significant difference between the proportion of female and male survivors aboard the Titanic.
We will need the Basic Statistics and Data Analysis (BSDA) package to run a Z-test. BSDA is a comprehensive package that provides various statistical methods and tools for data analysis and hypothesis testing.
We will also need the gmodels package. The gmodels package in R provides various tools and functions for graphical modeling and visualization, and includes functions such as CrossTable(), which is useful for generating contingency tables.
#install.packages("BSDA")
library(BSDA)
#install.packages(gmodels)
library(gmodels)
First, load your data (which should contain two independent samples with binary outcomes (e.g. yes/no; survived/died)).
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
4 113803 53.1000 C123 S
5 373450 8.0500 S
6 330877 8.4583 Q
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
This step will allow us to display the relationship between our two categorical variables.
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 891
| titanic$Sex
titanic$Survived | female | male | Row Total |
-----------------|-----------|-----------|-----------|
0 | 81 | 468 | 549 |
| 0.091 | 0.525 | |
-----------------|-----------|-----------|-----------|
1 | 233 | 109 | 342 |
| 0.262 | 0.122 | |
-----------------|-----------|-----------|-----------|
Column Total | 314 | 577 | 891 |
-----------------|-----------|-----------|-----------|
Both samples should be randomly drawn from two independent populations
Populations should follows a binomial distribution (denote \(q = 1 - p\)):
\[ n_1p_1(1-p_1) \ge 10 \enspace \text{and} \enspace n_2p_2(1-p_2) \ge 10 \]
\[ n_1q_1(1-q_1) \ge 10 \enspace \text{and} \enspace n_2q_2(1-q_2) \ge 10 \]
2-sample test for equality of proportions with continuity correction
data: c(233, 109) out of c(314, 577)
X-squared = 260.72, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.4926894 0.6135708
sample estimates:
prop 1 prop 2
0.7420382 0.1889081
Note
This Pearson’s chi square test is equivalent to z-test for independent proportions. The square of the z-test statistic \(Z^{2}\) is identical to the Pearson’s chi square statistic \(\chi^{2}\)
\[ \sum_{i=1}^n Z_{i}^2 = \sum_{i=1}^n (\frac{Y_{i} - \mu}\sigma){^2} \]
has a \(\chi^2\) distribution with \(n\) degrees of freedom (df).
Our p-value is <2.2e-16 (0.00000000000000022). Based on this value, we can reject the null hypothesis and conclude we have strong evidence to support that there is a difference between the proportion of male and female survivors aboard the Titanic at 0.05 level of significance.
First, we find the critical values for the two-tailed rejection region using the following code:
Since our chi square test statistic is 260.72 > 5.023886, it falls into the rejection region on the right, hence we reject the null hypothesis.
According to the 95% confidence interval result, the survival rate of female passengers on the Titanic is greater than the survival rate of male passengers by at least 0.49 and at most by 0.61.
Research Question: Is there a significant difference in the mean age between the survivors and non-survivors of the Titanic?
Null Hypothesis (H0): There is no significant difference in the mean ages between survivors and non-survivors of the Titanic.
Alternative Hypothesis (Ha): There is a significant difference in the mean ages between survivors and non-survivors of the Titanic.
We will need the Basic Statistics and Data Analysis (BSDA) package to run a Z-test. BSDA is a comprehensive package that provides various statistical methods and tools for data analysis and hypothesis testing.
#install.packages("BSDA")
library(BSDA)
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
4 113803 53.1000 C123 S
5 373450 8.0500 S
6 330877 8.4583 Q
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
Since we are comparing the means of two populations (survivors and non-survivors), we can proceed with a Z test because our sample sizes (below) meets the rule of thumb for the Central Limit Theorem (n \(\geq\) 30) so we can be confident about conducting this test.
| Category | N |
|---|---|
| Survived | 290 |
| Died | 424 |
Two-sample z-Test
data: survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09601277 4.46896641
sample estimates:
mean of x mean of y
30.62618 28.34369
Two-sample z-Test
data: survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09601277 4.46896641
sample estimates:
mean of x mean of y
30.62618 28.34369
P-value is less than 0.05 level, we reject the null and conclude that there is sufficient evidence to support the claim that there is a difference in the average age between those who survived and those who did not.
Two-sample z-Test
data: survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09601277 4.46896641
sample estimates:
mean of x mean of y
30.62618 28.34369
In this example, alpha is set to 0.05. Therefore the critical value is 1.96.
Our test statistic was 2.046, which is greater than the critical value (2.046 > 1.96). Therefore we can reject the null hypothesis.