Introduction to Z-test

Understanding and Coding in R

Linh Cao & Deidre Okeke

What is a Z-test?

  • A Z-test is a statistical test that concerns with the true mean or proportion of a population, or the difference of true means or proportions between two independent populations1.

  • It allows us to estimate the population parameter (in this case, true means or proportions) through sample data1.

Wait, this sounds like the goal of a t-test…

Yes! That’s correct. Both a z-test and a t-test share the goal of estimating the true mean.

However, there is a difference:

We use a z-test based on the following conditions:

  1. Sample is randomly selected from a normal distribution population

  2. Population variance is known

    OR

  3. When the sample size is large by Central Limit Theorem (CLT), rule of thumb is n \(\geq\) 30

(As a reminder) We use a t-test based on the following conditions:

  1. Population variance is unknown

  2. Sample size is small (n < 30)

Mathematical Formula

For large samples (n \(\ge\) 30), the formula2 for the Z-test statistic, which possesses approximately a standard normal distribution, is:

\[ Z = \frac{\hat\theta-\theta}{\sigma_\theta} \]

The table2 below includes the target parameters for hypothesis test, sample sizes, point estimators, and the standard error for the point estimators. Standard error is the standard deviation of the sampling distribution. It depicts how much disparity there is likely to be in a point estimate obtained from a sample relative to the true population mean.

To ensure validity3,4 (assumption criteria):

  1. Random Sampling: The sample data should be obtained through a random sampling method to ensure it represents the population accurately.

  2. Normal Distribution: The data is drawn from a normal distributed population, but can be ignored when we have a large sample size (n \(\geq\) 30)

  3. Known Population Variance (or Large Sample Size): The population variance should be known, or in the absence of this knowledge, the sample size should be large (typically greater than 30). When the population variance is unknown and the sample size is small, the t-test is often preferred.

  4. Independent Populations: for two-sample analysis, the two samples are assumed to be independent.

Let’s Practice!

For the example analysis, we will use the Titanic dataset, which is publicly available here for download on GitHub. This dataset contains information about the passengers aboard the Titanic, which famously sank on April 15, 1912 [Citation]. It includes variables such as age, sex, and, survival status. The Titanic dataset is widely used for introductory data analysis and as a tool for various machine learning tasks. [Citation, 5]

For this practice, we will learn how to run a Z-test in R when:

  1. Case A: You want to compare proportions difference between two populations

  2. Case B: You want to compare means difference between two populations

Case A: Proportion Differences

Research Question: Is there a significant difference in the survival rates between female and male passengers aboard the Titanic?

  • Null Hypothesis (H0): There is no significant difference between the proportion of female and male survivors aboard the Titanic.

  • Alternative Hypothesis (Ha): There is a significant difference between the proportion of female and male survivors aboard the Titanic.

Step 0: Install Required Package

We will need the Basic Statistics and Data Analysis (BSDA) package to run a Z-test. BSDA is a comprehensive package that provides various statistical methods and tools for data analysis and hypothesis testing.

We will also need the gmodels package. The gmodels package in R provides various tools and functions for graphical modeling and visualization, and includes functions such as CrossTable(), which is useful for generating contingency tables.

#install.packages("BSDA")
library(BSDA)

#install.packages(gmodels)
library(gmodels)

Step 1: Load and Inspect the Data

First, load your data (which should contain two independent samples with binary outcomes (e.g. yes/no; survived/died)).

# Load the Titanic dataset
titanic <- read.csv("~/GitHub/Z_test/titanic.csv")

head(titanic)
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

str(titanic)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

Step 1: Load and Inspect the Data

# Factor Survived and Sex
titanic$Survived <- factor(titanic$Survived)
titanic$Sex <- factor(titanic$Sex)

# Glimpse the data
summary(titanic$Survived)
  0   1 
549 342 
summary(titanic$Sex)
female   male 
   314    577 

Step 2: Organize the Data (Two-way Table)

This step will allow us to display the relationship between our two categorical variables.

CrossTable(titanic$Survived, titanic$Sex,
           prop.r = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  891 

 
                 | titanic$Sex 
titanic$Survived |    female |      male | Row Total | 
-----------------|-----------|-----------|-----------|
               0 |        81 |       468 |       549 | 
                 |     0.091 |     0.525 |           | 
-----------------|-----------|-----------|-----------|
               1 |       233 |       109 |       342 | 
                 |     0.262 |     0.122 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |       314 |       577 |       891 | 
-----------------|-----------|-----------|-----------|

 

Step 3: Check Assumptions

  • Both samples should be randomly drawn from two independent populations

  • Populations should follows a binomial distribution (denote \(q = 1 - p\)):

\[ n_1p_1(1-p_1) \ge 10 \enspace \text{and} \enspace n_2p_2(1-p_2) \ge 10 \]

\[ n_1q_1(1-q_1) \ge 10 \enspace \text{and} \enspace n_2q_2(1-q_2) \ge 10 \]

Step 4: Run Z-test

prop.test(x = c(233, 109), n = c(314, 577), alternative = "two.sided")

    2-sample test for equality of proportions with continuity correction

data:  c(233, 109) out of c(314, 577)
X-squared = 260.72, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.4926894 0.6135708
sample estimates:
   prop 1    prop 2 
0.7420382 0.1889081 

Step 5: Interpret Results

Note

This Pearson’s chi square test is equivalent to z-test for independent proportions. The square of the z-test statistic \(Z^{2}\) is identical to the Pearson’s chi square statistic \(\chi^{2}\)

\[ \sum_{i=1}^n Z_{i}^2 = \sum_{i=1}^n (\frac{Y_{i} - \mu}\sigma){^2} \]

has a \(\chi^2\) distribution with \(n\) degrees of freedom (df).

Interpretation Method #1: P-value

Our p-value is <2.2e-16 (0.00000000000000022). Based on this value, we can reject the null hypothesis and conclude we have strong evidence to support that there is a difference between the proportion of male and female survivors aboard the Titanic at 0.05 level of significance.

Interpretation Method #2: Rejection Region

First, we find the critical values for the two-tailed rejection region using the following code:

qchisq(0.025,df = 1)
[1] 0.0009820691
qchisq(0.975, df = 1)
[1] 5.023886

Since our chi square test statistic is 260.72 > 5.023886, it falls into the rejection region on the right, hence we reject the null hypothesis.

According to the 95% confidence interval result, the survival rate of female passengers on the Titanic is greater than the survival rate of male passengers by at least 0.49 and at most by 0.61.

Case B: Means Difference

Research Question: Is there a significant difference in the mean age between the survivors and non-survivors of the Titanic?

  • Null Hypothesis (H0): There is no significant difference in the mean ages between survivors and non-survivors of the Titanic.

  • Alternative Hypothesis (Ha): There is a significant difference in the mean ages between survivors and non-survivors of the Titanic.

Step 0: Install Required Package

We will need the Basic Statistics and Data Analysis (BSDA) package to run a Z-test. BSDA is a comprehensive package that provides various statistical methods and tools for data analysis and hypothesis testing.

#install.packages("BSDA")
library(BSDA)

Step 1: Load and Inspect the Data

# Load the titanic dataset
titanic <- read.csv("~/GitHub/Z_test/titanic.csv")

head(titanic)
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

str(titanic)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
summary(titanic$Survived)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.3838  1.0000  1.0000 
summary(titanic$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   20.12   28.00   29.70   38.00   80.00     177 

Step 2: Organize the Data

# Pull only the survival data
survive_no <- titanic[titanic$Survived == "0",]
survive_yes <- titanic[titanic$Survived == "1",]

# Remove any observations with N/A values from the Age column
survive_no <- na.omit(survive_no$Age)
survive_yes <- na.omit(survive_yes$Age)

Step 3: Calculate standard deviations

sd(survive_no)
[1] 14.17211
sd(survive_yes)
[1] 14.95095

Step 4: Check Assumptions

Since we are comparing the means of two populations (survivors and non-survivors), we can proceed with a Z test because our sample sizes (below) meets the rule of thumb for the Central Limit Theorem (n \(\geq\) 30) so we can be confident about conducting this test.

Sample sizes
Category N
Survived 290
Died 424

Step 5: Calculate the test statistic (z-score)

z.test(x = survive_no,y = survive_yes, sigma.x = sd(survive_no), sigma.y = sd(survive_yes), alternative = "two.sided")

    Two-sample z-Test

data:  survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.09601277 4.46896641
sample estimates:
mean of x mean of y 
 30.62618  28.34369 

Step 6: Interpret Z-test results


    Two-sample z-Test

data:  survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.09601277 4.46896641
sample estimates:
mean of x mean of y 
 30.62618  28.34369 

P-value is less than 0.05 level, we reject the null and conclude that there is sufficient evidence to support the claim that there is a difference in the average age between those who survived and those who did not.

Step 7: Determine critical value (CV) and compare Z-test statistic score


    Two-sample z-Test

data:  survive_no and survive_yes
z = 2.046, p-value = 0.04075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.09601277 4.46896641
sample estimates:
mean of x mean of y 
 30.62618  28.34369 

In this example, alpha is set to 0.05. Therefore the critical value is 1.96.

Our test statistic was 2.046, which is greater than the critical value (2.046 > 1.96). Therefore we can reject the null hypothesis.