library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
Despite applying the proper statistical procedures, we might sometimes arrive to the wrong conclusion of rejecting or failing to reject the null hypothesis. There are two different types of errors that can be distinguished by calling them Type I and Type II errors:
Two samples are independent if the sample values from one population are not related to or somehow naturally paired or match with the sample values from the other population.
Two samples are dependent if the sample values are somehow matched, where the matching is based on some inherent relationship. For example, sample values consist of two measurements from the same subject (before and after data), or a pair of sample values consists of matched pairs such as husband/wife data.
Example 1: A P1 pharmacy student conducted a study to compare the mean age of registered pharmacists in San Diego vs. San Francisco. The P1 pharmacy student randomly surveyed registered pharmacists in San Diego and San Francisco via phone. This is an example of independent samples because they are not matched according to some inherent relationship.
Example 2: You discovered a new drug that could potentially reduce blood pressure in patients with hypertension. You conducted a clinical trial to evaluate the efficacy of the drug by measuring the participants’ blood pressure before and after receiving the drug. This is an example of paired samples because the measurements belong to the same person.
Independent two-sample t-test is used to compare the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The following assumptions must be met for independent two-sample t-test to be valid:
In general, we use pooled variance t-test as long as \(s_1/s_2<2\) and \(s_2/s_1<2\), where \(s_1,s_2\) are the standard deviations of the outcome in the two groups.
Example: As part of the NHANES study, serum cholesterol level (mg/100 mL) was collected in 1971 to determine the effect of cholesterol on health. Using a significance level of 0.05, determine if the mean cholesterol level is equal between males and females.
\({H_0}: \mu_A = \mu_B\)
\({H_1}: \mu_A \neq\ \mu_B\)
Check if the data is numeric
class(NHANES_df$cholesterol)
## [1] "numeric"
Let’s explore the data
by(NHANES_df$cholesterol, NHANES_df$sex, length)
## NHANES_df$sex: Female
## [1] 830
## ------------------------------------------------------------
## NHANES_df$sex: Male
## [1] 799
by(NHANES_df$cholesterol, NHANES_df$sex, summary)
## NHANES_df$sex: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 78.0 188.0 218.0 221.6 252.0 377.0 11
## ------------------------------------------------------------
## NHANES_df$sex: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 79.0 189.0 215.0 218.3 242.0 416.0 5
Next, draw a few plots. We start with a boxplot of cholesterol level grouped by sex
NHANES_df$sex <- as.factor(NHANES_df$sex) #convert sex to categorical variable
NHANES_df %>% ggplot(aes(y=cholesterol,x=sex)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
theme_light()
## Warning: Removed 16 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 16 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Next, we plot a histogram of cholesterol level grouped by sex.
NHANES_df %>% ggplot(aes(x=cholesterol)) +
geom_histogram( bins=29, fill="deepskyblue", color="black") +
theme_light() + facet_grid(sex ~.)
## Warning: Removed 16 rows containing non-finite outside the scale range
## (`stat_bin()`).
We assess normality using Q-Q plot.
NHANES_df %>% ggplot(aes(sample = cholesterol)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()+
facet_grid(sex ~.)
The plots reveal that males have lower serum cholesterol compared to females, and the data is slightly right-skewed in both males and females. Because the data is slightly skewed and \(n_1\) ≥ 30 and \(n_2\) ≥ 30, the use of independent two-sample t-test is appropriate.
To assess if it is appropriate to use pooled variance, we calculate the standard deviations of cholesterol levels in males and females.
s1 <- sd(NHANES_df$cholesterol[NHANES_df$sex=='Male'],na.rm = TRUE)
s2 <- sd(NHANES_df$cholesterol[NHANES_df$sex=='Female'],na.rm = TRUE)
s1/s2
## [1] 0.9249073
s2/s1
## [1] 1.081189
Since both ratios are less than 2, we perform t-test with pooled variance.
t.test(cholesterol ~ sex,data=NHANES_df,var.equal=TRUE)
##
## Two Sample t-test
##
## data: cholesterol by sex
## t = 1.487, df = 1611, p-value = 0.1372
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -1.073375 7.801962
## sample estimates:
## mean in group Female mean in group Male
## 221.6300 218.2657
If s1/s2 or s2/s1 were ≥ 2, we would have used Welch’s correction (var.equal=FALSE).
Interpretation: Since the P-Value is > 0.05, we fail to reject the null hypothesis and conclude that there is no sufficient evidence to warrant the rejection of the claim that males and females have equal mean serum cholesterol.
If the P-Value was ≤ 0.05, we would reject the null hypothesis and conclude that there is sufficient evidence to conclude males and females have different mean serum cholesterol.
z-test and t-test are examples of parametric tests. Parametric tests assume that sample data comes from a population that can be adequately modeled by a probability distribution. Nonparametric tests are distribution- free tests that don’t require the samples to come from a population with a normal distribution or any other distribution.
Advantages of nonparametric tests:
Disadvantages of nonparametric tests:
Wilcoxon signed-ranks tests a claim that a single population of individual values has a median equal to some claimed value. By using ranks, the Wilcoxon signed ranks test takes the magnitudes of the differences into account, therefore it tends to yield conclusions that reflect the true nature of the data.
Example:Using a random sample size of 20 from the NHANES study, determine if the study participants have a median age of 48 years.
\({H_0}: median=48\)
\({H_1}: median \neq\ 48\)
set.seed(700)
NHANES_df_20 <- NHANES_df %>% sample_n(20) #select a sample of 20 participants
Let’s explore the data!
summary(NHANES_df_20$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 29.75 38.50 41.80 51.00 74.00
Graph a histogram of age.
NHANES_df_20 %>% ggplot(aes(x=age)) +
geom_histogram( binwidth=5, fill="deepskyblue", color="black") +
theme_light()
Graph a boxplot of age.
NHANES_df_20 %>% ggplot(aes(y=age)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
theme_light()
Assess normality using a Q-Q plot.
NHANES_df_20 %>% ggplot(aes(sample = age)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
It appears the data is not normally distributed and the small size is < 30; therefore, it is not appropriate to use one sample t-test and we should instead use Wilcoxon signed-rank test.
wilcox.test(NHANES_df_20$age,mu =48 ,exact=FALSE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: NHANES_df_20$age
## V = 54.5, p-value = 0.06177
## alternative hypothesis: true location is not equal to 48
Interpretation: Since the P-Value > 0.05, we fail to reject the null hypothesis and conclude that there is no sufficient evidence to warrant the rejection of the claim that median age of NHANES participants is 48 years.
Wilcoxon rank-sum test uses ranks of values from two independent samples to test the null hypothesis that the samples are from populations having equal medians. The Wilcoxon rank-sum test is equivalent to the Mann-Whitney U test.
The basic idea underlying the Wilcoxon rank-sum test: If two samples are drawn from identical populations and the individual values are all ranked as one combined collection of values, then the high and low ranks should fall evenly between the two samples. If the low ranks are found predominantly in one sample and the high ranks are found predominantly in the other sample, we have an indication that the two populations have different medians.
Example: Using a random sample of 20 participants from the NHANES study, determine if males and females have equal median age.
\({H_0}: median_{males} = median_{females}\)
\({H_1}: median_{males} \neq\ median_{females}\)
set.seed(600)
NHANES_df_20 <- NHANES_df %>% sample_n(20) #generate a sample size of 20 participants
Let’s evaluate the counts and frequency of males and females.
table(NHANES_df_20$sex)
##
## Female Male
## 9 11
prop.table(table(NHANES_df_20$sex))
##
## Female Male
## 0.45 0.55
Summarize data grouped by sex.
by(NHANES_df_20$age, NHANES_df_20$sex, summary)
## NHANES_df_20$sex: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29 37 42 41 46 51
## ------------------------------------------------------------
## NHANES_df_20$sex: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 29.00 34.00 41.45 53.50 69.00
Plot a histogram of age by sex.
NHANES_df_20 %>% ggplot(aes(x=age)) +
geom_histogram( binwidth=5, fill="deepskyblue", color="black") +
theme_light() + facet_grid(sex ~.)
Plot a boxplot of age by sex.
NHANES_df_20 %>% ggplot(aes(y=age,x=sex)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
theme_light()
Assess data using a Q-Q plot.
NHANES_df_20 %>% ggplot(aes(sample = age)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()+
facet_grid(sex ~.)
It appears the data is not normally distributed and the small size < 30; therefore, it is not appropriate to use two-sample t-test and we should use instead Wilcoxon rank-sum test.
wilcox.test(age~sex,data = NHANES_df_20,exact=FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: age by sex
## W = 52, p-value = 0.879
## alternative hypothesis: true location shift is not equal to 0
Interpretation: Since the P-Value > 0.05, we fail to reject the null hypothesis and conclude that there is no sufficient evidence to warrant the rejection of the claim that males and females have equal median age in the NHANES study.