lecture link: https://rpubs.com/zaidyousif/1111089
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
Two samples are independent if the sample values from one population are not related to or somehow naturally paired or matched with the sample values from the other population.
Two samples are dependent if the sample values are somehow matched, where the matching is based on some inherent relationship. For example, sample values consist of two measurements from the same subject (before and after data), or a pair of sample values consists of matched pairs.
Example 1: A P1 pharmacy student conducted a study comparing the mean age of registered pharmacists in San Diego and San Francisco cities. The P1 pharmacy student surveyed a random sample of registered pharmacists in San Diego and San Francisco. This is an example of independent samples, as they are not matched according to some inherent relationship.
Example 2: You discovered a new drug that could potentially reduce blood pressure in patients with hypertension. You conducted a clinical trial to evaluate the drug’s efficacy by measuring the subjects’ blood pressure before and after receiving the drug. This is an example of dependent samples because the measurements belong to the same person.
The paired-samples t-test is used to conduct a hypothesis test of a claim about dependent means. The following assumptions must be met for a paired t-test to return a valid results:
In principle, paired t-test is derived from one sample t-test. The formula for paired t-test is:
\[t= \frac{\bar{d}-\mu_d}{\frac{s_\bar{d}}{\sqrt{n}}}\]
where \(\bar{d}\) is the mean value of the differences for the paired sample data, \(\mu_d\) is mean value of the differences for the population, \(s_\bar{d}\) standard deviation of the differences of the sample data, and \(n\) is the sample size.
Example: As part of the NHANES study, the change in subjects’ weights in kilograms was recorded in 1971 and 1982. Using a significance level of 0.05, determine if the mean weight change is significant.
\({H_0}: \mu_d = 0\)
\({H_1}: \mu_d \neq\ 0\)
NHANES_df_delta_weight <- NHANES_df %>% filter(!is.na(wt82_71)) #remove record where change in weight is missing
Check if the data is numeric.
class(NHANES_df_delta_weight$wt82_71)
## [1] "numeric"
Lets’s explore the data!
length(NHANES_df_delta_weight$wt82_71)
## [1] 1566
summary(NHANES_df_delta_weight$wt82_71)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -41.280 -1.478 2.604 2.638 6.690 48.538
Next, draw a few plots. We start with a boxplot of change in weight.
NHANES_df_delta_weight %>% ggplot(aes(y=wt82_71)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
theme_light()
Next, we plot a histogram of change in weight.
NHANES_df_delta_weight %>% ggplot(aes(x=wt82_71)) +
geom_histogram( bins = 40, fill="deepskyblue", color="black") +
theme_light()
We assess normality using Q-Q plot.
NHANES_df_delta_weight %>% ggplot(aes(sample = wt82_71)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can test for normality using Shapiro-Wilk test. If the P-Value ≤ 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to reject the claim that the data is from normally distributed population.
shapiro.test(NHANES_df_delta_weight$wt82_71)
##
## Shapiro-Wilk normality test
##
## data: NHANES_df_delta_weight$wt82_71
## W = 0.95809, p-value < 2.2e-16
The plots reveal that the data is over-dispersed and the Shapiro-Wilk test is significant. However, since the sample size is ≥ 30, we can assume normality.
wt_df_mean <- mean(NHANES_df_delta_weight$wt82_71,na.rm = TRUE) #calculate sample mean
wt_df_sd <- sd(NHANES_df_delta_weight$wt82_71,na.rm = TRUE) #calculate sample standard deviation
wt_df_n <- length(NHANES_df_delta_weight$wt82_71) #calculate sample size
wt_df_mean
## [1] 2.6383
wt_df_sd
## [1] 7.879913
wt_df_n
## [1] 1566
t_paired <- (wt_df_mean-0)/(wt_df_sd/sqrt(wt_df_n))
t_paired
## [1] 13.24947
We can confirm the t statistics value using t.test command in R.
t.test(x=NHANES_df_delta_weight$wt82_71,mu = 0)
##
## One Sample t-test
##
## data: NHANES_df_delta_weight$wt82_71
## t = 13.249, df = 1565, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 2.247720 3.028879
## sample estimates:
## mean of x
## 2.6383
In case the difference in weights is not provided, we can use the weights from 1982 and 1971 to calculate paired t-test score.
t.test(x=NHANES_df_delta_weight$wt82,y=NHANES_df_delta_weight$wt71,paired =TRUE)
##
## Paired t-test
##
## data: NHANES_df_delta_weight$wt82 and NHANES_df_delta_weight$wt71
## t = 13.249, df = 1565, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 2.247720 3.028879
## sample estimates:
## mean difference
## 2.6383
Interpretation: Since the P-Value is ≤ 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to reject the claim that the mean change in weight is equal to zero kg.
The Wilcoxon matched-pairs signed-ranks test is a nonparametric test to test a claim that a population of matched pairs has the property that the matched pairs have differences with a median equal to zero
Example: You performed a small clinical trial to evaluate the efficacy of a new medication to treat hypertension. As part of the study, you collected participants’ systolic blood pressure before and after receiving the treatment. Using a significance level of 0.05, determine if the drug is effective.
hypertension_df <- data.frame(
pre_sbp=c(145,130,141,127,120,153,120,122,125,127),
post_sbp=c(170,130,135,130,122,145,120,120,170,150)
) #create data frame
hypertension_df$delta_sbp <- hypertension_df$post_sbp-hypertension_df$pre_sbp #calculate change in systolic blood pressure
Let’s explore the data
summary(hypertension_df$delta_sbp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.0 -1.5 1.0 8.2 18.0 45.0
Next, draw a few plots. We start with a boxplot of change in weight.
hypertension_df %>% ggplot(aes(y=delta_sbp)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
theme_light()
Next, we plot a histogram of change in weight.
hypertension_df %>% ggplot(aes(x=delta_sbp)) +
geom_histogram( bins =4, fill="deepskyblue", color="black") +
theme_light()
We assess normality using Q-Q plot.
hypertension_df %>% ggplot(aes(sample = delta_sbp)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
shapiro.test(hypertension_df$delta_sbp)
##
## Shapiro-Wilk normality test
##
## data: hypertension_df$delta_sbp
## W = 0.82105, p-value = 0.02609
Since the data is right-skewed and the sample size <30, it is not appropriate to use paired t-test and we should use instead the Wilcoxon matched-pairs signed-ranks test.
\({H_0}: median_d = 0\)
\({H_1}: median_d \neq\ 0\)
wilcox.test(x = hypertension_df$delta_sbp,mu = 0)
##
## Wilcoxon signed rank test with continuity correction
##
## data: hypertension_df$delta_sbp
## V = 25.5, p-value = 0.3264
## alternative hypothesis: true location is not equal to 0
wilcox.test(x=hypertension_df$pre_sbp,y = hypertension_df$post_sbp,paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: hypertension_df$pre_sbp and hypertension_df$post_sbp
## V = 10.5, p-value = 0.3264
## alternative hypothesis: true location shift is not equal to 0
Interpretation: Since the P-Value is > 0.05, we fail to reject the null hypothesis and conclude that there is no sufficient evidence to reject the claim that the median change in systolic blood pressure is equal to zero mmHg.