library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
The standard normal distribution is a normal distribution with \(\mu\) = 0 and \(\sigma\) = 1. The total area under its density curve equal to 1.
A \(z\) score (or standard score or standardized value) is the number of standard deviations that a given value \(x\) is above or below the mean. \(z\) scores are expressed as numbers with no units of measure. z-score can be calculated for any normal distribution using the following formula:
\[z= \frac{x-\mu}\sigma\]
where \(x\) is an observed value, \(\mu\) is the population mean, and \(\sigma\) is the standard deviation of the population mean.
Because the total area under any density curve is equal to 1, there is a correspondence between area and probability, z-score can be used to find probabilities.
Example: The mean age of P1 students in the United States is 24 years with a standard deviation of 3 years. Find the probability of randomly selecting a P1 student who is (1) less than 28 years (2) greater or equal to 28 years.
\[z= \frac{28-24}3= 1.33\]
Using z-score table, the probability of randomly selecting a P1 student who is less than 28 years old is 0.91 (91%). The probability of randomly selecting a P1 student who is 28 years or older is 1 - 0.91 = 0.09 (9%).
Given any population with any distribution (uniform, skewed, whatever), the distribution of sample means \(\bar{x}\) can be approximated by a normal distribution when the samples are large enough with n > 30. The sampling distribution of \(\bar{x}\) is approximated by a normal distribution with mean \(\mu\) and standard deviation (standard error of the mean) \(\sigma/\sqrt{n}\), where \(n\) is the sample size. Check out this cool simulation (https://onlinestatbook.com/stat_sim/sampling_dist/).
The central limit theorem can be used to solve many practical problems in statistics by working with sample mean instead of individual values. When working with a sample with mean \(\bar{x}\), z-score can be calculated using the following formula:
\[z= \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\] where \(\mu\) is the population mean, and \(\sigma\) is the standard deviation of the population, \(\bar{x}\) is the sample mean, and \(n\) is the sample size.
Example: The mean age of P1 students in the United States is 24 years with a standard deviation of 3 years. Find the probability of randomly selecting a sample of 30 P1 students with a mean age equal to or greater than 25 years.
\[z= \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} = \frac{25-24}{\frac{3}{\sqrt{30}}} = 1.83 \] Using z-score table, we find that the cumulative area to the left of z = 1.83 is 0.97. The cumulative area to the right of z = 1.83 is 1 - 0.97 = 0.03. There is a 0.03 (3%) probability that 30 randomly selected P1 students will have a mean age greater than 25 years.
Realistically, we rarely encounter a situation where the population mean and standard deviation are known. The sample mean \(\bar{x}\) is an \(unbiased\) estimate of the population mean \(\mu\). Thus, we can approximate the population mean using the sample mean. Unfortunately, the sample standard deviation \(s\) is a biased estimate of the population standard deviation \(\sigma\) and cannot be used as an approximate.
This challenge was solved by a statistician named William Gosset. He chose to publish his work by the pseudonym “Student”. Gosset discovered that approximating the population standard deviation \(\sigma\) by the sample standard deviation \(s\) will return distribution referred to as Student’s t distribution.
\[t= \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}\]
where \(\mu\) is the population mean, \(\bar{x}\) is the sample mean, \(s\) is the standard deviation of the sample, and \(n\) is the sample size.
The t distribution is not a unique distribution but is instead a family of distributions indexed by a parameter referred to as the degrees of freedom (df) of the distribution. The number of degrees of freedom (DF) is the sample size minus 1
\[DF = n-1\] where \(n\) is the sample size.
Example: A recent publication claimed that the average age of P1 students in the United States is 25 years. During the National ASHP meeting, you surveyed 40 pharmacy students. The average age of the sample was 24 years with a standard deviation of 3.3 years. What is the probability that a sample of 40 P1 pharmacy students would have an average age no more than 24 years
\[t= \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} = \frac{24-25}{\frac{3.3}{\sqrt{40}}} = -1.92 \]
Using t-score table, we find that the cumulative area to the left of t = -1.92 with DF= 40-1 = 39 is 0.03. Assuming the publication claim is true, there is a 0.03 (3%) probability that 40 randomly selected P1 students will have a mean age equal or less than 24.
A confidence interval (or interval estimate) is a range (or an interval) of values used to estimate the true value of a population parameter. A confidence interval is sometimes abbreviated as CI.
A confidence interval can be calculated the following formula:
\[CI=(\bar{x} - E,\bar{x} + E) \] where \(\bar{x}\) is the sample mean, and \(E\) is the margin of error.
In statistics, a hypothesis is a claim or a statement about the property of a population. A hypothesis test is a procedure for testing a claim about a population. Consider the following examples of hypotheses:
In hypothesis testing, the hypotheses being considered can be formulated in terms of null and alternative hypotheses, which can be defined as follows:
The null hypothesis (denoted by \({H_0}\)) is a statement that the value of a population parameter (such as proportion, mean) is equal to some claimed value.
The alternative hypothesis (denoted by \(H_1\) or \(H_a\) or \(H_A\)) is a statement that the parameter has a value that somehow contradicts the null hypothesis. The symbolic form of the alternative hypothesis is ≠.
It is important to note that null indicates no change, no effect, or no difference. We conduct the hypothesis test by assuming that the parameter is equal to some specified value so that we can work with a single distribution having a specific value.
Example: The mean hours of sleep P1 students receive is 8 hours daily.
Null hypothesis: The mean hours of sleep P1 pharmacy students receive is equal to 8 hours daily. \[H_0: \mu = 8\]
Alternative hypothesis: The mean hours of sleep P1 pharmacy students receive is NOT equal to 8 hours daily. \[H_1: \mu \neq\ 8\]
The critical region (or rejection region) is the area corresponding to all values of the test statistic that cause us to reject the null hypothesis.
The significance level \(\alpha\) for a hypothesis test is the probability value used as the cutoff for determining when the sample evidence constitutes significant evidence against the null hypothesis. The most common choice for \(\alpha\) is 0.05.
In this course, we will use two-tailed test in which the two extreme regions (tails) fall under the curve. Most scientific journals and the FDA don’t allow one sided tests.
The P-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
The one-sample t-test is used to test if the sample comes from a particular population with a specific mean. The following assumptions must be met for a one-sample t-test to return valid results:
Example: The mean age of the adult population in the United States is claimed to be 43 years. Using significance of level of 0.05, test the claim using the age of the NHANES study participants.
\({H_0}: \mu = 43\)
\({H_1}: \mu \neq\ 43\)
Let’s check if the data is numeric using the class() function
class(NHANES_df$age) #Check if data numeric
## [1] "numeric"
Let’s explore the data.
summary(NHANES_df$age) ## generate summary stat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 33.00 44.00 43.92 53.00 74.00
Next, let’s plot the data to determine the distribution and identify outliers.
NHANES_df %>% ggplot(aes(x=age)) +
geom_histogram( bins=40, fill="deepskyblue", color="black") +
theme_light()
# x = numeric vector of data values.
# binwidth = the width of the bins
NHANES_df %>% ggplot(aes(x=age)) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
geom_boxplot(fill='deepskyblue',outlier.colour="red") + theme_light()
Since the data is slightly skewed with no significant outliers, and the sample size > 30, we can assume normality and use t-test.
t.test(x =NHANES_df$age,mu = 43)
## x = numeric vector of data values.
## mu = the true population mean.
##
## One Sample t-test
##
## data: NHANES_df$age
## t = 3.0354, df = 1628, p-value = 0.002441
## alternative hypothesis: true mean is not equal to 43
## 95 percent confidence interval:
## 43.32384 44.50673
## sample estimates:
## mean of x
## 43.91529
Interpretation: Because the P-Value < 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to reject the claim that the mean age of adults in the United States is 43 years.
If the P-Value ≥ 0.05, we fail to reject the null hypothesis and conclude that there is no sufficient evidence to reject the claim that the mean age of adults in the United States is 43 years.