Before begin..
Let’s load the SBP dataset.
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dataset_sbp)
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
Sampling distribution and variability
Example 1
Let’s see this example. If we sample a small subset
of
data, the distribution will look different from larger set of
samples.
Original data’s histogram
hist(dataset_sbp$SBP, main = "Histogram of SBP")
Sample variance
Let’s say we sampled 2 samples from the original dataset.
set.seed(1)
print("Two randomly selecte samples")
## [1] "Two randomly selecte samples"
dataset_sbp$SBP %>%
sample(2)
## [1] 97 102
print("Two randomly selecte samples -2 ")
## [1] "Two randomly selecte samples -2 "
set.seed(2)
dataset_sbp$SBP %>%
sample(2)
## [1] 135 110
Two different random samplings showed difference in mean values.
We call this sample variance.
Sampling distribution
Let’s see how they looks like in histogram.
This time, the sample number is slightly higher.
Histogram of 10 samples from the dataset (run 1)
dataset_sbp$SBP %>%
sample(10) %>%
hist(main = "Histogram of SBP, N = 10", 10)
Histogram of 10 samples from the dataset (run 2)
dataset_sbp$SBP %>%
sample(10) %>%
hist(main = "Histogram of SBP, N = 10")
We randomly sampled 10 dataset from the
dataset_SBP
and looked at their distribution. Do you think
they look the same?
No, unfortunately they are not.
Since the sampling takes samples randomly, their mean, distribution is different.
Of course, since this sampling was done by computer, we can make the
sampling process reproducible by a
set.seed()
function in R.
set.seed(1)
dataset_sbp$SBP %>%
sample(10) %>%
hist(main = "Histogram of SBP, N = 10")
set.seed(1)
dataset_sbp$SBP %>%
sample(10) %>%
hist(main = "Histogram of SBP, N = 10")
If we set the seed, the sample sampling procedure will use the
seed number
(in this case it was 1
) for it’s
algorithm. When the seed number is the same, the computer will generate
the same data every time.
However, can we set seed for actual sampling processes in
the real world? As it is impossible, the samples are always different
from the population, and this phenomena is called
sampling variability
.
That means, in other words, the samples and their summary statistics actually cannot represent the population.
Then, how can we estimate the population?
Example data set for presentation material
set.seed(1)
dataset_sbp$SBP %>%
sample(20) %>%
hist(main = "Histogram of SBP, N = 20")
dataset_sbp$SBP %>%
sample(20) %>%
data.frame %>%
summarise(mean = mean(.),
sd = sd(.))
## mean sd
## 1 123.65 12.60441
set.seed(1)
dataset_sbp$SBP %>%
sample(50) %>%
hist(main = "Histogram of SBP, N = 50")
dataset_sbp$SBP %>%
sample(50) %>%
data.frame %>%
summarise(mean = mean(.),
sd = sd(.))
## mean sd
## 1 119.72 14.07566
Sample mean
This time, let’s do some other calculations.
We are going to sample 20 samples (randomly), 5000 times. And then, let’s see its histogram
set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 20",
xlim = c(105, 135))
What if we increase the number of random sampling?
set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 20",
xlim = c(105, 135))
set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 50)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 50",
xlim = c(105, 135))
set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 150)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 150",
xlim = c(105, 135))
How do they look?
Let’s see their variation in boxplots
set.seed(1)
dataset_pbs_r5000_n50_250_400 <- data.frame(`n = 50` = (replicate(5000, sample(dataset_sbp$FBS, 50)) %>%
data.frame %>%
colMeans()),
`n = 250` = (replicate(5000, sample(dataset_sbp$FBS, 250)) %>%
data.frame %>%
colMeans()),
`n = 400` = (replicate(5000, sample(dataset_sbp$FBS, 400)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, FBS, `n = 50`:`n = 400`, factor_key=TRUE)
dataset_pbs_r5000_n50_250_400 %>%
boxplot(data = .,
FBS ~ condition,
main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 50, n = 250, and n = 400",
xlab = "Sample N of each ramdom sampling")
Summary table
dataset_pbs_r5000_n50_250_400 %>%
group_by(condition) %>%
summarise(mean = mean(FBS),
sd = sd(FBS))
## # A tibble: 3 × 3
## condition mean sd
## <fct> <dbl> <dbl>
## 1 n = 50 99.0 3.31
## 2 n = 250 98.9 1.44
## 3 n = 400 98.9 1.14
dataset_sbp %>%
summarise(mean = mean(FBS),
sd = sd(FBS))
## mean sd
## 1 98.86443 22.9813
Note that the change in variation is dependent on N, not the number of repeat
set.seed(1)
dataset_sbp_r500_n20_50_150 <- data.frame(`n = 20` = (replicate(500, sample(dataset_sbp$SBP, 20)) %>%
data.frame %>%
colMeans()),
`n = 100` = (replicate(500, sample(dataset_sbp$SBP, 100)) %>%
data.frame %>%
colMeans()),
`n = 150` = (replicate(500, sample(dataset_sbp$SBP, 150)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE)
set.seed(1)
dataset_sbp_r50_n20_50_150 <- data.frame(`n = 20` = (replicate(50, sample(dataset_sbp$SBP, 20)) %>%
data.frame %>%
colMeans()),
`n = 100` = (replicate(50, sample(dataset_sbp$SBP, 100)) %>%
data.frame %>%
colMeans()),
`n = 150` = (replicate(50, sample(dataset_sbp$SBP, 150)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE)
set.seed(1)
dataset_sbp_r5000_n20_50_150 <- data.frame(`n = 20` = (replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
data.frame %>%
colMeans()),
`n = 100` = (replicate(50, sample(dataset_sbp$SBP, 100)) %>%
data.frame %>%
colMeans()),
`n = 150` = (replicate(50, sample(dataset_sbp$SBP, 150)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE)
dataset_sbp_r50_n20_50_150$`Number of repeat` <- 50
dataset_sbp_r500_n20_50_150$`Number of repeat` <- 500
dataset_sbp_r5000_n20_50_150$`Number of repeat` <- 5000
rbind(dataset_sbp_r50_n20_50_150,
dataset_sbp_r500_n20_50_150,
dataset_sbp_r5000_n20_50_150) %>%
mutate(`Number of repeat` = as.factor(`Number of repeat`)) %>%
ggplot(data = ., aes(x = condition, y = SBP)) +
geom_boxplot(aes(fill = `Number of repeat`)) +
ggtitle("Estimated sampling distribution, \nsample mean of some random samples of \nn = 20, n = 50, and n = 150") +
xlab("Sample N of each ramdom sampling")
?geom_boxplot()
Sample mean and sample distribution of skewed data
As same as we observed the tendency of sample means with bell-shaped dataset, we are going to investigate skewed data
Original data’s histogram
hist(dataset_sbp$FBS, main = "Histogram of FBS", 10)
dataset_sbp %>% summarise(mean = mean(FBS),
sd = sd(FBS))
## mean sd
## 1 98.86443 22.9813
Sample variance
Let’s say we sampled 2 samples from the original dataset.
set.seed(1)
print("Two randomly selected samples")
## [1] "Two randomly selected samples"
dataset_sbp$FBS %>%
sample(2)
## [1] 79 109
print("Two randomly selected samples -2 ")
## [1] "Two randomly selected samples -2 "
set.seed(2)
dataset_sbp$FBS %>%
sample(2)
## [1] 95 91
Two different random samplings showed difference in mean values.
We call this sample variance.
Sampling distribution
Let’s see how they looks like in histogram.
This time, the sample number is slightly higher.
Histogram of 10 samples from the dataset (run 1)
set.seed(3)
dataset_sbp$FBS %>%
sample(10) %>%
hist(main = "Histogram of FBS, N = 10", 10)
set.seed(3)
dataset_sbp$FBS %>%
sample(10) %>%
data.frame() %>%
summarise(mean = mean(.),
sd = sd(.))
## mean sd
## 1 98.5 23.0904
Histograms of multiple random sampling and their sample mean
set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 50)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 50",
xlim = c(85, 120))
set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 250)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 250",
xlim = c(85, 120))
set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 400)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 400",
xlim = c(85, 120))
How do they look?
Let’s see their variation in boxplots
set.seed(1)
dataset_sbp_r5000_n20_50_150 <- data.frame(`n = 20` = (replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
data.frame %>%
colMeans()),
`n = 100` = (replicate(5000, sample(dataset_sbp$SBP, 100)) %>%
data.frame %>%
colMeans()),
`n = 150` = (replicate(5000, sample(dataset_sbp$SBP, 150)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE)
dataset_sbp_r5000_n20_50_150 %>%
boxplot(data = .,
SBP ~ condition,
main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 20, n = 50, and n = 150",
xlab = "Sample N of each ramdom sampling")
Summary table
dataset_sbp_r5000_n20_50_150 %>%
group_by(condition) %>%
summarise(mean = mean(SBP),
sd = sd(SBP))
## # A tibble: 3 × 3
## condition mean sd
## <fct> <dbl> <dbl>
## 1 n = 20 122. 3.26
## 2 n = 100 122. 1.45
## 3 n = 150 122. 1.20
dataset_sbp %>%
summarise(mean = mean(SBP),
sd = sd(SBP))
## mean sd
## 1 121.8718 14.56171
Binary variables
Summary table
The criteria for diagnosing hyper tension (high blood pressure) is as follows.
Normal: SBP less than 120 and DBP less than 80 mm Hg; Elevated: SBP 120 to 129 and DBP less than 80 mm Hg; Stage 1 hypertension: SBP 130 to 139 or DBP 80 to 89 mm Hg; Stage 2 hypertension: SBP greater than or equal to 140 mm Hg or DBP greater than or equal to 90 mm Hg.
Using that categorization, we are going to create a new variable
called hypertension
.
dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 | dataset_sbp$DBP > 80,
1,
0)
And the histogram of hypertension will look like this.
dataset_sbp$hypertension %>% hist(main = "Histogram of hypertension")
dataset_sbp$hypertension %>% mean
## [1] 0.316803
Sample mean of binary variable
If we take random 50 samples, due to sampling variation, the sample mean will show different results.
set.seed(1)
dataset_sbp$hypertension %>% sample(50) %>% hist(main = "Histogram of hypertension of \n 50 random samples")
set.seed(1)
dataset_sbp$hypertension %>% sample(50) %>% mean
## [1] 0.3
dataset_sbp$hypertension %>% hist(main = "Histogram of hypertension")
dataset_sbp$hypertension %>% mean
## [1] 0.316803
Sampling distribution of binary variable
set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 50)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 50",
xlim = c(0, 0.6))
set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 150)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 150",
xlim = c(0, 0.6))
set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 500)) %>%
data.frame() %>%
colMeans() %>%
hist(100,
main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 500",
xlim = c(0, 0.6))
box plot
set.seed(1)
dataset_hypertension_r5000_n_50_150_500 <- data.frame(`n = 50` = (replicate(5000, sample(dataset_sbp$hypertension, 50)) %>%
data.frame %>%
colMeans()),
`n = 150` = (replicate(5000, sample(dataset_sbp$hypertension, 150)) %>%
data.frame %>%
colMeans()),
`n = 500` = (replicate(5000, sample(dataset_sbp$hypertension, 500)) %>%
data.frame %>%
colMeans()),
check.names = F
) %>%
gather(condition, hypertension, `n = 50`:`n = 500`, factor_key=TRUE)
dataset_hypertension_r5000_n_50_150_500 %>%
boxplot(data = .,
hypertension ~ condition,
main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 50, n = 150, and n = 500",
xlab = "Sample N of each ramdom sampling",
ylab = "P-hat of mean of 5000 random sample sets")
## Summary table
dataset_hypertension_r5000_n_50_150_500 %>%
group_by(condition) %>%
summarise(mean = mean(hypertension),
sd = sd(hypertension))
## # A tibble: 3 × 3
## condition mean sd
## <fct> <dbl> <dbl>
## 1 n = 50 0.316 0.0639
## 2 n = 150 0.317 0.0378
## 3 n = 500 0.317 0.0206
dataset_sbp %>% summarise(mean = mean(hypertension),
sd = sd(hypertension))
## mean sd
## 1 0.316803 0.4652301
Standard error (SE)
Standard error of and it’s theoretical value for it’s subsets and sampling distribution
dataset_sbp %>% summarise(mean = mean(SBP),
sd = sd(SBP))
## mean sd
## 1 121.8718 14.56171
sd(dataset_sbp$SBP)/sqrt(20)
## [1] 3.256096
sd(dataset_sbp$SBP)/sqrt(100)
## [1] 1.456171
sd(dataset_sbp$SBP)/sqrt(150)
## [1] 1.188958
Standard error of sample proportions
dataset_sbp %>% summarise(mean = mean(hypertension),
sd = sd(hypertension))
## mean sd
## 1 0.316803 0.4652301
sd(dataset_sbp$hypertension)/sqrt(50)
## [1] 0.06579348
sd(dataset_sbp$hypertension)/sqrt(150)
## [1] 0.03798588
sd(dataset_sbp$hypertension)/sqrt(500)
## [1] 0.02080572
sd(dataset_sbp$hypertension)
## [1] 0.4652301
set.seed(8)
dataset_sbp$hypertension %>% sample(100) %>% sd
## [1] 0.4943111
set.seed(1)
dataset_sbp$hypertension %>% sample(100) %>% sd
## [1] 0.4648232
set.seed(5)
dataset_sbp$hypertension %>% sample(100) %>% sd
## [1] 0.4512609
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Springer-Verlag New York, 2016.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. R package version 2.4.5, <https://CRAN.R-project.org/package=swirl>.