COD_week4_1_MGK_BTE3207

Minsik Kim

2023-09-19

Before begin..

Let’s load the SBP dataset.

dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv") 

head(dataset_sbp)
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

Sampling distribution and variability

Example 1

Let’s see this example. If we sample a small subset of data, the distribution will look different from larger set of samples.

Original data’s histogram

hist(dataset_sbp$SBP, main = "Histogram of SBP")

Sample variance

Let’s say we sampled 2 samples from the original dataset.

set.seed(1)

print("Two randomly selecte samples")
## [1] "Two randomly selecte samples"
dataset_sbp$SBP %>%
             sample(2)
## [1]  97 102
print("Two randomly selecte samples -2 ")
## [1] "Two randomly selecte samples -2 "
set.seed(2)


dataset_sbp$SBP %>%
             sample(2)
## [1] 135 110

Two different random samplings showed difference in mean values.

We call this sample variance.

Sampling distribution

Let’s see how they looks like in histogram.

This time, the sample number is slightly higher.

Histogram of 10 samples from the dataset (run 1)

dataset_sbp$SBP %>%
             sample(10) %>%
        hist(main = "Histogram of SBP, N = 10", 10)

Histogram of 10 samples from the dataset (run 2)

dataset_sbp$SBP %>%
             sample(10) %>%
        hist(main = "Histogram of SBP, N = 10")

We randomly sampled 10 dataset from the dataset_SBP and looked at their distribution. Do you think they look the same?

No, unfortunately they are not.

Since the sampling takes samples randomly, their mean, distribution is different.

Of course, since this sampling was done by computer, we can make the sampling process reproducible by a set.seed() function in R.

set.seed(1)

dataset_sbp$SBP %>%
             sample(10) %>%
        hist(main = "Histogram of SBP, N = 10")

set.seed(1)

dataset_sbp$SBP %>%
             sample(10) %>%
        hist(main = "Histogram of SBP, N = 10")

If we set the seed, the sample sampling procedure will use the seed number (in this case it was 1) for it’s algorithm. When the seed number is the same, the computer will generate the same data every time.

However, can we set seed for actual sampling processes in the real world? As it is impossible, the samples are always different from the population, and this phenomena is called sampling variability.

That means, in other words, the samples and their summary statistics actually cannot represent the population.

Then, how can we estimate the population?

Example data set for presentation material

set.seed(1)

dataset_sbp$SBP %>%
             sample(20) %>%
        hist(main = "Histogram of SBP, N = 20")

dataset_sbp$SBP %>%
        sample(20) %>% 
        data.frame %>% 
        summarise(mean = mean(.),
                  sd = sd(.))
##     mean       sd
## 1 123.65 12.60441
set.seed(1)

dataset_sbp$SBP %>%
             sample(50) %>%
        hist(main = "Histogram of SBP, N = 50")

dataset_sbp$SBP %>%
        sample(50) %>% 
        data.frame %>% 
        summarise(mean = mean(.),
                  sd = sd(.))
##     mean       sd
## 1 119.72 14.07566

Sample mean

This time, let’s do some other calculations.

We are going to sample 20 samples (randomly), 5000 times. And then, let’s see its histogram

set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 20",
             xlim = c(105, 135))

What if we increase the number of random sampling?

set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 20",
             xlim = c(105, 135))

set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 50)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 50",
             xlim = c(105, 135))

set.seed(1)
replicate(5000, sample(dataset_sbp$SBP, 150)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean SBP from 1M subjects.\nEach sample N = 150",
             xlim = c(105, 135))

How do they look?

Let’s see their variation in boxplots

set.seed(1)
dataset_pbs_r5000_n50_250_400 <- data.frame(`n = 50` = (replicate(5000, sample(dataset_sbp$FBS, 50)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 250` = (replicate(5000, sample(dataset_sbp$FBS, 250)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 400` = (replicate(5000, sample(dataset_sbp$FBS, 400)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, FBS, `n = 50`:`n = 400`, factor_key=TRUE) 

dataset_pbs_r5000_n50_250_400 %>% 
        boxplot(data = .,
                FBS ~ condition, 
                main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 50, n = 250, and n = 400",
                xlab = "Sample N of each ramdom sampling")

Summary table

dataset_pbs_r5000_n50_250_400 %>% 
        group_by(condition) %>%
        summarise(mean = mean(FBS),
                  sd = sd(FBS))
## # A tibble: 3 × 3
##   condition  mean    sd
##   <fct>     <dbl> <dbl>
## 1 n = 50     99.0  3.31
## 2 n = 250    98.9  1.44
## 3 n = 400    98.9  1.14
dataset_sbp %>% 
        summarise(mean = mean(FBS),
                  sd = sd(FBS))
##       mean      sd
## 1 98.86443 22.9813

Note that the change in variation is dependent on N, not the number of repeat

set.seed(1)

dataset_sbp_r500_n20_50_150 <- data.frame(`n = 20` = (replicate(500, sample(dataset_sbp$SBP, 20)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 100` = (replicate(500, sample(dataset_sbp$SBP, 100)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 150` = (replicate(500, sample(dataset_sbp$SBP, 150)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE) 

set.seed(1)

dataset_sbp_r50_n20_50_150 <- data.frame(`n = 20` = (replicate(50, sample(dataset_sbp$SBP, 20)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 100` = (replicate(50, sample(dataset_sbp$SBP, 100)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 150` = (replicate(50, sample(dataset_sbp$SBP, 150)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE) 

set.seed(1)

dataset_sbp_r5000_n20_50_150 <- data.frame(`n = 20` = (replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 100` = (replicate(50, sample(dataset_sbp$SBP, 100)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 150` = (replicate(50, sample(dataset_sbp$SBP, 150)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE) 

dataset_sbp_r50_n20_50_150$`Number of repeat` <- 50
dataset_sbp_r500_n20_50_150$`Number of repeat` <- 500
dataset_sbp_r5000_n20_50_150$`Number of repeat` <- 5000


rbind(dataset_sbp_r50_n20_50_150,
      dataset_sbp_r500_n20_50_150,
      dataset_sbp_r5000_n20_50_150) %>%
        mutate(`Number of repeat` = as.factor(`Number of repeat`)) %>% 
        ggplot(data = ., aes(x = condition, y = SBP)) +
        geom_boxplot(aes(fill = `Number of repeat`)) +
        ggtitle("Estimated sampling distribution, \nsample mean of some random samples of \nn = 20, n = 50, and n = 150") +
        xlab("Sample N of each ramdom sampling")

?geom_boxplot()

Sample mean and sample distribution of skewed data

As same as we observed the tendency of sample means with bell-shaped dataset, we are going to investigate skewed data

Original data’s histogram

hist(dataset_sbp$FBS, main = "Histogram of FBS", 10)

dataset_sbp %>% summarise(mean = mean(FBS),
                              sd = sd(FBS))
##       mean      sd
## 1 98.86443 22.9813

Sample variance

Let’s say we sampled 2 samples from the original dataset.

set.seed(1)

print("Two randomly selected samples")
## [1] "Two randomly selected samples"
dataset_sbp$FBS %>%
             sample(2)
## [1]  79 109
print("Two randomly selected samples -2 ")
## [1] "Two randomly selected samples -2 "
set.seed(2)


dataset_sbp$FBS %>%
             sample(2)
## [1] 95 91

Two different random samplings showed difference in mean values.

We call this sample variance.

Sampling distribution

Let’s see how they looks like in histogram.

This time, the sample number is slightly higher.

Histogram of 10 samples from the dataset (run 1)

set.seed(3)
dataset_sbp$FBS %>%
             sample(10) %>%
        hist(main = "Histogram of FBS, N = 10", 10)

set.seed(3)
dataset_sbp$FBS %>%
        sample(10) %>%
        data.frame() %>%
        summarise(mean = mean(.),
                              sd = sd(.))
##   mean      sd
## 1 98.5 23.0904

Histograms of multiple random sampling and their sample mean

set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 50)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 50",
             xlim = c(85, 120))

set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 250)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 250",
             xlim = c(85, 120))

set.seed(1)
replicate(5000, sample(dataset_sbp$FBS, 400)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean FBS from 1M subjects.\nEach sample N = 400",
             xlim = c(85, 120))

How do they look?

Let’s see their variation in boxplots

set.seed(1)
dataset_sbp_r5000_n20_50_150 <- data.frame(`n = 20` = (replicate(5000, sample(dataset_sbp$SBP, 20)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 100` = (replicate(5000, sample(dataset_sbp$SBP, 100)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 150` = (replicate(5000, sample(dataset_sbp$SBP, 150)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, SBP, `n = 20`:`n = 150`, factor_key=TRUE) 

dataset_sbp_r5000_n20_50_150 %>% 
        boxplot(data = .,
                SBP ~ condition, 
                main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 20, n = 50, and n = 150",
                xlab = "Sample N of each ramdom sampling")

Summary table

dataset_sbp_r5000_n20_50_150 %>% 
        group_by(condition) %>%
        summarise(mean = mean(SBP),
                  sd = sd(SBP))
## # A tibble: 3 × 3
##   condition  mean    sd
##   <fct>     <dbl> <dbl>
## 1 n = 20     122.  3.26
## 2 n = 100    122.  1.45
## 3 n = 150    122.  1.20
dataset_sbp %>% 
        summarise(mean = mean(SBP),
                  sd = sd(SBP))
##       mean       sd
## 1 121.8718 14.56171

Binary variables

Summary table

The criteria for diagnosing hyper tension (high blood pressure) is as follows.

Normal: SBP less than 120 and DBP less than 80 mm Hg; Elevated: SBP 120 to 129 and DBP less than 80 mm Hg; Stage 1 hypertension: SBP 130 to 139 or DBP 80 to 89 mm Hg; Stage 2 hypertension: SBP greater than or equal to 140 mm Hg or DBP greater than or equal to 90 mm Hg.

Using that categorization, we are going to create a new variable called hypertension.

dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 | dataset_sbp$DBP > 80,
                                   1,
                                   0)

And the histogram of hypertension will look like this.

dataset_sbp$hypertension %>% hist(main = "Histogram of hypertension")

dataset_sbp$hypertension %>% mean
## [1] 0.316803

Sample mean of binary variable

If we take random 50 samples, due to sampling variation, the sample mean will show different results.

set.seed(1)
dataset_sbp$hypertension %>% sample(50) %>% hist(main = "Histogram of hypertension of \n 50 random samples")

set.seed(1)
dataset_sbp$hypertension %>% sample(50) %>% mean
## [1] 0.3
dataset_sbp$hypertension %>% hist(main = "Histogram of hypertension")

dataset_sbp$hypertension %>% mean
## [1] 0.316803

Sampling distribution of binary variable

set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 50)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 50",
             xlim = c(0, 0.6))

set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 150)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 150",
             xlim = c(0, 0.6))

set.seed(1)
replicate(5000, sample(dataset_sbp$hypertension, 500)) %>%
        data.frame() %>% 
        colMeans() %>%
        hist(100,
             main = "Distribution of 5000 sample mean p-hat of hypertation from 1M subjects.\nEach sample N = 500",
xlim = c(0, 0.6))

box plot

set.seed(1)
dataset_hypertension_r5000_n_50_150_500 <- data.frame(`n = 50` = (replicate(5000, sample(dataset_sbp$hypertension, 50)) %>%
                               data.frame %>%
                               colMeans()),
           `n = 150` = (replicate(5000, sample(dataset_sbp$hypertension, 150)) %>%
                                data.frame %>%
                                colMeans()),
           `n = 500` = (replicate(5000, sample(dataset_sbp$hypertension, 500)) %>%
                                data.frame %>%
                                colMeans()),
           check.names = F
           ) %>%
        gather(condition, hypertension, `n = 50`:`n = 500`, factor_key=TRUE) 

dataset_hypertension_r5000_n_50_150_500 %>% 
        boxplot(data = .,
                hypertension ~ condition, 
                main = "Estimated sampling distribution, \nsample mean of 5000 random samples of \nn = 50, n = 150, and n = 500",
                xlab = "Sample N of each ramdom sampling",
                ylab = "P-hat of mean of 5000 random sample sets")

## Summary table

dataset_hypertension_r5000_n_50_150_500 %>%
        group_by(condition) %>%
        summarise(mean = mean(hypertension),
                  sd = sd(hypertension))
## # A tibble: 3 × 3
##   condition  mean     sd
##   <fct>     <dbl>  <dbl>
## 1 n = 50    0.316 0.0639
## 2 n = 150   0.317 0.0378
## 3 n = 500   0.317 0.0206
dataset_sbp %>% summarise(mean = mean(hypertension),
                  sd = sd(hypertension))
##       mean        sd
## 1 0.316803 0.4652301

Standard error (SE)

Standard error of and it’s theoretical value for it’s subsets and sampling distribution

dataset_sbp %>% summarise(mean = mean(SBP),
                              sd = sd(SBP))
##       mean       sd
## 1 121.8718 14.56171
sd(dataset_sbp$SBP)/sqrt(20)
## [1] 3.256096
sd(dataset_sbp$SBP)/sqrt(100)
## [1] 1.456171
sd(dataset_sbp$SBP)/sqrt(150)
## [1] 1.188958

Standard error of sample proportions

dataset_sbp %>% summarise(mean = mean(hypertension),
                              sd = sd(hypertension))
##       mean        sd
## 1 0.316803 0.4652301
sd(dataset_sbp$hypertension)/sqrt(50)
## [1] 0.06579348
sd(dataset_sbp$hypertension)/sqrt(150)
## [1] 0.03798588
sd(dataset_sbp$hypertension)/sqrt(500)
## [1] 0.02080572
sd(dataset_sbp$hypertension)
## [1] 0.4652301
set.seed(8)
dataset_sbp$hypertension %>% sample(100) %>% sd
## [1] 0.4943111
set.seed(1)
dataset_sbp$hypertension %>% sample(100) %>% sd 
## [1] 0.4648232
set.seed(5)
dataset_sbp$hypertension %>% sample(100) %>% sd 
## [1] 0.4512609

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Springer-Verlag New York, 2016.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. R package version 2.4.5, <https://CRAN.R-project.org/package=swirl>.