Before begin..

Let’s load the SBP dataset.

dataset_sbp <- read.csv(file = "Inha/5_Lectures/2024/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")

head(dataset_sbp)
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

The normal distribution

set.seed(1)
sim_100_5 <- rnorm(1000, mean = 100, sd = 5) #makes a data with 1000 elements having mean of 100 and sd of 5

hist(sim_100_5, 100, probability = T) #how does it look like?
curve(dnorm(x, mean=100, sd=5), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

mean(sim_100_5) 
## [1] 99.94176
sd(sim_100_5)
## [1] 5.174579

normal distribution

Let’s compare some data collected from real people (random 100 Koreans) with the normal distribution

set.seed(1)
sbp_10000 <- sample(dataset_sbp$SBP, 10000) %>%
        data.frame()

hist_sbp_10000 <- ggplot(data = sbp_10000) +
        geom_histogram(aes(x = .,
                           y =..density..),
                       bins = 15,
                       fill = "white",
                       col = "black") +
        ggtitle("Histogram of SBP of 10000 Koreans") +
        ylab("Probability") +
        xlab("SBP (mmHg)")

hist_sbp_10000

hist_sbp_10000 + 
        stat_function(fun = dnorm, args = list(mean = mean(sbp_10000$.), sd = sd(sbp_10000$.)))

sbp_10000$. %>% mean
## [1] 121.7429
sbp_10000$. %>% sd
## [1] 14.68246
sbp_10000$. %>% median
## [1] 120

z-score

How can we say the location of a person with 130 mmHg of SBP, in terms of distribution? We usually calculate z-score, to understand the location of one sample more easilty.

z_sbp_10000 <- (131-mean(sbp_10000$.))/sd(sbp_10000$.)

z_sbp_10000
## [1] 0.6304871

Here, the z-score of a person with SBP of 131 is 0.63.

Using this information, we cancalculate the percentile of him (assuming the SBP dataset is following the normal distribution)

pnorm(z_sbp_10000)
## [1] 0.735812

It’s percentile is 73.6%.

In other words,

  1. 29% of samples are having greater value than 130

  2. 71% of samples are having smaller SBP value than 130

Meanwhile, the actural 73.6th percentile of our dataset (sbp_10000) was 130. Why?

quantile(sbp_10000$., 0.736)
## 73.6% 
##   130

#Normalization

min-max normalization makes minimum vale to the 0 and maximum value to 1.

dataset_sbp$norm_SBP <- (dataset_sbp$SBP - min(dataset_sbp$SBP))/
        (max(dataset_sbp$SBP) - min(dataset_sbp$SBP))


ggplot(data = dataset_sbp) +
        geom_histogram(aes(x = SBP,
                           y = ..count..),
                       binwidth = 0.01,
                       fill = "white",
                       col = "black") +
        ggtitle("Histogram of SBP of 10000 Koreans") +
        ylab("Count") +
        theme_classic(base_size = 20, base_family = "serif") +
        xlab("SBP (mmHg)")

ggplot(data = dataset_sbp) +
        geom_histogram(aes(x = norm_SBP,
                           y = ..count..),
                       binwidth = 0.01,
                       fill = "white",
                       col = "black") +
        ggtitle("Histogram of Min-Max Normalized\nSBP of 10000 Koreans") +
        ylab("Count") +
        theme_classic(base_size = 20, base_family = "serif") +
        xlab("Normalized SBP (mmHg)")

#        stat_function(fun = dnorm, args = list(mean = mean(dataset_sbp$norm_SBP),
#                                               sd = sd(dataset_sbp$norm_SBP)))

Standardization

Standardization makes mean to 0 and SD into 1.

dataset_sbp$std_SBP <- (dataset_sbp$SBP - mean(dataset_sbp$SBP))/
        sd(dataset_sbp$SBP)

ggplot(data = dataset_sbp) +
        geom_histogram(aes(x = std_SBP,
                           y = ..count..),
                       binwidth = 0.3,
                       fill = "white",
                       col = "black") +
        ggtitle("Histogram of standardized\nSBP of 10000 Koreans") +
        ylab("Probability") +
        xlab("Standardized SBP (mmHg)") +
        theme_classic(base_family = "serif", base_size = 20) 

        #stat_function(fun = dnorm, args = list(mean = mean(dataset_sbp$std_SBP), sd = sd(dataset_sbp$std_SBP)))

Skewed data and z-score

We can calculate z-score of skewed data, but their calculation does not match with our assumption “the data is approximately normal”, therefore it results in wrong predictions.

dataset_sbp$FBS %>%
        hist(probability = T, main = "Histogram of blood sugar level (FBS)", 15)

dataset_sbp$FBS %>% mean
## [1] 98.86443
dataset_sbp$FBS %>% sd
## [1] 22.9813
dataset_sbp$FBS %>% median
## [1] 94

So, if we use z-score to guess the 2.5% percentile’s FBS level,

(dataset_sbp$FBS %>% mean) - 2 * (dataset_sbp$FBS %>% sd)
## [1] 52.90183

However, in actual database,

quantile(dataset_sbp$FBS, 0.025)
## 2.5% 
##   74

Thee z-score cannot represent the actual dataset.

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## version 0.4.4, <https://CRAN.R-project.org/package=reactable>.
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN, Gao C (2024). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.10, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman