Let’s load the SBP dataset.
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dataset_sbp)
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
set.seed(1)
sim_100_5 <- rnorm(1000, mean = 100, sd = 5) #makes a data with 1000 elements having mean of 100 and sd of 5
hist(sim_100_5, 100, probability = T) #how does it look like?
curve(dnorm(x, mean=100, sd=5),
col="darkblue", lwd=2, add=TRUE, yaxt="n")
mean(sim_100_5)
## [1] 99.94176
sd(sim_100_5)
## [1] 5.174579
Let’s compare some data collected from real people (random 100 Koreans) with the normal distribution
set.seed(1)
sbp_10000 <- sample(dataset_sbp$SBP, 10000) %>%
data.frame()
hist_sbp_10000 <- ggplot(data = sbp_10000) +
geom_histogram(aes(x = .,
y =..density..),
bins = 15,
fill = "white",
col = "black") +
ggtitle("Histogram of SBP of 10000 Koreans") +
ylab("Probability") +
xlab("SBP (mmHg)")
hist_sbp_10000
hist_sbp_10000 +
stat_function(fun = dnorm, args = list(mean = mean(sbp_10000$.), sd = sd(sbp_10000$.)))
sbp_10000$. %>% mean
## [1] 121.7429
sbp_10000$. %>% sd
## [1] 14.68246
sbp_10000$. %>% median
## [1] 120
How can we say the location of a person with 130 mmHg of SBP, in terms of distribution? We usually calculate z-score, to understand the location of one sample more easilty.
z_sbp_10000 <- (131-mean(sbp_10000$.))/sd(sbp_10000$.)
z_sbp_10000
## [1] 0.6304871
Here, the z-score of a person with SBP of 131 is 0.63.
Using this information, we cancalculate the percentile of him (assuming the SBP dataset is following the normal distribution)
pnorm(z_sbp_10000)
## [1] 0.735812
It’s percentile is 73.6%.
In other words,
29% of samples are having greater value than 130
71% of samples are having smaller SBP value than 130
Meanwhile, the actural 73.6th percentile of our dataset (sbp_10000) was 130. Why?
quantile(sbp_10000$., 0.736)
## 73.6%
## 130
#Normalization
min-max normalization makes minimum vale to the 0 and maximum value to 1.
dataset_sbp$norm_SBP <- (dataset_sbp$SBP - min(dataset_sbp$SBP))/
(max(dataset_sbp$SBP) - min(dataset_sbp$SBP))
ggplot(data = dataset_sbp) +
geom_histogram(aes(x = SBP,
y = ..count..),
binwidth = 0.01,
fill = "white",
col = "black") +
ggtitle("Histogram of SBP of 10000 Koreans") +
ylab("Count") +
theme_classic(base_size = 20, base_family = "serif") +
xlab("SBP (mmHg)")
ggplot(data = dataset_sbp) +
geom_histogram(aes(x = norm_SBP,
y = ..count..),
binwidth = 0.01,
fill = "white",
col = "black") +
ggtitle("Histogram of Min-Max Normalized\nSBP of 10000 Koreans") +
ylab("Count") +
theme_classic(base_size = 20, base_family = "serif") +
xlab("Normalized SBP (mmHg)")
# stat_function(fun = dnorm, args = list(mean = mean(dataset_sbp$norm_SBP),
# sd = sd(dataset_sbp$norm_SBP)))
Standardization makes mean to 0 and SD into 1.
dataset_sbp$std_SBP <- (dataset_sbp$SBP - mean(dataset_sbp$SBP))/
sd(dataset_sbp$SBP)
ggplot(data = dataset_sbp) +
geom_histogram(aes(x = std_SBP,
y = ..count..),
binwidth = 0.3,
fill = "white",
col = "black") +
ggtitle("Histogram of standardized\nSBP of 10000 Koreans") +
ylab("Probability") +
xlab("Standardized SBP (mmHg)") +
theme_classic(base_family = "serif", base_size = 20)
#stat_function(fun = dnorm, args = list(mean = mean(dataset_sbp$std_SBP), sd = sd(dataset_sbp$std_SBP)))
We can calculate z-score of skewed data, but their calculation does not match with our assumption “the data is approximately normal”, therefore it results in wrong predictions.
dataset_sbp$FBS %>%
hist(probability = T, main = "Histogram of blood sugar level (FBS)", 15)
dataset_sbp$FBS %>% mean
## [1] 98.86443
dataset_sbp$FBS %>% sd
## [1] 22.9813
dataset_sbp$FBS %>% median
## [1] 94
So, if we use z-score to guess the 2.5% percentile’s FBS level,
(dataset_sbp$FBS %>% mean) - 2 * (dataset_sbp$FBS %>% sd)
## [1] 52.90183
However, in actual database,
quantile(dataset_sbp$FBS, 0.025)
## 2.5%
## 74
Thee z-score cannot represent the actual dataset.
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Springer-Verlag New York, 2016.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. R package version 2.4.5, <https://CRAN.R-project.org/package=swirl>.