Basic visualization of data
We are going to visualize this example dataset, SBP (systolic blood pressure) data from
https://nhiss.nhis.or.kr/bd/ab/bdabf003cv.do.
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
Summary statistics of data
We can use mean()
, sd()
,
median()
functions to calculate summary statistics.
cat("#Mean of SBP of 1M subject\n",
mean(dataset_sbp$SBP),
"\n\n#Standard deviation of SBP of 1M subject\n",
sd(dataset_sbp$SBP),
"\n\n#Median of SBP of 1M subject\n",
median(dataset_sbp$SBP)
)
## #Mean of SBP of 1M subject
## 121.8718
##
## #Standard deviation of SBP of 1M subject
## 14.56171
##
## #Median of SBP of 1M subject
## 120
cat()
function prints out the character. \n
changes the line of the console.
Alternatively, print()
can be used but the output of
print()
function is recognized as data in R.
cat()
just adds lines in the console.
summary()
Instead, R has a convenient function called
summary()
.
summary(dataset_sbp$SBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 110.0 120.0 121.9 130.0 190.0
Voila! Now we can roughly see how the data looks like. However, it will be more straight forward if we can see the data in a form of figure.
hist()
hist()
function creates histogram in R. It as multiple
arguments to make more informative histogram as output. For example,
hist(dataset_sbp$SBP)
hist() - breaks
hist()
as argument called breaks =
, which
can manually set the number of bars in the histogram.
hist(dataset_sbp$SBP,
breaks = 5)
hist(dataset_sbp$SBP,
breaks = 10)
We can change number of bars by setting
breaks =
argument
for hist()
function.
hist() - continued
By assigning main =
(title of histogram),
xlab =
(x-axis label) and ylab =
(y-axis
label), we can create a histogram with more detailed information.
hist(dataset_sbp$SBP,
breaks = 10,
main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
xlab = "SBP (mmHg) of 1M Koreans",
ylab = "Number of measurements"
)
hist() - percentage
By imputing density
variable of histogram, we can create
a histogram with y axis that is having percentage.
h <- hist(dataset_sbp$SBP,
breaks = 10,
plot = F)
h$density <- h$counts/sum(h$counts)*100
plot(h,
freq = FALSE,
main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
xlab = "SBP (mmHg) of 1M Koreans",
ylab = "Percentage of subject"
)
Boxplot
However, what should we do if we want to some summary statistic
results as figures? Statisticians simply use boxplots. Boxplots can be
generated by boxplot()
command with information of what
will be the x axis or colors.
boxplot() - simple
We can directly put variable(vector) of our interest as
x
.
boxplot(x = dataset_sbp$SBP,
main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
ylab = "SBP (mmHg)"
)
boxplot() - with x axis
Boxplot is useful in comparing data, by adding more information along
x-axis. To make more redundant boxplot, use
formula = y ~ x
. Here, y
will be the variable
of y-axis and x will be the x-axis.
Remember x should be categorical (discrete) data.
boxplot(formula = SBP ~ SEX,
data = dataset_sbp,
main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
ylab = "SBP (mmHg)",
xlab = "Gender (male: 1, female: 2)"
)
boxplot() - x-axis label
we can also change x-axis texts with names =
argument.
boxplot(SBP ~ SEX,
data = dataset_sbp,
main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
ylab = "SBP (mmHg)",
xlab = "Gender",
names = c("Male",
"Female")
)
Question
It seems like the histogram of SBP is somewhat having multiple peaks in the data. Can you tell why?
hist(dataset_sbp$SBP,
breaks = 500,
main = "Histogram of Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014 with 500 breaks",
xlab = "SBP (mmHg) of 1M Koreans",
ylab = "Number of measurements"
)
Common distributions
Symmetric & bell-shaped
hist(dataset_sbp$SBP,
breaks = 10,
main = "Histogram of Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014 with 500 breaks",
xlab = "SBP (mmHg) of 1M Koreans",
ylab = "Number of measurements"
)
Right (positively) skewed
hist(dataset_sbp$FBS,
breaks = 10,
main = "Fasting Blood Sugar (FBS) levels\nof 1M Koreans in 2013-2014",
xlab = "FBS (mg/L) of 1M Koreans",
ylab = "Number of measurements"
)
## Left (negatively skewed)
hist(100-dataset_sbp$FBS,
breaks = 10,
main = "100 - FBS",
xlab = "100 - FBS",
ylab = "Number of measurements"
)
Uniform distributions
hist(dataset_sbp$BTH_G,
breaks = 30,
main = "Age of cohort",
xlab = "Age (years old)",
ylab = "Number of measurements"
)
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.