COD_week3_1_MGK_BTE3207

Minsik Kim

2024-09-15

Basic visualization of data

We are going to visualize this example dataset, SBP (systolic blood pressure) data from

https://nhiss.nhis.or.kr/bd/ab/bdabf003cv.do.

dataset_sbp <- read.csv(
        file = "Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv") 

Summary statistics of data

We can use mean(), sd(), median() functions to calculate summary statistics.

cat("#Mean of SBP of 1M subject\n",
      mean(dataset_sbp$SBP),
      "\n\n#Standard deviation of SBP of 1M subject\n",
      sd(dataset_sbp$SBP),
      "\n\n#Median of SBP of 1M subject\n",
      median(dataset_sbp$SBP)
      )
## #Mean of SBP of 1M subject
##  121.8718 
## 
## #Standard deviation of SBP of 1M subject
##  14.56171 
## 
## #Median of SBP of 1M subject
##  120

cat() function prints out the character. \n changes the line of the console.

Alternatively, print() can be used but the output of print() function is recognized as data in R. cat() just adds lines in the console.

summary()

Instead, R has a convenient function called summary().

summary(dataset_sbp$SBP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   110.0   120.0   121.9   130.0   190.0

Voila! Now we can roughly see how the data looks like. However, it will be more straight forward if we can see the data in a form of figure.

hist()

hist() function creates histogram in R. It as multiple arguments to make more informative histogram as output. For example,

hist(dataset_sbp$SBP)

hist() - breaks

hist() as argument called breaks =, which can manually set the number of bars in the histogram.

hist(dataset_sbp$SBP,
     breaks = 5)

hist(dataset_sbp$SBP,
     breaks = 10)

We can change number of bars by setting breaks = argument for hist() function.

hist() - continued

By assigning main = (title of histogram), xlab = (x-axis label) and ylab = (y-axis label), we can create a histogram with more detailed information.

hist(dataset_sbp$SBP, 
     breaks = 10,
     main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
     xlab = "SBP (mmHg) of 1M Koreans",
     ylab = "Number of measurements"
     )

hist() - percentage

By imputing density variable of histogram, we can create a histogram with y axis that is having percentage.

h <- hist(dataset_sbp$SBP,
         breaks = 10, 
         plot = F)

h$density <- h$counts/sum(h$counts)*100

plot(h,
     freq = FALSE,
     main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
     xlab = "SBP (mmHg) of 1M Koreans",
     ylab = "Percentage of subject"
     )

Boxplot

However, what should we do if we want to some summary statistic results as figures? Statisticians simply use boxplots. Boxplots can be generated by boxplot() command with information of what will be the x axis or colors.

boxplot() - simple

We can directly put variable(vector) of our interest as x.

boxplot(x = dataset_sbp$SBP)

When arguments were added we can manuipulate the data visualization as we want

boxplot(x = dataset_sbp$SBP,
        main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
        ylab = "SBP (mmHg)"
        )

boxplot() - with x axis

Boxplot is useful in comparing data, by adding more information along x-axis. To make more redundant boxplot, use formula = y ~ x. Here, y will be the variable of y-axis and x will be the x-axis.

Remember x should be categorical (discrete) data.

boxplot(formula = SBP ~ SEX,
        data = dataset_sbp)

Again, with more arguments,

boxplot(formula = SBP ~ SEX,
        data = dataset_sbp,
        main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
        ylab = "SBP (mmHg)",
        xlab = "Gender (male: 1, female: 2)"
        )

boxplot() - x-axis label

we can also change x-axis texts with names = argument.

boxplot(SBP ~ SEX,
        data = dataset_sbp,
        main = "Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014",
        ylab = "SBP (mmHg)",
        xlab = "Gender",
        names = c("Male",
                  "Female")
        )

Question

It seems like the histogram of SBP is somewhat having multiple peaks in the data. Can you tell why?

hist(dataset_sbp$SBP,
     breaks = 500, 
     main = "Histogram of Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014 with 500 breaks",
     xlab = "SBP (mmHg) of 1M Koreans",
     ylab = "Number of measurements"
     )

Common distributions

Symmetric & bell-shaped

hist(dataset_sbp$SBP,
     breaks = 10, 
     main = "Histogram of Systolic Blood Pressure (SBP)\nof 1M Koreans in 2013-2014 with 500 breaks",
     xlab = "SBP (mmHg) of 1M Koreans",
     ylab = "Number of measurements"
     )

Right (positively) skewed

hist(dataset_sbp$FBS,
     breaks = 10, 
     main = "Fasting Blood Sugar (FBS) levels\nof 1M Koreans in 2013-2014",
     xlab = "FBS (mg/L) of 1M Koreans",
     ylab = "Number of measurements"
     )

## Left (negatively skewed)

hist(100-dataset_sbp$FBS,
     breaks = 10, 
     main = "100 - FBS",
     xlab = "100 - FBS",
     ylab = "Number of measurements"
     )

Uniform distributions

hist(dataset_sbp$BTH_G,
     breaks = 30, 
     main = "Age of cohort",
     xlab = "Age (years old)",
     ylab = "Number of measurements"
     )

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## version 0.4.4, <https://CRAN.R-project.org/package=reactable>.
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN, Gao C (2024). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.10, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman