COD_week2_2_MGK_BTE3207

Minsik Kim

2023-09-05

Before begin..

Basics of R

Get current pathway of R working environment

getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics"

Listing files in current pathway

list.files()
##  [1] "BTE3207_Advanced_Biostatistics.Rproj"
##  [2] "COD_week1_2_MGK_BTE3207.html"        
##  [3] "COD_week1_2_MGK_BTE3207.R"           
##  [4] "COD_week1_2_MGK_BTE3207.Rmd"         
##  [5] "COD_week2_1_MGK_BTE3207.html"        
##  [6] "COD_week2_1_MGK_BTE3207.Rmd"         
##  [7] "COD_week2_2_MGK_BTE3207.html"        
##  [8] "COD_week2_2_MGK_BTE3207.Rmd"         
##  [9] "dataset"                             
## [10] "README.md"                           
## [11] "rsconnect"

Changing directory

Use tab for navigating folders!

setwd("dataset/")

getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset"

Now, you are in dataset folder!

Going back to original folder

To go back to the upper path,

setwd("..")

getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts"

Now you came back to the original working directory.

Logical values…having logical values.

a = 1

a
## [1] 1

= does the same thing as <-.

To test if the thing are same, R uses ==.

a == 1
## [1] TRUE

as we inserted a <- 1 in the previous code chunk, this test results in TRUE

"a" == 1
## [1] FALSE

This test will test whether a character, "a", is the same with a numeric value 1. As they are not the same, it returns FALSE

Here, TRUE and FALSE results are , and they are one type of binary variable. It works for longer vectors or variables as well.

c(1, 2, 3, 4, 5) == c(1, 2, 2, 4, 5)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE
c(1, 2, 3, 4, 5) == 1
## [1]  TRUE FALSE FALSE FALSE FALSE

And it results in a vector of all the logical tests. Using this, we can filter data easily!

Howe to select values

To select values from vector, we use [] (square brackets).

a <- c(1, 2, 3)

a[1]
## [1] 1

a[1] will result in the first data element in this vector.

It is slightly different with some data with names.

names(a) <- c("first", "second", "third")

str(a)
##  Named num [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "first" "second" "third"

now the vector a is named numeric vector.

In this case,

a[1] 
## first 
##     1

The results will be the name and the value!

This sometimes causes some problem when calculating data.

In that case, we need to use double brackets[[]]

a[[1]]
## [1] 1

By the way, selecting multiple numbers can be done with colons:.

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
5:20
##  [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

As it has output of vector (multiple elements), both below codes will work the same way

a[c(1, 2, 3)]
##  first second  third 
##      1      2      3
a[1:3]
##  first second  third 
##      1      2      3

For selecting data in data frames, it works the same way but it separates the rows and columns using comma. Here is one example.

dataframe_example <- data.frame(Joe = c(1:1000),
                                Trump = sample(1:1000, 100),
                                Obama = sample(1:1000, 100),
                                George = sample(1:1000, 100)
                                )

head(dataframe_example)
##   Joe Trump Obama George
## 1   1   922   208    239
## 2   2   519   909      3
## 3   3   942   214    858
## 4   4   432   137    198
## 5   5   202   724    668
## 6   6   763   400     79

This is a data frame, meaning nothing (it just has 100 numbers randomly selected from a numbers between 1 to 1000), besides Biden. It has ordered numbers from 1 to 100.

To select some data, we use numbers or columns again. But we separate inputs with a comma ,.

Selecting 1st row and 1st cloumn

dataframe_example[1,1]
## [1] 1

Selecting multiple rows in column 1

dataframe_example[1:10, 1]
##  [1]  1  2  3  4  5  6  7  8  9 10

Selecting multiple rows and columns

dataframe_example[3:5, 1:2]
##   Joe Trump
## 3   3   942
## 4   4   432
## 5   5   202

How to install more functions in R

We use install.packages() function to install The CRAN (Comprehensive R Archive Network) server.

install.packages("tidyverse")
## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpsYsGtn/downloaded_packages

However, the package you just installed, is on your computer (somewhere in a folder called libraries), but they are not loaded to R program (simply saying, you did not open them). To use the installed packages, you need to use function library()

library(tidyverse)

Now you can use tidyverse package!

tidyverse package is the one helps you writing code / summarizing results.

When you learn how to install new packages, you can navigate functions in that package using two colons (::).

tidyverse::tidyverse_conflicts()
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As we installed a new package, there could be a conflits in functions. tidyverse_conflicts() shows the list of those conflicts.

As developers are doing things by their own, and CRAN does not have a system controlling the names of the newly developed functions. That means, a new function from a package that you installed, can be overlapped with other functions from other packages!

dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv") 

head(dplyr::filter(dataset_sbp, SEX == 1))
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1
head(stats::filter(dataset_sbp$SEX, rep(1,3)))
## [1] NA  3  3  3  3  3

dplyr::filter() will filter out the data based on given condition, SEX == 1.

(doing the same thing as subset())

However, stats::filter() (which is the basic package comes with R) does different thing. It applies linear filtering to a univariate time series or to each series separately of a multivariate time series.

The best practice is to note all the function names with ::. But generally, you don’t have to do it as it is not that common problem.

Basic tidyverse

Tidy verse helps you writing code efficiently. But how?

Let’s see this example. We want to filter sample, based on some condition. And we have multiple conditions.

head(filter(dataset_sbp, SEX == 1 & SBP > 120))
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 130  80  90   4 28.4
## 2   1     1 130  80  92   4 27.8
## 3   1     1 130  80  87   4 23.5
## 4   1     1 124  75  92   4 21.7
## 5   1     1 139  89  86   4 21.7
## 6   1     1 138  86  95   4 27.2

This function filtered based on multiple conditions, when SEX == 1 and SBP > 120. But how are we going to do some imputation, and then filter out based on some conditions?

head(filter(filter(filter(dataset_sbp, SEX == 1), SBP > 120), FBS > 110))
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 135  85 114   4 31.4
## 2   1     1 130  80 130   4 27.5
## 3   1     1 160  95 113   4 38.3
## 4   1     1 130  70 111   4 18.4
## 5   1     1 130  80 114   2 24.0
## 6   1     1 150 100 123   4 27.8

This function filtered based on multiple conditions, when SEX == 1 and SBP > 120. Plus, it has head function outside again.

It can be done with this code and it does the same thing.

dataset_sbp %>%
        filter(SEX == 1) %>%
        filter(SBP > 120) %>%
        filter(FBS > 110) %>%
        head()
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 135  85 114   4 31.4
## 2   1     1 130  80 130   4 27.5
## 3   1     1 160  95 113   4 38.3
## 4   1     1 130  70 111   4 18.4
## 5   1     1 130  80 114   2 24.0
## 6   1     1 150 100 123   4 27.8

But how are we going to do some imputation, and then filter out based on some conditions?

Let’s see this example again.

We can try adding multiple lines of code to do this. Let’s say we are interested in the difference between SBP and DBP. And then we want to categorize them with genders. And then, we want to filter out the data based on their quantile.

dataset_sbp$Diff_SBP_DBP <- dataset_sbp$SBP - dataset_sbp$DBP

dataset_sbp_male <- filter(dataset_sbp, SEX == 1)
dataset_sbp_female <- filter(dataset_sbp, SEX == 2)

avg_male <- mean(dataset_sbp_male$Diff_SBP_DBP)

avg_female <- mean(dataset_sbp_female$Diff_SBP_DBP)

sd_male <- sd(dataset_sbp_male$Diff_SBP_DBP)

sd_female <- sd(dataset_sbp_female$Diff_SBP_DBP)

data.frame(SEX = c(1, 2),
           aberage_by_group = c(avg_male, avg_female),
           sd_by_group = c(sd_male, sd_female))
##   SEX aberage_by_group sd_by_group
## 1   1         46.66498     9.35818
## 2   2         45.47853    10.20447

We did it! However, the codes are quite nasty, and we have generated unnecessary intermediate data frames as well. Isn’t there a smarter way?

Piping

The good news is, tidyverse:: package has a great feature called piping. In basic R, if we do not assign values with <-, the computer will just show the result and it won’t store the output.

Piping helps employing that output temprarilly, using %>%

dataset_sbp %>% head()
##   SEX BTH_G SBP DBP FBS DIS  BMI Diff_SBP_DBP
## 1   1     1 116  78  94   4 16.6           38
## 2   1     1 100  60  79   4 22.3           40
## 3   1     1 100  60  87   4 21.9           40
## 4   1     1 111  70  72   4 20.2           41
## 5   1     1 120  80  98   4 20.0           40
## 6   1     1 115  79  95   4 23.1           36

Selection of piped data in tidyverse can be done with dot ..

dataset_sbp %>% .$SEX %>% head()
## [1] 1 1 1 1 1 1

The data will be moved the the next function, and will be employed for calculation.

dataset_sbp %>% mutate(Diff_SBP_DBP = SBP - DBP) %>% head()
##   SEX BTH_G SBP DBP FBS DIS  BMI Diff_SBP_DBP
## 1   1     1 116  78  94   4 16.6           38
## 2   1     1 100  60  79   4 22.3           40
## 3   1     1 100  60  87   4 21.9           40
## 4   1     1 111  70  72   4 20.2           41
## 5   1     1 120  80  98   4 20.0           40
## 6   1     1 115  79  95   4 23.1           36

See? Here, mutate() is a function for calculating new variable in tidyverse.

Let’s do the same thing with tidyverse.

dataset_sbp %>% 
        mutate(Diff_SBP_DBP = SBP - DBP) %>% 
        group_by(SEX) %>%
        summarise(avereage_by_group = mean(Diff_SBP_DBP),
                  sd_by_group = sd(Diff_SBP_DBP))
## # A tibble: 2 × 3
##     SEX avereage_by_group sd_by_group
##   <int>             <dbl>       <dbl>
## 1     1              46.7        9.36
## 2     2              45.5       10.2

Try to use piping as many as you can. And use lots of lines and indents. It will make your code look tidier.

These application is also possible (I didn’t make a bad example as it would take too much of my personal time)

dataset_sbp %>% 
        group_by(SEX, DIS) %>%
        summarise(avereage = mean(Diff_SBP_DBP),
                  sd = sd(Diff_SBP_DBP))
## # A tibble: 8 × 4
## # Groups:   SEX [2]
##     SEX   DIS avereage    sd
##   <int> <int>    <dbl> <dbl>
## 1     1     1     51.4 11.2 
## 2     1     2     49.9 10.6 
## 3     1     3     47.4  9.72
## 4     1     4     45.6  8.59
## 5     2     1     53.1 12.0 
## 6     2     2     51.1 11.3 
## 7     2     3     47.9 10.5 
## 8     2     4     43.5  8.96

How to learn basic R (optional)

swirl()

swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!

install.packages("swirl")
## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpsYsGtn/downloaded_packages
library(swirl)

Don’t go too further,,, it will do almost the half of my job, teaching (bio)stats.

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Springer-Verlag New York, 2016.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. R package version 2.4.5, <https://CRAN.R-project.org/package=swirl>.