COD_week3_2_MGK_BTE3207

Minsik Kim

2024-09-15

Introduction

In this lecture, we will explore basic R programming concepts and data manipulation using the tidyverse package. This includes variable assignment, vector operations, indexing, data frame manipulation, and data summarization.

Setup

First, we’ll set up our R environment by loading necessary libraries and setting the working directory.

Explanation:

•   Global Options: We suppress messages and warnings in the output for cleaner presentation.
•   Working Directory: Adjust path_working to match your system’s directory structure.
•   Package Management: We use the pacman package to load and manage other packages efficiently.
•   Reproducibility: Setting a seed ensures that random operations produce the same results every time.

Environment Report

Let’s output a report of the current R environment to verify our setup.

•   R Version: Displays the current R version.
•   Loaded Packages: Lists all packages currently loaded in the session.
•   Session Information: Provides detailed information about the R session for debugging purposes.

Before begin..

Basics of R

Get current pathway of R working environment

getwd()

## [1] "/Users/minsikkim/Dropbox (Personal)"

Listing files in current pathway

list.files()

##  [1] "@Lab_Administrative"                                  
##  [2] "@minsik"                                              
##  [3] "@wet_lab"                                             
##  [4] "2024_lailab_tech"                                     
##  [5] "Backup"                                               
##  [6] "COD_20240828_MGK_SICAS2_kegg_tax_table_validation.Rmd"
##  [7] "CV, papers"                                           
##  [8] "Database"                                             
##  [9] "ETC"                                                  
## [10] "Finance"                                              
## [11] "Forms_US"                                             
## [12] "Git"                                                  
## [13] "Graduate school data"                                 
## [14] "Icon\r"                                               
## [15] "Inha"                                                 
## [16] "KFTP.kaist.ac.kr"                                     
## [17] "KRIBB"                                                
## [18] "Lectures"                                             
## [19] "MGH"                                                  
## [20] "Modified.zip"                                         
## [21] "Photos"                                               
## [22] "Project_CFB"                                          
## [23] "Project_Freezer"                                      
## [24] "Project_SICAS2_microbiome"                            
## [25] "Project_Uganda_CAS"                                   
## [26] "Project_Uganda_CAS (view-only conflicts 2024-08-12)"  
## [27] "Project_Uganda_CAS (view-only conflicts 2024-08-26)"  
## [28] "R"                                                    
## [29] "Review"                                               
## [30] "sbp_dataset_korea_2013-2014.csv"                      
## [31] "scripts"                                              
## [32] "Sequencing_archive"                                   
## [33] "SICAS2_season_git"                                    
## [34] "Summer_Student_Projects"                              
## [35] "Undergraduate school (2011fall_2014spring)"           
## [36] "volume_reduction_data.csv"                            
## [37] "발표자료 작성방"

Changing directory

Use tab for navigating folders!

list.files("Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/")

## [1] "korea_population_2017_2019.csv"    "restaurants_by_size_2017_2019.csv"
## [3] "sbp_dataset_korea_2013-2014.csv"

setwd("Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/")

getwd()

## [1] "/Users/minsikkim/Dropbox (Personal)/Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset"

Now, you are in dataset folder!

Going back to original folder

To go back to the upper path,

setwd("..")

getwd()

## [1] "/Users/minsikkim"

Now you came back to the original working directory.

Logical values…having logical values.

a = 1

a

## [1] 1

= does the same thing as <-.

To test if the thing are same, R uses ==.

a == 1

## [1] TRUE

as we inserted a <- 1 in the previous code chunk, this test results in TRUE

"a" == 1

## [1] FALSE

This test will test whether a character, "a", is the same with a numeric value 1. As they are not the same, it returns FALSE

Here, TRUE and FALSE results are , and they are one type of binary variable. It works for longer vectors or variables as well.

c(1, 2, 3, 4, 5) == c(1, 2, 2, 4, 5)

## [1]  TRUE  TRUE FALSE  TRUE  TRUE

c(1, 2, 3, 4, 5) == 1

## [1]  TRUE FALSE FALSE FALSE FALSE

And it results in a vector of all the logical tests. Using this, we can filter data easily!

Howe to select values

To select values from vector, we use [] (square brackets).

a <- c(1, 2, 3)

a[1]

## [1] 1

a[1] will result in the first data element in this vector.

It is slightly different with some data with names.

names(a) <- c("first", "second", "third")

str(a)

##  Named num [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "first" "second" "third"

now the vector a is named numeric vector.

In this case,

a[1]

## first 
##     1

The results will be the name and the value!

This sometimes causes some problem when calculating data.

In that case, we need to use double brackets[[]]

a[[1]]

## [1] 1

By the way, selecting multiple numbers can be done with colons:.

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

5:20

##  [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

As it has output of vector (multiple elements), both below codes will work the same way

a[c(1, 2, 3)]

##  first second  third 
##      1      2      3

a[1:3]

##  first second  third 
##      1      2      3

For selecting data in data frames, it works the same way but it separates the rows and columns using comma. Here is one example.

dataframe_example <- data.frame(Joe = c(1:1000),
                                Trump = sample(1:1000, 100),
                                Obama = sample(1:1000, 100),
                                George = sample(1:1000, 100)
                                )

head(dataframe_example)

##   Joe Trump Obama George
## 1   1   371   136    973
## 2   2   792   593    332
## 3   3   242   694    530
## 4   4   383   929     89
## 5   5   539   903     17
## 6   6   260   915    932

This is a data frame, meaning nothing (it just has 100 numbers randomly selected from a numbers between 1 to 1000), besides Biden. It has ordered numbers from 1 to 100.

To select some data, we use numbers or columns again. But we separate inputs with a comma ,.

Selecting 1st row and 1st cloumn

dataframe_example[1,1]

## [1] 1

Selecting multiple rows in column 1

dataframe_example[1:10, 1]

##  [1]  1  2  3  4  5  6  7  8  9 10

Selecting multiple rows and columns

dataframe_example[3:5, 1:2]

##   Joe Trump
## 3   3   242
## 4   4   383
## 5   5   539

How to install more functions in R

We use install.packages() function to install The CRAN (Comprehensive R Archive Network) server.

install.packages("tidyverse")

## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//Rtmp2fNQa4/downloaded_packages

However, the package you just installed, is on your computer (somewhere in a folder called libraries), but they are not loaded to R program (simply saying, you did not open them). To use the installed packages, you need to use function library()

library(tidyverse)

Now you can use tidyverse package!

tidyverse package is the one helps you writing code / summarizing results.

When you learn how to install new packages, you can navigate functions in that package using two colons (::).

tidyverse::tidyverse_conflicts()

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As we installed a new package, there could be a conflits in functions. tidyverse_conflicts() shows the list of those conflicts.

As developers are doing things by their own, and CRAN does not have a system controlling the names of the newly developed functions. That means, a new function from a package that you installed, can be overlapped with other functions from other packages!

dataset_sbp <- read.csv(file = "Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv") 

head(dplyr::filter(dataset_sbp, SEX == 1))

##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

head(stats::filter(dataset_sbp$SEX, rep(1,3)))

## [1] NA  3  3  3  3  3

dplyr::filter() will filter out the data based on given condition, SEX == 1.

(doing the same thing as subset())

However, stats::filter() (which is the basic package comes with R) does different thing. It applies linear filtering to a univariate time series or to each series separately of a multivariate time series.

The best practice is to note all the function names with ::. But generally, you don’t have to do it as it is not that common problem.

Basic tidyverse

Tidy verse helps you writing code efficiently. But how?

Let’s see this example. We want to filter sample, based on some condition. And we have multiple conditions.

head(filter(dataset_sbp, SEX == 1 & SBP > 120))

##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 130  80  90   4 28.4
## 2   1     1 130  80  92   4 27.8
## 3   1     1 130  80  87   4 23.5
## 4   1     1 124  75  92   4 21.7
## 5   1     1 139  89  86   4 21.7
## 6   1     1 138  86  95   4 27.2

This function filtered based on multiple conditions, when SEX == 1 and SBP > 120. But how are we going to do some imputation, and then filter out based on some conditions?

head(filter(filter(filter(dataset_sbp, SEX == 1), SBP > 120), FBS > 110))

##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 135  85 114   4 31.4
## 2   1     1 130  80 130   4 27.5
## 3   1     1 160  95 113   4 38.3
## 4   1     1 130  70 111   4 18.4
## 5   1     1 130  80 114   2 24.0
## 6   1     1 150 100 123   4 27.8

This function filtered based on multiple conditions, when SEX == 1 and SBP > 120. Plus, it has head function outside again.

It can be done with this code and it does the same thing.

dataset_sbp %>%
        filter(SEX == 1) %>%
        filter(SBP > 120) %>%
        filter(FBS > 110) %>%
        head()

##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 135  85 114   4 31.4
## 2   1     1 130  80 130   4 27.5
## 3   1     1 160  95 113   4 38.3
## 4   1     1 130  70 111   4 18.4
## 5   1     1 130  80 114   2 24.0
## 6   1     1 150 100 123   4 27.8

But how are we going to do some imputation, and then filter out based on some conditions?

Let’s see this example again.

We can try adding multiple lines of code to do this. Let’s say we are interested in the difference between SBP and DBP. And then we want to categorize them with genders. And then, we want to filter out the data based on their quantile.

dataset_sbp$Diff_SBP_DBP <- dataset_sbp$SBP - dataset_sbp$DBP

dataset_sbp_male <- filter(dataset_sbp, SEX == 1)
dataset_sbp_female <- filter(dataset_sbp, SEX == 2)

avg_male <- mean(dataset_sbp_male$Diff_SBP_DBP)

avg_female <- mean(dataset_sbp_female$Diff_SBP_DBP)

sd_male <- sd(dataset_sbp_male$Diff_SBP_DBP)

sd_female <- sd(dataset_sbp_female$Diff_SBP_DBP)

data.frame(SEX = c(1, 2),
           aberage_by_group = c(avg_male, avg_female),
           sd_by_group = c(sd_male, sd_female))

##   SEX aberage_by_group sd_by_group
## 1   1         46.66498     9.35818
## 2   2         45.47853    10.20447

We did it! However, the codes are quite nasty, and we have generated unnecessary intermediate data frames as well. Isn’t there a smarter way?

Piping

The good news is, tidyverse:: package has a great feature called piping. In basic R, if we do not assign values with <-, the computer will just show the result and it won’t store the output.

Piping helps employing that output temprarilly, using %>%

dataset_sbp %>% head()

##   SEX BTH_G SBP DBP FBS DIS  BMI Diff_SBP_DBP
## 1   1     1 116  78  94   4 16.6           38
## 2   1     1 100  60  79   4 22.3           40
## 3   1     1 100  60  87   4 21.9           40
## 4   1     1 111  70  72   4 20.2           41
## 5   1     1 120  80  98   4 20.0           40
## 6   1     1 115  79  95   4 23.1           36

Selection of piped data in tidyverse can be done with dot ..

dataset_sbp %>% .$SEX %>% head()

## [1] 1 1 1 1 1 1

The data will be moved the the next function, and will be employed for calculation.

# Calculate the difference between SBP and DBP
dataset_sbp <- dataset_sbp %>%
  mutate(Diff_SBP_DBP = SBP - DBP)

# View the first few rows
head(dataset_sbp)

##   SEX BTH_G SBP DBP FBS DIS  BMI Diff_SBP_DBP
## 1   1     1 116  78  94   4 16.6           38
## 2   1     1 100  60  79   4 22.3           40
## 3   1     1 100  60  87   4 21.9           40
## 4   1     1 111  70  72   4 20.2           41
## 5   1     1 120  80  98   4 20.0           40
## 6   1     1 115  79  95   4 23.1           36

See? Here, mutate() is a function for calculating new variable in tidyverse.

Let’s do the same thing with tidyverse.

# Calculate average and standard deviation of Diff_SBP_DBP by SEX
summary_by_sex <- dataset_sbp %>%
  group_by(SEX) %>%
  summarise(
    average_diff = mean(Diff_SBP_DBP, na.rm = TRUE),
    sd_diff = sd(Diff_SBP_DBP, na.rm = TRUE)
  )

# View the summary
print(summary_by_sex)

## # A tibble: 2 × 3
##     SEX average_diff sd_diff
##   <int>        <dbl>   <dbl>
## 1     1         46.7    9.36
## 2     2         45.5   10.2

Grouping by Multiple Variables

Explanation:

•   Multiple Grouping Variables: Allows for more granular analysis.
•   Nested Groups: The data is first grouped by SEX, then by DIS within each SEX.

# Calculate average and standard deviation by SEX and DIS
summary_by_sex_dis <- dataset_sbp %>%
  group_by(SEX, DIS) %>%
  summarise(
    average_diff = mean(Diff_SBP_DBP, na.rm = TRUE),
    sd_diff = sd(Diff_SBP_DBP, na.rm = TRUE)
  )

# View the summary
print(summary_by_sex_dis)

## # A tibble: 8 × 4
## # Groups:   SEX [2]
##     SEX   DIS average_diff sd_diff
##   <int> <int>        <dbl>   <dbl>
## 1     1     1         51.4   11.2 
## 2     1     2         49.9   10.6 
## 3     1     3         47.4    9.72
## 4     1     4         45.6    8.59
## 5     2     1         53.1   12.0 
## 6     2     2         51.1   11.3 
## 7     2     3         47.9   10.5 
## 8     2     4         43.5    8.96

Conclusion

In this lecture, we covered:

•   Basic R Operations: Variable assignment, comparisons, and vector indexing.
•   Data Frames: Creation and element access.
•   Data Manipulation with tidyverse:
•   Reading data from files.
•   Filtering data using conditions.
•   Adding new variables with mutate().
•   Grouping and summarizing data with group_by() and summarise().

How to learn basic R (optional)

swirl()

swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!

install.packages("swirl")

## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//Rtmp2fNQa4/downloaded_packages

library(swirl)

Don’t go too further,,, it will do almost the half of my job, teaching (bio)stats.

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## version 0.4.4, <https://CRAN.R-project.org/package=reactable>.
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN, Gao C (2024). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.9, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman