Introduction
In this lecture, we will explore basic R programming concepts and data manipulation using the tidyverse package. This includes variable assignment, vector operations, indexing, data frame manipulation, and data summarization.
Setup
First, we’ll set up our R environment by loading necessary libraries and setting the working directory.
Explanation:
• Global Options: We suppress messages and warnings in the output for cleaner presentation.
• Working Directory: Adjust path_working to match your system’s directory structure.
• Package Management: We use the pacman package to load and manage other packages efficiently.
• Reproducibility: Setting a seed ensures that random operations produce the same results every time.
Environment Report
Let’s output a report of the current R environment to verify our setup.
• R Version: Displays the current R version.
• Loaded Packages: Lists all packages currently loaded in the session.
• Session Information: Provides detailed information about the R session for debugging purposes.
Before begin..
Basics of R
Get current pathway of R working environment
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)"
Listing files in current pathway
list.files()
## [1] "@Lab_Administrative"
## [2] "@minsik"
## [3] "@wet_lab"
## [4] "2024_lailab_tech"
## [5] "Backup"
## [6] "COD_20240828_MGK_SICAS2_kegg_tax_table_validation.Rmd"
## [7] "CV, papers"
## [8] "Database"
## [9] "ETC"
## [10] "Finance"
## [11] "Forms_US"
## [12] "Git"
## [13] "Graduate school data"
## [14] "Icon\r"
## [15] "Inha"
## [16] "KFTP.kaist.ac.kr"
## [17] "KRIBB"
## [18] "Lectures"
## [19] "MGH"
## [20] "Modified.zip"
## [21] "Photos"
## [22] "Project_CFB"
## [23] "Project_Freezer"
## [24] "Project_SICAS2_microbiome"
## [25] "Project_Uganda_CAS"
## [26] "Project_Uganda_CAS (view-only conflicts 2024-08-12)"
## [27] "Project_Uganda_CAS (view-only conflicts 2024-08-26)"
## [28] "R"
## [29] "Review"
## [30] "sbp_dataset_korea_2013-2014.csv"
## [31] "scripts"
## [32] "Sequencing_archive"
## [33] "SICAS2_season_git"
## [34] "Summer_Student_Projects"
## [35] "Undergraduate school (2011fall_2014spring)"
## [36] "volume_reduction_data.csv"
## [37] "발표자료 작성방"
Changing directory
Use tab
for navigating folders!
list.files("Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/")
## [1] "korea_population_2017_2019.csv" "restaurants_by_size_2017_2019.csv"
## [3] "sbp_dataset_korea_2013-2014.csv"
setwd("Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/")
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset"
Now, you are in dataset folder!
Going back to original folder
To go back to the upper path,
setwd("..")
getwd()
## [1] "/Users/minsikkim"
Now you came back to the original working directory.
Logical values…having logical values.
a = 1
a
## [1] 1
=
does the same thing as <-
.
To test if the thing are same, R uses ==
.
a == 1
## [1] TRUE
as we inserted a <- 1
in the previous code chunk,
this test results in TRUE
"a" == 1
## [1] FALSE
This test will test whether a character, "a"
, is the
same with a numeric value 1
. As they are not the same, it
returns FALSE
Here, TRUE
and FALSE
results are , and they
are one type of binary variable. It works for longer
vectors or variables as well.
c(1, 2, 3, 4, 5) == c(1, 2, 2, 4, 5)
## [1] TRUE TRUE FALSE TRUE TRUE
c(1, 2, 3, 4, 5) == 1
## [1] TRUE FALSE FALSE FALSE FALSE
And it results in a vector of all the logical tests. Using this, we can filter data easily!
Howe to select values
To select values from vector, we use []
(square
brackets).
a <- c(1, 2, 3)
a[1]
## [1] 1
a[1]
will result in the first data element in this
vector.
It is slightly different with some data with names.
names(a) <- c("first", "second", "third")
str(a)
## Named num [1:3] 1 2 3
## - attr(*, "names")= chr [1:3] "first" "second" "third"
now the vector a
is named numeric vector.
In this case,
a[1]
## first
## 1
The results will be the name and the value!
This sometimes causes some problem when calculating data.
In that case, we need to use double brackets[[]]
a[[1]]
## [1] 1
By the way, selecting multiple numbers can be done with
colons:
.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
5:20
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
As it has output of vector (multiple elements), both below codes will work the same way
a[c(1, 2, 3)]
## first second third
## 1 2 3
a[1:3]
## first second third
## 1 2 3
For selecting data in data frames, it works the same way but it separates the rows and columns using comma. Here is one example.
dataframe_example <- data.frame(Joe = c(1:1000),
Trump = sample(1:1000, 100),
Obama = sample(1:1000, 100),
George = sample(1:1000, 100)
)
head(dataframe_example)
## Joe Trump Obama George
## 1 1 371 136 973
## 2 2 792 593 332
## 3 3 242 694 530
## 4 4 383 929 89
## 5 5 539 903 17
## 6 6 260 915 932
This is a data frame, meaning nothing (it just has 100 numbers randomly selected from a numbers between 1 to 1000), besides Biden. It has ordered numbers from 1 to 100.
To select some data, we use numbers or columns again. But we separate
inputs with a comma ,
.
Selecting 1st row and 1st cloumn
dataframe_example[1,1]
## [1] 1
Selecting multiple rows in column 1
dataframe_example[1:10, 1]
## [1] 1 2 3 4 5 6 7 8 9 10
Selecting multiple rows and columns
dataframe_example[3:5, 1:2]
## Joe Trump
## 3 3 242
## 4 4 383
## 5 5 539
How to install more functions in R
We use install.packages()
function to install The CRAN
(Comprehensive R Archive Network) server.
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//Rtmp2fNQa4/downloaded_packages
However, the package you just installed, is on your computer
(somewhere in a folder called libraries), but they are not
loaded
to R program (simply saying, you did not open them).
To use the installed packages, you need to use function
library()
library(tidyverse)
Now you can use tidyverse
package!
tidyverse
package is the one helps you writing code /
summarizing results.
When you learn how to install new packages, you can navigate
functions in that package using two colons (::
).
tidyverse::tidyverse_conflicts()
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
As we installed a new package, there could be a conflits
in functions. tidyverse_conflicts()
shows the list of those
conflicts.
As developers are doing things by their own, and CRAN does not have a system controlling the names of the newly developed functions. That means, a new function from a package that you installed, can be overlapped with other functions from other packages!
dataset_sbp <- read.csv(file = "Git/BTE3207_Advanced_Biostatistics(Macmini)_git/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dplyr::filter(dataset_sbp, SEX == 1))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
head(stats::filter(dataset_sbp$SEX, rep(1,3)))
## [1] NA 3 3 3 3 3
dplyr::filter()
will filter out the data based on given
condition, SEX == 1
.
(doing the same thing as subset()
)
However, stats::filter()
(which is the basic package
comes with R) does different thing. It applies linear filtering to a
univariate time series or to each series separately of a multivariate
time series.
The best practice is to note all the function names with
::
. But generally, you don’t have to do it as it is not
that common problem.
Basic tidyverse
Tidy verse helps you writing code efficiently. But how?
Let’s see this example. We want to filter sample, based on some condition. And we have multiple conditions.
head(filter(dataset_sbp, SEX == 1 & SBP > 120))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 130 80 90 4 28.4
## 2 1 1 130 80 92 4 27.8
## 3 1 1 130 80 87 4 23.5
## 4 1 1 124 75 92 4 21.7
## 5 1 1 139 89 86 4 21.7
## 6 1 1 138 86 95 4 27.2
This function filtered based on multiple conditions, when
SEX == 1
and SBP > 120
. But how are we
going to do some imputation, and then filter out based on some
conditions?
head(filter(filter(filter(dataset_sbp, SEX == 1), SBP > 120), FBS > 110))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 135 85 114 4 31.4
## 2 1 1 130 80 130 4 27.5
## 3 1 1 160 95 113 4 38.3
## 4 1 1 130 70 111 4 18.4
## 5 1 1 130 80 114 2 24.0
## 6 1 1 150 100 123 4 27.8
This function filtered based on multiple conditions, when
SEX == 1
and SBP > 120
. Plus, it has
head
function outside again.
It can be done with this code and it does the same thing.
dataset_sbp %>%
filter(SEX == 1) %>%
filter(SBP > 120) %>%
filter(FBS > 110) %>%
head()
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 135 85 114 4 31.4
## 2 1 1 130 80 130 4 27.5
## 3 1 1 160 95 113 4 38.3
## 4 1 1 130 70 111 4 18.4
## 5 1 1 130 80 114 2 24.0
## 6 1 1 150 100 123 4 27.8
But how are we going to do some imputation, and then filter out based on some conditions?
Let’s see this example again.
We can try adding multiple lines of code to do this. Let’s say we are interested in the difference between SBP and DBP. And then we want to categorize them with genders. And then, we want to filter out the data based on their quantile.
dataset_sbp$Diff_SBP_DBP <- dataset_sbp$SBP - dataset_sbp$DBP
dataset_sbp_male <- filter(dataset_sbp, SEX == 1)
dataset_sbp_female <- filter(dataset_sbp, SEX == 2)
avg_male <- mean(dataset_sbp_male$Diff_SBP_DBP)
avg_female <- mean(dataset_sbp_female$Diff_SBP_DBP)
sd_male <- sd(dataset_sbp_male$Diff_SBP_DBP)
sd_female <- sd(dataset_sbp_female$Diff_SBP_DBP)
data.frame(SEX = c(1, 2),
aberage_by_group = c(avg_male, avg_female),
sd_by_group = c(sd_male, sd_female))
## SEX aberage_by_group sd_by_group
## 1 1 46.66498 9.35818
## 2 2 45.47853 10.20447
We did it! However, the codes are quite nasty, and we have generated unnecessary intermediate data frames as well. Isn’t there a smarter way?
Piping
The good news is, tidyverse::
package has a great
feature called piping. In basic R, if we do not assign values
with <-
, the computer will just show the result and it
won’t store the output.
Piping helps employing that output temprarilly, using
%>%
dataset_sbp %>% head()
## SEX BTH_G SBP DBP FBS DIS BMI Diff_SBP_DBP
## 1 1 1 116 78 94 4 16.6 38
## 2 1 1 100 60 79 4 22.3 40
## 3 1 1 100 60 87 4 21.9 40
## 4 1 1 111 70 72 4 20.2 41
## 5 1 1 120 80 98 4 20.0 40
## 6 1 1 115 79 95 4 23.1 36
Selection of piped data in tidyverse can be done with dot
.
.
dataset_sbp %>% .$SEX %>% head()
## [1] 1 1 1 1 1 1
The data will be moved the the next function, and will be employed for calculation.
# Calculate the difference between SBP and DBP
dataset_sbp <- dataset_sbp %>%
mutate(Diff_SBP_DBP = SBP - DBP)
# View the first few rows
head(dataset_sbp)
## SEX BTH_G SBP DBP FBS DIS BMI Diff_SBP_DBP
## 1 1 1 116 78 94 4 16.6 38
## 2 1 1 100 60 79 4 22.3 40
## 3 1 1 100 60 87 4 21.9 40
## 4 1 1 111 70 72 4 20.2 41
## 5 1 1 120 80 98 4 20.0 40
## 6 1 1 115 79 95 4 23.1 36
See? Here, mutate()
is a function for calculating new
variable in tidyverse
.
Let’s do the same thing with tidyverse.
# Calculate average and standard deviation of Diff_SBP_DBP by SEX
summary_by_sex <- dataset_sbp %>%
group_by(SEX) %>%
summarise(
average_diff = mean(Diff_SBP_DBP, na.rm = TRUE),
sd_diff = sd(Diff_SBP_DBP, na.rm = TRUE)
)
# View the summary
print(summary_by_sex)
## # A tibble: 2 × 3
## SEX average_diff sd_diff
## <int> <dbl> <dbl>
## 1 1 46.7 9.36
## 2 2 45.5 10.2
Grouping by Multiple Variables
Explanation:
• Multiple Grouping Variables: Allows for more granular analysis.
• Nested Groups: The data is first grouped by SEX, then by DIS within each SEX.
# Calculate average and standard deviation by SEX and DIS
summary_by_sex_dis <- dataset_sbp %>%
group_by(SEX, DIS) %>%
summarise(
average_diff = mean(Diff_SBP_DBP, na.rm = TRUE),
sd_diff = sd(Diff_SBP_DBP, na.rm = TRUE)
)
# View the summary
print(summary_by_sex_dis)
## # A tibble: 8 × 4
## # Groups: SEX [2]
## SEX DIS average_diff sd_diff
## <int> <int> <dbl> <dbl>
## 1 1 1 51.4 11.2
## 2 1 2 49.9 10.6
## 3 1 3 47.4 9.72
## 4 1 4 45.6 8.59
## 5 2 1 53.1 12.0
## 6 2 2 51.1 11.3
## 7 2 3 47.9 10.5
## 8 2 4 43.5 8.96
Conclusion
In this lecture, we covered:
• Basic R Operations: Variable assignment, comparisons, and vector indexing.
• Data Frames: Creation and element access.
• Data Manipulation with tidyverse:
• Reading data from files.
• Filtering data using conditions.
• Adding new variables with mutate().
• Grouping and summarizing data with group_by() and summarise().
How to learn basic R (optional)
swirl()
swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!
install.packages("swirl")
##
## The downloaded binary packages are in
## /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//Rtmp2fNQa4/downloaded_packages
library(swirl)
Don’t go too further,,, it will do almost the half of my job, teaching (bio)stats.
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## version 0.4.4, <https://CRAN.R-project.org/package=reactable>.
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN, Gao C (2024). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.9, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman