Before begin..
Basics of R
Get current pathway of R working environment
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics"
Listing files in current pathway
list.files()
## [1] "BTE3207_Advanced_Biostatistics.Rproj"
## [2] "COD_week1_2_MGK_BTE3207.html"
## [3] "COD_week1_2_MGK_BTE3207.R"
## [4] "COD_week1_2_MGK_BTE3207.Rmd"
## [5] "COD_week2_1_MGK_BTE3207.html"
## [6] "COD_week2_1_MGK_BTE3207.Rmd"
## [7] "COD_week2_2_MGK_BTE3207.html"
## [8] "COD_week2_2_MGK_BTE3207.Rmd"
## [9] "dataset"
## [10] "README.md"
## [11] "rsconnect"
Changing directory
Use tab
for navigating folders!
setwd("dataset/")
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset"
Now, you are in dataset folder!
Going back to original folder
To go back to the upper path,
setwd("..")
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts"
Now you came back to the original working directory.
Logical values…having logical values.
a = 1
a
## [1] 1
=
does the same thing as <-
.
To test if the thing are same, R uses ==
.
a == 1
## [1] TRUE
as we inserted a <- 1
in the previous code chunk,
this test results in TRUE
"a" == 1
## [1] FALSE
This test will test whether a character, "a"
, is the
same with a numeric value 1
. As they are not the same, it
returns FALSE
Here, TRUE
and FALSE
results are , and they
are one type of binary variable. It works for longer
vectors or variables as well.
c(1, 2, 3, 4, 5) == c(1, 2, 2, 4, 5)
## [1] TRUE TRUE FALSE TRUE TRUE
c(1, 2, 3, 4, 5) == 1
## [1] TRUE FALSE FALSE FALSE FALSE
And it results in a vector of all the logical tests. Using this, we can filter data easily!
Howe to select values
To select values from vector, we use []
(square
brackets).
a <- c(1, 2, 3)
a[1]
## [1] 1
a[1]
will result in the first data element in this
vector.
It is slightly different with some data with names.
names(a) <- c("first", "second", "third")
str(a)
## Named num [1:3] 1 2 3
## - attr(*, "names")= chr [1:3] "first" "second" "third"
now the vector a
is named numeric vector.
In this case,
a[1]
## first
## 1
The results will be the name and the value!
This sometimes causes some problem when calculating data.
In that case, we need to use double brackets[[]]
a[[1]]
## [1] 1
By the way, selecting multiple numbers can be done with
colons:
.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
5:20
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
As it has output of vector (multiple elements), both below codes will work the same way
a[c(1, 2, 3)]
## first second third
## 1 2 3
a[1:3]
## first second third
## 1 2 3
For selecting data in data frames, it works the same way but it separates the rows and columns using comma. Here is one example.
dataframe_example <- data.frame(Joe = c(1:1000),
Trump = sample(1:1000, 100),
Obama = sample(1:1000, 100),
George = sample(1:1000, 100)
)
head(dataframe_example)
## Joe Trump Obama George
## 1 1 922 208 239
## 2 2 519 909 3
## 3 3 942 214 858
## 4 4 432 137 198
## 5 5 202 724 668
## 6 6 763 400 79
This is a data frame, meaning nothing (it just has 100 numbers randomly selected from a numbers between 1 to 1000), besides Biden. It has ordered numbers from 1 to 100.
To select some data, we use numbers or columns again. But we separate
inputs with a comma ,
.
Selecting 1st row and 1st cloumn
dataframe_example[1,1]
## [1] 1
Selecting multiple rows in column 1
dataframe_example[1:10, 1]
## [1] 1 2 3 4 5 6 7 8 9 10
Selecting multiple rows and columns
dataframe_example[3:5, 1:2]
## Joe Trump
## 3 3 942
## 4 4 432
## 5 5 202
How to install more functions in R
We use install.packages()
function to install The CRAN
(Comprehensive R Archive Network) server.
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpsYsGtn/downloaded_packages
However, the package you just installed, is on your computer
(somewhere in a folder called libraries), but they are not
loaded
to R program (simply saying, you did not open them).
To use the installed packages, you need to use function
library()
library(tidyverse)
Now you can use tidyverse
package!
tidyverse
package is the one helps you writing code /
summarizing results.
When you learn how to install new packages, you can navigate
functions in that package using two colons (::
).
tidyverse::tidyverse_conflicts()
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
As we installed a new package, there could be a conflits
in functions. tidyverse_conflicts()
shows the list of those
conflicts.
As developers are doing things by their own, and CRAN does not have a system controlling the names of the newly developed functions. That means, a new function from a package that you installed, can be overlapped with other functions from other packages!
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dplyr::filter(dataset_sbp, SEX == 1))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
head(stats::filter(dataset_sbp$SEX, rep(1,3)))
## [1] NA 3 3 3 3 3
dplyr::filter()
will filter out the data based on given
condition, SEX == 1
.
(doing the same thing as subset()
)
However, stats::filter()
(which is the basic package
comes with R) does different thing. It applies linear filtering to a
univariate time series or to each series separately of a multivariate
time series.
The best practice is to note all the function names with
::
. But generally, you don’t have to do it as it is not
that common problem.
Basic tidyverse
Tidy verse helps you writing code efficiently. But how?
Let’s see this example. We want to filter sample, based on some condition. And we have multiple conditions.
head(filter(dataset_sbp, SEX == 1 & SBP > 120))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 130 80 90 4 28.4
## 2 1 1 130 80 92 4 27.8
## 3 1 1 130 80 87 4 23.5
## 4 1 1 124 75 92 4 21.7
## 5 1 1 139 89 86 4 21.7
## 6 1 1 138 86 95 4 27.2
This function filtered based on multiple conditions, when
SEX == 1
and SBP > 120
. But how are we
going to do some imputation, and then filter out based on some
conditions?
head(filter(filter(filter(dataset_sbp, SEX == 1), SBP > 120), FBS > 110))
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 135 85 114 4 31.4
## 2 1 1 130 80 130 4 27.5
## 3 1 1 160 95 113 4 38.3
## 4 1 1 130 70 111 4 18.4
## 5 1 1 130 80 114 2 24.0
## 6 1 1 150 100 123 4 27.8
This function filtered based on multiple conditions, when
SEX == 1
and SBP > 120
. Plus, it has
head
function outside again.
It can be done with this code and it does the same thing.
dataset_sbp %>%
filter(SEX == 1) %>%
filter(SBP > 120) %>%
filter(FBS > 110) %>%
head()
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 135 85 114 4 31.4
## 2 1 1 130 80 130 4 27.5
## 3 1 1 160 95 113 4 38.3
## 4 1 1 130 70 111 4 18.4
## 5 1 1 130 80 114 2 24.0
## 6 1 1 150 100 123 4 27.8
But how are we going to do some imputation, and then filter out based on some conditions?
Let’s see this example again.
We can try adding multiple lines of code to do this. Let’s say we are interested in the difference between SBP and DBP. And then we want to categorize them with genders. And then, we want to filter out the data based on their quantile.
dataset_sbp$Diff_SBP_DBP <- dataset_sbp$SBP - dataset_sbp$DBP
dataset_sbp_male <- filter(dataset_sbp, SEX == 1)
dataset_sbp_female <- filter(dataset_sbp, SEX == 2)
avg_male <- mean(dataset_sbp_male$Diff_SBP_DBP)
avg_female <- mean(dataset_sbp_female$Diff_SBP_DBP)
sd_male <- sd(dataset_sbp_male$Diff_SBP_DBP)
sd_female <- sd(dataset_sbp_female$Diff_SBP_DBP)
data.frame(SEX = c(1, 2),
aberage_by_group = c(avg_male, avg_female),
sd_by_group = c(sd_male, sd_female))
## SEX aberage_by_group sd_by_group
## 1 1 46.66498 9.35818
## 2 2 45.47853 10.20447
We did it! However, the codes are quite nasty, and we have generated unnecessary intermediate data frames as well. Isn’t there a smarter way?
Piping
The good news is, tidyverse::
package has a great
feature called piping. In basic R, if we do not assign values
with <-
, the computer will just show the result and it
won’t store the output.
Piping helps employing that output temprarilly, using
%>%
dataset_sbp %>% head()
## SEX BTH_G SBP DBP FBS DIS BMI Diff_SBP_DBP
## 1 1 1 116 78 94 4 16.6 38
## 2 1 1 100 60 79 4 22.3 40
## 3 1 1 100 60 87 4 21.9 40
## 4 1 1 111 70 72 4 20.2 41
## 5 1 1 120 80 98 4 20.0 40
## 6 1 1 115 79 95 4 23.1 36
Selection of piped data in tidyverse can be done with dot
.
.
dataset_sbp %>% .$SEX %>% head()
## [1] 1 1 1 1 1 1
The data will be moved the the next function, and will be employed for calculation.
dataset_sbp %>% mutate(Diff_SBP_DBP = SBP - DBP) %>% head()
## SEX BTH_G SBP DBP FBS DIS BMI Diff_SBP_DBP
## 1 1 1 116 78 94 4 16.6 38
## 2 1 1 100 60 79 4 22.3 40
## 3 1 1 100 60 87 4 21.9 40
## 4 1 1 111 70 72 4 20.2 41
## 5 1 1 120 80 98 4 20.0 40
## 6 1 1 115 79 95 4 23.1 36
See? Here, mutate()
is a function for calculating new
variable in tidyverse
.
Let’s do the same thing with tidyverse.
dataset_sbp %>%
mutate(Diff_SBP_DBP = SBP - DBP) %>%
group_by(SEX) %>%
summarise(avereage_by_group = mean(Diff_SBP_DBP),
sd_by_group = sd(Diff_SBP_DBP))
## # A tibble: 2 × 3
## SEX avereage_by_group sd_by_group
## <int> <dbl> <dbl>
## 1 1 46.7 9.36
## 2 2 45.5 10.2
Try to use piping as many as you can. And use lots of lines and indents. It will make your code look tidier.
These application is also possible (I didn’t make a bad example as it would take too much of my personal time)
dataset_sbp %>%
group_by(SEX, DIS) %>%
summarise(avereage = mean(Diff_SBP_DBP),
sd = sd(Diff_SBP_DBP))
## # A tibble: 8 × 4
## # Groups: SEX [2]
## SEX DIS avereage sd
## <int> <int> <dbl> <dbl>
## 1 1 1 51.4 11.2
## 2 1 2 49.9 10.6
## 3 1 3 47.4 9.72
## 4 1 4 45.6 8.59
## 5 2 1 53.1 12.0
## 6 2 2 51.1 11.3
## 7 2 3 47.9 10.5
## 8 2 4 43.5 8.96
How to learn basic R (optional)
swirl()
swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!
install.packages("swirl")
##
## The downloaded binary packages are in
## /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpsYsGtn/downloaded_packages
library(swirl)
Don’t go too further,,, it will do almost the half of my job, teaching (bio)stats.
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Springer-Verlag New York, 2016.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. R package version 2.4.5, <https://CRAN.R-project.org/package=swirl>.