knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Note: This analysis is used for my own study purpose. In this section, I will summerize some basic commands for data wrangling/ cleaning in R environment based on several online sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
Main data used in this notes: fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
| Variable | Type | Description |
|---|---|---|
| studyid | Category | Study ID # (z001 through z500) |
| trt | Category | Treatment group (Intervention or Control) |
| age | Numeric | Age in years |
| sex | Category | Male or Female |
| nihss | Numeric | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
| location | Category | Stroke Location - Left or Right Hemisphere |
| hx.isch | Category | History of Ischemic Stroke (Yes/No) |
| afib | Category | Atrial Fibrillation (1 = Yes, 0 = No) |
| dm | Category | Diabetes Mellitus (1 = Yes, 0 = No) |
| mrankin | Category | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
| sbp | Numeric | Systolic blood pressure, in mm Hg |
| iv.altep | Category | Treatment with IV alteplase (Yes/No) |
| time.iv | Numeric | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
| aspects | Category | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
| ia.occlus | Category | Intracranial arterial occlusion, based on vessel imaging - five categories |
| extra.ica | Category | Extracranial ICA occlusion (1 = Yes, 0 = No) |
| time.rand | Numeric | Time from stroke onset to study randomization, in minutes |
| time.punc | Numeric | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
4 things to have for data analysis:
Raw data (sometimes called source data, atomic data or primary data) is data that has not been processed. Raw data that has undergone processing is sometimes referred to as cooked data.
Although raw data has the potential to become “information,” it requires selective extraction, organization and sometimes analysis and formatting for presentation. Because of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.
In many instances, users must clean raw data before it can be used. Cleaning raw data may require parsing the data for easier ingestion into a computer, removing outliers or spurious results and, occasionally, reformatting or translating the data.
There are different levels of raw data:
It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.
There are three interrelated rules which make a dataset tidy:
# 
It provides:
It is a computer script that helps you to go back to the raw data and reprocess it and get the same tidy data set. If that doesn’t happen, then there’s something going wrong in your data processing pipeline, and so, you want to be able to identify that and fix it.
Ideally a script in R, but can be in Python, as well.
The input for the script is the, raw data.
The output is going to be the processed, tidy data.
What R version are you using? There are frequently real flaws in earlier or latest versions of R or R packages, and your problem may be solved if you simply upgrade to the most recent version. It’s crucial to note if you’re using the most recent version of R. Therefore, it is important to know your R system and the information of packages.
Command: sessionInfo()
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.4.1 conflicted_1.1.0 forcats_0.5.2 stringr_1.4.1
## [5] dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [9] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.32 bslib_0.4.0
## [4] haven_2.5.1 gargle_1.2.0 colorspace_2.0-3
## [7] vctrs_0.4.1 generics_0.1.3 htmltools_0.5.3
## [10] yaml_2.3.5 utf8_1.2.2 rlang_1.0.5
## [13] jquerylib_0.1.4 pillar_1.8.1 withr_2.5.0
## [16] glue_1.6.2 DBI_1.1.3 dbplyr_2.2.1
## [19] modelr_0.1.9 lifecycle_1.0.1 munsell_0.5.0
## [22] gtable_0.3.1 cellranger_1.1.0 rvest_1.0.3
## [25] memoise_2.0.1 evaluate_0.16 knitr_1.40
## [28] tzdb_0.3.0 fastmap_1.1.0 fansi_1.0.3
## [31] broom_1.0.1 backports_1.4.1 scales_1.2.1
## [34] googlesheets4_1.0.1 cachem_1.0.6 jsonlite_1.8.0
## [37] fs_1.5.2 hms_1.1.2 digest_0.6.29
## [40] stringi_1.7.8 grid_4.2.1 cli_3.3.0
## [43] tools_4.2.1 magrittr_2.0.3 sass_0.4.2
## [46] crayon_1.5.1 pkgconfig_2.0.3 ellipsis_0.3.2
## [49] xml2_1.3.3 reprex_2.0.2 googledrive_2.0.0
## [52] lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.16
## [55] httr_1.4.4 rstudioapi_0.14 R6_2.5.1
## [58] compiler_4.2.1
Command: devtools::session_info(c(“package name”))
devtools::session_info(c("ggplot2", "broom"))
## - Session info ---------------------------------------------------------------
## setting value
## version R version 4.2.1 (2022-06-23 ucrt)
## os Windows 10 x64 (build 17134)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## ctype English_United States.1252
## tz Asia/Bangkok
## date 2022-11-10
## pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
##
## - Packages -------------------------------------------------------------------
## package * version date (UTC) lib source
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
## broom 1.0.1 2022-08-29 [1] CRAN (R 4.2.1)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.1)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.1)
## cpp11 0.4.2 2021-11-30 [1] CRAN (R 4.2.1)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.1)
## dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.1)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.1)
## farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.1)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.1)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
## gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.1)
## isoband 0.2.5 2021-07-13 [1] CRAN (R 4.2.1)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.1)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
## MASS 7.3-57 2022-04-22 [2] CRAN (R 4.2.1)
## Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.1)
## mgcv 1.8-40 2022-03-29 [2] CRAN (R 4.2.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
## nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.1)
## pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.1)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
## rlang 1.0.5 2022-08-31 [1] CRAN (R 4.2.1)
## scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
## stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
## stringr * 1.4.1 2022-08-20 [1] CRAN (R 4.2.1)
## tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
## tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.1)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.1)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.1)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.1)
## viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.1)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.1)
##
## [1] C:/Users/Admin/AppData/Local/R/win-library/4.2
## [2] C:/Program Files/R/R-4.2.1/library
##
## ------------------------------------------------------------------------------
If you know the name of the function you seek support with, simply put a question mark ? at the command line prompt, followed by the function name.
Command: ? function_name()
? ggplot2()
Sometimes you can’t recall the exact name of the function, but you know the topic for which you need assistance (e.g. data input in this case). Use the help.search function (no question mark) with your query enclosed in double quotes, as shown below:
Command: help.search(“data input”)
# Instead of ggplot2, search gg:
help.search("gg")
The find function tells you what package something is in:
Command: find(“object name”)
find("glm")
## [1] "package:stats"
find("lm")
## [1] "package:stats"
apropos delivers a character vector containing the names of all objects in the search list that correspond to your (possibly incomplete) query:
Command: apropos(object name )
apropos("glm")
## [1] "glm" "glm.control" "glm.fit" "predict.glm"
## [5] "residuals.glm" "summary.glm"
apropos("lm")
## [1] ".colMeans" ".lm.fit" "colMeans" "confint.lm"
## [5] "contr.helmert" "dummy.coef.lm" "glm" "glm.control"
## [9] "glm.fit" "KalmanForecast" "KalmanLike" "KalmanRun"
## [13] "KalmanSmooth" "kappa.lm" "lm" "lm.fit"
## [17] "lm.influence" "lm.wfit" "lmap" "lmap_at"
## [21] "lmap_if" "model.matrix.lm" "nlm" "nlminb"
## [25] "predict.glm" "predict.lm" "residuals.glm" "residuals.lm"
## [29] "summary.glm" "summary.lm"
Install packages in two ways: through the menus or using a command.
Command: install.packages(“package.name”)
Example: > install.packages(“ggplot2”)
You need to install the package only once but you need to reference
it each time you start a new session of R. To reference a package, we
simply execute this general command:
library(package.name)
Example: > library(ggplot2)
If we have saved the data as a CSV file, then we can import these data to a dataframe using the read.csv function. Or we can replace read.csv to read.delim or read_excel. Besides, we can import directly through R markdown. The general form of this function is:
Command: df = read.csv(“filename.extension”, header = TRUE)
For example: I will using R markdown to import the dataset df:
df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
This command is useful when we want to check the format of the dataset or variable within dataset before analysing:
| Command | Description |
|---|---|
| is.data.frame() | whether the target is a dataframe. |
| is.tibble() | whether the target format is tibble. |
| is.na() | whether the object is NA. |
| is.factor() | whether the object’s class is factor. |
| is.numeric() | whether the object’s class is numeric. |
Example 1: I want to check whether or not the df is a data frame and does it contain NA value ?
is.data.frame(df)
## [1] TRUE
is.tibble(df)
## [1] TRUE
head(is.na(df))
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## iv.altep time.iv aspects ia.occlus extra.ica time.rand time.punc
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Example 2: I want to check whether or not the studyid ; age are factor and numeric, respectively ?
is.factor(df$studyid)
## [1] FALSE
is.numeric(df$age)
## [1] TRUE
Discuss
To be able to use the tidyverse package, the
dataset format should be tibble. df are both dataframe
and tibble.
The is.na() command returns a value of true and
false for each value in a data set. If the value is NA the is.na()
function return the value of true, otherwise, return to a value of
false. It is easy to note that the time.punc variable
contains lots of NA value.
The studyid variable is not a factor. So we
should convert these character variables to factor format before
progressing to analysis stage.
The age variable is numeric variable. => Later, I will show you how to convert all character variables to factor, while maintaining the other variables as numeric variables.
To check the number of variables (column) and observations (row) in the dataset:
Command: dim(dataset)
dim(df)
## [1] 500 18
Discuss
This dataset has 500 observations with 18 variables.
In the previous section, we know that this dataset has 500 observations with 18 variables. So what are these variables:
Command: names(dataset)
names(df)
## [1] "studyid" "trt" "age" "sex" "nihss" "location"
## [7] "hx.isch" "afib" "dm" "mrankin" "sbp" "iv.altep"
## [13] "time.iv" "aspects" "ia.occlus" "extra.ica" "time.rand" "time.punc"
To have some information of every variables in the dataframe:
Command: summary(dataset)
Example 1: Using command with the original dataset:
summary(df)
## studyid trt age sex
## Length:500 Length:500 Min. :23.00 Length:500
## Class :character Class :character 1st Qu.:55.00 Class :character
## Mode :character Mode :character Median :65.75 Mode :character
## Mean :64.71
## 3rd Qu.:76.00
## Max. :96.00
##
## nihss location hx.isch afib
## Min. :10.00 Length:500 Length:500 Min. :0.00
## 1st Qu.:14.00 Class :character Class :character 1st Qu.:0.00
## Median :18.00 Mode :character Mode :character Median :0.00
## Mean :18.03 Mean :0.27
## 3rd Qu.:22.00 3rd Qu.:1.00
## Max. :28.00 Max. :1.00
##
## dm mrankin sbp iv.altep
## Min. :0.000 Length:500 Min. : 78.0 Length:500
## 1st Qu.:0.000 Class :character 1st Qu.:128.5 Class :character
## Median :0.000 Mode :character Median :145.0 Mode :character
## Mean :0.126 Mean :145.5
## 3rd Qu.:0.000 3rd Qu.:162.5
## Max. :1.000 Max. :231.0
## NA's :1
## time.iv aspects ia.occlus extra.ica
## Min. : 42.00 Min. : 5.000 Length:500 Min. :0.0000
## 1st Qu.: 67.00 1st Qu.: 7.000 Class :character 1st Qu.:0.0000
## Median : 86.00 Median : 9.000 Mode :character Median :0.0000
## Mean : 92.64 Mean : 8.506 Mean :0.2906
## 3rd Qu.:115.00 3rd Qu.:10.000 3rd Qu.:1.0000
## Max. :218.00 Max. :10.000 Max. :1.0000
## NA's :55 NA's :4 NA's :1
## time.rand time.punc
## Min. :100.0 Min. :180
## 1st Qu.:151.2 1st Qu.:212
## Median :201.5 Median :260
## Mean :208.6 Mean :263
## 3rd Qu.:257.8 3rd Qu.:313
## Max. :360.0 Max. :360
## NA's :2 NA's :267
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
Discuss
There are 2 problems with this data:
Example 2: Using the same dataset, but solving the 2 problems mentioned above (The way to do this will be shown later in this practice):
data = df
# Changing coding variable to categorical:
data$afib <- factor(data$afib , levels=0:1, labels=c("No", "Yes"))
data$dm <- factor(data$dm , levels=0:1, labels=c("No", "Yes"))
data$extra.ica <- factor(data$extra.ica, levels=0:1, labels=c("No", "Yes"))
# Convert character to categorical variables:
data <- data %>%mutate_if(is.character, as.factor)
# Relevel factors:
data$trt = factor(data$trt, levels=c("Intervention", "Control"))
data$mrankin = factor(data$mrankin, levels=c("0", "1", "2", "> 2"))
data$ia.occlus = factor(data$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Test the commands:
summary(data)
## studyid trt age sex nihss
## z001 : 1 Intervention:233 Min. :23.00 Female:208 Min. :10.00
## z002 : 1 Control :267 1st Qu.:55.00 Male :292 1st Qu.:14.00
## z003 : 1 Median :65.75 Median :18.00
## z004 : 1 Mean :64.71 Mean :18.03
## z005 : 1 3rd Qu.:76.00 3rd Qu.:22.00
## z006 : 1 Max. :96.00 Max. :28.00
## (Other):494
## location hx.isch afib dm mrankin sbp iv.altep
## Left :269 No :446 No :365 No :437 0 :404 Min. : 78.0 No : 55
## Right:231 Yes: 54 Yes:135 Yes: 63 1 : 50 1st Qu.:128.5 Yes:445
## 2 : 25 Median :145.0
## > 2: 21 Mean :145.5
## 3rd Qu.:162.5
## Max. :231.0
## NA's :1
## time.iv aspects ia.occlus extra.ica
## Min. : 42.00 Min. : 5.000 Intracranial ICA: 4 No :354
## 1st Qu.: 67.00 1st Qu.: 7.000 ICA with M1 :134 Yes :145
## Median : 86.00 Median : 9.000 M1 :319 NA's: 1
## Mean : 92.64 Mean : 8.506 M2 : 39
## 3rd Qu.:115.00 3rd Qu.:10.000 A1 or A2 : 3
## Max. :218.00 Max. :10.000 NA's : 1
## NA's :55 NA's :4
## time.rand time.punc
## Min. :100.0 Min. :180
## 1st Qu.:151.2 1st Qu.:212
## Median :201.5 Median :260
## Mean :208.6 Mean :263
## 3rd Qu.:257.8 3rd Qu.:313
## Max. :360.0 Max. :360
## NA's :2 NA's :267
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ trt : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
Discuss
Note: From now to the end of this practicle, let clarify the 2 datasets that I will use:
df: The raw dataset
data: The tidy dataset
These 2 commands help to view the top / bottom n rows of the dataset. By default, n=6 :
Commands:
head(data, n = 3)
## # A tibble: 3 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z002 Intervent~ 51 Male 23 Left No Yes No 0 137
## 3 z003 Control 68 Fema~ 11 Right No No No 0 138
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
tail(data, n=3)
## # A tibble: 3 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z498 Intervent~ 61 Fema~ 21 Right No Yes No 1 121
## 2 z499 Intervent~ 23 Male 23 Left Yes Yes No 0 123
## 3 z500 Control 66 Male 24 Left No Yes No 0 190
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
Detail about table Descriptive Analysis: https://rpubs.com/minhtri/929114
Our goal, then, is to take the data in and use it to generate a Table for the study that compares the 233 patients in the Intervention group to the 267 patients in the Control group, on all of the other variables (except study ID #) available. I’ll use the compareGroups package of functions available in R to help me complete this task.
Package: compareGroups
Command: createTable (compareGroups (group ~ var1 + var2 + ….,
data = … , method = c ( var = NA)))
1: forces analysis as normal-distributed
2: forces analysis as continuous non-normal
3: forces analysis as categorical
NA: performs a Shapiro-Wilks test to decide between
normal or non-normal (preferable). If the method argument is stated as
NA for a variable, then a Shapiro-Wilk test for normality is used to
decide if the variable is normal or non-normal distributed.
Example 1: Becasue I don’t know the distribution of the variables in the data, I will use argument method = c(var = NA). With this argument, the program will decide which mean or median should be shown based on the data distribution in each variable groups:
library(compareGroups)
OR = createTable( compareGroups( trt ~ age + sex + nihss + location + hx.isch + afib + dm +
mrankin + sbp + iv.altep + time.iv + aspects + ia.occlus +
extra.ica + time.rand + time.punc,
data = data,
method = c(age=NA, nihss=NA, sbp = NA, time.iv=NA, aspects=NA,
time.rand =NA, time.punc =NA)))
OR
##
## --------Summary descriptives table by 'trt'---------
##
## ________________________________________________________________
## Intervention Control p.overall
## N=233 N=267
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## age 65.8 [54.5;76.0] 65.7 [55.8;76.2] 0.579
## sex: 0.917
## Female 98 (42.1%) 110 (41.2%)
## Male 135 (57.9%) 157 (58.8%)
## nihss 17.0 [14.0;21.0] 18.0 [14.0;22.0] 0.453
## location: 0.111
## Left 116 (49.8%) 153 (57.3%)
## Right 117 (50.2%) 114 (42.7%)
## hx.isch: 0.335
## No 204 (87.6%) 242 (90.6%)
## Yes 29 (12.4%) 25 (9.36%)
## afib: 0.601
## No 167 (71.7%) 198 (74.2%)
## Yes 66 (28.3%) 69 (25.8%)
## dm: 1.000
## No 204 (87.6%) 233 (87.3%)
## Yes 29 (12.4%) 34 (12.7%)
## mrankin: 0.922
## 0 190 (81.5%) 214 (80.1%)
## 1 21 (9.01%) 29 (10.9%)
## 2 12 (5.15%) 13 (4.87%)
## > 2 10 (4.29%) 11 (4.12%)
## sbp 146 (26.0) 145 (24.4) 0.649
## iv.altep: 0.267
## No 30 (12.9%) 25 (9.36%)
## Yes 203 (87.1%) 242 (90.6%)
## time.iv 85.0 [67.0;110] 87.0 [65.0;116] 0.596
## aspects 9.00 [7.00;10.0] 9.00 [8.00;10.0] 0.075
## ia.occlus: 0.819
## Intracranial ICA 1 (0.43%) 3 (1.13%)
## ICA with M1 59 (25.3%) 75 (28.2%)
## M1 154 (66.1%) 165 (62.0%)
## M2 18 (7.73%) 21 (7.89%)
## A1 or A2 1 (0.43%) 2 (0.75%)
## extra.ica: 0.179
## No 158 (67.8%) 196 (73.7%)
## Yes 75 (32.2%) 70 (26.3%)
## time.rand 204 [152;250] 196 [149;266] 0.251
## time.punc 260 [212;313] . [.;.] .
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Example 2: Show missingness
missingTable(OR)
##
## --------Missingness table by 'trt'---------
##
## ___________________________________________
## Intervention Control p.overall
## N=233 N=267
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## age 0 (0.00%) 0 (0.00%) .
## sex 0 (0.00%) 0 (0.00%) .
## nihss 0 (0.00%) 0 (0.00%) .
## location 0 (0.00%) 0 (0.00%) .
## hx.isch 0 (0.00%) 0 (0.00%) .
## afib 0 (0.00%) 0 (0.00%) .
## dm 0 (0.00%) 0 (0.00%) .
## mrankin 0 (0.00%) 0 (0.00%) .
## sbp 0 (0.00%) 1 (0.37%) 1.000
## iv.altep 0 (0.00%) 0 (0.00%) .
## time.iv 30 (12.9%) 25 (9.36%) 0.267
## aspects 0 (0.00%) 4 (1.50%) 0.127
## ia.occlus 0 (0.00%) 1 (0.37%) 1.000
## extra.ica 0 (0.00%) 1 (0.37%) 1.000
## time.rand 2 (0.86%) 0 (0.00%) 0.217
## time.punc 0 (0.00%) 267 (100%) <0.001
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Command: table1( ~ var1 + var2 + … | group, data = …)
Using built-in styles for table1():
Adjust the components of table:
library(table1)
table1( ~ age + sex + nihss + location + hx.isch + afib + dm + mrankin +
sbp + iv.altep + time.iv + aspects + ia.occlus + extra.ica + time.rand +
time.punc | trt, data = data, topclass="Rtable1-zebra",
render.continuous = c(.="Mean (SD)", .="Median [Q1, Q3]"),
overall="Total")
| Intervention (N=233) |
Control (N=267) |
Total (N=500) |
|
|---|---|---|---|
| age | |||
| Mean (SD) | 63.9 (18.1) | 65.4 (16.1) | 64.7 (17.1) |
| Median [Q1, Q3] | 65.8 [54.5, 76.0] | 65.7 [55.8, 76.2] | 65.8 [55.0, 76.0] |
| sex | |||
| Female | 98 (42.1%) | 110 (41.2%) | 208 (41.6%) |
| Male | 135 (57.9%) | 157 (58.8%) | 292 (58.4%) |
| nihss | |||
| Mean (SD) | 18.0 (5.04) | 18.1 (4.32) | 18.0 (4.67) |
| Median [Q1, Q3] | 17.0 [14.0, 21.0] | 18.0 [14.0, 22.0] | 18.0 [14.0, 22.0] |
| location | |||
| Left | 116 (49.8%) | 153 (57.3%) | 269 (53.8%) |
| Right | 117 (50.2%) | 114 (42.7%) | 231 (46.2%) |
| hx.isch | |||
| No | 204 (87.6%) | 242 (90.6%) | 446 (89.2%) |
| Yes | 29 (12.4%) | 25 (9.4%) | 54 (10.8%) |
| afib | |||
| No | 167 (71.7%) | 198 (74.2%) | 365 (73.0%) |
| Yes | 66 (28.3%) | 69 (25.8%) | 135 (27.0%) |
| dm | |||
| No | 204 (87.6%) | 233 (87.3%) | 437 (87.4%) |
| Yes | 29 (12.4%) | 34 (12.7%) | 63 (12.6%) |
| mrankin | |||
| 0 | 190 (81.5%) | 214 (80.1%) | 404 (80.8%) |
| 1 | 21 (9.0%) | 29 (10.9%) | 50 (10.0%) |
| 2 | 12 (5.2%) | 13 (4.9%) | 25 (5.0%) |
| > 2 | 10 (4.3%) | 11 (4.1%) | 21 (4.2%) |
| sbp | |||
| Mean (SD) | 146 (26.0) | 145 (24.4) | 145 (25.1) |
| Median [Q1, Q3] | 146 [129, 164] | 145 [128, 161] | 145 [129, 163] |
| Missing | 0 (0%) | 1 (0.4%) | 1 (0.2%) |
| iv.altep | |||
| No | 30 (12.9%) | 25 (9.4%) | 55 (11.0%) |
| Yes | 203 (87.1%) | 242 (90.6%) | 445 (89.0%) |
| time.iv | |||
| Mean (SD) | 98.2 (45.5) | 88.0 (26.0) | 92.6 (36.5) |
| Median [Q1, Q3] | 85.0 [67.0, 110] | 87.0 [65.0, 116] | 86.0 [67.0, 115] |
| Missing | 30 (12.9%) | 25 (9.4%) | 55 (11.0%) |
| aspects | |||
| Mean (SD) | 8.35 (1.64) | 8.65 (1.47) | 8.51 (1.56) |
| Median [Q1, Q3] | 9.00 [7.00, 10.0] | 9.00 [8.00, 10.0] | 9.00 [7.00, 10.0] |
| Missing | 0 (0%) | 4 (1.5%) | 4 (0.8%) |
| ia.occlus | |||
| Intracranial ICA | 1 (0.4%) | 3 (1.1%) | 4 (0.8%) |
| ICA with M1 | 59 (25.3%) | 75 (28.1%) | 134 (26.8%) |
| M1 | 154 (66.1%) | 165 (61.8%) | 319 (63.8%) |
| M2 | 18 (7.7%) | 21 (7.9%) | 39 (7.8%) |
| A1 or A2 | 1 (0.4%) | 2 (0.7%) | 3 (0.6%) |
| Missing | 0 (0%) | 1 (0.4%) | 1 (0.2%) |
| extra.ica | |||
| No | 158 (67.8%) | 196 (73.4%) | 354 (70.8%) |
| Yes | 75 (32.2%) | 70 (26.2%) | 145 (29.0%) |
| Missing | 0 (0%) | 1 (0.4%) | 1 (0.2%) |
| time.rand | |||
| Mean (SD) | 203 (57.3) | 214 (70.3) | 209 (64.8) |
| Median [Q1, Q3] | 204 [152, 250] | 196 [149, 266] | 202 [151, 258] |
| Missing | 2 (0.9%) | 0 (0%) | 2 (0.4%) |
| time.punc | |||
| Mean (SD) | 263 (54.2) | NA | 263 (54.2) |
| Median [Q1, Q3] | 260 [212, 313] | NA | 260 [212, 313] |
| Missing | 0 (0%) | 267 (100%) | 267 (53.4%) |
Summary:
compareGroups()
table1()
Provide full detail Mean-Median for numeric variable; Count-% for
character variable, missing value.
For numeric variable: We must include extra steps to know
the distribution normal or non-normal.
Normal distribution: Mean (SD)
** Non-normal: Median (Q1-Q3)
Do not show the p-value.
The function by allow summary of the dataframe on the basis of factor levels.
Package: pastecs
Command: by(dataframe, categorical variables, FUN = stat.desc,
basic = , norm = )
Example: Summary the data based on sex:
library(pastecs)
by(data, data$sex, stat.desc, basic = F, norm = T)
## data$sex: Female
## studyid trt age sex nihss location hx.isch afib dm
## median NA NA 6.600000e+01 NA 1.700000e+01 NA NA NA NA
## mean NA NA 6.562837e+01 NA 1.790865e+01 NA NA NA NA
## SE.mean NA NA 1.246515e+00 NA 3.334470e-01 NA NA NA NA
## CI.mean NA NA 2.457492e+00 NA 6.573876e-01 NA NA NA NA
## var NA NA 3.231902e+02 NA 2.312688e+01 NA NA NA NA
## std.dev NA NA 1.797749e+01 NA 4.809042e+00 NA NA NA NA
## coef.var NA NA 2.739287e-01 NA 2.685317e-01 NA NA NA NA
## skewness NA NA -3.380714e-01 NA 3.601314e-01 NA NA NA NA
## skew.2SE NA NA -1.002383e+00 NA 1.067792e+00 NA NA NA NA
## kurtosis NA NA -6.166876e-01 NA -9.104475e-01 NA NA NA NA
## kurt.2SE NA NA -9.184789e-01 NA -1.355997e+00 NA NA NA NA
## normtest.W NA NA 9.660446e-01 NA 9.535481e-01 NA NA NA NA
## normtest.p NA NA 6.691984e-05 NA 2.789767e-06 NA NA NA NA
## mrankin sbp iv.altep time.iv aspects ia.occlus
## median NA 145.0000000 NA 9.000000e+01 9.000000e+00 NA
## mean NA 144.6714976 NA 9.562500e+01 8.509615e+00 NA
## SE.mean NA 1.8169069 NA 2.660020e+00 1.060223e-01 NA
## CI.mean NA 3.5821167 NA 5.248250e+00 2.090219e-01 NA
## var NA 683.3381643 NA 1.301930e+03 2.338071e+00 NA
## std.dev NA 26.1407376 NA 3.608226e+01 1.529075e+00 NA
## coef.var NA 0.1806903 NA 3.773308e-01 1.796879e-01 NA
## skewness NA 0.0565986 NA 8.671872e-01 -8.330367e-01 NA
## skew.2SE NA 0.1674170 NA 2.420556e+00 -2.469958e+00 NA
## kurtosis NA -0.2152592 NA 4.601769e-01 -3.927604e-01 NA
## kurt.2SE NA -0.3198477 NA 6.455886e-01 -5.849674e-01 NA
## normtest.W NA 0.9971516 NA 9.388386e-01 8.463962e-01 NA
## normtest.p NA 0.9715080 NA 4.801444e-07 1.450610e-13 NA
## extra.ica time.rand time.punc
## median NA 1.955000e+02 2.530000e+02
## mean NA 2.041068e+02 2.611429e+02
## SE.mean NA 4.121473e+00 5.592442e+00
## CI.mean NA 8.125911e+00 1.109945e+01
## var NA 3.499228e+03 3.064990e+03
## std.dev NA 5.915427e+01 5.536235e+01
## coef.var NA 2.898202e-01 2.120002e-01
## skewness NA 3.619128e-01 1.966935e-01
## skew.2SE NA 1.067975e+00 4.034589e-01
## kurtosis NA -7.925322e-01 -1.467434e+00
## kurt.2SE NA -1.174820e+00 -1.519200e+00
## normtest.W NA 9.685555e-01 9.091972e-01
## normtest.p NA 1.467393e-04 4.785416e-06
## ------------------------------------------------------------
## data$sex: Male
## studyid trt age sex nihss location hx.isch afib dm
## median NA NA 6.500000e+01 NA 1.800000e+01 NA NA NA NA
## mean NA NA 6.404863e+01 NA 1.810959e+01 NA NA NA NA
## SE.mean NA NA 9.577055e-01 NA 2.675487e-01 NA NA NA NA
## CI.mean NA NA 1.884908e+00 NA 5.265759e-01 NA NA NA NA
## var NA NA 2.678224e+02 NA 2.090204e+01 NA NA NA NA
## std.dev NA NA 1.636528e+01 NA 4.571875e+00 NA NA NA NA
## coef.var NA NA 2.555133e-01 NA 2.524560e-01 NA NA NA NA
## skewness NA NA -3.460726e-01 NA 2.139443e-01 NA NA NA NA
## skew.2SE NA NA -1.213298e+00 NA 7.500685e-01 NA NA NA NA
## kurtosis NA NA -2.248394e-01 NA -9.225550e-01 NA NA NA NA
## kurt.2SE NA NA -3.954482e-01 NA -1.622592e+00 NA NA NA NA
## normtest.W NA NA 9.768745e-01 NA 9.658956e-01 NA NA NA NA
## normtest.p NA NA 1.155718e-04 NA 2.184316e-06 NA NA NA NA
## mrankin sbp iv.altep time.iv aspects ia.occlus
## median NA 145.50000000 NA 8.300000e+01 9.000000e+00 NA
## mean NA 146.05136986 NA 9.054023e+01 8.503472e+00 NA
## SE.mean NA 1.42957605 NA 2.275761e+00 9.294175e-02 NA
## CI.mean NA 2.81361946 NA 4.481269e+00 1.829339e-01 NA
## var NA 596.75680224 NA 1.351742e+03 2.487793e+00 NA
## std.dev NA 24.42860623 NA 3.676604e+01 1.577274e+00 NA
## coef.var NA 0.16726037 NA 4.060741e-01 1.854859e-01 NA
## skewness NA 0.04620720 NA 1.289112e+00 -7.841729e-01 NA
## skew.2SE NA 0.16199812 NA 4.275439e+00 -2.730533e+00 NA
## kurtosis NA -0.05936071 NA 1.804211e+00 -5.072399e-01 NA
## kurt.2SE NA -0.10440377 NA 3.003037e+00 -8.861054e-01 NA
## normtest.W NA 0.99667438 NA 8.907261e-01 8.400948e-01 NA
## normtest.p NA 0.80618706 NA 8.833036e-13 1.404705e-16 NA
## extra.ica time.rand time.punc
## median NA 2.055000e+02 2.650000e+02
## mean NA 2.117774e+02 2.643852e+02
## SE.mean NA 4.003036e+00 4.608942e+00
## CI.mean NA 7.878573e+00 9.115683e+00
## var NA 4.679095e+03 2.867716e+03
## std.dev NA 6.840391e+01 5.355106e+01
## coef.var NA 3.229991e-01 2.025494e-01
## skewness NA 3.772407e-01 4.516589e-02
## skew.2SE NA 1.322571e+00 1.082980e-01
## kurtosis NA -9.122173e-01 -1.259536e+00
## kurt.2SE NA -1.604410e+00 -1.520617e+00
## normtest.W NA 9.506005e-01 9.479431e-01
## normtest.p NA 2.359023e-08 5.716403e-05
by(df$age, df$sex, shapiro.test)
## df$sex: Female
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.96604, p-value = 6.692e-05
##
## ------------------------------------------------------------
## df$sex: Male
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.97687, p-value = 0.0001156
Discuss
The Shapiro-Wilk test result between the 2 commands are the same:
normtest.W and normtest.p in 1st command equal to W and p-value in 2nd
command.
This by() commands provide us an overview about the basic characteristics of dataset.
To summerize factors in each variable.
Frequency table: table(variable, useNA = “ifany”)
Command for conditions: table(variable %in% value)
To list the variable for condition above: dataset[variable %in% c(“Condition 1”, “Condition 2”) , ]
OR similar formular dataset [ variable == Condition 1 | variable == Condition 1, ]
Example 1: Frquency for 1 variable:
table(data$trt)
##
## Intervention Control
## 233 267
Example 2: Frquency for 2 variables:
table(data$trt, data$sex)
##
## Female Male
## Intervention 98 135
## Control 110 157
table(data$trt, data$age)
##
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## Intervention 3 1 0 1 3 1 1 3 5 1 1 1 1 0 1 2 0 4 1 3 1
## Control 0 1 1 1 0 2 0 0 0 1 3 3 5 3 1 3 2 2 1 0 1
##
## 44 45 46 47 50 51 52 53 54 54.5 55 55.5 56 57 58 59 60 61 62 63
## Intervention 3 4 4 2 2 3 2 1 2 2 5 0 9 9 4 2 6 4 5 5
## Control 2 0 0 0 2 2 2 5 12 1 10 1 7 3 15 8 6 2 2 1
##
## 64 65 65.7 65.8 66 67 68 69 70 71 72 73 74 75 76 76.4 77 78 79
## Intervention 6 2 0 1 6 6 8 5 3 2 7 4 5 5 10 0 2 1 5
## Control 5 17 1 0 8 6 9 9 8 4 5 7 3 6 1 1 0 6 1
##
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## Intervention 0 4 3 1 3 5 1 3 5 4 4 1 2 6 1 2 2
## Control 2 7 4 1 5 3 5 5 3 3 4 3 9 4 1 0 0
Example 3: How many observations have age 75 and age 96 and list them?
table(data$age %in% c("75", "96"))
##
## FALSE TRUE
## 487 13
data[data$age %in% c("75", "96") , ]
## # A tibble: 13 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z007 Interven~ 75 Male 25 Right No No No 0 140
## 2 z009 Control 75 Male 25 Left No Yes No 2 129
## 3 z055 Interven~ 75 Fema~ 25 Right No No No 0 127
## 4 z086 Interven~ 96 Male 14 Right No No Yes 0 139
## 5 z148 Control 75 Male 23 Right No No No 0 231
## 6 z170 Interven~ 75 Male 18 Left No Yes No 0 126
## 7 z180 Control 75 Male 18 Right No No No 0 177
## 8 z343 Interven~ 75 Male 14 Left No No No 0 162
## 9 z344 Control 75 Fema~ 13 Left Yes No Yes 0 125
## 10 z374 Interven~ 96 Male 26 Left No No No 0 124
## 11 z401 Control 75 Male 20 Left No No No 0 176
## 12 z404 Control 75 Fema~ 22 Left No No No 0 154
## 13 z485 Interven~ 75 Male 20 Right No Yes No 0 140
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
data [ data$age == 75 | data$age == 96, ]
## # A tibble: 13 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z007 Interven~ 75 Male 25 Right No No No 0 140
## 2 z009 Control 75 Male 25 Left No Yes No 2 129
## 3 z055 Interven~ 75 Fema~ 25 Right No No No 0 127
## 4 z086 Interven~ 96 Male 14 Right No No Yes 0 139
## 5 z148 Control 75 Male 23 Right No No No 0 231
## 6 z170 Interven~ 75 Male 18 Left No Yes No 0 126
## 7 z180 Control 75 Male 18 Right No No No 0 177
## 8 z343 Interven~ 75 Male 14 Left No No No 0 162
## 9 z344 Control 75 Fema~ 13 Left Yes No Yes 0 125
## 10 z374 Interven~ 96 Male 26 Left No No No 0 124
## 11 z401 Control 75 Male 20 Left No No No 0 176
## 12 z404 Control 75 Fema~ 22 Left No No No 0 154
## 13 z485 Interven~ 75 Male 20 Right No Yes No 0 140
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
To create the frequency table of 1 or 2 categorical variables:
xtabs( ~ variable 1 + variable 2, data = dataframe)
Example 1: Frquency for 1 variable:
xtabs( ~ data$trt, data = data)
## data$trt
## Intervention Control
## 233 267
Example 2: Frquency for 2 variable:
xtabs( ~ data$trt + data$sex, data = data)
## data$sex
## data$trt Female Male
## Intervention 98 135
## Control 110 157
This is the function for Categorical variables testing: Chi squared and Fisher exact test. I will menton this topic in another day. Let focus on how this function help us create 2x2 table:
Full command: CrossTable(predictor, outcome, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = “SAS”/“SPSS”)
library(gmodels)
CrossTable(data$trt, data$sex)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 500
##
##
## | data$sex
## data$trt | Female | Male | Row Total |
## -------------|-----------|-----------|-----------|
## Intervention | 98 | 135 | 233 |
## | 0.012 | 0.008 | |
## | 0.421 | 0.579 | 0.466 |
## | 0.471 | 0.462 | |
## | 0.196 | 0.270 | |
## -------------|-----------|-----------|-----------|
## Control | 110 | 157 | 267 |
## | 0.010 | 0.007 | |
## | 0.412 | 0.588 | 0.534 |
## | 0.529 | 0.538 | |
## | 0.220 | 0.314 | |
## -------------|-----------|-----------|-----------|
## Column Total | 208 | 292 | 500 |
## | 0.416 | 0.584 | |
## -------------|-----------|-----------|-----------|
##
##
CrossTable(data$trt, data$sex, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## | Std Residual |
## |-------------------------|
##
## Total Observations in Table: 500
##
## | data$sex
## data$trt | Female | Male | Row Total |
## -------------|-----------|-----------|-----------|
## Intervention | 98 | 135 | 233 |
## | 96.928 | 136.072 | |
## | 0.012 | 0.008 | |
## | 42.060% | 57.940% | 46.600% |
## | 47.115% | 46.233% | |
## | 19.600% | 27.000% | |
## | 0.109 | -0.092 | |
## -------------|-----------|-----------|-----------|
## Control | 110 | 157 | 267 |
## | 111.072 | 155.928 | |
## | 0.010 | 0.007 | |
## | 41.199% | 58.801% | 53.400% |
## | 52.885% | 53.767% | |
## | 22.000% | 31.400% | |
## | -0.102 | 0.086 | |
## -------------|-----------|-----------|-----------|
## Column Total | 208 | 292 | 500 |
## | 41.600% | 58.400% | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.03801773 d.f. = 1 p = 0.8454075
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 0.01082402 d.f. = 1 p = 0.9171387
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: 1.03604
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 0.8561188
## 95% confidence interval: 0.7138417 1.50323
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 0.6126348
## 95% confidence interval: 0 1.419597
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 0.4584408
## 95% confidence interval: 0.7559892 Inf
##
##
##
## Minimum expected frequency: 96.928
R has basic commands that allow users to check the default commands
or to modify data.
## Default check of command: args() vs str()
The args() function in R is used to return the required
arguments or parameters of a given function. The args()
function takes the name of the function as a parameter value, and
returns the parameters that are required by the function.
str() function in R Language is used for compactly displaying the internal structure of a R object. It can display even the internal structure of large lists which are nested. It provides one liner output for the basic R objects letting the user know about the object and its constituents.
Commands:
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## NULL
str(pivot_longer)
## function (data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL,
## names_pattern = NULL, names_ptypes = NULL, names_transform = NULL,
## names_repair = "check_unique", values_to = "value", values_drop_na = FALSE,
## values_ptypes = NULL, values_transform = NULL, ...)
To create a new dataframe that contains only the variables or cases that you want.
Command: newDataframe <- oldDataframe[rows, columns]
Functions:
There are many functions and operators that are useful when constructing
the expressions used to filter the data:
Example 1: Choose rows with 1 condition:
data[data$trt == "Control", ] %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z004 Control 28 Male 22 Left No No No 0 122
## 4 z005 Control 91 Male 24 Right No No No 0 162
## 5 z006 Control 34 Female 18 Left No No No 2 166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Example 2: Choose rows with 2 conditions:
data[data$trt == "Control" & data$age >=50, ] %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z005 Control 91 Male 24 Right No No No 0 162
## 4 z008 Control 89 Female 18 Right No No No 0 157
## 5 z009 Control 75 Male 25 Left No Yes No 2 129
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Example 3: Choose rows that trt is Control and age > global mean(age). If you want to extract age > mean(age) for each group (trt), group_by() + filter() should be used:
data[data$trt == "Control" & data$age > mean(data$age), ] %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z003 Control 68 Female 11 Right No No No 0 138
## 2 z005 Control 91 Male 24 Right No No No 0 162
## 3 z008 Control 89 Female 18 Right No No No 0 157
## 4 z009 Control 75 Male 25 Left No Yes No 2 129
## 5 z015 Control 91 Male 15 Right No No No 1 144
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Example 4: Drop all row that are not Intervention in trt:
data[! (data$trt == "Intervention"), ] %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z004 Control 28 Male 22 Left No No No 0 122
## 4 z005 Control 91 Male 24 Right No No No 0 162
## 5 z006 Control 34 Female 18 Left No No No 2 166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Another way to select parts of your dataframe is to use the subset() function. Everything which can be used with subcripts can also be used with subset():
Command: newDataframe<-subset(oldDataframe, cases to retain (rows), select = c(list of variables - column))
Example 1: Choose rows with 1 condition:
data %>% subset(trt == "Control", ) %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z004 Control 28 Male 22 Left No No No 0 122
## 4 z005 Control 91 Male 24 Right No No No 0 162
## 5 z006 Control 34 Female 18 Left No No No 2 166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Example 2: Choose rows with 2 conditions:
data %>% subset(trt == "Control" & data$age >=50, ) %>% head (n=5)
## # A tibble: 5 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z005 Control 91 Male 24 Right No No No 0 162
## 4 z008 Control 89 Female 18 Right No No No 0 157
## 5 z009 Control 75 Male 25 Left No Yes No 2 129
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>
Example 3: Subset data with top 10% age. After this command, we will get top 10% highest age (46 rows)
quantile(data$age, 0.9)
## 90%
## 88
subset(data, age> quantile(data$age, 0.9))
## # A tibble: 46 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z005 Control 91 Male 24 Right No No No 0 162
## 2 z008 Control 89 Fema~ 18 Right No No No 0 157
## 3 z015 Control 91 Male 15 Right No No No 1 144
## 4 z017 Control 92 Fema~ 12 Right No No No 0 99
## 5 z024 Control 92 Male 14 Left No No Yes 2 122
## 6 z030 Interven~ 91 Male 14 Left No Yes No 0 154
## 7 z045 Interven~ 93 Fema~ 28 Right No No No 1 162
## 8 z060 Control 93 Fema~ 18 Left No No No 0 142
## 9 z067 Interven~ 93 Fema~ 17 Right No Yes No 0 156
## 10 z078 Control 89 Fema~ 13 Left No No No 0 172
## # ... with 36 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Merge two data frames by common columns or row names, or do other versions of database join operations.
Command: merge(x, y, by = , by.x = by, by.y = by, all = FALSE
Parameter:
* x,y: 2 dataframes
* by.x by.y: which of the column should merge
* by default: it will merge all of the column by common name. For
example: it will merge all “id”, “start,”“stop”, “time left” as they
appear in both dataframe (even though, they are not the same)
* all = TRUE: if there’s a value that appears in one
but not in the other, it should include another row but with na values
for the missing values that are, don’t appear in the other data
frame.
* all = FALSE: only include the same value between 2
dataframes.
Let create 2 dataframes contains following variables:
# Create 1st dataframe:
id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c("F", "M", "F", "M", "F")
age = c(20, 17, 50, 30, 80)
merge1 = as.tibble(data.frame(id, names, sex, age))
# Create 2nd dataframe:
studyid = c(1:6)
names = c("Ana", "Boc", "Cindy", "Peter", "Yuly", "Doe")
group = c("Control", "Intervention", "Control", "Control", "Intervention", "Control")
age = c(20, 17, 50, 64, 34, 30)
merge2 = as.tibble(data.frame(studyid, names, group, age))
# Check dataframes:
merge1
## # A tibble: 5 x 4
## id names sex age
## <int> <chr> <chr> <dbl>
## 1 1 Ana F 20
## 2 2 Boc M 17
## 3 3 Cindy F 50
## 4 4 Doe M 30
## 5 5 Eva F 80
merge2
## # A tibble: 6 x 4
## studyid names group age
## <int> <chr> <chr> <dbl>
## 1 1 Ana Control 20
## 2 2 Boc Intervention 17
## 3 3 Cindy Control 50
## 4 4 Peter Control 64
## 5 5 Yuly Intervention 34
## 6 6 Doe Control 30
Example 1: Merge 2 dataframe by names. COmpare the parameter all= TRUE vs FALSE
merge(merge1, merge2, by = "names", all = TRUE)
## names id sex age.x studyid group age.y
## 1 Ana 1 F 20 1 Control 20
## 2 Boc 2 M 17 2 Intervention 17
## 3 Cindy 3 F 50 3 Control 50
## 4 Doe 4 M 30 6 Control 30
## 5 Eva 5 F 80 NA <NA> NA
## 6 Peter NA <NA> NA 4 Control 64
## 7 Yuly NA <NA> NA 5 Intervention 34
merge(merge1, merge2, by = "names", all = F)
## names id sex age.x studyid group age.y
## 1 Ana 1 F 20 1 Control 20
## 2 Boc 2 M 17 2 Intervention 17
## 3 Cindy 3 F 50 3 Control 50
## 4 Doe 4 M 30 6 Control 30
Example 2: We can see that in our dataframes, there are 2 similar variables: names and age. Let merge by both names and age:
merge(merge1, merge2, by.x = c("names", "age"),
by.y = c("names", "age"), all = TRUE)
## names age id sex studyid group
## 1 Ana 20 1 F 1 Control
## 2 Boc 17 2 M 2 Intervention
## 3 Cindy 50 3 F 3 Control
## 4 Doe 30 4 M 6 Control
## 5 Eva 80 5 F NA <NA>
## 6 Peter 64 NA <NA> 4 Control
## 7 Yuly 34 NA <NA> 5 Intervention
Commands:
factor(variable, levels = c(“x”,“y”, …” z”), labels =
c(“label1”, “label2”, … “label3”))
Let create a dataframe:
id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c(0, 1, 0, 1, 0)
age = c(20, 17, 50, 30, 80)
merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 5 x 4
## id names sex age
## <int> <chr> <dbl> <dbl>
## 1 1 Ana 0 20
## 2 2 Boc 1 17
## 3 3 Cindy 0 50
## 4 4 Doe 1 30
## 5 5 Eva 0 80
Example: Change the 0,1 in sex to Female and Male, respectively:
merge1$sex = factor(merge1$sex, levels = c(0, 1), labels = c("Female ", "Male"))
merge1
## # A tibble: 5 x 4
## id names sex age
## <int> <chr> <fct> <dbl>
## 1 1 Ana "Female " 20
## 2 2 Boc "Male" 17
## 3 3 Cindy "Female " 50
## 4 4 Doe "Male" 30
## 5 5 Eva "Female " 80
Let create a dataframe:
id = c(1:6)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva", "EOI")
sex = c(0, 1, 0, 1, 0, 1)
age = c(20, 17, 50, 30, 80, 40)
merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 6 x 4
## id names sex age
## <int> <chr> <dbl> <dbl>
## 1 1 Ana 0 20
## 2 2 Boc 1 17
## 3 3 Cindy 0 50
## 4 4 Doe 1 30
## 5 5 Eva 0 80
## 6 6 EOI 1 40
Example: Create group_age categorical variables with >=30 and the rest
merge1$group_age[merge1$age >=30] <- ">=30"
merge1$group_age[merge1$age <30] <- "Phan con lai"
merge1
## # A tibble: 6 x 5
## id names sex age group_age
## <int> <chr> <dbl> <dbl> <chr>
## 1 1 Ana 0 20 Phan con lai
## 2 2 Boc 1 17 Phan con lai
## 3 3 Cindy 0 50 >=30
## 4 4 Doe 1 30 >=30
## 5 5 Eva 0 80 >=30
## 6 6 EOI 1 40 >=30
Replace 20 = 0; Missing value = 1000
id = c(1:6)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva", "EOI")
sex = c(0, 1, 0, 1, 0, 1)
age = c(20, 17, 50, 20, NA, NA)
merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 6 x 4
## id names sex age
## <int> <chr> <dbl> <dbl>
## 1 1 Ana 0 20
## 2 2 Boc 1 17
## 3 3 Cindy 0 50
## 4 4 Doe 1 20
## 5 5 Eva 0 NA
## 6 6 EOI 1 NA
merge1$age[merge1$age == 20] <- 0
merge1$age[is.na(merge1$age)]<- 1000
merge1
## # A tibble: 6 x 4
## id names sex age
## <int> <chr> <dbl> <dbl>
## 1 1 Ana 0 0
## 2 2 Boc 1 17
## 3 3 Cindy 0 50
## 4 4 Doe 1 0
## 5 5 Eva 0 1000
## 6 6 EOI 1 1000
Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
Package used: tidyverse.
library(tidyverse)
tidyverse aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with:
Rows:
Columns:
Groups of rows:
Others:
These can all be used in conjunction with group_by()
which changes the scope of each function from operating on the entire
dataset to operating on it group-by-group. More infomation is available
here:
https://dplyr.tidyverse.org/articles/base.html#one-table-verbs
The pipe operator, denoted as %>%. It takes the output of one function and provides it as an input to another function. This enables us to connect a series of analytic stages. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).
The filter() method is applied to subset a data frame, keeping only the rows that meet your criteria. To be kept, the row must return TRUE for all requirements. When a condition evaluates to NA, the row is discarded, as opposed to base subsetting with.
Command: filter(dataframe, condition, .preserve = FALSE)
The filter() method is used to subset the rows of data by applying expressions in… to column values to decide which rows should be kept. It may be used on both grouped and ungrouped data (see group by() and ungroup() for further information). dplyr, on the other hand, is not yet intelligent enough to optimize the filtering procedure on grouped datasets that do not require grouped computations. As a result, sifting ungrouped data is frequently more quicker.
There are many functions and operators that are useful when constructing the expressions used to filter the data:
Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”.
The output has the following properties:
Example 1: Filtering by one criterion
filter(data, trt == "Control")
## # A tibble: 267 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z004 Control 28 Male 22 Left No No No 0 122
## 4 z005 Control 91 Male 24 Right No No No 0 162
## 5 z006 Control 34 Female 18 Left No No No 2 166
## 6 z008 Control 89 Female 18 Right No No No 0 157
## 7 z009 Control 75 Male 25 Left No Yes No 2 129
## 8 z011 Control 35 Female 17 Right No Yes No 2 148
## 9 z012 Control 55 Male 13 Right No No No 0 114
## 10 z013 Control 59 Male 21 Right No No No 0 140
## # ... with 257 more rows, and 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>
filter(data, age >= 50)
## # A tibble: 421 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z002 Interven~ 51 Male 23 Left No Yes No 0 137
## 3 z003 Control 68 Fema~ 11 Right No No No 0 138
## 4 z005 Control 91 Male 24 Right No No No 0 162
## 5 z007 Interven~ 75 Male 25 Right No No No 0 140
## 6 z008 Control 89 Fema~ 18 Right No No No 0 157
## 7 z009 Control 75 Male 25 Left No Yes No 2 129
## 8 z012 Control 55 Male 13 Right No No No 0 114
## 9 z013 Control 59 Male 21 Right No No No 0 140
## 10 z015 Control 91 Male 15 Right No No No 1 144
## # ... with 411 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Example 2: Filtering by multiple criteria within a single logical expression
# Joint conditions "& and ,"
# filter(data, trt == "Control" , age >= 50)
filter(data, trt == "Control" & age >= 50)
## # A tibble: 235 x 18
## studyid trt age sex nihss location hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z003 Control 68 Female 11 Right No No No 0 138
## 3 z005 Control 91 Male 24 Right No No No 0 162
## 4 z008 Control 89 Female 18 Right No No No 0 157
## 5 z009 Control 75 Male 25 Left No Yes No 2 129
## 6 z012 Control 55 Male 13 Right No No No 0 114
## 7 z013 Control 59 Male 21 Right No No No 0 140
## 8 z015 Control 91 Male 15 Right No No No 1 144
## 9 z017 Control 92 Female 12 Right No No No 0 99
## 10 z018 Control 59 Male 22 Right No No No 0 170
## # ... with 225 more rows, and 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>
# Seperate conditions "| or"
filter(data, trt == "Control" | age >= 50)
## # A tibble: 453 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z002 Interven~ 51 Male 23 Left No Yes No 0 137
## 3 z003 Control 68 Fema~ 11 Right No No No 0 138
## 4 z004 Control 28 Male 22 Left No No No 0 122
## 5 z005 Control 91 Male 24 Right No No No 0 162
## 6 z006 Control 34 Fema~ 18 Left No No No 2 166
## 7 z007 Interven~ 75 Male 25 Right No No No 0 140
## 8 z008 Control 89 Fema~ 18 Right No No No 0 157
## 9 z009 Control 75 Male 25 Left No Yes No 2 129
## 10 z011 Control 35 Fema~ 17 Right No Yes No 2 148
## # ... with 443 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Note 1: x %in% y. This will select every row where x is one of the values in y. In this example, I will use 2 conditions. The 1st one is list all observations that has age = 50 or age = 60. The 2nd condition is list all observations that has age range from 50 to 60.
filter(data, age == 50 | age == 60)
## # A tibble: 16 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z037 Control 60 Fema~ 16 Left No Yes No 0 204
## 2 z076 Interven~ 50 Male 17 Left No No No 0 121
## 3 z095 Interven~ 60 Male 17 Right No Yes No 0 141
## 4 z147 Interven~ 60 Male 22 Left No No No 2 119
## 5 z152 Interven~ 60 Male 24 Right No No No 0 138
## 6 z161 Interven~ 60 Male 21 Left No No No 1 147
## 7 z163 Interven~ 60 Fema~ 10 Right No No Yes 0 176
## 8 z181 Control 60 Male 17 Right No Yes No 0 141
## 9 z231 Control 60 Male 22 Right No No No 0 105
## 10 z365 Control 60 Male 12 Left No No No 0 190
## 11 z373 Control 60 Male 14 Right Yes No No 0 178
## 12 z378 Control 50 Fema~ 22 Left No No No 1 130
## 13 z400 Interven~ 50 Male 20 Right No No No 0 135
## 14 z417 Interven~ 60 Male 14 Right No Yes No 0 170
## 15 z427 Control 50 Fema~ 22 Left No No No 2 126
## 16 z444 Control 60 Male 20 Left No No No > 2 159
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
filter(data, age %in% c(50, 60))
## # A tibble: 16 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z037 Control 60 Fema~ 16 Left No Yes No 0 204
## 2 z076 Interven~ 50 Male 17 Left No No No 0 121
## 3 z095 Interven~ 60 Male 17 Right No Yes No 0 141
## 4 z147 Interven~ 60 Male 22 Left No No No 2 119
## 5 z152 Interven~ 60 Male 24 Right No No No 0 138
## 6 z161 Interven~ 60 Male 21 Left No No No 1 147
## 7 z163 Interven~ 60 Fema~ 10 Right No No Yes 0 176
## 8 z181 Control 60 Male 17 Right No Yes No 0 141
## 9 z231 Control 60 Male 22 Right No No No 0 105
## 10 z365 Control 60 Male 12 Left No No No 0 190
## 11 z373 Control 60 Male 14 Right Yes No No 0 178
## 12 z378 Control 50 Fema~ 22 Left No No No 1 130
## 13 z400 Interven~ 50 Male 20 Right No No No 0 135
## 14 z417 Interven~ 60 Male 14 Right No Yes No 0 170
## 15 z427 Control 50 Fema~ 22 Left No No No 2 126
## 16 z444 Control 60 Male 20 Left No No No > 2 159
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## # ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## # abbreviated variable name 1: location
filter(data, age %in% c(50: 60))
## # A tibble: 117 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 Control 53 Male 21 Right No No No 2 127
## 2 z002 Interven~ 51 Male 23 Left No Yes No 0 137
## 3 z012 Control 55 Male 13 Right No No No 0 114
## 4 z013 Control 59 Male 21 Right No No No 0 140
## 5 z018 Control 59 Male 22 Right No No No 0 170
## 6 z019 Control 58 Male 22 Left No Yes No 0 130
## 7 z021 Interven~ 57 Male 17 Right No No No > 2 108
## 8 z031 Control 59 Fema~ 12 Right No No No 0 129
## 9 z032 Control 53 Male 22 Right No No Yes 0 157
## 10 z034 Interven~ 58 Male 15 Right No No No 0 105
## # ... with 107 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Note 2: Sometimes you can simplify complicated subsetting: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find observations that whose age is less than 60 and nihss is less than 15, you could use either of the following two filters. The following commands produce similar results:
filter(data, !(age > 60 | nihss > 15))
## # A tibble: 72 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z012 Control 55 Male 13 Right No No No 0 114
## 2 z031 Control 59 Fema~ 12 Right No No No 0 129
## 3 z034 Interven~ 58 Male 15 Right No No No 0 105
## 4 z039 Interven~ 59 Fema~ 14 Right No No Yes 0 148
## 5 z048 Interven~ 46 Fema~ 10 Left No No No 0 168
## 6 z052 Control 54.5 Male 12 Right No Yes No 0 149
## 7 z059 Control 58 Fema~ 15 Left No Yes No 1 130
## 8 z062 Interven~ 56 Male 14 Left No No Yes 0 146
## 9 z074 Interven~ 24 Fema~ 14 Right No No No 0 162
## 10 z080 Interven~ 31 Male 10 Left No No No 0 131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
filter(data, !(age > 60) , ! (nihss > 15))
## # A tibble: 72 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z012 Control 55 Male 13 Right No No No 0 114
## 2 z031 Control 59 Fema~ 12 Right No No No 0 129
## 3 z034 Interven~ 58 Male 15 Right No No No 0 105
## 4 z039 Interven~ 59 Fema~ 14 Right No No Yes 0 148
## 5 z048 Interven~ 46 Fema~ 10 Left No No No 0 168
## 6 z052 Control 54.5 Male 12 Right No Yes No 0 149
## 7 z059 Control 58 Fema~ 15 Left No Yes No 1 130
## 8 z062 Interven~ 56 Male 14 Left No No Yes 0 146
## 9 z074 Interven~ 24 Fema~ 14 Right No No No 0 162
## 10 z080 Interven~ 31 Male 10 Left No No No 0 131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
filter(data, age <= 60, nihss <= 15)
## # A tibble: 72 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z012 Control 55 Male 13 Right No No No 0 114
## 2 z031 Control 59 Fema~ 12 Right No No No 0 129
## 3 z034 Interven~ 58 Male 15 Right No No No 0 105
## 4 z039 Interven~ 59 Fema~ 14 Right No No Yes 0 148
## 5 z048 Interven~ 46 Fema~ 10 Left No No No 0 168
## 6 z052 Control 54.5 Male 12 Right No Yes No 0 149
## 7 z059 Control 58 Fema~ 15 Left No Yes No 1 130
## 8 z062 Interven~ 56 Male 14 Left No No Yes 0 146
## 9 z074 Interven~ 24 Fema~ 14 Right No No No 0 162
## 10 z080 Interven~ 31 Male 10 Left No No No 0 131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Example 3: The filtering operation may yield different results on grouped tibbles because the expressions are computed within groups.
# The following filters rows where `age` is greater than the global average:
data %>% filter(age > mean(age, na.rm = TRUE))
## # A tibble: 270 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z003 Control 68 Fema~ 11 Right No No No 0 138
## 2 z005 Control 91 Male 24 Right No No No 0 162
## 3 z007 Interven~ 75 Male 25 Right No No No 0 140
## 4 z008 Control 89 Fema~ 18 Right No No No 0 157
## 5 z009 Control 75 Male 25 Left No Yes No 2 129
## 6 z015 Control 91 Male 15 Right No No No 1 144
## 7 z016 Interven~ 76 Male 16 Right No No No 0 135
## 8 z017 Control 92 Fema~ 12 Right No No No 0 99
## 9 z020 Interven~ 73 Fema~ 15 Right No Yes Yes 0 169
## 10 z022 Interven~ 67 Male 27 Left No Yes No > 2 148
## # ... with 260 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
# Whereas this keeps rows with `age` greater than the average value of age based on trt groups:
data %>% group_by(trt) %>% filter( age > mean(age, na.rm = TRUE))
## # A tibble: 259 x 18
## # Groups: trt [2]
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z003 Control 68 Fema~ 11 Right No No No 0 138
## 2 z005 Control 91 Male 24 Right No No No 0 162
## 3 z007 Interven~ 75 Male 25 Right No No No 0 140
## 4 z008 Control 89 Fema~ 18 Right No No No 0 157
## 5 z009 Control 75 Male 25 Left No Yes No 2 129
## 6 z015 Control 91 Male 15 Right No No No 1 144
## 7 z016 Interven~ 76 Male 16 Right No No No 0 135
## 8 z017 Control 92 Fema~ 12 Right No No No 0 99
## 9 z020 Interven~ 73 Fema~ 15 Right No Yes Yes 0 169
## 10 z022 Interven~ 67 Male 27 Left No Yes No > 2 148
## # ... with 249 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Command: arrange(dataframe, by_group = FALSE)
Unlike base sorting with sort(), NA are always sorted to the end for local data, even when wrapped with desc().
The output has the following properties:
Example 1: Arrange the data based on age and nihss in ascending order
arrange(data, age, nihss)
## # A tibble: 500 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z481 Interven~ 23 Male 12 Right No Yes No 2 169
## 2 z191 Interven~ 23 Male 20 Right No Yes No 0 137
## 3 z499 Interven~ 23 Male 23 Left Yes Yes No 0 123
## 4 z074 Interven~ 24 Fema~ 14 Right No No No 0 162
## 5 z204 Control 24 Fema~ 22 Left No No No 0 123
## 6 z387 Control 25 Male 17 Left No Yes No 1 138
## 7 z406 Control 26 Male 15 Right No No No > 2 177
## 8 z010 Interven~ 26 Fema~ 27 Right No No No 0 143
## 9 z122 Interven~ 27 Fema~ 19 Left No Yes Yes 2 195
## 10 z347 Interven~ 27 Fema~ 20 Right No No No 0 133
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Example 2: Arrange the data based on age and nihss in descending order
arrange(data, desc(age), desc(nihss))
## # A tibble: 500 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z374 Interven~ 96 Male 26 Left No No No 0 124
## 2 z086 Interven~ 96 Male 14 Right No No Yes 0 139
## 3 z434 Interven~ 95 Fema~ 19 Right No No No 0 111
## 4 z340 Interven~ 95 Male 18 Left No Yes No > 2 188
## 5 z357 Interven~ 94 Fema~ 26 Right No No No 0 200
## 6 z135 Control 94 Fema~ 18 Right Yes No Yes 0 120
## 7 z045 Interven~ 93 Fema~ 28 Right No No No 1 162
## 8 z440 Control 93 Male 25 Right No No No 0 129
## 9 z219 Interven~ 93 Fema~ 20 Right No No No 0 146
## 10 z227 Interven~ 93 Fema~ 20 Left No No No 0 134
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Example 3: Grouped arrange: Grouped by age, arrange based on nihss. Compare the result with arrange without group:
# Arrange with group:
data %>% group_by(age) %>% arrange(nihss, .by_group = TRUE)
## # A tibble: 500 x 18
## # Groups: age [77]
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z481 Interven~ 23 Male 12 Right No Yes No 2 169
## 2 z191 Interven~ 23 Male 20 Right No Yes No 0 137
## 3 z499 Interven~ 23 Male 23 Left Yes Yes No 0 123
## 4 z074 Interven~ 24 Fema~ 14 Right No No No 0 162
## 5 z204 Control 24 Fema~ 22 Left No No No 0 123
## 6 z387 Control 25 Male 17 Left No Yes No 1 138
## 7 z406 Control 26 Male 15 Right No No No > 2 177
## 8 z010 Interven~ 26 Fema~ 27 Right No No No 0 143
## 9 z122 Interven~ 27 Fema~ 19 Left No Yes Yes 2 195
## 10 z347 Interven~ 27 Fema~ 20 Right No No No 0 133
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
# Arrange without group:
data %>% arrange(nihss)
## # A tibble: 500 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <fct> <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z048 Interven~ 46 Fema~ 10 Left No No No 0 168
## 2 z080 Interven~ 31 Male 10 Left No No No 0 131
## 3 z163 Interven~ 60 Fema~ 10 Right No No Yes 0 176
## 4 z297 Interven~ 40 Male 10 Left Yes No No 0 134
## 5 z298 Interven~ 72 Fema~ 10 Left No No No 0 183
## 6 z301 Interven~ 67 Male 10 Left No No No 0 166
## 7 z310 Interven~ 30 Male 10 Right No No No 0 164
## 8 z330 Interven~ 33 Male 10 Left No No No 1 176
## 9 z334 Interven~ 64 Male 10 Right Yes No No 0 160
## 10 z003 Control 68 Fema~ 11 Right No No No 0 138
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
Example 4: Use across() access select()-style semantics:
data %>% select(sbp, nihss, time.iv, time.rand, time.punc) %>%
arrange(across(starts_with("time.")))
## # A tibble: 500 x 5
## sbp nihss time.iv time.rand time.punc
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 111 13 42 276 295
## 2 134 17 43 138 267
## 3 121 17 43 146 186
## 4 159 28 44 189 327
## 5 144 21 44 212 NA
## 6 139 14 44 299 331
## 7 135 20 45 189 336
## 8 133 11 45 299 329
## 9 165 22 46 128 NA
## 10 148 14 46 148 333
## # ... with 490 more rows
data %>% select(sbp, nihss, time.iv, time.rand, time.punc) %>%
arrange(across(starts_with("time."), desc))
## # A tibble: 500 x 5
## sbp nihss time.iv time.rand time.punc
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 119 22 218 195 182
## 2 142 18 218 152 247
## 3 154 15 212 256 321
## 4 166 10 211 235 260
## 5 160 10 204 300 184
## 6 176 21 200 276 219
## 7 139 17 200 162 194
## 8 131 10 199 292 190
## 9 140 15 199 267 198
## 10 131 24 197 300 200
## # ... with 490 more rows
Using actions based on variable names, select() lets you to quickly zoom in on a suitable subset.
Command: select(dataframe, …)
Tidyverse selections implement a dialect of R where operators make it easy to select variables:
Some helpers select specific columns:
Some helpers select variables by matching patterns in their names:
Some helpers select variables from a character vector:
This helper selects variables with a function:
The output has the following properties:
Example 1: Select variables by name:
data %>% select(trt, nihss, location, mrankin)
## # A tibble: 500 x 4
## trt nihss location mrankin
## <fct> <dbl> <fct> <fct>
## 1 Control 21 Right 2
## 2 Intervention 23 Left 0
## 3 Control 11 Right 0
## 4 Control 22 Left 0
## 5 Control 24 Right 0
## 6 Control 18 Left 2
## 7 Intervention 25 Right 0
## 8 Control 18 Right 0
## 9 Control 25 Left 2
## 10 Intervention 27 Right 0
## # ... with 490 more rows
Example 2: Select a range of consecutive variables with :
data %>% select(trt : nihss)
## # A tibble: 500 x 4
## trt age sex nihss
## <fct> <dbl> <fct> <dbl>
## 1 Control 53 Male 21
## 2 Intervention 51 Male 23
## 3 Control 68 Female 11
## 4 Control 28 Male 22
## 5 Control 91 Male 24
## 6 Control 34 Female 18
## 7 Intervention 75 Male 25
## 8 Control 89 Female 18
## 9 Control 75 Male 25
## 10 Intervention 26 Female 27
## # ... with 490 more rows
Example 3: Negates a selection with !. These commands produces similar results:
# Choose variables except studyid; location: time.punc:
data %>% select(! c( studyid, location : time.punc))
## # A tibble: 500 x 4
## trt age sex nihss
## <fct> <dbl> <fct> <dbl>
## 1 Control 53 Male 21
## 2 Intervention 51 Male 23
## 3 Control 68 Female 11
## 4 Control 28 Male 22
## 5 Control 91 Male 24
## 6 Control 34 Female 18
## 7 Intervention 75 Male 25
## 8 Control 89 Female 18
## 9 Control 75 Male 25
## 10 Intervention 26 Female 27
## # ... with 490 more rows
data %>% select(!studyid & !c(location : time.punc))
## # A tibble: 500 x 4
## trt age sex nihss
## <fct> <dbl> <fct> <dbl>
## 1 Control 53 Male 21
## 2 Intervention 51 Male 23
## 3 Control 68 Female 11
## 4 Control 28 Male 22
## 5 Control 91 Male 24
## 6 Control 34 Female 18
## 7 Intervention 75 Male 25
## 8 Control 89 Female 18
## 9 Control 75 Male 25
## 10 Intervention 26 Female 27
## # ... with 490 more rows
Example 4: Select variables start with time
data %>% select(starts_with("time"))
## # A tibble: 500 x 3
## time.iv time.rand time.punc
## <dbl> <dbl> <dbl>
## 1 63 139 NA
## 2 68 118 281
## 3 NA 178 NA
## 4 78 160 NA
## 5 121 214 NA
## 6 78 154 NA
## 7 97 122 268
## 8 NA 147 NA
## 9 49 271 NA
## 10 99 141 259
## # ... with 490 more rows
Example 5: & and | take the intersection or the union of two selections:
data %>% select(starts_with("time.") & ends_with("iv"))
## # A tibble: 500 x 1
## time.iv
## <dbl>
## 1 63
## 2 68
## 3 NA
## 4 78
## 5 121
## 6 78
## 7 97
## 8 NA
## 9 49
## 10 99
## # ... with 490 more rows
data %>% select(starts_with("time.") | ends_with("iv"))
## # A tibble: 500 x 3
## time.iv time.rand time.punc
## <dbl> <dbl> <dbl>
## 1 63 139 NA
## 2 68 118 281
## 3 NA 178 NA
## 4 78 160 NA
## 5 121 214 NA
## 6 78 154 NA
## 7 97 122 268
## 8 NA 147 NA
## 9 49 271 NA
## 10 99 141 259
## # ... with 490 more rows
Example 6: Use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
select(data, studyid, nihss, trt, everything())
## # A tibble: 500 x 18
## studyid nihss trt age sex locat~1 hx.isch afib dm mrankin sbp
## <fct> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 z001 21 Control 53 Male Right No No No 2 127
## 2 z002 23 Interven~ 51 Male Left No Yes No 0 137
## 3 z003 11 Control 68 Fema~ Right No No No 0 138
## 4 z004 22 Control 28 Male Left No No No 0 122
## 5 z005 24 Control 91 Male Right No No No 0 162
## 6 z006 18 Control 34 Fema~ Left No No No 2 166
## 7 z007 25 Interven~ 75 Male Right No No No 0 140
## 8 z008 18 Control 89 Fema~ Right No No No 0 157
## 9 z009 25 Control 75 Male Left No Yes No 2 129
## 10 z010 27 Interven~ 26 Fema~ Right No No No 0 143
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
rename() changes the names of individual variables using new_name = old_name syntax; rename_with() renames columns using a function.
Command: rename(dataframe, new name = old name …)
The output has the following properties:
Example 1: Rename variable trt to Group:
data %>% select(studyid, trt, sex) %>% rename (Group = trt)
## # A tibble: 500 x 3
## studyid Group sex
## <fct> <fct> <fct>
## 1 z001 Control Male
## 2 z002 Intervention Male
## 3 z003 Control Female
## 4 z004 Control Male
## 5 z005 Control Male
## 6 z006 Control Female
## 7 z007 Intervention Male
## 8 z008 Control Female
## 9 z009 Control Male
## 10 z010 Intervention Female
## # ... with 490 more rows
Example 2: Change the variable names start with time. to upper case TIME. Or change all variables to upper case:
data %>% select(trt, starts_with("time.")) %>% rename_with (toupper, starts_with("time."))
## # A tibble: 500 x 4
## trt TIME.IV TIME.RAND TIME.PUNC
## <fct> <dbl> <dbl> <dbl>
## 1 Control 63 139 NA
## 2 Intervention 68 118 281
## 3 Control NA 178 NA
## 4 Control 78 160 NA
## 5 Control 121 214 NA
## 6 Control 78 154 NA
## 7 Intervention 97 122 268
## 8 Control NA 147 NA
## 9 Control 49 271 NA
## 10 Intervention 99 141 259
## # ... with 490 more rows
data %>% select(studyid, trt, sex) %>% rename_with (toupper)
## # A tibble: 500 x 3
## STUDYID TRT SEX
## <fct> <fct> <fct>
## 1 z001 Control Male
## 2 z002 Intervention Male
## 3 z003 Control Female
## 4 z004 Control Male
## 5 z005 Control Male
## 6 z006 Control Female
## 7 z007 Intervention Male
## 8 z008 Control Female
## 9 z009 Control Male
## 10 z010 Intervention Female
## # ... with 490 more rows
Example 3: Change the variable names start with time. to lower case TIME, replace “.” to “_” between each words:
data %>% select (studyid, starts_with("time.")) %>%
rename_with(~ tolower(gsub(".", "_", .x, fixed = TRUE)))
## # A tibble: 500 x 4
## studyid time_iv time_rand time_punc
## <fct> <dbl> <dbl> <dbl>
## 1 z001 63 139 NA
## 2 z002 68 118 281
## 3 z003 NA 178 NA
## 4 z004 78 160 NA
## 5 z005 121 214 NA
## 6 z006 78 154 NA
## 7 z007 97 122 268
## 8 z008 NA 147 NA
## 9 z009 49 271 NA
## 10 z010 99 141 259
## # ... with 490 more rows
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate(). mutate() always adds new columns at the end of your dataset.
Command:
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
row_number(y)
## [1] 1 2 3 NA 4 5
dense_rank(y)
## [1] 1 2 2 NA 3 4
percent_rank(y)
## [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6 NA 0.8 1.0
Example 1: Newly created variables are available immediately:
data %>%
select(studyid, nihss) %>%
mutate(
nihss2 = nihss * 2,
nihss_squared = nihss * nihss
)
## # A tibble: 500 x 4
## studyid nihss nihss2 nihss_squared
## <fct> <dbl> <dbl> <dbl>
## 1 z001 21 42 441
## 2 z002 23 46 529
## 3 z003 11 22 121
## 4 z004 22 44 484
## 5 z005 24 48 576
## 6 z006 18 36 324
## 7 z007 25 50 625
## 8 z008 18 36 324
## 9 z009 25 50 625
## 10 z010 27 54 729
## # ... with 490 more rows
Example 2: Remove variables and modify existing variables:
data %>%
select(studyid, nihss, sbp) %>%
mutate(
sbp = NULL,
nihss = nihss /10
)
## # A tibble: 500 x 2
## studyid nihss
## <fct> <dbl>
## 1 z001 2.1
## 2 z002 2.3
## 3 z003 1.1
## 4 z004 2.2
## 5 z005 2.4
## 6 z006 1.8
## 7 z007 2.5
## 8 z008 1.8
## 9 z009 2.5
## 10 z010 2.7
## # ... with 490 more rows
Example 3: Apply a transformation to multiple columns in a tibble:
data %>%
select(studyid, sex, sbp) %>%
mutate(across( !sbp, as.character))
## # A tibble: 500 x 3
## studyid sex sbp
## <chr> <chr> <dbl>
## 1 z001 Male 127
## 2 z002 Male 137
## 3 z003 Female 138
## 4 z004 Male 122
## 5 z005 Male 162
## 6 z006 Female 166
## 7 z007 Male 140
## 8 z008 Female 157
## 9 z009 Male 129
## 10 z010 Female 143
## # ... with 490 more rows
Example 4: Functions are useful for grouped mutates: min_rank; row_number(y); dense_rank(y); percent_rank(y); cume_dist(y):
data %>%
select(studyid, nihss, sbp) %>%
group_by(sbp) %>%
mutate(rank = min_rank(nihss))
## # A tibble: 500 x 4
## # Groups: sbp [113]
## studyid nihss sbp rank
## <fct> <dbl> <dbl> <int>
## 1 z001 21 127 3
## 2 z002 23 137 7
## 3 z003 11 138 1
## 4 z004 22 122 5
## 5 z005 24 162 12
## 6 z006 18 166 6
## 7 z007 25 140 9
## 8 z008 18 157 4
## 9 z009 25 129 6
## 10 z010 27 143 6
## # ... with 490 more rows
data %>%
select(studyid, nihss, sbp) %>%
group_by(sbp) %>%
mutate(rank = dense_rank(nihss))
## # A tibble: 500 x 4
## # Groups: sbp [113]
## studyid nihss sbp rank
## <fct> <dbl> <dbl> <int>
## 1 z001 21 127 3
## 2 z002 23 137 5
## 3 z003 11 138 1
## 4 z004 22 122 4
## 5 z005 24 162 7
## 6 z006 18 166 5
## 7 z007 25 140 8
## 8 z008 18 157 4
## 9 z009 25 129 5
## 10 z010 27 143 5
## # ... with 490 more rows
Example 5: New column can be placed in any column position: .before and .after. The 2 following commands are similar:
data %>%
select(studyid, nihss, sbp) %>%
mutate( proportion = nihss / sbp, .before = nihss)
## # A tibble: 500 x 4
## studyid proportion nihss sbp
## <fct> <dbl> <dbl> <dbl>
## 1 z001 0.165 21 127
## 2 z002 0.168 23 137
## 3 z003 0.0797 11 138
## 4 z004 0.180 22 122
## 5 z005 0.148 24 162
## 6 z006 0.108 18 166
## 7 z007 0.179 25 140
## 8 z008 0.115 18 157
## 9 z009 0.194 25 129
## 10 z010 0.189 27 143
## # ... with 490 more rows
data %>%
select(studyid, nihss, sbp) %>%
mutate( proportion = nihss / sbp, .after = studyid)
## # A tibble: 500 x 4
## studyid proportion nihss sbp
## <fct> <dbl> <dbl> <dbl>
## 1 z001 0.165 21 127
## 2 z002 0.168 23 137
## 3 z003 0.0797 11 138
## 4 z004 0.180 22 122
## 5 z005 0.148 24 162
## 6 z006 0.108 18 166
## 7 z007 0.179 25 140
## 8 z008 0.115 18 157
## 9 z009 0.194 25 129
## 10 z010 0.189 27 143
## # ... with 490 more rows
Example 6: Calculate based on group:
# Global average:
data %>%
select(studyid, ia.occlus, nihss) %>%
mutate(global_proportion = nihss / mean(nihss, na.rm = TRUE))
## # A tibble: 500 x 4
## studyid ia.occlus nihss global_proportion
## <fct> <fct> <dbl> <dbl>
## 1 z001 M1 21 1.16
## 2 z002 M1 23 1.28
## 3 z003 ICA with M1 11 0.610
## 4 z004 ICA with M1 22 1.22
## 5 z005 M1 24 1.33
## 6 z006 A1 or A2 18 0.999
## 7 z007 M1 25 1.39
## 8 z008 M1 18 0.999
## 9 z009 M1 25 1.39
## 10 z010 M1 27 1.50
## # ... with 490 more rows
# Group-Local average: calculate based on variable "sex"
data %>%
select(studyid, ia.occlus, nihss) %>%
group_by(ia.occlus) %>%
mutate(sex_proportion = nihss / mean(nihss, na.rm = TRUE))
## # A tibble: 500 x 4
## # Groups: ia.occlus [6]
## studyid ia.occlus nihss sex_proportion
## <fct> <fct> <dbl> <dbl>
## 1 z001 M1 21 1.15
## 2 z002 M1 23 1.26
## 3 z003 ICA with M1 11 0.638
## 4 z004 ICA with M1 22 1.28
## 5 z005 M1 24 1.32
## 6 z006 A1 or A2 18 0.806
## 7 z007 M1 25 1.37
## 8 z008 M1 18 0.988
## 9 z009 M1 25 1.37
## 10 z010 M1 27 1.48
## # ... with 490 more rows
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
Command: summarise(dataframe, x = )
Example 1: Calculate the mean and sd of age based on trt group
data %>%
group_by (trt) %>%
summarise( count = n(), mean_age = mean(age), sd_age = sd(age) )
## # A tibble: 2 x 4
## trt count mean_age sd_age
## <fct> <int> <dbl> <dbl>
## 1 Intervention 233 63.9 18.1
## 2 Control 267 65.4 16.1
Example 2: Calculate the mean and sd of age based on ia.occlus group
data %>%
group_by (ia.occlus) %>%
summarise( count = n(), mean_age = mean(age), sd_age = sd(age) )
## # A tibble: 6 x 4
## ia.occlus count mean_age sd_age
## <fct> <int> <dbl> <dbl>
## 1 Intracranial ICA 4 68.5 31.4
## 2 ICA with M1 134 62.2 17.1
## 3 M1 319 65.8 17.1
## 4 M2 39 64.3 14.2
## 5 A1 or A2 3 53.7 17.1
## 6 <NA> 1 75 NA
pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().
Command:
* pivot_longer( dataframe, names_to = ,values_to = , names_sep =
“.”)
* pivot_wider ( dataframe, names_from = ,values_from = ,
names_sep = “.”)
The variables in the data is already in pivot_longer format.
Example 1: Choosing the 3 variables in the data and transform 2 of them using pivot_wider. Compare it with the original variables:
# Original
data %>% select (studyid, trt, age)
## # A tibble: 500 x 3
## studyid trt age
## <fct> <fct> <dbl>
## 1 z001 Control 53
## 2 z002 Intervention 51
## 3 z003 Control 68
## 4 z004 Control 28
## 5 z005 Control 91
## 6 z006 Control 34
## 7 z007 Intervention 75
## 8 z008 Control 89
## 9 z009 Control 75
## 10 z010 Intervention 26
## # ... with 490 more rows
# Transform
data %>% select (studyid, trt, age) %>% pivot_wider (names_from = trt,values_from = age)
## # A tibble: 500 x 3
## studyid Control Intervention
## <fct> <dbl> <dbl>
## 1 z001 53 NA
## 2 z002 NA 51
## 3 z003 68 NA
## 4 z004 28 NA
## 5 z005 91 NA
## 6 z006 34 NA
## 7 z007 NA 75
## 8 z008 89 NA
## 9 z009 75 NA
## 10 z010 NA 26
## # ... with 490 more rows
Example 2: Choosing the 4 variables in the data and transform 3 of them using pivot_wider. Compare it with the original variables:
# Original
data %>% select (studyid, trt, age, sbp)
## # A tibble: 500 x 4
## studyid trt age sbp
## <fct> <fct> <dbl> <dbl>
## 1 z001 Control 53 127
## 2 z002 Intervention 51 137
## 3 z003 Control 68 138
## 4 z004 Control 28 122
## 5 z005 Control 91 162
## 6 z006 Control 34 166
## 7 z007 Intervention 75 140
## 8 z008 Control 89 157
## 9 z009 Control 75 129
## 10 z010 Intervention 26 143
## # ... with 490 more rows
# Transform
data %>% select (studyid, trt, age, sbp) %>%
pivot_wider (names_from = trt,values_from = c(age, sbp), names_sep = "_")
## # A tibble: 500 x 5
## studyid age_Control age_Intervention sbp_Control sbp_Intervention
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 z001 53 NA 127 NA
## 2 z002 NA 51 NA 137
## 3 z003 68 NA 138 NA
## 4 z004 28 NA 122 NA
## 5 z005 91 NA 162 NA
## 6 z006 34 NA 166 NA
## 7 z007 NA 75 NA 140
## 8 z008 89 NA 157 NA
## 9 z009 75 NA 129 NA
## 10 z010 NA 26 NA 143
## # ... with 490 more rows
You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using “, unless you want to create a string that contains multiple”.
string1 <- “This is a string”
Base R has several functions for working with strings, but we’ll omit them since they can be inconsistent, making them difficult to remember. Instead, we’ll utilize stringr functions. These have more logical names and all begin with str_. For instance, str_length() returns the number of characters in a string:
str_length(c("a", "Basic R programming", NA))
## [1] 1 19 NA
To combine two or more strings, use str_c(). For a vector of strings, we will need a complicated command, I will mention its command at the end of this part:
Command: str_c (“string 1”, “string 2”, “string 3”, sep = ” “)
str_c("a", "b")
## [1] "ab"
str_c("I", "like", "cats")
## [1] "Ilikecats"
Use the sep argument to control how they’re separated:
str_c("a", "b", sep = ", ")
## [1] "a, b"
str_c("I", "like", "cats", sep = " ")
## [1] "I like cats"
Like most other functions in R, missing values are contagious. If you want them to print as “NA”, use str_replace_na():
x <- c("abc", NA)
str_c("|-", x, "-|")
## [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
## [1] "|-abc-|" "|-NA-|"
As previously demonstrated, str_c() is vectorised,
and it recycles lesser vectors to the same length as the longest:
Command: str_c (“string 1”, c(“string 2”, “string 3”), “string
4”, sep = ” “)
str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
str_c("I", c("like", "hate", "love"), "animal", sep = " ")
## [1] "I like animal" "I hate animal" "I love animal"
To collapse a vector of strings into a single
string, use collapse:
Command: str_c (Vector string) , collapse = ” “)
str_c(c("a", "b", "c"), collapse = ", ")
## [1] "a, b, c"
str_sub() can be used to extract sections of a string. In addition to the string, str_sub() accepts start and end parameters that specify the (inclusive) location of the substring:
Command: str_sub(x, start, end)
#Positive numbers count forwards from start
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# Negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"
Note: str_sub() won’t fail if the string is too short: it will just return as much as possible:
str_sub("App", 1, 5)
## [1] "App"
You can also use the assignment form of str_sub() to modify strings:
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
## [1] "apple" "banana" "pear"
I used str_to_lower() above to convert the text to lower case. str_to_upper() and str_to_title() can also be used (). However, changing case is more complex than it appears since various languages have distinct case-changing laws. By defining a location, you may choose which set of rules to use:
str_to_upper(str_sub(x, 1, 3))
## [1] "APP" "BAN" "PEA"
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
## [1] "Apple" "Banana" "Pear"
This function can be used to modify the string within variable, so that they will have the same format:
df2 = df
str_sub(df2$sex, 1, ) <- str_to_lower(str_sub(df2$sex, 1, ))
as.factor(df2$sex[1:3])
## [1] male male female
## Levels: female male
x = c("male", "FEMALE", "MaLe", "Female", "FeMale")
str_sub(x,1, ) = str_to_lower(str_sub(x, 1, ))
x
## [1] "male" "female" "male" "female" "female"
The simplest patterns match exact strings:
Command: str_view(x, ” y “)
x <- c("apple", "banana", "pear")
str_view(x, "an")
The next step up in complexity is ., which matches any character
(except a newline):
Command: str_view(x, “. y .”)
str_view(x, ".a.")
But how can you match the character “.” if “.” matches any character? You must use a “escape” to notify the regular expression that you want it to match it exactly, rather than using its unique behavior. Regexps, like strings, employ the backslash “\”, to avoid specific behavior. To match an “.”, you’ll need the regexp “\”. This, unfortunately, causes a difficulty. Strings are used to represent regular expressions, and is also used in strings as an escape symbol. So we need the string “\.” to generate the regular expression “.”.
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
## \.
#> \.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
If is used as an escape character in regular expressions, how do you match a literal ? To match a literal you need to write “\\” — you need four backslashes to match one!
x <- "a\\b"
writeLines(x)
## a\b
#> a\b
str_view(x, "\\\\")
For more information, please refer to this link: https://r4ds.had.co.nz/strings.html#strings
For more information, please refer to this link: https://r4ds.had.co.nz/dates-and-times.html
This chapter will concentrate on the lubridate package,
which simplifies working with dates and timings in R.
Lubridate is not part of the core tidyverse because it
is only required when working with dates/times.
library(tidyverse)
library(lubridate)
Create a dataframe dat:
my_dates<-c( "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04")
my_values<-c( 1,2,3,4,5,6,7,8)
dat<-tibble(my_dates,my_values)
dat
## # A tibble: 8 x 2
## my_dates my_values
## <chr> <dbl>
## 1 2020-01-01 1
## 2 2020-01-02 2
## 3 2020-01-03 3
## 4 2020-01-04 4
## 5 2020-02-01 5
## 6 2020-02-02 6
## 7 2020-02-03 7
## 8 2020-02-04 8
ymd() will parse the character string that falls
into that format.
Example 1: Using ymd(), change the dat$my_dates from character to date
format.
dat$my_dates<-ymd(dat$my_dates)
dat
## # A tibble: 8 x 2
## my_dates my_values
## <date> <dbl>
## 1 2020-01-01 1
## 2 2020-01-02 2
## 3 2020-01-03 3
## 4 2020-01-04 4
## 5 2020-02-01 5
## 6 2020-02-02 6
## 7 2020-02-03 7
## 8 2020-02-04 8
Other date parsers are mdy, myd, dmy, yq, and so forth. Once you have data as a date, you can extract components of the date:
Example 2: Apply the above commands to dat$my_dates:
month(dat$my_dates)
## [1] 1 1 1 1 2 2 2 2
year(dat$my_dates)
## [1] 2020 2020 2020 2020 2020 2020 2020 2020
day(dat$my_dates)
## [1] 1 2 3 4 1 2 3 4
wday(dat$my_dates, label=TRUE)
## [1] Wed Thu Fri Sat Sat Sun Mon Tue
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
For more information, please refer to the following link:
https://dplyr.tidyverse.org/articles/two-table.html#mutating-joins
There are four types of mutating join, which differ in their behaviour when a match is not found.
Remember the Example we make in merge section earlier? Let create 2 dataframes contains following variables:
# Create 1st dataframe:
id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c("F", "M", "F", "M", "F")
age = c(20, 17, 50, 30, 80)
merge1 = as.tibble(data.frame(id, names, sex, age))
# Create 2nd dataframe:
studyid = c(1:6)
names = c("Ana", "Boc", "Cindy", "Peter", "Yuly", "Doe")
group = c("Control", "Intervention", "Control", "Control", "Intervention", "Control")
age = c(20, 17, 50, 64, 34, 30)
merge2 = as.tibble(data.frame(studyid, names, group, age))
# Check dataframes:
merge1
## # A tibble: 5 x 4
## id names sex age
## <int> <chr> <chr> <dbl>
## 1 1 Ana F 20
## 2 2 Boc M 17
## 3 3 Cindy F 50
## 4 4 Doe M 30
## 5 5 Eva F 80
merge2
## # A tibble: 6 x 4
## studyid names group age
## <int> <chr> <chr> <dbl>
## 1 1 Ana Control 20
## 2 2 Boc Intervention 17
## 3 3 Cindy Control 50
## 4 4 Peter Control 64
## 5 5 Yuly Intervention 34
## 6 6 Doe Control 30
merge1 %>% inner_join(merge2, by = c("names", "age"))
## # A tibble: 4 x 6
## id names sex age studyid group
## <int> <chr> <chr> <dbl> <int> <chr>
## 1 1 Ana F 20 1 Control
## 2 2 Boc M 17 2 Intervention
## 3 3 Cindy F 50 3 Control
## 4 4 Doe M 30 6 Control
merge1 %>% left_join(merge2, by = c("names", "age"))
## # A tibble: 5 x 6
## id names sex age studyid group
## <int> <chr> <chr> <dbl> <int> <chr>
## 1 1 Ana F 20 1 Control
## 2 2 Boc M 17 2 Intervention
## 3 3 Cindy F 50 3 Control
## 4 4 Doe M 30 6 Control
## 5 5 Eva F 80 NA <NA>
merge1 %>% right_join(merge2, by = c("names", "age"))
## # A tibble: 6 x 6
## id names sex age studyid group
## <int> <chr> <chr> <dbl> <int> <chr>
## 1 1 Ana F 20 1 Control
## 2 2 Boc M 17 2 Intervention
## 3 3 Cindy F 50 3 Control
## 4 4 Doe M 30 6 Control
## 5 NA Peter <NA> 64 4 Control
## 6 NA Yuly <NA> 34 5 Intervention
merge1 %>% full_join(merge2, by = c("names", "age"))
## # A tibble: 7 x 6
## id names sex age studyid group
## <int> <chr> <chr> <dbl> <int> <chr>
## 1 1 Ana F 20 1 Control
## 2 2 Boc M 17 2 Intervention
## 3 3 Cindy F 50 3 Control
## 4 4 Doe M 30 6 Control
## 5 5 Eva F 80 NA <NA>
## 6 NA Peter <NA> 64 4 Control
## 7 NA Yuly <NA> 34 5 Intervention
Note: While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
df1 <- tibble(id = c(1, 1, 2), value = 1:3)
df2 <- tibble(id = c(1, 1, 2), gender = c("Male", "Female", "Male"))
df1 %>% left_join(df2)
## # A tibble: 5 x 3
## id value gender
## <dbl> <int> <chr>
## 1 1 1 Male
## 2 1 1 Female
## 3 1 2 Male
## 4 1 2 Female
## 5 2 3 Male
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
merge1
## # A tibble: 5 x 4
## id names sex age
## <int> <chr> <chr> <dbl>
## 1 1 Ana F 20
## 2 2 Boc M 17
## 3 3 Cindy F 50
## 4 4 Doe M 30
## 5 5 Eva F 80
merge2
## # A tibble: 6 x 4
## studyid names group age
## <int> <chr> <chr> <dbl>
## 1 1 Ana Control 20
## 2 2 Boc Intervention 17
## 3 3 Cindy Control 50
## 4 4 Peter Control 64
## 5 5 Yuly Intervention 34
## 6 6 Doe Control 30
semi_join(merge1, merge2, by = c("names", "age"))
## # A tibble: 4 x 4
## id names sex age
## <int> <chr> <chr> <dbl>
## 1 1 Ana F 20
## 2 2 Boc M 17
## 3 3 Cindy F 50
## 4 4 Doe M 30
anti_join(merge1, merge2, by = c("names", "age"))
## # A tibble: 1 x 4
## id names sex age
## <int> <chr> <chr> <dbl>
## 1 5 Eva F 80
If you’re worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only ever remove observations.
df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
## [1] 4
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
## [1] 4
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
## [1] 2
Discuss
Here, we get 4 rows to start with, and after inner_join
- we get 4 rows that match. But when we check with
semi_join, only 2 rows are actually match. That means
extra 2 rows are newly created because when a match is not
unique, a join will add all possible combinations (the Cartesian
product) of the matching observations.
The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:
More infomation can be found here:
https://dplyr.tidyverse.org/articles/two-table.html#observations
https://www.techtarget.com/searchdatamanagement/definition/raw-data
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html#:~:text=Tidy%20data%20is%20a%20standard,Every%20row%20is%20an%20observation.
https://r4ds.had.co.nz/tidy-data.html#introduction-6
https://dplyr.tidyverse.org/reference/filter.html
Field, A., Miles, J. and Field, Z., 2012. Discovering statistics using
R. Sage publications.