knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

  1. Clustering practice: https://rpubs.com/minhtri/923711
  2. Table Descriptive Analysis: https://rpubs.com/minhtri/929114
  3. Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586

 

Note: This analysis is used for my own study purpose. In this section, I will summerize some basic commands for data wrangling/ cleaning in R environment based on several online sources.

 

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

 

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

  • Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

 

1 Simulated fakestroke data

Main data used in this notes: fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) | 
Variable Type Description
studyid Category Study ID # (z001 through z500)
trt Category Treatment group (Intervention or Control)
age Numeric Age in years
sex Category Male or Female
nihss Numeric NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location Category Stroke Location - Left or Right Hemisphere
hx.isch Category History of Ischemic Stroke (Yes/No)
afib Category Atrial Fibrillation (1 = Yes, 0 = No)
dm Category Diabetes Mellitus (1 = Yes, 0 = No)
mrankin Category Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp Numeric Systolic blood pressure, in mm Hg
iv.altep Category Treatment with IV alteplase (Yes/No)
time.iv Numeric Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects Category Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus Category Intracranial arterial occlusion, based on vessel imaging - five categories
extra.ica Category Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand Numeric Time from stroke onset to study randomization, in minutes
time.punc Numeric Time from stroke onset to groin puncture, in minutes (only if Intervention)

 

2 Getting and Cleaning data

4 things to have for data analysis:

  • A raw dataset
  • A tidy dataset
  • A code book
  • An instruction list that you use to go from raw dataset to tidy dataset.

2.1 Raw dataset

Raw data (sometimes called source data, atomic data or primary data) is data that has not been processed. Raw data that has undergone processing is sometimes referred to as cooked data.

Although raw data has the potential to become “information,” it requires selective extraction, organization and sometimes analysis and formatting for presentation. Because of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.

In many instances, users must clean raw data before it can be used. Cleaning raw data may require parsing the data for easier ingestion into a computer, removing outliers or spurious results and, occasionally, reformatting or translating the data.

There are different levels of raw data:

  • It might be a strange binary file that whatever your measuring spits out.
  • An unformatted Excel file with ten worksheets that some company you contracted with sent you.
  • Something that was handed to you as an Excel file.

2.2 Tidy dataset

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.
# ![](/Statistics/R/R data/Model practise/image/4. Tidy data/4. Three rules make dataset tidy.jpg)   

2.3 A codebook

It provides:

  • The information about the variables in the data set not contained in the tidy data: Explaining the meaning of each variable and the unit that is used.
  • Information about the summary choices that was made: which information you used to describe the data: mean, median, mode, etc
  • Information about the experimental study design that you used.

2.4 Instruction list

It is a computer script that helps you to go back to the raw data and reprocess it and get the same tidy data set. If that doesn’t happen, then there’s something going wrong in your data processing pipeline, and so, you want to be able to identify that and fix it.

  • Ideally a script in R, but can be in Python, as well.

  • The input for the script is the, raw data.

  • The output is going to be the processed, tidy data.

     

3 The R environment

3.1 Printing your system info

What R version are you using? There are frequently real flaws in earlier or latest versions of R or R packages, and your problem may be solved if you simply upgrade to the most recent version. It’s crucial to note if you’re using the most recent version of R. Therefore, it is important to know your R system and the information of packages.

Command: sessionInfo()

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] readxl_1.4.1     conflicted_1.1.0 forcats_0.5.2    stringr_1.4.1   
##  [5] dplyr_1.0.10     purrr_0.3.4      readr_2.1.2      tidyr_1.2.0     
##  [9] tibble_3.1.8     ggplot2_3.3.6    tidyverse_1.3.2 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2    xfun_0.32           bslib_0.4.0        
##  [4] haven_2.5.1         gargle_1.2.0        colorspace_2.0-3   
##  [7] vctrs_0.4.1         generics_0.1.3      htmltools_0.5.3    
## [10] yaml_2.3.5          utf8_1.2.2          rlang_1.0.5        
## [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
## [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.2.1       
## [19] modelr_0.1.9        lifecycle_1.0.1     munsell_0.5.0      
## [22] gtable_0.3.1        cellranger_1.1.0    rvest_1.0.3        
## [25] memoise_2.0.1       evaluate_0.16       knitr_1.40         
## [28] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.3        
## [31] broom_1.0.1         backports_1.4.1     scales_1.2.1       
## [34] googlesheets4_1.0.1 cachem_1.0.6        jsonlite_1.8.0     
## [37] fs_1.5.2            hms_1.1.2           digest_0.6.29      
## [40] stringi_1.7.8       grid_4.2.1          cli_3.3.0          
## [43] tools_4.2.1         magrittr_2.0.3      sass_0.4.2         
## [46] crayon_1.5.1        pkgconfig_2.0.3     ellipsis_0.3.2     
## [49] xml2_1.3.3          reprex_2.0.2        googledrive_2.0.0  
## [52] lubridate_1.8.0     assertthat_0.2.1    rmarkdown_2.16     
## [55] httr_1.4.4          rstudioapi_0.14     R6_2.5.1           
## [58] compiler_4.2.1

Command: devtools::session_info(c(“package name”))

devtools::session_info(c("ggplot2", "broom"))
## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.2.1 (2022-06-23 ucrt)
##  os       Windows 10 x64 (build 17134)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.1252
##  ctype    English_United States.1252
##  tz       Asia/Bangkok
##  date     2022-11-10
##  pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package      * version date (UTC) lib source
##  backports      1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
##  broom          1.0.1   2022-08-29 [1] CRAN (R 4.2.1)
##  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.1)
##  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.2.1)
##  cpp11          0.4.2   2021-11-30 [1] CRAN (R 4.2.1)
##  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.1)
##  dplyr        * 1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
##  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
##  farver         2.1.1   2022-07-06 [1] CRAN (R 4.2.1)
##  generics       0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
##  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.2.1)
##  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
##  gtable         0.3.1   2022-09-01 [1] CRAN (R 4.2.1)
##  isoband        0.2.5   2021-07-13 [1] CRAN (R 4.2.1)
##  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
##  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
##  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.1)
##  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
##  MASS           7.3-57  2022-04-22 [2] CRAN (R 4.2.1)
##  Matrix         1.4-1   2022-03-23 [2] CRAN (R 4.2.1)
##  mgcv           1.8-40  2022-03-29 [2] CRAN (R 4.2.1)
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
##  nlme           3.1-157 2022-03-25 [2] CRAN (R 4.2.1)
##  pillar         1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
##  purrr        * 0.3.4   2020-04-17 [1] CRAN (R 4.2.1)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
##  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.2.0)
##  rlang          1.0.5   2022-08-31 [1] CRAN (R 4.2.1)
##  scales         1.2.1   2022-08-20 [1] CRAN (R 4.2.1)
##  stringi        1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
##  stringr      * 1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
##  tibble       * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
##  tidyr        * 1.2.0   2022-02-01 [1] CRAN (R 4.2.1)
##  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.2.1)
##  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
##  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.1)
##  viridisLite    0.4.1   2022-08-22 [1] CRAN (R 4.2.1)
##  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
## 
##  [1] C:/Users/Admin/AppData/Local/R/win-library/4.2
##  [2] C:/Program Files/R/R-4.2.1/library
## 
## ------------------------------------------------------------------------------

3.2 Getting help in R

If you know the name of the function you seek support with, simply put a question mark ? at the command line prompt, followed by the function name.

Command: ? function_name()

? ggplot2()

Sometimes you can’t recall the exact name of the function, but you know the topic for which you need assistance (e.g. data input in this case). Use the help.search function (no question mark) with your query enclosed in double quotes, as shown below:

Command: help.search(“data input”)

# Instead of ggplot2, search gg:
help.search("gg")

The find function tells you what package something is in:

Command: find(“object name”)

find("glm")
## [1] "package:stats"
find("lm")
## [1] "package:stats"

 

apropos delivers a character vector containing the names of all objects in the search list that correspond to your (possibly incomplete) query:

Command: apropos(object name )

apropos("glm")
## [1] "glm"           "glm.control"   "glm.fit"       "predict.glm"  
## [5] "residuals.glm" "summary.glm"
apropos("lm")
##  [1] ".colMeans"       ".lm.fit"         "colMeans"        "confint.lm"     
##  [5] "contr.helmert"   "dummy.coef.lm"   "glm"             "glm.control"    
##  [9] "glm.fit"         "KalmanForecast"  "KalmanLike"      "KalmanRun"      
## [13] "KalmanSmooth"    "kappa.lm"        "lm"              "lm.fit"         
## [17] "lm.influence"    "lm.wfit"         "lmap"            "lmap_at"        
## [21] "lmap_if"         "model.matrix.lm" "nlm"             "nlminb"         
## [25] "predict.glm"     "predict.lm"      "residuals.glm"   "residuals.lm"   
## [29] "summary.glm"     "summary.lm"

 

3.3 Installing + Loading R packages

Install packages in two ways: through the menus or using a command.

Command: install.packages(“package.name”)
Example: > install.packages(“ggplot2”)

 

You need to install the package only once but you need to reference it each time you start a new session of R. To reference a package, we simply execute this general command:
library(package.name)
Example: > library(ggplot2)

 

3.4 Import data to R

If we have saved the data as a CSV file, then we can import these data to a dataframe using the read.csv function. Or we can replace read.csv to read.delim or read_excel. Besides, we can import directly through R markdown. The general form of this function is:

Command: df = read.csv(“filename.extension”, header = TRUE)

 

For example: I will using R markdown to import the dataset df:

df <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 

 

4 Descriptive analysis

4.1 Format checking: is. ()

This command is useful when we want to check the format of the dataset or variable within dataset before analysing:

Command Description
is.data.frame() whether the target is a dataframe.
is.tibble() whether the target format is tibble.
is.na() whether the object is NA.
is.factor() whether the object’s class is factor.
is.numeric() whether the object’s class is numeric.

Example 1: I want to check whether or not the df is a data frame and does it contain NA value ?

is.data.frame(df)
## [1] TRUE
is.tibble(df)
## [1] TRUE
head(is.na(df))
##      studyid   trt   age   sex nihss location hx.isch  afib    dm mrankin   sbp
## [1,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
## [2,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
## [3,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
## [4,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
## [5,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
## [6,]   FALSE FALSE FALSE FALSE FALSE    FALSE   FALSE FALSE FALSE   FALSE FALSE
##      iv.altep time.iv aspects ia.occlus extra.ica time.rand time.punc
## [1,]    FALSE   FALSE   FALSE     FALSE     FALSE     FALSE      TRUE
## [2,]    FALSE   FALSE   FALSE     FALSE     FALSE     FALSE     FALSE
## [3,]    FALSE    TRUE   FALSE     FALSE     FALSE     FALSE      TRUE
## [4,]    FALSE   FALSE   FALSE     FALSE     FALSE     FALSE      TRUE
## [5,]    FALSE   FALSE   FALSE     FALSE     FALSE     FALSE      TRUE
## [6,]    FALSE   FALSE   FALSE     FALSE     FALSE     FALSE      TRUE

 

Example 2: I want to check whether or not the studyid ; age are factor and numeric, respectively ?

is.factor(df$studyid)
## [1] FALSE
is.numeric(df$age)
## [1] TRUE

Discuss

  • To be able to use the tidyverse package, the dataset format should be tibble. df are both dataframe and tibble.

  • The is.na() command returns a value of true and false for each value in a data set. If the value is NA the is.na() function return the value of true, otherwise, return to a value of false. It is easy to note that the time.punc variable contains lots of NA value.

  • The studyid variable is not a factor. So we should convert these character variables to factor format before progressing to analysis stage.

  • The age variable is numeric variable. => Later, I will show you how to convert all character variables to factor, while maintaining the other variables as numeric variables.

     

4.2 dim(dataset)

To check the number of variables (column) and observations (row) in the dataset:

Command: dim(dataset)

dim(df)
## [1] 500  18

Discuss
This dataset has 500 observations with 18 variables.

 

4.3 names()

In the previous section, we know that this dataset has 500 observations with 18 variables. So what are these variables:

Command: names(dataset)

names(df)
##  [1] "studyid"   "trt"       "age"       "sex"       "nihss"     "location" 
##  [7] "hx.isch"   "afib"      "dm"        "mrankin"   "sbp"       "iv.altep" 
## [13] "time.iv"   "aspects"   "ia.occlus" "extra.ica" "time.rand" "time.punc"

 

4.4 summary() and str()

To have some information of every variables in the dataframe:

  • For text-based variable: It will tell you the count of each categorical variable.
  • For a quantitative variable: min, median, max, Q1, Q3, etc.

Command: summary(dataset)

 

Example 1: Using command with the original dataset:

summary(df)
##    studyid              trt                 age            sex           
##  Length:500         Length:500         Min.   :23.00   Length:500        
##  Class :character   Class :character   1st Qu.:55.00   Class :character  
##  Mode  :character   Mode  :character   Median :65.75   Mode  :character  
##                                        Mean   :64.71                     
##                                        3rd Qu.:76.00                     
##                                        Max.   :96.00                     
##                                                                          
##      nihss         location           hx.isch               afib     
##  Min.   :10.00   Length:500         Length:500         Min.   :0.00  
##  1st Qu.:14.00   Class :character   Class :character   1st Qu.:0.00  
##  Median :18.00   Mode  :character   Mode  :character   Median :0.00  
##  Mean   :18.03                                         Mean   :0.27  
##  3rd Qu.:22.00                                         3rd Qu.:1.00  
##  Max.   :28.00                                         Max.   :1.00  
##                                                                      
##        dm          mrankin               sbp          iv.altep        
##  Min.   :0.000   Length:500         Min.   : 78.0   Length:500        
##  1st Qu.:0.000   Class :character   1st Qu.:128.5   Class :character  
##  Median :0.000   Mode  :character   Median :145.0   Mode  :character  
##  Mean   :0.126                      Mean   :145.5                     
##  3rd Qu.:0.000                      3rd Qu.:162.5                     
##  Max.   :1.000                      Max.   :231.0                     
##                                     NA's   :1                         
##     time.iv          aspects        ia.occlus           extra.ica     
##  Min.   : 42.00   Min.   : 5.000   Length:500         Min.   :0.0000  
##  1st Qu.: 67.00   1st Qu.: 7.000   Class :character   1st Qu.:0.0000  
##  Median : 86.00   Median : 9.000   Mode  :character   Median :0.0000  
##  Mean   : 92.64   Mean   : 8.506                      Mean   :0.2906  
##  3rd Qu.:115.00   3rd Qu.:10.000                      3rd Qu.:1.0000  
##  Max.   :218.00   Max.   :10.000                      Max.   :1.0000  
##  NA's   :55       NA's   :4                           NA's   :1       
##    time.rand       time.punc  
##  Min.   :100.0   Min.   :180  
##  1st Qu.:151.2   1st Qu.:212  
##  Median :201.5   Median :260  
##  Mean   :208.6   Mean   :263  
##  3rd Qu.:257.8   3rd Qu.:313  
##  Max.   :360.0   Max.   :360  
##  NA's   :2       NA's   :267
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Discuss
There are 2 problems with this data:

  1. All the categorical variables are defined as character. As a result, the summary shows the length of them instead of which factor the variables have.
  2. In the raw dataset: afib, dm, extra.ica variables are noted as “1” and “0”, and the program calculate the mean-sd instead of showing the factor of “1”-Yes and “0”-No. Therefore, I must convert them back to “Yes-No” for character variables.

 

Example 2: Using the same dataset, but solving the 2 problems mentioned above (The way to do this will be shown later in this practice):

data = df

# Changing coding variable to categorical:
data$afib       <- factor(data$afib , levels=0:1, labels=c("No", "Yes"))
data$dm         <- factor(data$dm , levels=0:1, labels=c("No", "Yes"))
data$extra.ica  <- factor(data$extra.ica, levels=0:1, labels=c("No", "Yes"))

# Convert character to categorical variables:
data <- data %>%mutate_if(is.character, as.factor)

# Relevel factors:
data$trt = factor(data$trt, levels=c("Intervention", "Control"))
data$mrankin = factor(data$mrankin, levels=c("0", "1", "2", "> 2"))
data$ia.occlus = factor(data$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Test the commands:
summary(data)
##     studyid              trt           age            sex          nihss      
##  z001   :  1   Intervention:233   Min.   :23.00   Female:208   Min.   :10.00  
##  z002   :  1   Control     :267   1st Qu.:55.00   Male  :292   1st Qu.:14.00  
##  z003   :  1                      Median :65.75                Median :18.00  
##  z004   :  1                      Mean   :64.71                Mean   :18.03  
##  z005   :  1                      3rd Qu.:76.00                3rd Qu.:22.00  
##  z006   :  1                      Max.   :96.00                Max.   :28.00  
##  (Other):494                                                                  
##   location   hx.isch    afib       dm      mrankin        sbp        iv.altep 
##  Left :269   No :446   No :365   No :437   0  :404   Min.   : 78.0   No : 55  
##  Right:231   Yes: 54   Yes:135   Yes: 63   1  : 50   1st Qu.:128.5   Yes:445  
##                                            2  : 25   Median :145.0            
##                                            > 2: 21   Mean   :145.5            
##                                                      3rd Qu.:162.5            
##                                                      Max.   :231.0            
##                                                      NA's   :1                
##     time.iv          aspects                  ia.occlus   extra.ica 
##  Min.   : 42.00   Min.   : 5.000   Intracranial ICA:  4   No  :354  
##  1st Qu.: 67.00   1st Qu.: 7.000   ICA with M1     :134   Yes :145  
##  Median : 86.00   Median : 9.000   M1              :319   NA's:  1  
##  Mean   : 92.64   Mean   : 8.506   M2              : 39             
##  3rd Qu.:115.00   3rd Qu.:10.000   A1 or A2        :  3             
##  Max.   :218.00   Max.   :10.000   NA's            :  1             
##  NA's   :55       NA's   :4                                         
##    time.rand       time.punc  
##  Min.   :100.0   Min.   :180  
##  1st Qu.:151.2   1st Qu.:212  
##  Median :201.5   Median :260  
##  Mean   :208.6   Mean   :263  
##  3rd Qu.:257.8   3rd Qu.:313  
##  Max.   :360.0   Max.   :360  
##  NA's   :2       NA's   :267
str(data)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : Factor w/ 500 levels "z001","z002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ trt      : Factor w/ 2 levels "Intervention",..: 2 1 2 2 2 2 1 2 2 1 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "0","1","2","> 2": 3 1 1 1 1 3 1 1 3 1 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 5 levels "Intracranial ICA",..: 3 3 2 2 3 5 3 3 3 3 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...

Discuss

  • After converting, in the summary command, the factors and their levels of categorical variables are shown instead of length.
  • Similar results are also shown in str command.

Note: From now to the end of this practicle, let clarify the 2 datasets that I will use:

  • df: The raw dataset

  • data: The tidy dataset

     

4.5 head() and tail()

These 2 commands help to view the top / bottom n rows of the dataset. By default, n=6 :

Commands:

  • head(dataset, n = )
  • tail(dataset, n = )
head(data, n = 3)
## # A tibble: 3 x 18
##   studyid trt          age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>      <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control       53 Male     21 Right   No      No    No    2         127
## 2 z002    Intervent~    51 Male     23 Left    No      Yes   No    0         137
## 3 z003    Control       68 Fema~    11 Right   No      No    No    0         138
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location
tail(data, n=3)
## # A tibble: 3 x 18
##   studyid trt          age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>      <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
## 1 z498    Intervent~    61 Fema~    21 Right   No      Yes   No    1         121
## 2 z499    Intervent~    23 Male     23 Left    Yes     Yes   No    0         123
## 3 z500    Control       66 Male     24 Left    No      Yes   No    0         190
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location

 

4.6 compareGroups() and table1()

Detail about table Descriptive Analysis: https://rpubs.com/minhtri/929114

4.6.1 compareGroups()

Our goal, then, is to take the data in and use it to generate a Table for the study that compares the 233 patients in the Intervention group to the 267 patients in the Control group, on all of the other variables (except study ID #) available. I’ll use the compareGroups package of functions available in R to help me complete this task.

Package: compareGroups
Command: createTable (compareGroups (group ~ var1 + var2 + …., data = … , method = c ( var = NA)))

  • group: name of the main group(s) you want to summerize/ put to header.
  • var1; var2: numeric or categorical variables.
    ** For numeric var: Mean (SD) for normal distribution or Median (Q1, Q3) for non-normal
    ** For categorical var: arragne in order: count (%)
  • method: Possible values in methods statement:

1: forces analysis as normal-distributed
2: forces analysis as continuous non-normal
3: forces analysis as categorical
NA: performs a Shapiro-Wilks test to decide between normal or non-normal (preferable). If the method argument is stated as NA for a variable, then a Shapiro-Wilk test for normality is used to decide if the variable is normal or non-normal distributed.

 

Example 1: Becasue I don’t know the distribution of the variables in the data, I will use argument method = c(var = NA). With this argument, the program will decide which mean or median should be shown based on the data distribution in each variable groups:

library(compareGroups)
OR = createTable( compareGroups( trt ~ age + sex + nihss + location + hx.isch + afib + dm +
                              mrankin + sbp + iv.altep + time.iv + aspects + ia.occlus +
                              extra.ica + time.rand + time.punc, 
                            data = data,
                            method = c(age=NA, nihss=NA, sbp = NA, time.iv=NA, aspects=NA, 
                                       time.rand =NA, time.punc =NA)))
OR
## 
## --------Summary descriptives table by 'trt'---------
## 
## ________________________________________________________________ 
##                        Intervention       Control      p.overall 
##                           N=233            N=267                 
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## age                  65.8 [54.5;76.0] 65.7 [55.8;76.2]   0.579   
## sex:                                                     0.917   
##     Female              98 (42.1%)      110 (41.2%)              
##     Male               135 (57.9%)      157 (58.8%)              
## nihss                17.0 [14.0;21.0] 18.0 [14.0;22.0]   0.453   
## location:                                                0.111   
##     Left               116 (49.8%)      153 (57.3%)              
##     Right              117 (50.2%)      114 (42.7%)              
## hx.isch:                                                 0.335   
##     No                 204 (87.6%)      242 (90.6%)              
##     Yes                 29 (12.4%)       25 (9.36%)              
## afib:                                                    0.601   
##     No                 167 (71.7%)      198 (74.2%)              
##     Yes                 66 (28.3%)       69 (25.8%)              
## dm:                                                      1.000   
##     No                 204 (87.6%)      233 (87.3%)              
##     Yes                 29 (12.4%)       34 (12.7%)              
## mrankin:                                                 0.922   
##     0                  190 (81.5%)      214 (80.1%)              
##     1                   21 (9.01%)       29 (10.9%)              
##     2                   12 (5.15%)       13 (4.87%)              
##     > 2                 10 (4.29%)       11 (4.12%)              
## sbp                     146 (26.0)       145 (24.4)      0.649   
## iv.altep:                                                0.267   
##     No                  30 (12.9%)       25 (9.36%)              
##     Yes                203 (87.1%)      242 (90.6%)              
## time.iv              85.0 [67.0;110]  87.0 [65.0;116]    0.596   
## aspects              9.00 [7.00;10.0] 9.00 [8.00;10.0]   0.075   
## ia.occlus:                                               0.819   
##     Intracranial ICA    1 (0.43%)        3 (1.13%)               
##     ICA with M1         59 (25.3%)       75 (28.2%)              
##     M1                 154 (66.1%)      165 (62.0%)              
##     M2                  18 (7.73%)       21 (7.89%)              
##     A1 or A2            1 (0.43%)        2 (0.75%)               
## extra.ica:                                               0.179   
##     No                 158 (67.8%)      196 (73.7%)              
##     Yes                 75 (32.2%)       70 (26.3%)              
## time.rand             204 [152;250]    196 [149;266]     0.251   
## time.punc             260 [212;313]       . [.;.]          .     
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Example 2: Show missingness

missingTable(OR)
## 
## --------Missingness table by 'trt'---------
## 
## ___________________________________________ 
##           Intervention  Control   p.overall 
##              N=233       N=267              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## age        0 (0.00%)   0 (0.00%)      .     
## sex        0 (0.00%)   0 (0.00%)      .     
## nihss      0 (0.00%)   0 (0.00%)      .     
## location   0 (0.00%)   0 (0.00%)      .     
## hx.isch    0 (0.00%)   0 (0.00%)      .     
## afib       0 (0.00%)   0 (0.00%)      .     
## dm         0 (0.00%)   0 (0.00%)      .     
## mrankin    0 (0.00%)   0 (0.00%)      .     
## sbp        0 (0.00%)   1 (0.37%)    1.000   
## iv.altep   0 (0.00%)   0 (0.00%)      .     
## time.iv    30 (12.9%)  25 (9.36%)   0.267   
## aspects    0 (0.00%)   4 (1.50%)    0.127   
## ia.occlus  0 (0.00%)   1 (0.37%)    1.000   
## extra.ica  0 (0.00%)   1 (0.37%)    1.000   
## time.rand  2 (0.86%)   0 (0.00%)    0.217   
## time.punc  0 (0.00%)   267 (100%)  <0.001   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

4.6.2 table1()

Command: table1( ~ var1 + var2 + … | group, data = …)

  • var1; var2: numeric or categorical variables
    ** For numeric var: Mean (SD) – Median (Q1, Q3)
    ** For categorical var: arragne in order: count (%)
  • group: name of the main group(s) you want to summerize/ put to header.

Using built-in styles for table1():

  • zebra: alternating shaded and unshaded rows (zebra stripes)
  • grid: show all grid lines
  • shade: shade the header row(s) in gray
  • times: use a serif font
  • center: center all columns, including the first which contains the row labels

Adjust the components of table:

  • render.continuous = c(.=“Mean (SD)”, .=“Median [Q1, Q3]”)
  • render.continuous = c(.=“Mean (SD)”, .=“Median [Mix, Max]”)
library(table1)
table1( ~ age + sex + nihss + location + hx.isch + afib + dm + mrankin +
          sbp + iv.altep + time.iv + aspects + ia.occlus + extra.ica + time.rand + 
          time.punc | trt, data = data, topclass="Rtable1-zebra",
          render.continuous = c(.="Mean (SD)", .="Median [Q1, Q3]"),
          overall="Total")
Intervention
(N=233)
Control
(N=267)
Total
(N=500)
age
Mean (SD) 63.9 (18.1) 65.4 (16.1) 64.7 (17.1)
Median [Q1, Q3] 65.8 [54.5, 76.0] 65.7 [55.8, 76.2] 65.8 [55.0, 76.0]
sex
Female 98 (42.1%) 110 (41.2%) 208 (41.6%)
Male 135 (57.9%) 157 (58.8%) 292 (58.4%)
nihss
Mean (SD) 18.0 (5.04) 18.1 (4.32) 18.0 (4.67)
Median [Q1, Q3] 17.0 [14.0, 21.0] 18.0 [14.0, 22.0] 18.0 [14.0, 22.0]
location
Left 116 (49.8%) 153 (57.3%) 269 (53.8%)
Right 117 (50.2%) 114 (42.7%) 231 (46.2%)
hx.isch
No 204 (87.6%) 242 (90.6%) 446 (89.2%)
Yes 29 (12.4%) 25 (9.4%) 54 (10.8%)
afib
No 167 (71.7%) 198 (74.2%) 365 (73.0%)
Yes 66 (28.3%) 69 (25.8%) 135 (27.0%)
dm
No 204 (87.6%) 233 (87.3%) 437 (87.4%)
Yes 29 (12.4%) 34 (12.7%) 63 (12.6%)
mrankin
0 190 (81.5%) 214 (80.1%) 404 (80.8%)
1 21 (9.0%) 29 (10.9%) 50 (10.0%)
2 12 (5.2%) 13 (4.9%) 25 (5.0%)
> 2 10 (4.3%) 11 (4.1%) 21 (4.2%)
sbp
Mean (SD) 146 (26.0) 145 (24.4) 145 (25.1)
Median [Q1, Q3] 146 [129, 164] 145 [128, 161] 145 [129, 163]
Missing 0 (0%) 1 (0.4%) 1 (0.2%)
iv.altep
No 30 (12.9%) 25 (9.4%) 55 (11.0%)
Yes 203 (87.1%) 242 (90.6%) 445 (89.0%)
time.iv
Mean (SD) 98.2 (45.5) 88.0 (26.0) 92.6 (36.5)
Median [Q1, Q3] 85.0 [67.0, 110] 87.0 [65.0, 116] 86.0 [67.0, 115]
Missing 30 (12.9%) 25 (9.4%) 55 (11.0%)
aspects
Mean (SD) 8.35 (1.64) 8.65 (1.47) 8.51 (1.56)
Median [Q1, Q3] 9.00 [7.00, 10.0] 9.00 [8.00, 10.0] 9.00 [7.00, 10.0]
Missing 0 (0%) 4 (1.5%) 4 (0.8%)
ia.occlus
Intracranial ICA 1 (0.4%) 3 (1.1%) 4 (0.8%)
ICA with M1 59 (25.3%) 75 (28.1%) 134 (26.8%)
M1 154 (66.1%) 165 (61.8%) 319 (63.8%)
M2 18 (7.7%) 21 (7.9%) 39 (7.8%)
A1 or A2 1 (0.4%) 2 (0.7%) 3 (0.6%)
Missing 0 (0%) 1 (0.4%) 1 (0.2%)
extra.ica
No 158 (67.8%) 196 (73.4%) 354 (70.8%)
Yes 75 (32.2%) 70 (26.2%) 145 (29.0%)
Missing 0 (0%) 1 (0.4%) 1 (0.2%)
time.rand
Mean (SD) 203 (57.3) 214 (70.3) 209 (64.8)
Median [Q1, Q3] 204 [152, 250] 196 [149, 266] 202 [151, 258]
Missing 2 (0.9%) 0 (0%) 2 (0.4%)
time.punc
Mean (SD) 263 (54.2) NA 263 (54.2)
Median [Q1, Q3] 260 [212, 313] NA 260 [212, 313]
Missing 0 (0%) 267 (100%) 267 (53.4%)

 

Summary:

compareGroups()

  • Provide detail Mean-Median for numeric variable; Count-% for character variable, missing value.
  • For numeric variable: Provide formula for normal or non-normal distribution data.
  • Provide p-value.

table1()

  • Provide full detail Mean-Median for numeric variable; Count-% for character variable, missing value.

  • For numeric variable: We must include extra steps to know the distribution normal or non-normal.
    Normal distribution: Mean (SD)
    ** Non-normal: Median (Q1-Q3)

  • Do not show the p-value.

     

4.7 by()

The function by allow summary of the dataframe on the basis of factor levels.

Package: pastecs
Command: by(dataframe, categorical variables, FUN = stat.desc, basic = , norm = )

  • dataframe: Name of the data we use.
  • categorical variable: the group variable.
  • Fun = : a function to be applied to (usually data-frame) subsets of data.
  • basic = : whether or not to show basic attribute.
  • norm = : whether or not to include normality test result (Shapiro-Wilk test).

Example: Summary the data based on sex:

library(pastecs)
by(data, data$sex, stat.desc, basic = F, norm = T) 
## data$sex: Female
##            studyid trt           age sex         nihss location hx.isch afib dm
## median          NA  NA  6.600000e+01  NA  1.700000e+01       NA      NA   NA NA
## mean            NA  NA  6.562837e+01  NA  1.790865e+01       NA      NA   NA NA
## SE.mean         NA  NA  1.246515e+00  NA  3.334470e-01       NA      NA   NA NA
## CI.mean         NA  NA  2.457492e+00  NA  6.573876e-01       NA      NA   NA NA
## var             NA  NA  3.231902e+02  NA  2.312688e+01       NA      NA   NA NA
## std.dev         NA  NA  1.797749e+01  NA  4.809042e+00       NA      NA   NA NA
## coef.var        NA  NA  2.739287e-01  NA  2.685317e-01       NA      NA   NA NA
## skewness        NA  NA -3.380714e-01  NA  3.601314e-01       NA      NA   NA NA
## skew.2SE        NA  NA -1.002383e+00  NA  1.067792e+00       NA      NA   NA NA
## kurtosis        NA  NA -6.166876e-01  NA -9.104475e-01       NA      NA   NA NA
## kurt.2SE        NA  NA -9.184789e-01  NA -1.355997e+00       NA      NA   NA NA
## normtest.W      NA  NA  9.660446e-01  NA  9.535481e-01       NA      NA   NA NA
## normtest.p      NA  NA  6.691984e-05  NA  2.789767e-06       NA      NA   NA NA
##            mrankin         sbp iv.altep      time.iv       aspects ia.occlus
## median          NA 145.0000000       NA 9.000000e+01  9.000000e+00        NA
## mean            NA 144.6714976       NA 9.562500e+01  8.509615e+00        NA
## SE.mean         NA   1.8169069       NA 2.660020e+00  1.060223e-01        NA
## CI.mean         NA   3.5821167       NA 5.248250e+00  2.090219e-01        NA
## var             NA 683.3381643       NA 1.301930e+03  2.338071e+00        NA
## std.dev         NA  26.1407376       NA 3.608226e+01  1.529075e+00        NA
## coef.var        NA   0.1806903       NA 3.773308e-01  1.796879e-01        NA
## skewness        NA   0.0565986       NA 8.671872e-01 -8.330367e-01        NA
## skew.2SE        NA   0.1674170       NA 2.420556e+00 -2.469958e+00        NA
## kurtosis        NA  -0.2152592       NA 4.601769e-01 -3.927604e-01        NA
## kurt.2SE        NA  -0.3198477       NA 6.455886e-01 -5.849674e-01        NA
## normtest.W      NA   0.9971516       NA 9.388386e-01  8.463962e-01        NA
## normtest.p      NA   0.9715080       NA 4.801444e-07  1.450610e-13        NA
##            extra.ica     time.rand     time.punc
## median            NA  1.955000e+02  2.530000e+02
## mean              NA  2.041068e+02  2.611429e+02
## SE.mean           NA  4.121473e+00  5.592442e+00
## CI.mean           NA  8.125911e+00  1.109945e+01
## var               NA  3.499228e+03  3.064990e+03
## std.dev           NA  5.915427e+01  5.536235e+01
## coef.var          NA  2.898202e-01  2.120002e-01
## skewness          NA  3.619128e-01  1.966935e-01
## skew.2SE          NA  1.067975e+00  4.034589e-01
## kurtosis          NA -7.925322e-01 -1.467434e+00
## kurt.2SE          NA -1.174820e+00 -1.519200e+00
## normtest.W        NA  9.685555e-01  9.091972e-01
## normtest.p        NA  1.467393e-04  4.785416e-06
## ------------------------------------------------------------ 
## data$sex: Male
##            studyid trt           age sex         nihss location hx.isch afib dm
## median          NA  NA  6.500000e+01  NA  1.800000e+01       NA      NA   NA NA
## mean            NA  NA  6.404863e+01  NA  1.810959e+01       NA      NA   NA NA
## SE.mean         NA  NA  9.577055e-01  NA  2.675487e-01       NA      NA   NA NA
## CI.mean         NA  NA  1.884908e+00  NA  5.265759e-01       NA      NA   NA NA
## var             NA  NA  2.678224e+02  NA  2.090204e+01       NA      NA   NA NA
## std.dev         NA  NA  1.636528e+01  NA  4.571875e+00       NA      NA   NA NA
## coef.var        NA  NA  2.555133e-01  NA  2.524560e-01       NA      NA   NA NA
## skewness        NA  NA -3.460726e-01  NA  2.139443e-01       NA      NA   NA NA
## skew.2SE        NA  NA -1.213298e+00  NA  7.500685e-01       NA      NA   NA NA
## kurtosis        NA  NA -2.248394e-01  NA -9.225550e-01       NA      NA   NA NA
## kurt.2SE        NA  NA -3.954482e-01  NA -1.622592e+00       NA      NA   NA NA
## normtest.W      NA  NA  9.768745e-01  NA  9.658956e-01       NA      NA   NA NA
## normtest.p      NA  NA  1.155718e-04  NA  2.184316e-06       NA      NA   NA NA
##            mrankin          sbp iv.altep      time.iv       aspects ia.occlus
## median          NA 145.50000000       NA 8.300000e+01  9.000000e+00        NA
## mean            NA 146.05136986       NA 9.054023e+01  8.503472e+00        NA
## SE.mean         NA   1.42957605       NA 2.275761e+00  9.294175e-02        NA
## CI.mean         NA   2.81361946       NA 4.481269e+00  1.829339e-01        NA
## var             NA 596.75680224       NA 1.351742e+03  2.487793e+00        NA
## std.dev         NA  24.42860623       NA 3.676604e+01  1.577274e+00        NA
## coef.var        NA   0.16726037       NA 4.060741e-01  1.854859e-01        NA
## skewness        NA   0.04620720       NA 1.289112e+00 -7.841729e-01        NA
## skew.2SE        NA   0.16199812       NA 4.275439e+00 -2.730533e+00        NA
## kurtosis        NA  -0.05936071       NA 1.804211e+00 -5.072399e-01        NA
## kurt.2SE        NA  -0.10440377       NA 3.003037e+00 -8.861054e-01        NA
## normtest.W      NA   0.99667438       NA 8.907261e-01  8.400948e-01        NA
## normtest.p      NA   0.80618706       NA 8.833036e-13  1.404705e-16        NA
##            extra.ica     time.rand     time.punc
## median            NA  2.055000e+02  2.650000e+02
## mean              NA  2.117774e+02  2.643852e+02
## SE.mean           NA  4.003036e+00  4.608942e+00
## CI.mean           NA  7.878573e+00  9.115683e+00
## var               NA  4.679095e+03  2.867716e+03
## std.dev           NA  6.840391e+01  5.355106e+01
## coef.var          NA  3.229991e-01  2.025494e-01
## skewness          NA  3.772407e-01  4.516589e-02
## skew.2SE          NA  1.322571e+00  1.082980e-01
## kurtosis          NA -9.122173e-01 -1.259536e+00
## kurt.2SE          NA -1.604410e+00 -1.520617e+00
## normtest.W        NA  9.506005e-01  9.479431e-01
## normtest.p        NA  2.359023e-08  5.716403e-05
by(df$age, df$sex, shapiro.test)
## df$sex: Female
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.96604, p-value = 6.692e-05
## 
## ------------------------------------------------------------ 
## df$sex: Male
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.97687, p-value = 0.0001156

Discuss

  • The Shapiro-Wilk test result between the 2 commands are the same: normtest.W and normtest.p in 1st command equal to W and p-value in 2nd command.

  • This by() commands provide us an overview about the basic characteristics of dataset.

     

4.8 Frequency: table() + xtab() + Crosstable()

To summerize factors in each variable.

4.8.1 table()

Frequency table: table(variable, useNA = “ifany”)

  • useNA= “ifany”: if there are any missing value, they will also be created in the table.
  • By default, the table() function does not tell the number of missing value.

Command for conditions: table(variable %in% value)

To list the variable for condition above: dataset[variable %in% c(“Condition 1”, “Condition 2”) , ]

OR similar formular dataset [ variable == Condition 1 | variable == Condition 1, ]

 

Example 1: Frquency for 1 variable:

table(data$trt)
## 
## Intervention      Control 
##          233          267

 

Example 2: Frquency for 2 variables:

table(data$trt, data$sex)
##               
##                Female Male
##   Intervention     98  135
##   Control         110  157
table(data$trt, data$age)
##               
##                23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
##   Intervention  3  1  0  1  3  1  1  3  5  1  1  1  1  0  1  2  0  4  1  3  1
##   Control       0  1  1  1  0  2  0  0  0  1  3  3  5  3  1  3  2  2  1  0  1
##               
##                44 45 46 47 50 51 52 53 54 54.5 55 55.5 56 57 58 59 60 61 62 63
##   Intervention  3  4  4  2  2  3  2  1  2    2  5    0  9  9  4  2  6  4  5  5
##   Control       2  0  0  0  2  2  2  5 12    1 10    1  7  3 15  8  6  2  2  1
##               
##                64 65 65.7 65.8 66 67 68 69 70 71 72 73 74 75 76 76.4 77 78 79
##   Intervention  6  2    0    1  6  6  8  5  3  2  7  4  5  5 10    0  2  1  5
##   Control       5 17    1    0  8  6  9  9  8  4  5  7  3  6  1    1  0  6  1
##               
##                80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
##   Intervention  0  4  3  1  3  5  1  3  5  4  4  1  2  6  1  2  2
##   Control       2  7  4  1  5  3  5  5  3  3  4  3  9  4  1  0  0

 

Example 3: How many observations have age 75 and age 96 and list them?

table(data$age %in% c("75", "96"))
## 
## FALSE  TRUE 
##   487    13
data[data$age %in% c("75", "96") ,  ]
## # A tibble: 13 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  2 z009    Control      75 Male     25 Left    No      Yes   No    2         129
##  3 z055    Interven~    75 Fema~    25 Right   No      No    No    0         127
##  4 z086    Interven~    96 Male     14 Right   No      No    Yes   0         139
##  5 z148    Control      75 Male     23 Right   No      No    No    0         231
##  6 z170    Interven~    75 Male     18 Left    No      Yes   No    0         126
##  7 z180    Control      75 Male     18 Right   No      No    No    0         177
##  8 z343    Interven~    75 Male     14 Left    No      No    No    0         162
##  9 z344    Control      75 Fema~    13 Left    Yes     No    Yes   0         125
## 10 z374    Interven~    96 Male     26 Left    No      No    No    0         124
## 11 z401    Control      75 Male     20 Left    No      No    No    0         176
## 12 z404    Control      75 Fema~    22 Left    No      No    No    0         154
## 13 z485    Interven~    75 Male     20 Right   No      Yes   No    0         140
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location
data [ data$age == 75 | data$age == 96, ]
## # A tibble: 13 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  2 z009    Control      75 Male     25 Left    No      Yes   No    2         129
##  3 z055    Interven~    75 Fema~    25 Right   No      No    No    0         127
##  4 z086    Interven~    96 Male     14 Right   No      No    Yes   0         139
##  5 z148    Control      75 Male     23 Right   No      No    No    0         231
##  6 z170    Interven~    75 Male     18 Left    No      Yes   No    0         126
##  7 z180    Control      75 Male     18 Right   No      No    No    0         177
##  8 z343    Interven~    75 Male     14 Left    No      No    No    0         162
##  9 z344    Control      75 Fema~    13 Left    Yes     No    Yes   0         125
## 10 z374    Interven~    96 Male     26 Left    No      No    No    0         124
## 11 z401    Control      75 Male     20 Left    No      No    No    0         176
## 12 z404    Control      75 Fema~    22 Left    No      No    No    0         154
## 13 z485    Interven~    75 Male     20 Right   No      Yes   No    0         140
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location

 

4.8.2 xtabs()

To create the frequency table of 1 or 2 categorical variables:
xtabs( ~ variable 1 + variable 2, data = dataframe)

Example 1: Frquency for 1 variable:

xtabs( ~ data$trt, data = data)
## data$trt
## Intervention      Control 
##          233          267

Example 2: Frquency for 2 variable:

xtabs( ~ data$trt + data$sex, data = data)
##               data$sex
## data$trt       Female Male
##   Intervention     98  135
##   Control         110  157

 

4.8.3 CrossTable()

This is the function for Categorical variables testing: Chi squared and Fisher exact test. I will menton this topic in another day. Let focus on how this function help us create 2x2 table:

Full command: CrossTable(predictor, outcome, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = “SAS”/“SPSS”)

library(gmodels)
CrossTable(data$trt, data$sex)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  500 
## 
##  
##              | data$sex 
##     data$trt |    Female |      Male | Row Total | 
## -------------|-----------|-----------|-----------|
## Intervention |        98 |       135 |       233 | 
##              |     0.012 |     0.008 |           | 
##              |     0.421 |     0.579 |     0.466 | 
##              |     0.471 |     0.462 |           | 
##              |     0.196 |     0.270 |           | 
## -------------|-----------|-----------|-----------|
##      Control |       110 |       157 |       267 | 
##              |     0.010 |     0.007 |           | 
##              |     0.412 |     0.588 |     0.534 | 
##              |     0.529 |     0.538 |           | 
##              |     0.220 |     0.314 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       208 |       292 |       500 | 
##              |     0.416 |     0.584 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
CrossTable(data$trt, data$sex, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |         Expected Values |
## | Chi-square contribution |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |            Std Residual |
## |-------------------------|
## 
## Total Observations in Table:  500 
## 
##              | data$sex 
##     data$trt |   Female  |     Male  | Row Total | 
## -------------|-----------|-----------|-----------|
## Intervention |       98  |      135  |      233  | 
##              |   96.928  |  136.072  |           | 
##              |    0.012  |    0.008  |           | 
##              |   42.060% |   57.940% |   46.600% | 
##              |   47.115% |   46.233% |           | 
##              |   19.600% |   27.000% |           | 
##              |    0.109  |   -0.092  |           | 
## -------------|-----------|-----------|-----------|
##      Control |      110  |      157  |      267  | 
##              |  111.072  |  155.928  |           | 
##              |    0.010  |    0.007  |           | 
##              |   41.199% |   58.801% |   53.400% | 
##              |   52.885% |   53.767% |           | 
##              |   22.000% |   31.400% |           | 
##              |   -0.102  |    0.086  |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      208  |      292  |      500  | 
##              |   41.600% |   58.400% |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  0.03801773     d.f. =  1     p =  0.8454075 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  0.01082402     d.f. =  1     p =  0.9171387 
## 
##  
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio:  1.03604 
## 
## Alternative hypothesis: true odds ratio is not equal to 1
## p =  0.8561188 
## 95% confidence interval:  0.7138417 1.50323 
## 
## Alternative hypothesis: true odds ratio is less than 1
## p =  0.6126348 
## 95% confidence interval:  0 1.419597 
## 
## Alternative hypothesis: true odds ratio is greater than 1
## p =  0.4584408 
## 95% confidence interval:  0.7559892 Inf 
## 
## 
##  
##        Minimum expected frequency: 96.928

 

5 Manipulating data commands

R has basic commands that allow users to check the default commands or to modify data.
## Default check of command: args() vs str()
The args() function in R is used to return the required arguments or parameters of a given function. The args() function takes the name of the function as a parameter value, and returns the parameters that are required by the function.

str() function in R Language is used for compactly displaying the internal structure of a R object. It can display even the internal structure of large lists which are nested. It provides one liner output for the basic R objects letting the user know about the object and its constituents.

Commands:

  • args(function)
  • str(function)
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL
str(pivot_longer)
## function (data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL, 
##     names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, 
##     names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, 
##     values_ptypes = NULL, values_transform = NULL, ...)

 

5.1 Selecting part of dataframe: Subcripts vs subset()

5.1.1 Subcripts

To create a new dataframe that contains only the variables or cases that you want.

Command: newDataframe <- oldDataframe[rows, columns]

Functions:
There are many functions and operators that are useful when constructing the expressions used to filter the data:

  • ==, >, >= etc
  • &, |, !

Example 1: Choose rows with 1 condition:

data[data$trt == "Control", ] %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control    53 Male      21 Right    No      No    No    2         127
## 2 z003    Control    68 Female    11 Right    No      No    No    0         138
## 3 z004    Control    28 Male      22 Left     No      No    No    0         122
## 4 z005    Control    91 Male      24 Right    No      No    No    0         162
## 5 z006    Control    34 Female    18 Left     No      No    No    2         166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

Example 2: Choose rows with 2 conditions:

data[data$trt == "Control" & data$age >=50, ] %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control    53 Male      21 Right    No      No    No    2         127
## 2 z003    Control    68 Female    11 Right    No      No    No    0         138
## 3 z005    Control    91 Male      24 Right    No      No    No    0         162
## 4 z008    Control    89 Female    18 Right    No      No    No    0         157
## 5 z009    Control    75 Male      25 Left     No      Yes   No    2         129
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

Example 3: Choose rows that trt is Control and age > global mean(age). If you want to extract age > mean(age) for each group (trt), group_by() + filter() should be used:

data[data$trt == "Control" & data$age > mean(data$age), ] %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z003    Control    68 Female    11 Right    No      No    No    0         138
## 2 z005    Control    91 Male      24 Right    No      No    No    0         162
## 3 z008    Control    89 Female    18 Right    No      No    No    0         157
## 4 z009    Control    75 Male      25 Left     No      Yes   No    2         129
## 5 z015    Control    91 Male      15 Right    No      No    No    1         144
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

Example 4: Drop all row that are not Intervention in trt:

data[! (data$trt == "Intervention"), ] %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control    53 Male      21 Right    No      No    No    2         127
## 2 z003    Control    68 Female    11 Right    No      No    No    0         138
## 3 z004    Control    28 Male      22 Left     No      No    No    0         122
## 4 z005    Control    91 Male      24 Right    No      No    No    0         162
## 5 z006    Control    34 Female    18 Left     No      No    No    2         166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

5.1.2 Subset

Another way to select parts of your dataframe is to use the subset() function. Everything which can be used with subcripts can also be used with subset():

Command: newDataframe<-subset(oldDataframe, cases to retain (rows), select = c(list of variables - column))

Example 1: Choose rows with 1 condition:

data %>% subset(trt == "Control", ) %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control    53 Male      21 Right    No      No    No    2         127
## 2 z003    Control    68 Female    11 Right    No      No    No    0         138
## 3 z004    Control    28 Male      22 Left     No      No    No    0         122
## 4 z005    Control    91 Male      24 Right    No      No    No    0         162
## 5 z006    Control    34 Female    18 Left     No      No    No    2         166
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

Example 2: Choose rows with 2 conditions:

data %>% subset(trt == "Control" & data$age >=50, ) %>% head (n=5)
## # A tibble: 5 x 18
##   studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##   <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
## 1 z001    Control    53 Male      21 Right    No      No    No    2         127
## 2 z003    Control    68 Female    11 Right    No      No    No    0         138
## 3 z005    Control    91 Male      24 Right    No      No    No    0         162
## 4 z008    Control    89 Female    18 Right    No      No    No    0         157
## 5 z009    Control    75 Male      25 Left     No      Yes   No    2         129
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>

 

Example 3: Subset data with top 10% age. After this command, we will get top 10% highest age (46 rows)

quantile(data$age, 0.9)
## 90% 
##  88
subset(data, age> quantile(data$age, 0.9))
## # A tibble: 46 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z005    Control      91 Male     24 Right   No      No    No    0         162
##  2 z008    Control      89 Fema~    18 Right   No      No    No    0         157
##  3 z015    Control      91 Male     15 Right   No      No    No    1         144
##  4 z017    Control      92 Fema~    12 Right   No      No    No    0          99
##  5 z024    Control      92 Male     14 Left    No      No    Yes   2         122
##  6 z030    Interven~    91 Male     14 Left    No      Yes   No    0         154
##  7 z045    Interven~    93 Fema~    28 Right   No      No    No    1         162
##  8 z060    Control      93 Fema~    18 Left    No      No    No    0         142
##  9 z067    Interven~    93 Fema~    17 Right   No      Yes   No    0         156
## 10 z078    Control      89 Fema~    13 Left    No      No    No    0         172
## # ... with 36 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

5.2 Combine 2 dataframes: merge()

Merge two data frames by common columns or row names, or do other versions of database join operations.

Command: merge(x, y, by = , by.x = by, by.y = by, all = FALSE

Parameter:
* x,y: 2 dataframes
* by.x by.y: which of the column should merge
* by default: it will merge all of the column by common name. For example: it will merge all “id”, “start,”“stop”, “time left” as they appear in both dataframe (even though, they are not the same)
* all = TRUE: if there’s a value that appears in one but not in the other, it should include another row but with na values for the missing values that are, don’t appear in the other data frame.
* all = FALSE: only include the same value between 2 dataframes.

 

Let create 2 dataframes contains following variables:

  • merge1: id, names, sex.
  • merge2: studyid, names, group, age.
# Create 1st dataframe:
id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c("F", "M", "F", "M", "F")
age = c(20, 17, 50, 30, 80)
merge1 = as.tibble(data.frame(id, names, sex, age))

# Create 2nd dataframe:
studyid = c(1:6)
names = c("Ana", "Boc", "Cindy", "Peter", "Yuly", "Doe")
group = c("Control", "Intervention", "Control", "Control", "Intervention", "Control")
age = c(20, 17, 50, 64, 34, 30)
merge2 = as.tibble(data.frame(studyid, names, group, age))

# Check dataframes:
merge1
## # A tibble: 5 x 4
##      id names sex     age
##   <int> <chr> <chr> <dbl>
## 1     1 Ana   F        20
## 2     2 Boc   M        17
## 3     3 Cindy F        50
## 4     4 Doe   M        30
## 5     5 Eva   F        80
merge2
## # A tibble: 6 x 4
##   studyid names group          age
##     <int> <chr> <chr>        <dbl>
## 1       1 Ana   Control         20
## 2       2 Boc   Intervention    17
## 3       3 Cindy Control         50
## 4       4 Peter Control         64
## 5       5 Yuly  Intervention    34
## 6       6 Doe   Control         30

Example 1: Merge 2 dataframe by names. COmpare the parameter all= TRUE vs FALSE

merge(merge1, merge2, by = "names", all = TRUE)
##   names id  sex age.x studyid        group age.y
## 1   Ana  1    F    20       1      Control    20
## 2   Boc  2    M    17       2 Intervention    17
## 3 Cindy  3    F    50       3      Control    50
## 4   Doe  4    M    30       6      Control    30
## 5   Eva  5    F    80      NA         <NA>    NA
## 6 Peter NA <NA>    NA       4      Control    64
## 7  Yuly NA <NA>    NA       5 Intervention    34
merge(merge1, merge2, by = "names", all = F)
##   names id sex age.x studyid        group age.y
## 1   Ana  1   F    20       1      Control    20
## 2   Boc  2   M    17       2 Intervention    17
## 3 Cindy  3   F    50       3      Control    50
## 4   Doe  4   M    30       6      Control    30

 

Example 2: We can see that in our dataframes, there are 2 similar variables: names and age. Let merge by both names and age:

merge(merge1, merge2, by.x = c("names", "age"), 
      by.y = c("names", "age"), all = TRUE)
##   names age id  sex studyid        group
## 1   Ana  20  1    F       1      Control
## 2   Boc  17  2    M       2 Intervention
## 3 Cindy  50  3    F       3      Control
## 4   Doe  30  4    M       6      Control
## 5   Eva  80  5    F      NA         <NA>
## 6 Peter  64 NA <NA>       4      Control
## 7  Yuly  34 NA <NA>       5 Intervention

 

5.3 Rename + Reorder level of factor: factor()

Commands:
factor(variable, levels = c(“x”,“y”, …” z”), labels = c(“label1”, “label2”, … “label3”))
Let create a dataframe:

id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c(0, 1, 0, 1, 0)
age = c(20, 17, 50, 30, 80)

merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 5 x 4
##      id names   sex   age
##   <int> <chr> <dbl> <dbl>
## 1     1 Ana       0    20
## 2     2 Boc       1    17
## 3     3 Cindy     0    50
## 4     4 Doe       1    30
## 5     5 Eva       0    80

 

Example: Change the 0,1 in sex to Female and Male, respectively:

merge1$sex = factor(merge1$sex, levels = c(0, 1), labels = c("Female ", "Male"))
merge1
## # A tibble: 5 x 4
##      id names sex         age
##   <int> <chr> <fct>     <dbl>
## 1     1 Ana   "Female "    20
## 2     2 Boc   "Male"       17
## 3     3 Cindy "Female "    50
## 4     4 Doe   "Male"       30
## 5     5 Eva   "Female "    80

 

5.4 Assign group to numeric value

Let create a dataframe:

id = c(1:6)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva", "EOI")
sex = c(0, 1, 0, 1, 0, 1)
age = c(20, 17, 50, 30, 80, 40)

merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 6 x 4
##      id names   sex   age
##   <int> <chr> <dbl> <dbl>
## 1     1 Ana       0    20
## 2     2 Boc       1    17
## 3     3 Cindy     0    50
## 4     4 Doe       1    30
## 5     5 Eva       0    80
## 6     6 EOI       1    40

 

Example: Create group_age categorical variables with >=30 and the rest

merge1$group_age[merge1$age >=30] <- ">=30"
merge1$group_age[merge1$age <30] <- "Phan con lai"
merge1
## # A tibble: 6 x 5
##      id names   sex   age group_age   
##   <int> <chr> <dbl> <dbl> <chr>       
## 1     1 Ana       0    20 Phan con lai
## 2     2 Boc       1    17 Phan con lai
## 3     3 Cindy     0    50 >=30        
## 4     4 Doe       1    30 >=30        
## 5     5 Eva       0    80 >=30        
## 6     6 EOI       1    40 >=30

 

5.5 Replace value

Replace 20 = 0; Missing value = 1000

id = c(1:6)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva", "EOI")
sex = c(0, 1, 0, 1, 0, 1)
age = c(20, 17, 50, 20, NA, NA)

merge1 = as.tibble(data.frame(id, names, sex, age))
merge1
## # A tibble: 6 x 4
##      id names   sex   age
##   <int> <chr> <dbl> <dbl>
## 1     1 Ana       0    20
## 2     2 Boc       1    17
## 3     3 Cindy     0    50
## 4     4 Doe       1    20
## 5     5 Eva       0    NA
## 6     6 EOI       1    NA
merge1$age[merge1$age == 20] <- 0

merge1$age[is.na(merge1$age)]<- 1000
merge1
## # A tibble: 6 x 4
##      id names   sex   age
##   <int> <chr> <dbl> <dbl>
## 1     1 Ana       0     0
## 2     2 Boc       1    17
## 3     3 Cindy     0    50
## 4     4 Doe       1     0
## 5     5 Eva       0  1000
## 6     6 EOI       1  1000

 

6 Data cleaning: tidyverse

Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

Package used: tidyverse.

library(tidyverse)

6.1 Single table verbs

tidyverse aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with:

Rows:

  • filter(): chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows.

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.
  • relocate(): changes the order of the columns.

Groups of rows:

  • summarise(): collapses a group into a single row.

Others:

  • pivot_longer and pivot_wider: changing format of the data.
  • str_length() and their friends: dealing with string.

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. More infomation is available here:
https://dplyr.tidyverse.org/articles/base.html#one-table-verbs

6.1.1 Combining operations with pipe: %>%

The pipe operator, denoted as %>%. It takes the output of one function and provides it as an input to another function. This enables us to connect a series of analytic stages. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).

6.1.2 Filter rows: filter()

6.1.2.1 Introduction

The filter() method is applied to subset a data frame, keeping only the rows that meet your criteria. To be kept, the row must return TRUE for all requirements. When a condition evaluates to NA, the row is discarded, as opposed to base subsetting with.

Command: filter(dataframe, condition, .preserve = FALSE)

The filter() method is used to subset the rows of data by applying expressions in… to column values to decide which rows should be kept. It may be used on both grouped and ungrouped data (see group by() and ungroup() for further information). dplyr, on the other hand, is not yet intelligent enough to optimize the filtering procedure on grouped datasets that do not require grouped computations. As a result, sifting ungrouped data is frequently more quicker.

6.1.2.2 Filter functions

There are many functions and operators that are useful when constructing the expressions used to filter the data:

  • ==, >, >= etc
  • &, |, !, xor()
  • is.na()
  • between(), near()

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”.

6.1.2.3 Value

The output has the following properties:

  • Rows are a subset of the input, but appear in the same order.
  • Columns are not modified.
  • The number of groups may be reduced (if .preserve is not TRUE).
  • Data frame attributes are preserved.

6.1.2.4 Examples

Example 1: Filtering by one criterion

filter(data, trt == "Control")
## # A tibble: 267 x 18
##    studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001    Control    53 Male      21 Right    No      No    No    2         127
##  2 z003    Control    68 Female    11 Right    No      No    No    0         138
##  3 z004    Control    28 Male      22 Left     No      No    No    0         122
##  4 z005    Control    91 Male      24 Right    No      No    No    0         162
##  5 z006    Control    34 Female    18 Left     No      No    No    2         166
##  6 z008    Control    89 Female    18 Right    No      No    No    0         157
##  7 z009    Control    75 Male      25 Left     No      Yes   No    2         129
##  8 z011    Control    35 Female    17 Right    No      Yes   No    2         148
##  9 z012    Control    55 Male      13 Right    No      No    No    0         114
## 10 z013    Control    59 Male      21 Right    No      No    No    0         140
## # ... with 257 more rows, and 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>
filter(data, age >= 50)
## # A tibble: 421 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001    Control      53 Male     21 Right   No      No    No    2         127
##  2 z002    Interven~    51 Male     23 Left    No      Yes   No    0         137
##  3 z003    Control      68 Fema~    11 Right   No      No    No    0         138
##  4 z005    Control      91 Male     24 Right   No      No    No    0         162
##  5 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  6 z008    Control      89 Fema~    18 Right   No      No    No    0         157
##  7 z009    Control      75 Male     25 Left    No      Yes   No    2         129
##  8 z012    Control      55 Male     13 Right   No      No    No    0         114
##  9 z013    Control      59 Male     21 Right   No      No    No    0         140
## 10 z015    Control      91 Male     15 Right   No      No    No    1         144
## # ... with 411 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Example 2: Filtering by multiple criteria within a single logical expression

# Joint conditions "& and ,"
# filter(data, trt == "Control" , age >= 50)
filter(data, trt == "Control" & age >= 50)
## # A tibble: 235 x 18
##    studyid trt       age sex    nihss location hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>   <dbl> <fct>  <dbl> <fct>    <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001    Control    53 Male      21 Right    No      No    No    2         127
##  2 z003    Control    68 Female    11 Right    No      No    No    0         138
##  3 z005    Control    91 Male      24 Right    No      No    No    0         162
##  4 z008    Control    89 Female    18 Right    No      No    No    0         157
##  5 z009    Control    75 Male      25 Left     No      Yes   No    2         129
##  6 z012    Control    55 Male      13 Right    No      No    No    0         114
##  7 z013    Control    59 Male      21 Right    No      No    No    0         140
##  8 z015    Control    91 Male      15 Right    No      No    No    1         144
##  9 z017    Control    92 Female    12 Right    No      No    No    0          99
## 10 z018    Control    59 Male      22 Right    No      No    No    0         170
## # ... with 225 more rows, and 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>
# Seperate conditions "| or"
filter(data, trt == "Control" | age >= 50)
## # A tibble: 453 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001    Control      53 Male     21 Right   No      No    No    2         127
##  2 z002    Interven~    51 Male     23 Left    No      Yes   No    0         137
##  3 z003    Control      68 Fema~    11 Right   No      No    No    0         138
##  4 z004    Control      28 Male     22 Left    No      No    No    0         122
##  5 z005    Control      91 Male     24 Right   No      No    No    0         162
##  6 z006    Control      34 Fema~    18 Left    No      No    No    2         166
##  7 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  8 z008    Control      89 Fema~    18 Right   No      No    No    0         157
##  9 z009    Control      75 Male     25 Left    No      Yes   No    2         129
## 10 z011    Control      35 Fema~    17 Right   No      Yes   No    2         148
## # ... with 443 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Note 1: x %in% y. This will select every row where x is one of the values in y. In this example, I will use 2 conditions. The 1st one is list all observations that has age = 50 or age = 60. The 2nd condition is list all observations that has age range from 50 to 60.

filter(data, age == 50 | age == 60)
## # A tibble: 16 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z037    Control      60 Fema~    16 Left    No      Yes   No    0         204
##  2 z076    Interven~    50 Male     17 Left    No      No    No    0         121
##  3 z095    Interven~    60 Male     17 Right   No      Yes   No    0         141
##  4 z147    Interven~    60 Male     22 Left    No      No    No    2         119
##  5 z152    Interven~    60 Male     24 Right   No      No    No    0         138
##  6 z161    Interven~    60 Male     21 Left    No      No    No    1         147
##  7 z163    Interven~    60 Fema~    10 Right   No      No    Yes   0         176
##  8 z181    Control      60 Male     17 Right   No      Yes   No    0         141
##  9 z231    Control      60 Male     22 Right   No      No    No    0         105
## 10 z365    Control      60 Male     12 Left    No      No    No    0         190
## 11 z373    Control      60 Male     14 Right   Yes     No    No    0         178
## 12 z378    Control      50 Fema~    22 Left    No      No    No    1         130
## 13 z400    Interven~    50 Male     20 Right   No      No    No    0         135
## 14 z417    Interven~    60 Male     14 Right   No      Yes   No    0         170
## 15 z427    Control      50 Fema~    22 Left    No      No    No    2         126
## 16 z444    Control      60 Male     20 Left    No      No    No    > 2       159
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location
filter(data, age %in% c(50, 60))
## # A tibble: 16 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z037    Control      60 Fema~    16 Left    No      Yes   No    0         204
##  2 z076    Interven~    50 Male     17 Left    No      No    No    0         121
##  3 z095    Interven~    60 Male     17 Right   No      Yes   No    0         141
##  4 z147    Interven~    60 Male     22 Left    No      No    No    2         119
##  5 z152    Interven~    60 Male     24 Right   No      No    No    0         138
##  6 z161    Interven~    60 Male     21 Left    No      No    No    1         147
##  7 z163    Interven~    60 Fema~    10 Right   No      No    Yes   0         176
##  8 z181    Control      60 Male     17 Right   No      Yes   No    0         141
##  9 z231    Control      60 Male     22 Right   No      No    No    0         105
## 10 z365    Control      60 Male     12 Left    No      No    No    0         190
## 11 z373    Control      60 Male     14 Right   Yes     No    No    0         178
## 12 z378    Control      50 Fema~    22 Left    No      No    No    1         130
## 13 z400    Interven~    50 Male     20 Right   No      No    No    0         135
## 14 z417    Interven~    60 Male     14 Right   No      Yes   No    0         170
## 15 z427    Control      50 Fema~    22 Left    No      No    No    2         126
## 16 z444    Control      60 Male     20 Left    No      No    No    > 2       159
## # ... with 7 more variables: iv.altep <fct>, time.iv <dbl>, aspects <dbl>,
## #   ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>, time.punc <dbl>, and
## #   abbreviated variable name 1: location
filter(data, age %in% c(50: 60))
## # A tibble: 117 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001    Control      53 Male     21 Right   No      No    No    2         127
##  2 z002    Interven~    51 Male     23 Left    No      Yes   No    0         137
##  3 z012    Control      55 Male     13 Right   No      No    No    0         114
##  4 z013    Control      59 Male     21 Right   No      No    No    0         140
##  5 z018    Control      59 Male     22 Right   No      No    No    0         170
##  6 z019    Control      58 Male     22 Left    No      Yes   No    0         130
##  7 z021    Interven~    57 Male     17 Right   No      No    No    > 2       108
##  8 z031    Control      59 Fema~    12 Right   No      No    No    0         129
##  9 z032    Control      53 Male     22 Right   No      No    Yes   0         157
## 10 z034    Interven~    58 Male     15 Right   No      No    No    0         105
## # ... with 107 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Note 2: Sometimes you can simplify complicated subsetting: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find observations that whose age is less than 60 and nihss is less than 15, you could use either of the following two filters. The following commands produce similar results:

filter(data, !(age > 60 | nihss > 15))
## # A tibble: 72 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z012    Control    55   Male     13 Right   No      No    No    0         114
##  2 z031    Control    59   Fema~    12 Right   No      No    No    0         129
##  3 z034    Interven~  58   Male     15 Right   No      No    No    0         105
##  4 z039    Interven~  59   Fema~    14 Right   No      No    Yes   0         148
##  5 z048    Interven~  46   Fema~    10 Left    No      No    No    0         168
##  6 z052    Control    54.5 Male     12 Right   No      Yes   No    0         149
##  7 z059    Control    58   Fema~    15 Left    No      Yes   No    1         130
##  8 z062    Interven~  56   Male     14 Left    No      No    Yes   0         146
##  9 z074    Interven~  24   Fema~    14 Right   No      No    No    0         162
## 10 z080    Interven~  31   Male     10 Left    No      No    No    0         131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location
filter(data, !(age > 60) , ! (nihss > 15))
## # A tibble: 72 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z012    Control    55   Male     13 Right   No      No    No    0         114
##  2 z031    Control    59   Fema~    12 Right   No      No    No    0         129
##  3 z034    Interven~  58   Male     15 Right   No      No    No    0         105
##  4 z039    Interven~  59   Fema~    14 Right   No      No    Yes   0         148
##  5 z048    Interven~  46   Fema~    10 Left    No      No    No    0         168
##  6 z052    Control    54.5 Male     12 Right   No      Yes   No    0         149
##  7 z059    Control    58   Fema~    15 Left    No      Yes   No    1         130
##  8 z062    Interven~  56   Male     14 Left    No      No    Yes   0         146
##  9 z074    Interven~  24   Fema~    14 Right   No      No    No    0         162
## 10 z080    Interven~  31   Male     10 Left    No      No    No    0         131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location
filter(data, age <= 60, nihss <= 15)
## # A tibble: 72 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z012    Control    55   Male     13 Right   No      No    No    0         114
##  2 z031    Control    59   Fema~    12 Right   No      No    No    0         129
##  3 z034    Interven~  58   Male     15 Right   No      No    No    0         105
##  4 z039    Interven~  59   Fema~    14 Right   No      No    Yes   0         148
##  5 z048    Interven~  46   Fema~    10 Left    No      No    No    0         168
##  6 z052    Control    54.5 Male     12 Right   No      Yes   No    0         149
##  7 z059    Control    58   Fema~    15 Left    No      Yes   No    1         130
##  8 z062    Interven~  56   Male     14 Left    No      No    Yes   0         146
##  9 z074    Interven~  24   Fema~    14 Right   No      No    No    0         162
## 10 z080    Interven~  31   Male     10 Left    No      No    No    0         131
## # ... with 62 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Example 3: The filtering operation may yield different results on grouped tibbles because the expressions are computed within groups.

# The following filters rows where `age` is greater than the global average:
data %>% filter(age > mean(age, na.rm = TRUE))
## # A tibble: 270 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z003    Control      68 Fema~    11 Right   No      No    No    0         138
##  2 z005    Control      91 Male     24 Right   No      No    No    0         162
##  3 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  4 z008    Control      89 Fema~    18 Right   No      No    No    0         157
##  5 z009    Control      75 Male     25 Left    No      Yes   No    2         129
##  6 z015    Control      91 Male     15 Right   No      No    No    1         144
##  7 z016    Interven~    76 Male     16 Right   No      No    No    0         135
##  8 z017    Control      92 Fema~    12 Right   No      No    No    0          99
##  9 z020    Interven~    73 Fema~    15 Right   No      Yes   Yes   0         169
## 10 z022    Interven~    67 Male     27 Left    No      Yes   No    > 2       148
## # ... with 260 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location
# Whereas this keeps rows with `age` greater than the average value of age based on trt groups:
data %>% group_by(trt) %>% filter( age > mean(age, na.rm = TRUE))
## # A tibble: 259 x 18
## # Groups:   trt [2]
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z003    Control      68 Fema~    11 Right   No      No    No    0         138
##  2 z005    Control      91 Male     24 Right   No      No    No    0         162
##  3 z007    Interven~    75 Male     25 Right   No      No    No    0         140
##  4 z008    Control      89 Fema~    18 Right   No      No    No    0         157
##  5 z009    Control      75 Male     25 Left    No      Yes   No    2         129
##  6 z015    Control      91 Male     15 Right   No      No    No    1         144
##  7 z016    Interven~    76 Male     16 Right   No      No    No    0         135
##  8 z017    Control      92 Fema~    12 Right   No      No    No    0          99
##  9 z020    Interven~    73 Fema~    15 Right   No      Yes   Yes   0         169
## 10 z022    Interven~    67 Male     27 Left    No      Yes   No    > 2       148
## # ... with 249 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

6.1.3 Arrange rows with arrange()

6.1.3.1 Introduction

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Command: arrange(dataframe, by_group = FALSE)

  • dataframe: A data frame, data frame extension. Use desc() to sort a variable in descending order.
  • by_group: If TRUE, will sort first by grouping variable. Applies to grouped data frames only.

6.1.3.2 Missing values

Unlike base sorting with sort(), NA are always sorted to the end for local data, even when wrapped with desc().

6.1.3.3 Value

The output has the following properties:

  • All rows appear in the output, but (usually) in a different place.
  • Columns are not modified.
  • Groups are not modified.
  • Data frame attributes are preserved.

6.1.3.4 Examples

Example 1: Arrange the data based on age and nihss in ascending order

arrange(data, age, nihss)
## # A tibble: 500 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z481    Interven~    23 Male     12 Right   No      Yes   No    2         169
##  2 z191    Interven~    23 Male     20 Right   No      Yes   No    0         137
##  3 z499    Interven~    23 Male     23 Left    Yes     Yes   No    0         123
##  4 z074    Interven~    24 Fema~    14 Right   No      No    No    0         162
##  5 z204    Control      24 Fema~    22 Left    No      No    No    0         123
##  6 z387    Control      25 Male     17 Left    No      Yes   No    1         138
##  7 z406    Control      26 Male     15 Right   No      No    No    > 2       177
##  8 z010    Interven~    26 Fema~    27 Right   No      No    No    0         143
##  9 z122    Interven~    27 Fema~    19 Left    No      Yes   Yes   2         195
## 10 z347    Interven~    27 Fema~    20 Right   No      No    No    0         133
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Example 2: Arrange the data based on age and nihss in descending order

arrange(data, desc(age), desc(nihss))
## # A tibble: 500 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z374    Interven~    96 Male     26 Left    No      No    No    0         124
##  2 z086    Interven~    96 Male     14 Right   No      No    Yes   0         139
##  3 z434    Interven~    95 Fema~    19 Right   No      No    No    0         111
##  4 z340    Interven~    95 Male     18 Left    No      Yes   No    > 2       188
##  5 z357    Interven~    94 Fema~    26 Right   No      No    No    0         200
##  6 z135    Control      94 Fema~    18 Right   Yes     No    Yes   0         120
##  7 z045    Interven~    93 Fema~    28 Right   No      No    No    1         162
##  8 z440    Control      93 Male     25 Right   No      No    No    0         129
##  9 z219    Interven~    93 Fema~    20 Right   No      No    No    0         146
## 10 z227    Interven~    93 Fema~    20 Left    No      No    No    0         134
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Example 3: Grouped arrange: Grouped by age, arrange based on nihss. Compare the result with arrange without group:

# Arrange with group:
data %>% group_by(age) %>% arrange(nihss, .by_group = TRUE)
## # A tibble: 500 x 18
## # Groups:   age [77]
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z481    Interven~    23 Male     12 Right   No      Yes   No    2         169
##  2 z191    Interven~    23 Male     20 Right   No      Yes   No    0         137
##  3 z499    Interven~    23 Male     23 Left    Yes     Yes   No    0         123
##  4 z074    Interven~    24 Fema~    14 Right   No      No    No    0         162
##  5 z204    Control      24 Fema~    22 Left    No      No    No    0         123
##  6 z387    Control      25 Male     17 Left    No      Yes   No    1         138
##  7 z406    Control      26 Male     15 Right   No      No    No    > 2       177
##  8 z010    Interven~    26 Fema~    27 Right   No      No    No    0         143
##  9 z122    Interven~    27 Fema~    19 Left    No      Yes   Yes   2         195
## 10 z347    Interven~    27 Fema~    20 Right   No      No    No    0         133
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location
# Arrange without group:
data %>% arrange(nihss)
## # A tibble: 500 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <fct>     <dbl> <fct> <dbl> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z048    Interven~    46 Fema~    10 Left    No      No    No    0         168
##  2 z080    Interven~    31 Male     10 Left    No      No    No    0         131
##  3 z163    Interven~    60 Fema~    10 Right   No      No    Yes   0         176
##  4 z297    Interven~    40 Male     10 Left    Yes     No    No    0         134
##  5 z298    Interven~    72 Fema~    10 Left    No      No    No    0         183
##  6 z301    Interven~    67 Male     10 Left    No      No    No    0         166
##  7 z310    Interven~    30 Male     10 Right   No      No    No    0         164
##  8 z330    Interven~    33 Male     10 Left    No      No    No    1         176
##  9 z334    Interven~    64 Male     10 Right   Yes     No    No    0         160
## 10 z003    Control      68 Fema~    11 Right   No      No    No    0         138
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

Example 4: Use across() access select()-style semantics:

data %>% select(sbp, nihss, time.iv, time.rand, time.punc) %>%
  arrange(across(starts_with("time.")))
## # A tibble: 500 x 5
##      sbp nihss time.iv time.rand time.punc
##    <dbl> <dbl>   <dbl>     <dbl>     <dbl>
##  1   111    13      42       276       295
##  2   134    17      43       138       267
##  3   121    17      43       146       186
##  4   159    28      44       189       327
##  5   144    21      44       212        NA
##  6   139    14      44       299       331
##  7   135    20      45       189       336
##  8   133    11      45       299       329
##  9   165    22      46       128        NA
## 10   148    14      46       148       333
## # ... with 490 more rows
data %>% select(sbp, nihss, time.iv, time.rand, time.punc) %>%
  arrange(across(starts_with("time."), desc))
## # A tibble: 500 x 5
##      sbp nihss time.iv time.rand time.punc
##    <dbl> <dbl>   <dbl>     <dbl>     <dbl>
##  1   119    22     218       195       182
##  2   142    18     218       152       247
##  3   154    15     212       256       321
##  4   166    10     211       235       260
##  5   160    10     204       300       184
##  6   176    21     200       276       219
##  7   139    17     200       162       194
##  8   131    10     199       292       190
##  9   140    15     199       267       198
## 10   131    24     197       300       200
## # ... with 490 more rows

 

6.1.4 Select columns: select()

6.1.4.1 Introduction

Using actions based on variable names, select() lets you to quickly zoom in on a suitable subset.

Command: select(dataframe, …)

6.1.4.2 Selection features

Tidyverse selections implement a dialect of R where operators make it easy to select variables:

  • : for selecting a range of consecutive variables.
  • ! for taking the complement of a set of variables.
  • & and | for selecting the intersection or the union of two sets of variables.
  • c() for combining selections.

Some helpers select specific columns:

  • everything(): Matches all variables.
  • last_col(): Select last variable, possibly with an offset.

Some helpers select variables by matching patterns in their names:

  • starts_with(): Matches names that begin with ().
  • ends_with(): Ends with a suffix.
  • contains(): Contains a literal string.
  • matches(): Matches a regular expression.
  • num_range(): Matches a numerical range like x01, x02, x03.

Some helpers select variables from a character vector:

  • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
  • any_of(): Same as all_of(), except that no error is thrown for names that don’t exist.

This helper selects variables with a function:

  • where(): Applies a function to all variables and selects those for which the function returns TRUE.

6.1.4.3 Value

The output has the following properties:

  • Rows are not affected.
  • Output columns are a subset of input columns, potentially with a different order. Columns will be renamed if new_name = old_name form is used.
  • Data frame attributes are preserved.
  • Groups are maintained; you can’t select off grouping variables.

6.1.4.4 Examples

Example 1: Select variables by name:

data %>% select(trt, nihss, location, mrankin)
## # A tibble: 500 x 4
##    trt          nihss location mrankin
##    <fct>        <dbl> <fct>    <fct>  
##  1 Control         21 Right    2      
##  2 Intervention    23 Left     0      
##  3 Control         11 Right    0      
##  4 Control         22 Left     0      
##  5 Control         24 Right    0      
##  6 Control         18 Left     2      
##  7 Intervention    25 Right    0      
##  8 Control         18 Right    0      
##  9 Control         25 Left     2      
## 10 Intervention    27 Right    0      
## # ... with 490 more rows

 

Example 2: Select a range of consecutive variables with :

data %>% select(trt : nihss)
## # A tibble: 500 x 4
##    trt            age sex    nihss
##    <fct>        <dbl> <fct>  <dbl>
##  1 Control         53 Male      21
##  2 Intervention    51 Male      23
##  3 Control         68 Female    11
##  4 Control         28 Male      22
##  5 Control         91 Male      24
##  6 Control         34 Female    18
##  7 Intervention    75 Male      25
##  8 Control         89 Female    18
##  9 Control         75 Male      25
## 10 Intervention    26 Female    27
## # ... with 490 more rows

 

Example 3: Negates a selection with !. These commands produces similar results:

# Choose variables except studyid; location: time.punc:
data %>% select(! c( studyid, location : time.punc))
## # A tibble: 500 x 4
##    trt            age sex    nihss
##    <fct>        <dbl> <fct>  <dbl>
##  1 Control         53 Male      21
##  2 Intervention    51 Male      23
##  3 Control         68 Female    11
##  4 Control         28 Male      22
##  5 Control         91 Male      24
##  6 Control         34 Female    18
##  7 Intervention    75 Male      25
##  8 Control         89 Female    18
##  9 Control         75 Male      25
## 10 Intervention    26 Female    27
## # ... with 490 more rows
data %>% select(!studyid & !c(location : time.punc))
## # A tibble: 500 x 4
##    trt            age sex    nihss
##    <fct>        <dbl> <fct>  <dbl>
##  1 Control         53 Male      21
##  2 Intervention    51 Male      23
##  3 Control         68 Female    11
##  4 Control         28 Male      22
##  5 Control         91 Male      24
##  6 Control         34 Female    18
##  7 Intervention    75 Male      25
##  8 Control         89 Female    18
##  9 Control         75 Male      25
## 10 Intervention    26 Female    27
## # ... with 490 more rows

 

Example 4: Select variables start with time

data %>% select(starts_with("time"))
## # A tibble: 500 x 3
##    time.iv time.rand time.punc
##      <dbl>     <dbl>     <dbl>
##  1      63       139        NA
##  2      68       118       281
##  3      NA       178        NA
##  4      78       160        NA
##  5     121       214        NA
##  6      78       154        NA
##  7      97       122       268
##  8      NA       147        NA
##  9      49       271        NA
## 10      99       141       259
## # ... with 490 more rows

 

Example 5: & and | take the intersection or the union of two selections:

data %>% select(starts_with("time.") & ends_with("iv"))
## # A tibble: 500 x 1
##    time.iv
##      <dbl>
##  1      63
##  2      68
##  3      NA
##  4      78
##  5     121
##  6      78
##  7      97
##  8      NA
##  9      49
## 10      99
## # ... with 490 more rows
data %>% select(starts_with("time.") | ends_with("iv"))
## # A tibble: 500 x 3
##    time.iv time.rand time.punc
##      <dbl>     <dbl>     <dbl>
##  1      63       139        NA
##  2      68       118       281
##  3      NA       178        NA
##  4      78       160        NA
##  5     121       214        NA
##  6      78       154        NA
##  7      97       122       268
##  8      NA       147        NA
##  9      49       271        NA
## 10      99       141       259
## # ... with 490 more rows

 

Example 6: Use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(data, studyid, nihss, trt, everything())
## # A tibble: 500 x 18
##    studyid nihss trt         age sex   locat~1 hx.isch afib  dm    mrankin   sbp
##    <fct>   <dbl> <fct>     <dbl> <fct> <fct>   <fct>   <fct> <fct> <fct>   <dbl>
##  1 z001       21 Control      53 Male  Right   No      No    No    2         127
##  2 z002       23 Interven~    51 Male  Left    No      Yes   No    0         137
##  3 z003       11 Control      68 Fema~ Right   No      No    No    0         138
##  4 z004       22 Control      28 Male  Left    No      No    No    0         122
##  5 z005       24 Control      91 Male  Right   No      No    No    0         162
##  6 z006       18 Control      34 Fema~ Left    No      No    No    2         166
##  7 z007       25 Interven~    75 Male  Right   No      No    No    0         140
##  8 z008       18 Control      89 Fema~ Right   No      No    No    0         157
##  9 z009       25 Control      75 Male  Left    No      Yes   No    2         129
## 10 z010       27 Interven~    26 Fema~ Right   No      No    No    0         143
## # ... with 490 more rows, 7 more variables: iv.altep <fct>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <fct>, extra.ica <fct>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location

 

6.1.5 Change column name: rename()

6.1.5.1 Introduction

rename() changes the names of individual variables using new_name = old_name syntax; rename_with() renames columns using a function.

Command: rename(dataframe, new name = old name …)

6.1.5.2 Value

The output has the following properties:

  • Rows are not affected.
  • Column names are changed; column order is preserved.
  • Data frame attributes are preserved.
  • Groups are updated to reflect new names.

6.1.5.3 Examples

Example 1: Rename variable trt to Group:

data %>% select(studyid, trt, sex) %>% rename (Group = trt)
## # A tibble: 500 x 3
##    studyid Group        sex   
##    <fct>   <fct>        <fct> 
##  1 z001    Control      Male  
##  2 z002    Intervention Male  
##  3 z003    Control      Female
##  4 z004    Control      Male  
##  5 z005    Control      Male  
##  6 z006    Control      Female
##  7 z007    Intervention Male  
##  8 z008    Control      Female
##  9 z009    Control      Male  
## 10 z010    Intervention Female
## # ... with 490 more rows

 

Example 2: Change the variable names start with time. to upper case TIME. Or change all variables to upper case:

data %>% select(trt, starts_with("time.")) %>% rename_with (toupper, starts_with("time."))
## # A tibble: 500 x 4
##    trt          TIME.IV TIME.RAND TIME.PUNC
##    <fct>          <dbl>     <dbl>     <dbl>
##  1 Control           63       139        NA
##  2 Intervention      68       118       281
##  3 Control           NA       178        NA
##  4 Control           78       160        NA
##  5 Control          121       214        NA
##  6 Control           78       154        NA
##  7 Intervention      97       122       268
##  8 Control           NA       147        NA
##  9 Control           49       271        NA
## 10 Intervention      99       141       259
## # ... with 490 more rows
data %>% select(studyid, trt, sex) %>% rename_with (toupper)
## # A tibble: 500 x 3
##    STUDYID TRT          SEX   
##    <fct>   <fct>        <fct> 
##  1 z001    Control      Male  
##  2 z002    Intervention Male  
##  3 z003    Control      Female
##  4 z004    Control      Male  
##  5 z005    Control      Male  
##  6 z006    Control      Female
##  7 z007    Intervention Male  
##  8 z008    Control      Female
##  9 z009    Control      Male  
## 10 z010    Intervention Female
## # ... with 490 more rows

 

Example 3: Change the variable names start with time. to lower case TIME, replace “.” to “_” between each words:

data %>% select (studyid, starts_with("time.")) %>% 
  rename_with(~ tolower(gsub(".", "_", .x, fixed = TRUE)))
## # A tibble: 500 x 4
##    studyid time_iv time_rand time_punc
##    <fct>     <dbl>     <dbl>     <dbl>
##  1 z001         63       139        NA
##  2 z002         68       118       281
##  3 z003         NA       178        NA
##  4 z004         78       160        NA
##  5 z005        121       214        NA
##  6 z006         78       154        NA
##  7 z007         97       122       268
##  8 z008         NA       147        NA
##  9 z009         49       271        NA
## 10 z010         99       141       259
## # ... with 490 more rows

 

6.1.6 Add new variables: mutate()

6.1.6.1 Introduction

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate(). mutate() always adds new columns at the end of your dataset.

Command:

  • mutate(dataframe, var = …, .keep = c(“all”, “used”, “unused”, “none”), .before = NULL, .after = NULL)
  • transmute(dataframe, var = …)

6.1.6.2 Useful mutate functions

  • Arithmetic operators: +, -, *, /, ^.
  • Modular arithmetic: %/% (integer division) and %% (remainder). Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
  • Logs: log(), log2(), log10().
  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by().
  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
  • Logical comparisons: <, <=, >, >=, !=, and ==.
  • Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1]  1  2  2 NA  4  5
min_rank(desc(y))
## [1]  5  3  3 NA  2  1

 

  • If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile():
row_number(y)
## [1]  1  2  3 NA  4  5
dense_rank(y)
## [1]  1  2  2 NA  3  4
percent_rank(y)
## [1] 0.00 0.25 0.25   NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6  NA 0.8 1.0

6.1.6.3 Examples

Example 1: Newly created variables are available immediately:

data %>%
 select(studyid, nihss) %>%
 mutate(
  nihss2 = nihss * 2,
  nihss_squared = nihss * nihss
 )
## # A tibble: 500 x 4
##    studyid nihss nihss2 nihss_squared
##    <fct>   <dbl>  <dbl>         <dbl>
##  1 z001       21     42           441
##  2 z002       23     46           529
##  3 z003       11     22           121
##  4 z004       22     44           484
##  5 z005       24     48           576
##  6 z006       18     36           324
##  7 z007       25     50           625
##  8 z008       18     36           324
##  9 z009       25     50           625
## 10 z010       27     54           729
## # ... with 490 more rows

 

Example 2: Remove variables and modify existing variables:

data %>%
 select(studyid, nihss, sbp) %>%
 mutate(
   sbp = NULL,
   nihss = nihss /10
)
## # A tibble: 500 x 2
##    studyid nihss
##    <fct>   <dbl>
##  1 z001      2.1
##  2 z002      2.3
##  3 z003      1.1
##  4 z004      2.2
##  5 z005      2.4
##  6 z006      1.8
##  7 z007      2.5
##  8 z008      1.8
##  9 z009      2.5
## 10 z010      2.7
## # ... with 490 more rows

 

Example 3: Apply a transformation to multiple columns in a tibble:

data %>%
 select(studyid, sex, sbp) %>%
 mutate(across( !sbp, as.character))
## # A tibble: 500 x 3
##    studyid sex      sbp
##    <chr>   <chr>  <dbl>
##  1 z001    Male     127
##  2 z002    Male     137
##  3 z003    Female   138
##  4 z004    Male     122
##  5 z005    Male     162
##  6 z006    Female   166
##  7 z007    Male     140
##  8 z008    Female   157
##  9 z009    Male     129
## 10 z010    Female   143
## # ... with 490 more rows

 

Example 4: Functions are useful for grouped mutates: min_rank; row_number(y); dense_rank(y); percent_rank(y); cume_dist(y):

data %>%
 select(studyid, nihss, sbp) %>%
  group_by(sbp) %>%
  mutate(rank = min_rank(nihss))
## # A tibble: 500 x 4
## # Groups:   sbp [113]
##    studyid nihss   sbp  rank
##    <fct>   <dbl> <dbl> <int>
##  1 z001       21   127     3
##  2 z002       23   137     7
##  3 z003       11   138     1
##  4 z004       22   122     5
##  5 z005       24   162    12
##  6 z006       18   166     6
##  7 z007       25   140     9
##  8 z008       18   157     4
##  9 z009       25   129     6
## 10 z010       27   143     6
## # ... with 490 more rows
data %>%
 select(studyid, nihss, sbp) %>%
  group_by(sbp) %>%
  mutate(rank = dense_rank(nihss))
## # A tibble: 500 x 4
## # Groups:   sbp [113]
##    studyid nihss   sbp  rank
##    <fct>   <dbl> <dbl> <int>
##  1 z001       21   127     3
##  2 z002       23   137     5
##  3 z003       11   138     1
##  4 z004       22   122     4
##  5 z005       24   162     7
##  6 z006       18   166     5
##  7 z007       25   140     8
##  8 z008       18   157     4
##  9 z009       25   129     5
## 10 z010       27   143     5
## # ... with 490 more rows

 

Example 5: New column can be placed in any column position: .before and .after. The 2 following commands are similar:

data %>%
 select(studyid, nihss, sbp) %>%
 mutate( proportion = nihss / sbp, .before = nihss)
## # A tibble: 500 x 4
##    studyid proportion nihss   sbp
##    <fct>        <dbl> <dbl> <dbl>
##  1 z001        0.165     21   127
##  2 z002        0.168     23   137
##  3 z003        0.0797    11   138
##  4 z004        0.180     22   122
##  5 z005        0.148     24   162
##  6 z006        0.108     18   166
##  7 z007        0.179     25   140
##  8 z008        0.115     18   157
##  9 z009        0.194     25   129
## 10 z010        0.189     27   143
## # ... with 490 more rows
data %>%
 select(studyid, nihss, sbp) %>%
 mutate( proportion = nihss / sbp, .after = studyid)
## # A tibble: 500 x 4
##    studyid proportion nihss   sbp
##    <fct>        <dbl> <dbl> <dbl>
##  1 z001        0.165     21   127
##  2 z002        0.168     23   137
##  3 z003        0.0797    11   138
##  4 z004        0.180     22   122
##  5 z005        0.148     24   162
##  6 z006        0.108     18   166
##  7 z007        0.179     25   140
##  8 z008        0.115     18   157
##  9 z009        0.194     25   129
## 10 z010        0.189     27   143
## # ... with 490 more rows

 

Example 6: Calculate based on group:

# Global average:
data %>%
  select(studyid, ia.occlus, nihss) %>%
  mutate(global_proportion = nihss / mean(nihss, na.rm = TRUE))
## # A tibble: 500 x 4
##    studyid ia.occlus   nihss global_proportion
##    <fct>   <fct>       <dbl>             <dbl>
##  1 z001    M1             21             1.16 
##  2 z002    M1             23             1.28 
##  3 z003    ICA with M1    11             0.610
##  4 z004    ICA with M1    22             1.22 
##  5 z005    M1             24             1.33 
##  6 z006    A1 or A2       18             0.999
##  7 z007    M1             25             1.39 
##  8 z008    M1             18             0.999
##  9 z009    M1             25             1.39 
## 10 z010    M1             27             1.50 
## # ... with 490 more rows
# Group-Local average: calculate based on variable "sex"
data %>%
  select(studyid, ia.occlus, nihss) %>%
  group_by(ia.occlus) %>%
  mutate(sex_proportion = nihss / mean(nihss, na.rm = TRUE))
## # A tibble: 500 x 4
## # Groups:   ia.occlus [6]
##    studyid ia.occlus   nihss sex_proportion
##    <fct>   <fct>       <dbl>          <dbl>
##  1 z001    M1             21          1.15 
##  2 z002    M1             23          1.26 
##  3 z003    ICA with M1    11          0.638
##  4 z004    ICA with M1    22          1.28 
##  5 z005    M1             24          1.32 
##  6 z006    A1 or A2       18          0.806
##  7 z007    M1             25          1.37 
##  8 z008    M1             18          0.988
##  9 z009    M1             25          1.37 
## 10 z010    M1             27          1.48 
## # ... with 490 more rows

 

6.1.7 Grouped summaries: summarise()

6.1.7.1 Introduction

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

Command: summarise(dataframe, x = )

6.1.7.2 Useful functions

  • Center: mean(), median()
  • Spread: sd(), IQR(), mad()
  • Range: min(), max(), quantile()
  • Position: first(), last(), nth()
  • Count: n(), n_distinct()
  • Logical: any(), all()

6.1.7.3 Examples

Example 1: Calculate the mean and sd of age based on trt group

data %>% 
  group_by (trt) %>%
  summarise( count = n(), mean_age = mean(age), sd_age = sd(age) )
## # A tibble: 2 x 4
##   trt          count mean_age sd_age
##   <fct>        <int>    <dbl>  <dbl>
## 1 Intervention   233     63.9   18.1
## 2 Control        267     65.4   16.1

Example 2: Calculate the mean and sd of age based on ia.occlus group

data %>% 
  group_by (ia.occlus) %>%
  summarise( count = n(), mean_age = mean(age), sd_age = sd(age) )
## # A tibble: 6 x 4
##   ia.occlus        count mean_age sd_age
##   <fct>            <int>    <dbl>  <dbl>
## 1 Intracranial ICA     4     68.5   31.4
## 2 ICA with M1        134     62.2   17.1
## 3 M1                 319     65.8   17.1
## 4 M2                  39     64.3   14.2
## 5 A1 or A2             3     53.7   17.1
## 6 <NA>                 1     75     NA

 

6.1.8 Pivot data

6.1.8.1 Introduction

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().

Command:
* pivot_longer( dataframe, names_to = ,values_to = , names_sep = “.”)
* pivot_wider ( dataframe, names_from = ,values_from = , names_sep = “.”)

6.1.8.2 Examples

The variables in the data is already in pivot_longer format.

Example 1: Choosing the 3 variables in the data and transform 2 of them using pivot_wider. Compare it with the original variables:

# Original
data %>% select (studyid, trt, age)
## # A tibble: 500 x 3
##    studyid trt            age
##    <fct>   <fct>        <dbl>
##  1 z001    Control         53
##  2 z002    Intervention    51
##  3 z003    Control         68
##  4 z004    Control         28
##  5 z005    Control         91
##  6 z006    Control         34
##  7 z007    Intervention    75
##  8 z008    Control         89
##  9 z009    Control         75
## 10 z010    Intervention    26
## # ... with 490 more rows
# Transform
data %>% select (studyid, trt, age) %>% pivot_wider (names_from = trt,values_from = age)
## # A tibble: 500 x 3
##    studyid Control Intervention
##    <fct>     <dbl>        <dbl>
##  1 z001         53           NA
##  2 z002         NA           51
##  3 z003         68           NA
##  4 z004         28           NA
##  5 z005         91           NA
##  6 z006         34           NA
##  7 z007         NA           75
##  8 z008         89           NA
##  9 z009         75           NA
## 10 z010         NA           26
## # ... with 490 more rows

 

Example 2: Choosing the 4 variables in the data and transform 3 of them using pivot_wider. Compare it with the original variables:

# Original
data %>% select (studyid, trt, age, sbp)
## # A tibble: 500 x 4
##    studyid trt            age   sbp
##    <fct>   <fct>        <dbl> <dbl>
##  1 z001    Control         53   127
##  2 z002    Intervention    51   137
##  3 z003    Control         68   138
##  4 z004    Control         28   122
##  5 z005    Control         91   162
##  6 z006    Control         34   166
##  7 z007    Intervention    75   140
##  8 z008    Control         89   157
##  9 z009    Control         75   129
## 10 z010    Intervention    26   143
## # ... with 490 more rows
# Transform
data %>% select (studyid, trt, age, sbp) %>% 
  pivot_wider (names_from = trt,values_from = c(age, sbp), names_sep = "_")
## # A tibble: 500 x 5
##    studyid age_Control age_Intervention sbp_Control sbp_Intervention
##    <fct>         <dbl>            <dbl>       <dbl>            <dbl>
##  1 z001             53               NA         127               NA
##  2 z002             NA               51          NA              137
##  3 z003             68               NA         138               NA
##  4 z004             28               NA         122               NA
##  5 z005             91               NA         162               NA
##  6 z006             34               NA         166               NA
##  7 z007             NA               75          NA              140
##  8 z008             89               NA         157               NA
##  9 z009             75               NA         129               NA
## 10 z010             NA               26          NA              143
## # ... with 490 more rows

6.1.9 Dealing with string str_()

You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using “, unless you want to create a string that contains multiple”.

string1 <- “This is a string”

6.1.9.1 String length: str_length()

Base R has several functions for working with strings, but we’ll omit them since they can be inconsistent, making them difficult to remember. Instead, we’ll utilize stringr functions. These have more logical names and all begin with str_. For instance, str_length() returns the number of characters in a string:

str_length(c("a", "Basic R programming", NA))
## [1]  1 19 NA

 

6.1.9.2 Combining strings: str_c()

To combine two or more strings, use str_c(). For a vector of strings, we will need a complicated command, I will mention its command at the end of this part:

Command: str_c (“string 1”, “string 2”, “string 3”, sep = ” “)

str_c("a", "b")
## [1] "ab"
str_c("I", "like", "cats")
## [1] "Ilikecats"

 

Use the sep argument to control how they’re separated:

str_c("a", "b", sep = ", ")
## [1] "a, b"
str_c("I", "like", "cats", sep = " ")
## [1] "I like cats"

 

Like most other functions in R, missing values are contagious. If you want them to print as “NA”, use str_replace_na():

x <- c("abc", NA)
str_c("|-", x, "-|")
## [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
## [1] "|-abc-|" "|-NA-|"

 

As previously demonstrated, str_c() is vectorised, and it recycles lesser vectors to the same length as the longest:
Command: str_c (“string 1”, c(“string 2”, “string 3”), “string 4”, sep = ” “)

str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
str_c("I", c("like", "hate", "love"), "animal", sep = " ")
## [1] "I like animal" "I hate animal" "I love animal"

 

To collapse a vector of strings into a single string, use collapse:
Command: str_c (Vector string) , collapse = ” “)

str_c(c("a", "b", "c"), collapse = ", ")
## [1] "a, b, c"

 

6.1.9.3 Subsetting strings: str_sub()

str_sub() can be used to extract sections of a string. In addition to the string, str_sub() accepts start and end parameters that specify the (inclusive) location of the substring:

Command: str_sub(x, start, end)

#Positive numbers count forwards from start
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# Negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"

 

Note: str_sub() won’t fail if the string is too short: it will just return as much as possible:

str_sub("App", 1, 5)
## [1] "App"

 

You can also use the assignment form of str_sub() to modify strings:

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
## [1] "apple"  "banana" "pear"

 

6.1.9.4 Locale: str_to_upper/lower ()

I used str_to_lower() above to convert the text to lower case. str_to_upper() and str_to_title() can also be used (). However, changing case is more complex than it appears since various languages have distinct case-changing laws. By defining a location, you may choose which set of rules to use:

str_to_upper(str_sub(x, 1, 3))
## [1] "APP" "BAN" "PEA"
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
## [1] "Apple"  "Banana" "Pear"

 

This function can be used to modify the string within variable, so that they will have the same format:

df2 = df
str_sub(df2$sex, 1, ) <- str_to_lower(str_sub(df2$sex, 1, ))
as.factor(df2$sex[1:3])
## [1] male   male   female
## Levels: female male
x = c("male", "FEMALE", "MaLe", "Female", "FeMale")
str_sub(x,1, ) = str_to_lower(str_sub(x, 1, ))
x
## [1] "male"   "female" "male"   "female" "female"

 

6.1.9.5 Matching patterns

The simplest patterns match exact strings:
Command: str_view(x, ” y “)

x <- c("apple", "banana", "pear")
str_view(x, "an")

 

The next step up in complexity is ., which matches any character (except a newline):
Command: str_view(x, “. y .”)

str_view(x, ".a.")

 

But how can you match the character “.” if “.” matches any character? You must use a “escape” to notify the regular expression that you want it to match it exactly, rather than using its unique behavior. Regexps, like strings, employ the backslash “\”, to avoid specific behavior. To match an “.”, you’ll need the regexp “\”. This, unfortunately, causes a difficulty. Strings are used to represent regular expressions, and is also used in strings as an escape symbol. So we need the string “\.” to generate the regular expression “.”.

# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
## \.
#> \.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

 

If  is used as an escape character in regular expressions, how do you match a literal ? To match a literal  you need to write “\\” — you need four backslashes to match one!

x <- "a\\b"
writeLines(x)
## a\b
#> a\b

str_view(x, "\\\\")

For more information, please refer to this link: https://r4ds.had.co.nz/strings.html#strings

6.1.10 Dates and times: lubridate()

For more information, please refer to this link: https://r4ds.had.co.nz/dates-and-times.html
This chapter will concentrate on the lubridate package, which simplifies working with dates and timings in R. Lubridate is not part of the core tidyverse because it is only required when working with dates/times.

library(tidyverse)
library(lubridate)

Create a dataframe dat:

my_dates<-c( "2020-01-01",  "2020-01-02",  "2020-01-03",  "2020-01-04",  "2020-02-01",  "2020-02-02",  "2020-02-03",  "2020-02-04")
my_values<-c(  1,2,3,4,5,6,7,8)

dat<-tibble(my_dates,my_values)
dat
## # A tibble: 8 x 2
##   my_dates   my_values
##   <chr>          <dbl>
## 1 2020-01-01         1
## 2 2020-01-02         2
## 3 2020-01-03         3
## 4 2020-01-04         4
## 5 2020-02-01         5
## 6 2020-02-02         6
## 7 2020-02-03         7
## 8 2020-02-04         8

 

ymd() will parse the character string that falls into that format.
Example 1: Using ymd(), change the dat$my_dates from character to date format.

dat$my_dates<-ymd(dat$my_dates)
dat
## # A tibble: 8 x 2
##   my_dates   my_values
##   <date>         <dbl>
## 1 2020-01-01         1
## 2 2020-01-02         2
## 3 2020-01-03         3
## 4 2020-01-04         4
## 5 2020-02-01         5
## 6 2020-02-02         6
## 7 2020-02-03         7
## 8 2020-02-04         8

 

Other date parsers are mdy, myd, dmy, yq, and so forth. Once you have data as a date, you can extract components of the date:

  • month(x)
  • year(x)
  • day(x)
  • wday(x, label=TRUE)

Example 2: Apply the above commands to dat$my_dates:

month(dat$my_dates)
## [1] 1 1 1 1 2 2 2 2
year(dat$my_dates)
## [1] 2020 2020 2020 2020 2020 2020 2020 2020
day(dat$my_dates)
## [1] 1 2 3 4 1 2 3 4
wday(dat$my_dates, label=TRUE)
## [1] Wed Thu Fri Sat Sat Sun Mon Tue
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

 

6.2 Two-table verbs

In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.

  • Mutating joins, which add new variables to one table from matching rows in another.
  • Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
  • Set operations, which combine the observations in the data sets as if they were set elements.

For more information, please refer to the following link:
https://dplyr.tidyverse.org/articles/two-table.html#mutating-joins

 

6.2.1 Mutating joins

There are four types of mutating join, which differ in their behaviour when a match is not found.

Remember the Example we make in merge section earlier? Let create 2 dataframes contains following variables:

  • merge1: id, names, sex.
  • merge2: studyid, names, group, age.
# Create 1st dataframe:
id = c(1:5)
names = c("Ana", "Boc", "Cindy", "Doe", "Eva")
sex = c("F", "M", "F", "M", "F")
age = c(20, 17, 50, 30, 80)
merge1 = as.tibble(data.frame(id, names, sex, age))

# Create 2nd dataframe:
studyid = c(1:6)
names = c("Ana", "Boc", "Cindy", "Peter", "Yuly", "Doe")
group = c("Control", "Intervention", "Control", "Control", "Intervention", "Control")
age = c(20, 17, 50, 64, 34, 30)
merge2 = as.tibble(data.frame(studyid, names, group, age))

# Check dataframes:
merge1
## # A tibble: 5 x 4
##      id names sex     age
##   <int> <chr> <chr> <dbl>
## 1     1 Ana   F        20
## 2     2 Boc   M        17
## 3     3 Cindy F        50
## 4     4 Doe   M        30
## 5     5 Eva   F        80
merge2
## # A tibble: 6 x 4
##   studyid names group          age
##     <int> <chr> <chr>        <dbl>
## 1       1 Ana   Control         20
## 2       2 Boc   Intervention    17
## 3       3 Cindy Control         50
## 4       4 Peter Control         64
## 5       5 Yuly  Intervention    34
## 6       6 Doe   Control         30

 

  1. inner_join(x, y) only includes observations that match in both x and y.
merge1 %>% inner_join(merge2, by = c("names", "age"))
## # A tibble: 4 x 6
##      id names sex     age studyid group       
##   <int> <chr> <chr> <dbl>   <int> <chr>       
## 1     1 Ana   F        20       1 Control     
## 2     2 Boc   M        17       2 Intervention
## 3     3 Cindy F        50       3 Control     
## 4     4 Doe   M        30       6 Control

 

  1. left_join(x, y) includes all observations in x, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.
merge1 %>% left_join(merge2, by = c("names", "age"))
## # A tibble: 5 x 6
##      id names sex     age studyid group       
##   <int> <chr> <chr> <dbl>   <int> <chr>       
## 1     1 Ana   F        20       1 Control     
## 2     2 Boc   M        17       2 Intervention
## 3     3 Cindy F        50       3 Control     
## 4     4 Doe   M        30       6 Control     
## 5     5 Eva   F        80      NA <NA>

 

  1. right_join(x, y) includes all observations in y. It’s equivalent to left_join(y, x), but the columns and rows will be ordered differently.
merge1 %>% right_join(merge2, by = c("names", "age"))
## # A tibble: 6 x 6
##      id names sex     age studyid group       
##   <int> <chr> <chr> <dbl>   <int> <chr>       
## 1     1 Ana   F        20       1 Control     
## 2     2 Boc   M        17       2 Intervention
## 3     3 Cindy F        50       3 Control     
## 4     4 Doe   M        30       6 Control     
## 5    NA Peter <NA>     64       4 Control     
## 6    NA Yuly  <NA>     34       5 Intervention

 

  1. full_join() includes all observations from x and y.
merge1 %>% full_join(merge2, by = c("names", "age"))
## # A tibble: 7 x 6
##      id names sex     age studyid group       
##   <int> <chr> <chr> <dbl>   <int> <chr>       
## 1     1 Ana   F        20       1 Control     
## 2     2 Boc   M        17       2 Intervention
## 3     3 Cindy F        50       3 Control     
## 4     4 Doe   M        30       6 Control     
## 5     5 Eva   F        80      NA <NA>        
## 6    NA Peter <NA>     64       4 Control     
## 7    NA Yuly  <NA>     34       5 Intervention

 

Note: While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:

df1 <- tibble(id = c(1, 1, 2), value = 1:3)
df2 <- tibble(id = c(1, 1, 2), gender = c("Male", "Female", "Male"))

df1 %>% left_join(df2)
## # A tibble: 5 x 3
##      id value gender
##   <dbl> <int> <chr> 
## 1     1     1 Male  
## 2     1     1 Female
## 3     1     2 Male  
## 4     1     2 Female
## 5     2     3 Male

 

6.2.2 Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
merge1
## # A tibble: 5 x 4
##      id names sex     age
##   <int> <chr> <chr> <dbl>
## 1     1 Ana   F        20
## 2     2 Boc   M        17
## 3     3 Cindy F        50
## 4     4 Doe   M        30
## 5     5 Eva   F        80
merge2
## # A tibble: 6 x 4
##   studyid names group          age
##     <int> <chr> <chr>        <dbl>
## 1       1 Ana   Control         20
## 2       2 Boc   Intervention    17
## 3       3 Cindy Control         50
## 4       4 Peter Control         64
## 5       5 Yuly  Intervention    34
## 6       6 Doe   Control         30
semi_join(merge1, merge2, by = c("names", "age"))
## # A tibble: 4 x 4
##      id names sex     age
##   <int> <chr> <chr> <dbl>
## 1     1 Ana   F        20
## 2     2 Boc   M        17
## 3     3 Cindy F        50
## 4     4 Doe   M        30
anti_join(merge1, merge2, by = c("names", "age"))
## # A tibble: 1 x 4
##      id names sex     age
##   <int> <chr> <chr> <dbl>
## 1     5 Eva   F        80

 

If you’re worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only ever remove observations.

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 %>% nrow()
## [1] 4
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
## [1] 4
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
## [1] 2

Discuss
Here, we get 4 rows to start with, and after inner_join - we get 4 rows that match. But when we check with semi_join, only 2 rows are actually match. That means extra 2 rows are newly created because when a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations.

 

6.2.3 Set operations

The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:

  • intersect(x, y): return only observations in both x and y.
  • union(x, y): return unique observations in x and y.
  • setdiff(x, y): return observations in x, but not in y.

More infomation can be found here:
https://dplyr.tidyverse.org/articles/two-table.html#observations

 

7 References

https://www.techtarget.com/searchdatamanagement/definition/raw-data
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html#:~:text=Tidy%20data%20is%20a%20standard,Every%20row%20is%20an%20observation.
https://r4ds.had.co.nz/tidy-data.html#introduction-6
https://dplyr.tidyverse.org/reference/filter.html
Field, A., Miles, J. and Field, Z., 2012. Discovering statistics using R. Sage publications.