.columns{display:flex;}
h1{color:red}
h2{color:blue}
h3{color:green}
Phastar Presentation
Why Clinical trial Should Embrace the R language
-Data manipulation is the changing of data to make it easier to read or be more organized.
-Companies may manipulate their data because it can provide them with well-organized databases. Categorization can allow companies to group similar data, which may make it easier to search for information.
-Another reason companies might manipulate their data is that it can allow them to archive project data and access it later if they want to use it as a reference while working on a new project or setting business goals.
-Data manipulation is also a valuable tool in identifying and correcting data redundancies in reporting.
The filter() function subsets the rows in a data frame by testing against a conditional statement.
Whereas filter() subsets a dataframe by row, select() returns a subset of the columns.
mutate() adds new columns of data.
arrange() is the simplest of the dplyr functions, which orders rows according to values in a given column. The default is to order numbers from lowest -> highest.
The summarize(), which as the name suggests, creates individual summary statistics from larger data sets.
group_by() allows us to create sub-groups based on labels in a particular column, and to run subsequent functions on all sub-groups.
(%>%) The point of the pipe is to help you write code in a way that is easier to read and understand
library(xaringanthemer)
suppressPackageStartupMessages(library(haven))
dmdata=haven::read_sas("dm.sas7bdat")
suppressPackageStartupMessages(library(tidyverse))#package for data manipulation
names(dmdata) #checking the variables in the data
## [1] "STUDYID" "DOMAIN" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC"
## [7] "RFXSTDTC" "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "DTHFL"
## [13] "SITEID" "BRTHDTC" "AGE" "AGEU" "SEX" "RACE"
## [19] "ETHNIC" "ARM" "ACTARM" "COUNTRY" "ARMCD" "ACTARMCD"
glimpse(dmdata) #checking the if the variable is numeric ,character etc
## Rows: 21
## Columns: 24
## $ STUDYID <chr> "AB-5365-101", "AB-5365-101", "AB-5365-101", "AB-5365-101", "…
## $ DOMAIN <chr> "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "…
## $ USUBJID <chr> "AB-5365-101-001-1001", "AB-5365-101-001-4002", "AB-5365-101-…
## $ SUBJID <chr> "001-1001", "001-4002", "001-4004", "002-1001", "002-3003", "…
## $ RFSTDTC <chr> "2018-04-03T11:24", "2018-04-12T20:40", "2019-03-04T23:30", "…
## $ RFENDTC <chr> "2018-07-23", "2019-05-06", "2019-04-01", "2018-11-03", "2019…
## $ RFXSTDTC <chr> "2018-04-03T11:24", "2018-04-12T20:40", "2019-03-04T23:30", "…
## $ RFXENDTC <chr> "2018-04-23", "2019-04-28", "2019-03-17", "2018-06-20", "2019…
## $ RFICDTC <chr> "2018-03-12", "2019-02-01", "2019-02-20", "2018-04-24", "2019…
## $ RFPENDTC <chr> "2018-07-23", "", "2019-04-01", "2018-11-03", "2019-04-22", "…
## $ DTHDTC <chr> "2018-07-23", "", "2019-04-01", "2018-11-03", "", "", "", "",…
## $ DTHFL <chr> "Y", "", "Y", "Y", "", "", "", "", "", "", "", "", "", "Y", "…
## $ SITEID <chr> "001", "001", "001", "002", "002", "002", "012", "012", "012"…
## $ BRTHDTC <chr> "1956-07-15", "1940-05-23", "1937-01-22", "1934-03-16", "1935…
## $ AGE <dbl> 61, 78, 82, 84, 83, 66, 58, 61, 65, 66, 76, 87, 68, 69, 63, 8…
## $ AGEU <chr> "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS"…
## $ SEX <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "F", "M", "M", "…
## $ RACE <chr> "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE"…
## $ ETHNIC <chr> "NOT HISPANIC OR LATINO", "HISPANIC OR LATINO", "NOT HISPANIC…
## $ ARM <chr> "AB-5365-101 50 MG", "AB-5365-101 100 MG", "AB-5365-101 100 M…
## $ ACTARM <chr> "AB-5365-101 50 MG", "AB-5365-101 100 MG", "AB-5365-101 100 M…
## $ COUNTRY <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
## $ ARMCD <chr> "AB-5365101-50", "AB-5365101-100", "AB-5365101-100", "AB-5365…
## $ ACTARMCD <chr> "AB-5365101-50", "AB-5365101-100", "AB-5365101-100", "AB-5365…
dmdata %>%
select_if(is.numeric) %>%
names()
## [1] "AGE"
dmdata %>%
select_if(is.character) %>%
names()
## [1] "STUDYID" "DOMAIN" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC"
## [7] "RFXSTDTC" "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "DTHFL"
## [13] "SITEID" "BRTHDTC" "AGEU" "SEX" "RACE" "ETHNIC"
## [19] "ARM" "ACTARM" "COUNTRY" "ARMCD" "ACTARMCD"
This allow you to select a column by a given condition. If the condition is true or false.
dmdata %>%
select_if(is.numeric) %>%
select_if(~mean(., na.rm=TRUE) > 5) %>% head(2) %>% knitr::kable()
AGE |
---|
61 |
78 |
#shorter way
dmdata %>%
select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 5) %>% head(2) %>% knitr::kable()#where mean is more than 5 and only numeric variables
AGE |
---|
61 |
78 |
If you have a lot of columns with a similar structure you can use partial matching by adding starts_with(), ends_with() or contains() in your select statement.
dmdata %>%
select( starts_with("ACT")) %>%#select columns that start with ACT
names() %>% knitr::kable()
x |
---|
ACTARM |
ACTARMCD |
#ends
dmdata %>%
select(ends_with("DTC")) %>%#select columns ending with DTC
colnames() %>% knitr::kable()
x |
---|
RFSTDTC |
RFENDTC |
RFXSTDTC |
RFXENDTC |
RFICDTC |
RFPENDTC |
DTHDTC |
BRTHDTC |
#contains
dmdata %>%
select(contains("RFX"))%>%#select columns that has RFX
colnames()
## [1] "RFXSTDTC" "RFXENDTC"
dmdata %>%
select(matches("RFX|ARM")) %>%
glimpse
## Rows: 21
## Columns: 6
## $ RFXSTDTC <chr> "2018-04-03T11:24", "2018-04-12T20:40", "2019-03-04T23:30", "…
## $ RFXENDTC <chr> "2018-04-23", "2019-04-28", "2019-03-17", "2018-06-20", "2019…
## $ ARM <chr> "AB-5365-101 50 MG", "AB-5365-101 100 MG", "AB-5365-101 100 M…
## $ ACTARM <chr> "AB-5365-101 50 MG", "AB-5365-101 100 MG", "AB-5365-101 100 M…
## $ ARMCD <chr> "AB-5365101-50", "AB-5365101-100", "AB-5365101-100", "AB-5365…
## $ ACTARMCD <chr> "AB-5365101-50", "AB-5365101-100", "AB-5365101-100", "AB-5365…
classification <- c("STUDYID", "DOMAIN", "USUBJID", "SUBJID", "RFSTDTC", "RFENDTC",
"RFXSTDTC", "RFXENDTC", "RFICDTC", "RFPENDTC", "DTHDTC", "DTHFL",
"SITEID", "BRTHDTC")
dmdata %>%
select(!!classification) %>% names()
## [1] "STUDYID" "DOMAIN" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC"
## [7] "RFXSTDTC" "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "DTHFL"
## [13] "SITEID" "BRTHDTC"
dmdata %>% select(last_col(0:4))
## # A tibble: 21 × 5
## ACTARMCD ARMCD COUNTRY ACTARM ARM
## <chr> <chr> <chr> <chr> <chr>
## 1 AB-5365101-50 AB-5365101-50 USA AB-5365-101 50 MG AB-5365-101 50 MG
## 2 AB-5365101-100 AB-5365101-100 USA AB-5365-101 100 MG AB-5365-101 100 MG
## 3 AB-5365101-100 AB-5365101-100 USA AB-5365-101 100 MG AB-5365-101 100 MG
## 4 AB-5365101-50 AB-5365101-50 USA AB-5365-101 50 MG AB-5365-101 50 MG
## 5 AB-5365101-50 AB-5365101-50 USA AB-5365-101 50 MG AB-5365-101 50 MG
## 6 AB-5365101-100 AB-5365101-100 USA AB-5365-101 100 MG AB-5365-101 100 MG
## 7 AB-5365101-50 AB-5365101-50 USA AB-5365-101 50 MG AB-5365-101 50 MG
## 8 AB-5365101-50 AB-5365101-50 USA AB-5365-101 50 MG AB-5365-101 50 MG
## 9 AB-5365101-100 AB-5365101-100 USA AB-5365-101 100 MG AB-5365-101 100 MG
## 10 AB-5365101-250 AB-5365101-250 USA AB-5365-101 250 MG AB-5365-101 250 MG
## # … with 11 more rows
(dmdata %>% select(last_col(0),everything())) %>% head(2)#putting the last columns to be the first
## # A tibble: 2 × 24
## ACTARMCD STUDYID DOMAIN USUBJID SUBJID RFSTDTC RFENDTC RFXSTDTC RFXENDTC
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AB-5365101-50 AB-536… DM AB-536… 001-1… 2018-0… 2018-0… 2018-04… 2018-04…
## 2 AB-5365101-100 AB-536… DM AB-536… 001-4… 2018-0… 2019-0… 2018-04… 2019-04…
## # … with 15 more variables: RFICDTC <chr>, RFPENDTC <chr>, DTHDTC <chr>,
## # DTHFL <chr>, SITEID <chr>, BRTHDTC <chr>, AGE <dbl>, AGEU <chr>, SEX <chr>,
## # RACE <chr>, ETHNIC <chr>, ARM <chr>, ACTARM <chr>, COUNTRY <chr>,
## # ARMCD <chr>
dmdata$NULLcol=NA
dmdata %>%select_if(~ any(is.na(.))) %>%names() %>% knitr::kable()#empty column
x |
---|
NULLcol |
dmdata %>%select_if(~! any(is.na(.))) %>%names() %>% knitr::kable()#non-empty column
x |
---|
STUDYID |
DOMAIN |
USUBJID |
SUBJID |
RFSTDTC |
RFENDTC |
RFXSTDTC |
RFXENDTC |
RFICDTC |
RFPENDTC |
DTHDTC |
DTHFL |
SITEID |
BRTHDTC |
AGE |
AGEU |
SEX |
RACE |
ETHNIC |
ARM |
ACTARM |
COUNTRY |
ARMCD |
ACTARMCD |
suppressPackageStartupMessages(library(janitor))#calling janitor package
#(dfw=dmdata %>% remove_empty() %>% head(2)) # the fucntion drop empty columns from data
#(dfw1=dmdata %>% discard(~all(is.na(.))) %>% head(2))
We use a minus sign before the variable or the vector of the variables to be dropped
dmdata %>%
select(-STUDYID, -(DTHFL:ACTARMCD)) %>%
names()
## [1] "DOMAIN" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC" "RFXSTDTC"
## [7] "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "NULLcol"
1 na.omit the command drop all rows with NAs,it comes from base r
2 comlpete.case it works in similar way as na.omit
3 drop_na the function comes from ****dplyr*** package it works like na.omit
set.seed(4321)
mt <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)#creating a matrix (10 by 10)
dt <- as.data.frame(mt)#converting the matrix to a dataframe
dt#calling the data
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 4 5 2 8 9 1 4 8 9 NA
## 2 10 5 5 6 8 6 10 10 3 8
## 3 9 9 10 6 2 6 6 6 10 8
## 4 4 3 6 9 NA 6 10 7 5 NA
## 5 9 6 8 4 7 3 8 3 7 2
## 6 7 9 1 2 4 2 2 3 1 6
## 7 10 10 7 3 6 NA 6 2 3 6
## 8 3 3 4 10 8 3 4 10 NA 10
## 9 3 3 5 3 4 1 4 3 1 4
## 10 4 2 NA NA 3 7 5 3 7 5
colSums(is.na(dt))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 1 1 1 1 0 0 1 2
dt[is.na(dt)] <- 0# assigning zero to NAs
dt# calling the new data
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 4 5 2 8 9 1 4 8 9 0
## 2 10 5 5 6 8 6 10 10 3 8
## 3 9 9 10 6 2 6 6 6 10 8
## 4 4 3 6 9 0 6 10 7 5 0
## 5 9 6 8 4 7 3 8 3 7 2
## 6 7 9 1 2 4 2 2 3 1 6
## 7 10 10 7 3 6 0 6 2 3 6
## 8 3 3 4 10 8 3 4 10 0 10
## 9 3 3 5 3 4 1 4 3 1 4
## 10 4 2 0 0 3 7 5 3 7 5
colSums(is.na(dt))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 0 0 0 0 0 0 0 0
df <- data.frame(X1 = c(1:10),
X2 = rep(NA, 10),
X3 = letters[seq( from = 1, to = 10 )],
X4 = rep("",10))# creating hypothetical dataset in r
df
## X1 X2 X3 X4
## 1 1 NA a
## 2 2 NA b
## 3 3 NA c
## 4 4 NA d
## 5 5 NA e
## 6 6 NA f
## 7 7 NA g
## 8 8 NA h
## 9 9 NA i
## 10 10 NA j
df1=na.omit(df)#we drop all the NAs values
df1
## [1] X1 X2 X3 X4
## <0 rows> (or 0-length row.names)
df2=df[complete.cases(df),]#we keep only complete cases
df2
## [1] X1 X2 X3 X4
## <0 rows> (or 0-length row.names)
m <- structure(c( 1, NA, 3, 4, 5,
6, NA, 8, NA, 10,
11, NA, 13, NA, NA),
.Dim = c(5L, 3L))
m
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] NA NA NA
## [3,] 3 8 13
## [4,] 4 NA NA
## [5,] 5 10 NA
(mm=m[rowSums(is.na(m)) != ncol(m), ])#we delete all rows that have NAs
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 3 8 13
## [3,] 4 NA NA
## [4,] 5 10 NA
ind <- apply(m, 1, function(x) all(is.na(x)))
m1=m[!ind,]
m1
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 3 8 13
## [3,] 4 NA NA
## [4,] 5 10 NA
suppressPackageStartupMessages(library(lubridate))#we call lubridate and suppress any worning or message
dmdata %>% mutate(year1=substr(RFSTDTC,1,4),yr=year(RFSTDTC),hr=hour(RFSTDTC)) %>%
select(year1,yr,hr) %>% head(3) %>% knitr::kable()
year1 | yr | hr |
---|---|---|
2018 | 2018 | 0 |
2018 | 2018 | 0 |
2019 | 2019 | 0 |
dmdata %>%
mutate_if(is.numeric, round,1) %>%select_if(~is.numeric(.)) %>%
head(3) %>% knitr::kable()
AGE |
---|
61 |
78 |
82 |
dffd=dmdata %>%
select(STUDYID, SUBJID) %>%unite('usubjid1',c(STUDYID,SUBJID),sep = "/",remove = F) %>%
head(3)
dffd %>% knitr::kable()
usubjid1 | STUDYID | SUBJID |
---|---|---|
AB-5365-101/001-1001 | AB-5365-101 | 001-1001 |
AB-5365-101/001-4002 | AB-5365-101 | 001-4002 |
AB-5365-101/001-4004 | AB-5365-101 | 001-4004 |
dffd %>%separate(usubjid1,c('STUDYID_a','SUBJID_b'),sep = "/",remove = F) %>%
head(3) %>% knitr::kable()#separating the usubjid1 by separator/
usubjid1 | STUDYID_a | SUBJID_b | STUDYID | SUBJID |
---|---|---|---|---|
AB-5365-101/001-1001 | AB-5365-101 | 001-1001 | AB-5365-101 | 001-1001 |
AB-5365-101/001-4002 | AB-5365-101 | 001-4002 | AB-5365-101 | 001-4002 |
AB-5365-101/001-4004 | AB-5365-101 | 001-4004 | AB-5365-101 | 001-4004 |
dmdata %>% select(USUBJID,SEX,ETHNIC) %>% filter(SEX=="M" &USUBJID %in% c('AB-5365-101-017-3003','AB-5365-101-018-2004','AB-5365-101-019-2003')) %>%
knitr::kable() #select the identified subject and they should ne male
USUBJID | SEX | ETHNIC |
---|---|---|
AB-5365-101-017-3003 | M | NOT HISPANIC OR LATINO |
AB-5365-101-019-2003 | M | NOT HISPANIC OR LATINO |
dmdata %>% select(USUBJID,SEX,ETHNIC,AGE) %>% filter(SEX=="M" & AGE>78) %>%
knitr::kable()
USUBJID | SEX | ETHNIC | AGE |
---|---|---|---|
AB-5365-101-001-4004 | M | NOT HISPANIC OR LATINO | 82 |
AB-5365-101-002-1001 | M | NOT HISPANIC OR LATINO | 84 |
AB-5365-101-002-3003 | M | NOT HISPANIC OR LATINO | 83 |
AB-5365-101-012-5007 | M | NOT HISPANIC OR LATINO | 87 |
AB-5365-101-017-3003 | M | NOT HISPANIC OR LATINO | 81 |
dmdata %>% select(USUBJID,AGE) %>% arrange(AGE) %>% head(8) %>%
knitr::kable()#ascending
USUBJID | AGE |
---|---|
AB-5365-101-012-1002 | 58 |
AB-5365-101-001-1001 | 61 |
AB-5365-101-012-3003 | 61 |
AB-5365-101-018-2004 | 61 |
AB-5365-101-017-3002 | 63 |
AB-5365-101-019-2003 | 64 |
AB-5365-101-012-4004 | 65 |
AB-5365-101-002-4004 | 66 |
dmdata %>% select(USUBJID,AGE) %>% arrange(desc(AGE)) %>% head(8) %>%
knitr::kable()#descending age
USUBJID | AGE |
---|---|
AB-5365-101-012-5007 | 87 |
AB-5365-101-002-1001 | 84 |
AB-5365-101-002-3003 | 83 |
AB-5365-101-001-4004 | 82 |
AB-5365-101-017-3003 | 81 |
AB-5365-101-001-4002 | 78 |
AB-5365-101-012-5006 | 76 |
AB-5365-101-019-1001 | 76 |
dmdata %>% select(SUBJID,SEX,contains('ACT')) %>% rename(Subject=SUBJID,Gender=SEX) %>% head(3)%>%
knitr::kable() #we use rename to change columns name
Subject | Gender | ACTARM | ACTARMCD |
---|---|---|---|
001-1001 | M | AB-5365-101 50 MG | AB-5365101-50 |
001-4002 | M | AB-5365-101 100 MG | AB-5365101-100 |
001-4004 | M | AB-5365-101 100 MG | AB-5365101-100 |
dmdata %>% select(Subject1=SUBJID,Gender1=SEX,contains('ACT')) %>% head(3)%>%
knitr::kable()
Subject1 | Gender1 | ACTARM | ACTARMCD |
---|---|---|---|
001-1001 | M | AB-5365-101 50 MG | AB-5365101-50 |
001-4002 | M | AB-5365-101 100 MG | AB-5365101-100 |
001-4004 | M | AB-5365-101 100 MG | AB-5365101-100 |
(names(dmdata)[4]=('SBJD'))
## [1] "SBJD"
dmdata %>% group_by(ARM) %>% summarise(Average = mean(AGE),sdv=sd(AGE),median_=median(AGE))%>%
knitr::kable()
ARM | Average | sdv | median_ |
---|---|---|---|
AB-5365-101 100 MG | 70.14286 | 8.071112 | 66.0 |
AB-5365-101 250 MG | 76.33333 | 10.503967 | 76.0 |
AB-5365-101 450 MG | 68.00000 | NA | 68.0 |
AB-5365-101 50 MG | 70.80000 | 9.863513 | 70.5 |