*Tidyverse documentation (https://www.tidyverse.org/)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
Labs_SCr_df <- openxlsx::read.xlsx('Patients_Labs_SCr.xlsx')
Learning how to organize your data and preparing it for statistical analysis is an essential skill to acquire. Data comes in different shapes that require wrangling to be put into the proper form for statistical analysis.
Tidyverse package comes with excellent functions for data wrangling. We will be only scratching the surface during this course. If you are interested in the topic, I highly recommend R for Data Science (“https://r4ds.had.co.nz/”).
A common problem is when the column names are not names of variables, but values of a variable. For example, the column names “Day1” and “Day2” represent values of the “Date” variable, and the values in the “Day1” and “Day2” columns represent values of serum creatinine (SCr).
dim(Labs_SCr_df)
## [1] 100 12
head(Labs_SCr_df)
## patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1 pt_01 1 1.48 0.84 0.89 0.90 1.43 1.18 0.79 1.10 1.08 0.88
## 2 pt_02 1 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98 0.66
## 3 pt_03 1 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28 0.93
## 4 pt_04 1 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51 0.97
## 5 pt_05 1 0.68 1.01 1.06 0.60 1.25 0.97 0.83 1.07 0.69 1.39
## 6 pt_06 1 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07 1.54
colnames(Labs_SCr_df)
## [1] "patient_id" "ethnicity" "Day1" "Day2" "Day3"
## [6] "Day4" "Day5" "Day6" "Day7" "Day8"
## [11] "Day9" "Day10"
To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables using the pivot_longer() function. To describe that operation, we need three parameters:
Labs_SCr_df_long <- Labs_SCr_df %>% pivot_longer(cols= c(Day1,Day2,Day3,Day4,Day5,Day6,Day7,Day8,Day9,Day10), names_to = "Date", values_to = "SCr")
Labs_SCr_df_long %>% head()
## # A tibble: 6 × 4
## patient_id ethnicity Date SCr
## <chr> <dbl> <chr> <dbl>
## 1 pt_01 1 Day1 1.48
## 2 pt_01 1 Day2 0.84
## 3 pt_01 1 Day3 0.89
## 4 pt_01 1 Day4 0.9
## 5 pt_01 1 Day5 1.43
## 6 pt_01 1 Day6 1.18
Labs_SCr_df_long %>% dim()
## [1] 1000 4
Alternatively, instead of specifying the columns to pivot, you specify the columns not to pivot by putting “-” before the vector of columns. This could save you time!
Labs_SCr_df_long <- Labs_SCr_df %>% pivot_longer(cols= -c(patient_id,ethnicity), names_to = 'Date', values_to = 'SCr')
head(Labs_SCr_df_long)
## # A tibble: 6 × 4
## patient_id ethnicity Date SCr
## <chr> <dbl> <chr> <dbl>
## 1 pt_01 1 Day1 1.48
## 2 pt_01 1 Day2 0.84
## 3 pt_01 1 Day3 0.89
## 4 pt_01 1 Day4 0.9
## 5 pt_01 1 Day5 1.43
## 6 pt_01 1 Day6 1.18
The figure below illustrates how pivot_longer() function works.
pivot_wider() is the opposite of pivot_longer(). You use it when observations are scattered across multiple rows. You need two parameters to use pivot_wider():
Labs_SCr_df_wide <- Labs_SCr_df_long %>% pivot_wider(names_from = Date,values_from = SCr)
Labs_SCr_df_wide %>% head()
## # A tibble: 6 × 12
## patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pt_01 1 1.48 0.84 0.89 0.9 1.43 1.18 0.79 1.1 1.08
## 2 pt_02 1 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98
## 3 pt_03 1 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28
## 4 pt_04 1 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51
## 5 pt_05 1 0.68 1.01 1.06 0.6 1.25 0.97 0.83 1.07 0.69
## 6 pt_06 1 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07
## # ℹ 1 more variable: Day10 <dbl>
To extract columns from our dataset, we use the select() function to provide information about which columns we want or don’t want. If we have many columns we want to include but a few we want to exclude, we can use a minus (-) before the column names to be excluded.
Labs_SCr_df %>% select(patient_id,Day1, Day2, Day3) %>% head() # select column patient_id, Day1, Day2, Day3
## patient_id Day1 Day2 Day3
## 1 pt_01 1.48 0.84 0.89
## 2 pt_02 1.11 0.88 1.27
## 3 pt_03 0.96 0.62 0.77
## 4 pt_04 1.39 0.87 0.92
## 5 pt_05 0.68 1.01 1.06
## 6 pt_06 1.11 1.02 1.43
Labs_SCr_df %>% select(-Day2, -Day10) %>% head() #select all columns except Day2 and Day10
## patient_id ethnicity Day1 Day3 Day4 Day5 Day6 Day7 Day8 Day9
## 1 pt_01 1 1.48 0.89 0.90 1.43 1.18 0.79 1.10 1.08
## 2 pt_02 1 1.11 1.27 0.83 0.98 0.86 0.83 1.17 0.98
## 3 pt_03 1 0.96 0.77 1.15 1.27 0.67 0.74 0.77 0.28
## 4 pt_04 1 1.39 0.92 1.05 0.86 1.39 0.85 1.19 1.51
## 5 pt_05 1 0.68 1.06 0.60 1.25 0.97 0.83 1.07 0.69
## 6 pt_06 1 1.11 1.43 0.74 1.33 0.81 0.94 1.03 1.07
You can select columns based on index/location.
Labs_SCr_df %>% select(1) %>% head() #select column 1
## patient_id
## 1 pt_01
## 2 pt_02
## 3 pt_03
## 4 pt_04
## 5 pt_05
## 6 pt_06
Labs_SCr_df %>% select(1,2,3) %>% head() #select columns 1, 2, 3
## patient_id ethnicity Day1
## 1 pt_01 1 1.48
## 2 pt_02 1 1.11
## 3 pt_03 1 0.96
## 4 pt_04 1 1.39
## 5 pt_05 1 0.68
## 6 pt_06 1 1.11
Labs_SCr_df %>% select(3:12) %>% head() #select columns 3 through 12
## Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1 1.48 0.84 0.89 0.90 1.43 1.18 0.79 1.10 1.08 0.88
## 2 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98 0.66
## 3 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28 0.93
## 4 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51 0.97
## 5 0.68 1.01 1.06 0.60 1.25 0.97 0.83 1.07 0.69 1.39
## 6 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07 1.54
You can also select columns based on names. starts_with() function will return columns which the string you provide is at the beginning of a column name. ends_with() will return columns which the string you provide is at the end of a column name.
Labs_SCr_df %>% select(starts_with('patient'), Day1) %>% head()
## patient_id Day1
## 1 pt_01 1.48
## 2 pt_02 1.11
## 3 pt_03 0.96
## 4 pt_04 1.39
## 5 pt_05 0.68
## 6 pt_06 1.11
The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for the condition. We can put multiple logical conditions inside a single filter() function. You can comma separate conditions, which is equivalent to an “AND”, or you can use the compound logical operators “AND” (&) and “OR” (|) to construct more complex statements.
Labs_SCr_df %>% filter(patient_id=='pt_05') #filter patient with ID pt_05
## patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1 pt_05 1 0.68 1.01 1.06 0.6 1.25 0.97 0.83 1.07 0.69 1.39
Labs_SCr_df %>% filter(Day1 > 1.5) %>% head() #filter observations with SCr > 1.5 on Day1
## [1] patient_id ethnicity Day1 Day2 Day3 Day4
## [7] Day5 Day6 Day7 Day8 Day9 Day10
## <0 rows> (or 0-length row.names)
Labs_SCr_df %>% filter(Day1 >1 & Day2>1) %>% dim() #filter observations with SCr > 1 on Day1 AND Day2
## [1] 24 12
Labs_SCr_df %>% filter(Day1>1 | Day2 >1) %>% dim() #filter observations with SCr > 1 on Day1 OR Day2
## [1] 72 12
Labs_SCr_df %>% filter((Day1 >1 & Day2 >1)| Day3 >1.5) %>% dim() #filter observations with SCr > 1 on Day1 and Day2 OR SCr > 1.5 on Day3
## [1] 26 12
Labs_SCr_df %>% filter(!is.na(Day1)) %>% dim() #filter observations with is NOT Day1
## [1] 87 12
The mutate() function is used to to modify data sets. The mutate() function takes any number of expressions and return new values of the same length.
Labs_SCr_df %>% group_by(patient_id) %>% mutate(SCrMean=mean(c(Day1, Day2, Day3, Day4, Day5, Day6, Day7, Day8, Day9, Day10),na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per patient
## # A tibble: 6 × 13
## # Groups: patient_id [6]
## patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pt_01 1 1.48 0.84 0.89 0.9 1.43 1.18 0.79 1.1 1.08
## 2 pt_02 1 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98
## 3 pt_03 1 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28
## 4 pt_04 1 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51
## 5 pt_05 1 0.68 1.01 1.06 0.6 1.25 0.97 0.83 1.07 0.69
## 6 pt_06 1 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07
## # ℹ 2 more variables: Day10 <dbl>, SCrMean <dbl>
The summarise() function is used to aggregate data and return one row or one row per group. To use summarise provide expressions that result in a single value and provide multiple expressions in a comma separated fashion.
Labs_SCr_df %>% summarise(Day1Mean=mean(Day1,na.rm=TRUE),Day2Mean=mean(Day2,na.rm=TRUE)) #calculate mean SCr on Day1 and Day2.
## Day1Mean Day2Mean
## 1 1.016207 0.9748
Labs_SCr_df %>% group_by(patient_id) %>% summarise(SCrMean=mean(c(Day1, Day2, Day3, Day4, Day5, Day6, Day7, Day8, Day9, Day10),na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per patient
## # A tibble: 6 × 2
## patient_id SCrMean
## <chr> <dbl>
## 1 pt_01 1.06
## 2 pt_02 0.957
## 3 pt_03 0.816
## 4 pt_04 1.1
## 5 pt_05 0.955
## 6 pt_06 1.10
Labs_SCr_df_long %>% group_by(patient_id) %>% summarise(SCrMean=mean(SCr,na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per patient
## # A tibble: 6 × 2
## patient_id SCrMean
## <chr> <dbl>
## 1 pt_01 1.06
## 2 pt_02 0.957
## 3 pt_03 0.816
## 4 pt_04 1.1
## 5 pt_05 0.955
## 6 pt_06 1.10
Labs_SCr_df_long %>% group_by(ethnicity) %>% summarise(SCrMean=mean(SCr,na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per ethnicity
## # A tibble: 3 × 2
## ethnicity SCrMean
## <dbl> <dbl>
## 1 1 1.01
## 2 2 0.991
## 3 3 0.974
Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. Factors play very important role in data modeling because categorical variables are handled differently compared to continuous variables.
Depending on the function used to import data into R environment, factors may or may not be created automatically. Always, make sure to investigate your data before making any assumptions. The ethnicity field is coded as a numeric variable.
class(Labs_SCr_df$ethnicity)
## [1] "numeric"
We can convert as numeric variable to a factor using the as.factor() function.
Labs_SCr_df$ethnicity <- as.factor(Labs_SCr_df$ethnicity)
class(Labs_SCr_df$ethnicity)
## [1] "factor"
Labs_SCr_df$ethnicity
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3
The ethnicity factor has levels (1,2, and 3) associated with it. You can assign new levels using the levels function.
levels(Labs_SCr_df$ethnicity) <- c('White','Black','Other') #assign new levels
Labs_SCr_df$ethnicity
## [1] White White White White White White White White White White White White
## [13] White White White White White White White White White White White White
## [25] White White White White White White White White Black Black Black Black
## [37] Black Black Black Black Black Black Black Black Black Black Black Black
## [49] Black Black Black Black Black Black Black Black Black Black Black Black
## [61] Black Black Black Black Black Black Other Other Other Other Other Other
## [73] Other Other Other Other Other Other Other Other Other Other Other Other
## [85] Other Other Other Other Other Other Other Other Other Other Other Other
## [97] Other Other Other Other
## Levels: White Black Other
As you will learn later this quarter, the reference level in a factor variable is important in the interpreting statistical models. The first level in a factor variable is usually the default reference level. You can change that using the relevel() function.
Labs_SCr_df$ethnicity <- relevel(Labs_SCr_df$ethnicity,ref = 'Black') #change reference to Black
Labs_SCr_df$ethnicity
## [1] White White White White White White White White White White White White
## [13] White White White White White White White White White White White White
## [25] White White White White White White White White Black Black Black Black
## [37] Black Black Black Black Black Black Black Black Black Black Black Black
## [49] Black Black Black Black Black Black Black Black Black Black Black Black
## [61] Black Black Black Black Black Black Other Other Other Other Other Other
## [73] Other Other Other Other Other Other Other Other Other Other Other Other
## [85] Other Other Other Other Other Other Other Other Other Other Other Other
## [97] Other Other Other Other
## Levels: Black White Other