Data Wrangling with R

Learning Objectives

Learn how to pivot data frames wide and long.
Learn how to how to select, filter, mutate, and summarise data.
Learn to work with factors with R.

*Tidyverse documentation (https://www.tidyverse.org/)

RPubs

https://rpubs.com/zaidyousif/1140283

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)

Load data

setwd("E:/Biostat and Study Design/204/Lectures/Data")
Labs_SCr_df <- openxlsx::read.xlsx('Patients_Labs_SCr.xlsx')

Wrangling Data

Learning how to organize your data and preparing it for statistical analysis is an essential skill to acquire. Data comes in different shapes that require wrangling to be put into the proper form for statistical analysis.

Tidyverse package comes with excellent functions for data wrangling. We will be only scratching the surface during this course. If you are interested in the topic, I highly recommend R for Data Science (“https://r4ds.had.co.nz/”).

Pivoting Data

A common problem is when the column names are not names of variables, but values of a variable. For example, the column names “Day1” and “Day2” represent values of the “Date” variable, and the values in the “Day1” and “Day2” columns represent values of serum creatinine (SCr).

dim(Labs_SCr_df)

## [1] 100  12

head(Labs_SCr_df)

##   patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1      pt_01         1 1.48 0.84 0.89 0.90 1.43 1.18 0.79 1.10 1.08  0.88
## 2      pt_02         1 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98  0.66
## 3      pt_03         1 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28  0.93
## 4      pt_04         1 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51  0.97
## 5      pt_05         1 0.68 1.01 1.06 0.60 1.25 0.97 0.83 1.07 0.69  1.39
## 6      pt_06         1 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07  1.54

colnames(Labs_SCr_df)

##  [1] "patient_id" "ethnicity"  "Day1"       "Day2"       "Day3"      
##  [6] "Day4"       "Day5"       "Day6"       "Day7"       "Day8"      
## [11] "Day9"       "Day10"

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables using the pivot_longer() function. To describe that operation, we need three parameters:

cols: the set of columns whose names are values, not variables. In this example, those are the columns “Day1” to “Day10”.
names_to: the name of the variable to move the column names to. Here it is “Date”.
values_to: The name of the variable to move the column values to. Here it’s “SCr”.

Labs_SCr_df_long <- Labs_SCr_df %>% pivot_longer(cols= c(Day1,Day2,Day3,Day4,Day5,Day6,Day7,Day8,Day9,Day10), names_to = "Date", values_to = "SCr")

Labs_SCr_df_long %>% head()

## # A tibble: 6 × 4
##   patient_id ethnicity Date    SCr
##   <chr>          <dbl> <chr> <dbl>
## 1 pt_01              1 Day1   1.48
## 2 pt_01              1 Day2   0.84
## 3 pt_01              1 Day3   0.89
## 4 pt_01              1 Day4   0.9 
## 5 pt_01              1 Day5   1.43
## 6 pt_01              1 Day6   1.18

Labs_SCr_df_long %>% dim()

## [1] 1000    4

Alternatively, instead of specifying the columns to pivot, you specify the columns not to pivot by putting “-” before the vector of columns. This could save you time!

Labs_SCr_df_long <- Labs_SCr_df %>% pivot_longer(cols= -c(patient_id,ethnicity), names_to = 'Date', values_to = 'SCr')
head(Labs_SCr_df_long)

## # A tibble: 6 × 4
##   patient_id ethnicity Date    SCr
##   <chr>          <dbl> <chr> <dbl>
## 1 pt_01              1 Day1   1.48
## 2 pt_01              1 Day2   0.84
## 3 pt_01              1 Day3   0.89
## 4 pt_01              1 Day4   0.9 
## 5 pt_01              1 Day5   1.43
## 6 pt_01              1 Day6   1.18

The figure below illustrates how pivot_longer() function works.

Alt

pivot_wider() is the opposite of pivot_longer(). You use it when observations are scattered across multiple rows. You need two parameters to use pivot_wider():

names_from: the column to take variable names from. Here, it’s “Date”.
values_from: The column to take values from. Here it’s “SCr”.

Labs_SCr_df_wide <- Labs_SCr_df_long %>% pivot_wider(names_from = Date,values_from = SCr)
Labs_SCr_df_wide %>% head()

## # A tibble: 6 × 12
##   patient_id ethnicity  Day1  Day2  Day3  Day4  Day5  Day6  Day7  Day8  Day9
##   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pt_01              1  1.48  0.84  0.89  0.9   1.43  1.18  0.79  1.1   1.08
## 2 pt_02              1  1.11  0.88  1.27  0.83  0.98  0.86  0.83  1.17  0.98
## 3 pt_03              1  0.96  0.62  0.77  1.15  1.27  0.67  0.74  0.77  0.28
## 4 pt_04              1  1.39  0.87  0.92  1.05  0.86  1.39  0.85  1.19  1.51
## 5 pt_05              1  0.68  1.01  1.06  0.6   1.25  0.97  0.83  1.07  0.69
## 6 pt_06              1  1.11  1.02  1.43  0.74  1.33  0.81  0.94  1.03  1.07
## # ℹ 1 more variable: Day10 <dbl>

Select

To extract columns from our dataset, we use the select() function to provide information about which columns we want or don’t want. If we have many columns we want to include but a few we want to exclude, we can use a minus (-) before the column names to be excluded.

Labs_SCr_df %>% select(patient_id,Day1, Day2, Day3) %>% head() # select column patient_id, Day1, Day2, Day3

##   patient_id Day1 Day2 Day3
## 1      pt_01 1.48 0.84 0.89
## 2      pt_02 1.11 0.88 1.27
## 3      pt_03 0.96 0.62 0.77
## 4      pt_04 1.39 0.87 0.92
## 5      pt_05 0.68 1.01 1.06
## 6      pt_06 1.11 1.02 1.43

Labs_SCr_df %>% select(-Day2, -Day10) %>% head() #select all columns except Day2 and Day10

##   patient_id ethnicity Day1 Day3 Day4 Day5 Day6 Day7 Day8 Day9
## 1      pt_01         1 1.48 0.89 0.90 1.43 1.18 0.79 1.10 1.08
## 2      pt_02         1 1.11 1.27 0.83 0.98 0.86 0.83 1.17 0.98
## 3      pt_03         1 0.96 0.77 1.15 1.27 0.67 0.74 0.77 0.28
## 4      pt_04         1 1.39 0.92 1.05 0.86 1.39 0.85 1.19 1.51
## 5      pt_05         1 0.68 1.06 0.60 1.25 0.97 0.83 1.07 0.69
## 6      pt_06         1 1.11 1.43 0.74 1.33 0.81 0.94 1.03 1.07

You can select columns based on index/location.

Labs_SCr_df %>% select(1) %>% head()  #select column 1

##   patient_id
## 1      pt_01
## 2      pt_02
## 3      pt_03
## 4      pt_04
## 5      pt_05
## 6      pt_06

Labs_SCr_df %>% select(1,2,3) %>% head()  #select columns 1, 2, 3

##   patient_id ethnicity Day1
## 1      pt_01         1 1.48
## 2      pt_02         1 1.11
## 3      pt_03         1 0.96
## 4      pt_04         1 1.39
## 5      pt_05         1 0.68
## 6      pt_06         1 1.11

Labs_SCr_df %>% select(3:12) %>% head()  #select columns 3 through 12

##   Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1 1.48 0.84 0.89 0.90 1.43 1.18 0.79 1.10 1.08  0.88
## 2 1.11 0.88 1.27 0.83 0.98 0.86 0.83 1.17 0.98  0.66
## 3 0.96 0.62 0.77 1.15 1.27 0.67 0.74 0.77 0.28  0.93
## 4 1.39 0.87 0.92 1.05 0.86 1.39 0.85 1.19 1.51  0.97
## 5 0.68 1.01 1.06 0.60 1.25 0.97 0.83 1.07 0.69  1.39
## 6 1.11 1.02 1.43 0.74 1.33 0.81 0.94 1.03 1.07  1.54

You can also select columns based on names. starts_with() function will return columns which the string you provide is at the beginning of a column name. ends_with() will return columns which the string you provide is at the end of a column name.

Labs_SCr_df %>% select(starts_with('patient'), Day1) %>% head()

##   patient_id Day1
## 1      pt_01 1.48
## 2      pt_02 1.11
## 3      pt_03 0.96
## 4      pt_04 1.39
## 5      pt_05 0.68
## 6      pt_06 1.11

Filter

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for the condition. We can put multiple logical conditions inside a single filter() function. You can comma separate conditions, which is equivalent to an “AND”, or you can use the compound logical operators “AND” (&) and “OR” (|) to construct more complex statements.

Labs_SCr_df %>% filter(patient_id=='pt_05') #filter patient with ID pt_05

##   patient_id ethnicity Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1      pt_05         1 0.68 1.01 1.06  0.6 1.25 0.97 0.83 1.07 0.69  1.39

Labs_SCr_df %>% filter(Day1 > 1.5) %>% head() #filter  observations with SCr > 1.5 on Day1

##  [1] patient_id ethnicity  Day1       Day2       Day3       Day4      
##  [7] Day5       Day6       Day7       Day8       Day9       Day10     
## <0 rows> (or 0-length row.names)

Labs_SCr_df %>% filter(Day1 >1 & Day2>1) %>% dim() #filter observations with SCr > 1 on Day1 AND Day2

## [1] 24 12

Labs_SCr_df %>% filter(Day1>1 | Day2 >1) %>% dim() #filter observations with SCr > 1 on Day1 OR Day2

## [1] 72 12

Labs_SCr_df %>% filter((Day1 >1 & Day2 >1)| Day3 >1.5) %>% dim() #filter observations with SCr > 1 on Day1 and Day2 OR SCr > 1.5 on Day3

## [1] 26 12

Labs_SCr_df %>% filter(!is.na(Day1)) %>% dim() #filter observations with is NOT Day1

## [1] 87 12

Summarise & Mutate

The mutate() function is used to to modify data sets. The mutate() function takes any number of expressions and return new values of the same length.

Labs_SCr_df %>% group_by(patient_id) %>% mutate(SCrMean=mean(c(Day1, Day2, Day3, Day4, Day5, Day6, Day7, Day8, Day9, Day10),na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per patient

## # A tibble: 6 × 13
## # Groups:   patient_id [6]
##   patient_id ethnicity  Day1  Day2  Day3  Day4  Day5  Day6  Day7  Day8  Day9
##   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pt_01              1  1.48  0.84  0.89  0.9   1.43  1.18  0.79  1.1   1.08
## 2 pt_02              1  1.11  0.88  1.27  0.83  0.98  0.86  0.83  1.17  0.98
## 3 pt_03              1  0.96  0.62  0.77  1.15  1.27  0.67  0.74  0.77  0.28
## 4 pt_04              1  1.39  0.87  0.92  1.05  0.86  1.39  0.85  1.19  1.51
## 5 pt_05              1  0.68  1.01  1.06  0.6   1.25  0.97  0.83  1.07  0.69
## 6 pt_06              1  1.11  1.02  1.43  0.74  1.33  0.81  0.94  1.03  1.07
## # ℹ 2 more variables: Day10 <dbl>, SCrMean <dbl>

The summarise() function is used to aggregate data and return one row or one row per group. To use summarise provide expressions that result in a single value and provide multiple expressions in a comma separated fashion.

Labs_SCr_df %>% summarise(Day1Mean=mean(Day1,na.rm=TRUE),Day2Mean=mean(Day2,na.rm=TRUE)) #calculate mean SCr on Day1 and Day2.

##   Day1Mean Day2Mean
## 1 1.016207   0.9748

Labs_SCr_df %>% group_by(patient_id) %>% summarise(SCrMean=mean(c(Day1, Day2, Day3, Day4, Day5, Day6, Day7, Day8, Day9, Day10),na.rm=TRUE)) %>% head() #calculate mean of SCr across observations per patient

## # A tibble: 6 × 2
##   patient_id SCrMean
##   <chr>        <dbl>
## 1 pt_01        1.06 
## 2 pt_02        0.957
## 3 pt_03        0.816
## 4 pt_04        1.1  
## 5 pt_05        0.955
## 6 pt_06        1.10

Labs_SCr_df_long %>% group_by(patient_id) %>% summarise(SCrMean=mean(SCr,na.rm=TRUE)) %>% head()  #calculate mean of SCr across observations per patient

## # A tibble: 6 × 2
##   patient_id SCrMean
##   <chr>        <dbl>
## 1 pt_01        1.06 
## 2 pt_02        0.957
## 3 pt_03        0.816
## 4 pt_04        1.1  
## 5 pt_05        0.955
## 6 pt_06        1.10

Labs_SCr_df_long %>% group_by(ethnicity) %>% summarise(SCrMean=mean(SCr,na.rm=TRUE)) %>% head()  #calculate mean of SCr across observations per ethnicity

## # A tibble: 3 × 2
##   ethnicity SCrMean
##       <dbl>   <dbl>
## 1         1   1.01 
## 2         2   0.991
## 3         3   0.974

Factors in R

Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. Factors play very important role in data modeling because categorical variables are handled differently compared to continuous variables.

Depending on the function used to import data into R environment, factors may or may not be created automatically. Always, make sure to investigate your data before making any assumptions. The ethnicity field is coded as a numeric variable.

class(Labs_SCr_df$ethnicity)

## [1] "numeric"

We can convert as numeric variable to a factor using the as.factor() function.

Labs_SCr_df$ethnicity <- as.factor(Labs_SCr_df$ethnicity)
class(Labs_SCr_df$ethnicity)

## [1] "factor"

Labs_SCr_df$ethnicity

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3

The ethnicity factor has levels (1,2, and 3) associated with it. You can assign new levels using the levels function.

levels(Labs_SCr_df$ethnicity) <- c('White','Black','Other') #assign new levels
Labs_SCr_df$ethnicity

##   [1] White White White White White White White White White White White White
##  [13] White White White White White White White White White White White White
##  [25] White White White White White White White White Black Black Black Black
##  [37] Black Black Black Black Black Black Black Black Black Black Black Black
##  [49] Black Black Black Black Black Black Black Black Black Black Black Black
##  [61] Black Black Black Black Black Black Other Other Other Other Other Other
##  [73] Other Other Other Other Other Other Other Other Other Other Other Other
##  [85] Other Other Other Other Other Other Other Other Other Other Other Other
##  [97] Other Other Other Other
## Levels: White Black Other

As you will learn later this quarter, the reference level in a factor variable is important in the interpreting statistical models. The first level in a factor variable is usually the default reference level. You can change that using the relevel() function.

Labs_SCr_df$ethnicity <- relevel(Labs_SCr_df$ethnicity,ref = 'Black') #change reference to Black
Labs_SCr_df$ethnicity

##   [1] White White White White White White White White White White White White
##  [13] White White White White White White White White White White White White
##  [25] White White White White White White White White Black Black Black Black
##  [37] Black Black Black Black Black Black Black Black Black Black Black Black
##  [49] Black Black Black Black Black Black Black Black Black Black Black Black
##  [61] Black Black Black Black Black Black Other Other Other Other Other Other
##  [73] Other Other Other Other Other Other Other Other Other Other Other Other
##  [85] Other Other Other Other Other Other Other Other Other Other Other Other
##  [97] Other Other Other Other
## Levels: Black White Other