The Data Scientist’s Workflow

This reading is provided as part of the Introduction to Clinical Data Science, the first of six courses in the Coursera Clinical Data Science Specialization created by the University of Colorado Anschutz Medical Campus and supported by our industry partner Google Cloud.

Loading Packages

Even before we can get data into R we need to tell R that we want to use libraries of code or “packages”. The tidyverse is a package of code.

Installing Packages

The first time you want to use a package you must install it. We’ve already installed most of the packages you need, but in case you want to try this on your own computer, to install the tidyverse you would run:

install.packages("tidyverse", dependencies = TRUE)

Making Packages Available

Once all the packages you want are installed in each R script you write, you will want to call the library() command to make the code available. For this project we are going to want the tidyverse, a package call magrittr, and bigrquery.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

library(magrittr)

## Warning: package 'magrittr' was built under R version 4.1.3

library(bigrquery)

## Warning: package 'bigrquery' was built under R version 4.1.3

Getting Data In/Out

Depending on the form your data take (database, csv file, excel file, etc.) you will use different R packages to load data. We are going to focus on the two most relevant packages

readr - tidyverse packaage that reads text files - files that end in .txt or .csv.
bigrquery - package used to access Google BigQuery tables from R.

readr

readr is fairly straightforward to use - simply provide a data frame you want your file to be loaded into and select the appropriate command:

test_csv_data <- read_csv(file = "myfilename.csv")
test_txt_data <- read_delim(file = "myfilename.txt", delim = "\t")

In this example code the first command will read a csv file named “myfilename.csv” into the data frame test_csv_data. The code in the the second command reads the text file “myfilename.txt”, tells R that a tab ( is used to separate each column, and reads that data into test_txt_data.

You export data in a similar way:

write_csv(x = mydataframe, path = "mynewfile.csv")
write_delim(x = mydataframe, path = "mynewfile.txt", delim = "\t")

In this example code, the first line writes a new csv file in the current directory with the data stored in mydataframe. The second command creates a tab-delimited file with the same content.

For more information on downloading file data, check out:

“Data Import” section of R for Data Science.
readr Tidyverse Page - reading csv and other text files
readxl Tidyverse Page - reading Excel spreadsheets
haven Tidyverse Page - reading data from SAS, SPSS, and Stata

bigrquery

bigrquery requires a bit more setup that reading in text files.

First you have to set up your connection details. This involves telling bigrquery the name of the billing project where your data is stored. For this course you will use the “learnclinicaldatascience” project. Remember that all the queries of data in this project is completely free to you thanks to our industry partner Google Cloud!

con <- DBI::dbConnect(drv = bigquery(),
                      project = "learnclinicaldatascience")

Using the dplyr package (which we cover in more detail in the “Tidying and Manipulatin Data” section), we can connect an R dataframe to any table in the project. Let’s try connecting the ADMISSIONS table:

admissions <- tbl(con, "mimic3_demo.ADMISSIONS")

## ! Using an auto-discovered, cached token.

##   To suppress this message, modify your code or options to clearly consent to
##   the use of a cached token.

##   See gargle's "Non-interactive auth" vignette for more details:

##   <https://gargle.r-lib.org/articles/non-interactive-auth.html>

## i The bigrquery package is using a cached token for 'mrxmf6@gmail.com'.

admissions

The first time you run this command, in the Console R will ask you if you want to

Use a local file ('.httr-oauth'), to cache OAuth access credentials between R sessions?
1: Yes
2: No

You should select 1 - this will make it easier to connect in the future. After entering 1 into the Console a window will pop up asking you to authenticate with Google. Use the google account you used for the tech-registration. This will cause Google to return a long key phrase that you can copy and paste into the console. Because you saved the credentials to a file, this should be the only time you have to go through this process. The one exception is when you work with RProjects. You will need to go through this process in each project.

Here is the code you can just copy/paste into your future R scripts:

library(bigrquery)
con <- DBI::dbConnect(drv = bigquery(),
                      project = "learnclinicaldatascience")
admissions <- tbl(src = con, "mimic3_demo.ADMISSIONS")

Try it out for yourself:

Connect to the PATIENTS table in mimic3_demo.

For more information on connecting to databases, check out:

Databases Using R - An RStudio help page for connecting to and querying databases
DBI GitHub Page - The Github page for the DBI package that you can use to connect to many types of databases
bigrquery GitHub Page - The Github page page for the database connections to Google BigQuery.

Tidying and Manipulating Data

There are three primary tidyverse packages you will use to tidy and manipulate data:

magrittr
tidyr
dplyr

There are also a handful of helpful packages for dealing with specific types of data.

lubridate - wrangles date-time data
stringr - wrangle string data

We will cover lubridate and stringr in more detail in later courses of the Clinical Data Science Specialization.

magrittr

The magrittr package provides a couple of helpful tools for clinical data science. This package introduced the concept of the pipe, a function that takes data from the left and passes it to functions on the right. There are two types of pipes you’ll see us use in the Specialization:

%>% - This is the basic pipe that takes data on the left and passes it to functions on the right.
%<>% - This pipe takes the data from the left, passes it through the function/s on the right and returns the result back into the original data frame. Caution! This pipe replaces the original data, use carefully!

These pipes allow you to actually create an analytic pipeline or chain, where you take the data and connect multiple functions in sequence to perform a particular analysis. You’ll see more in the dplyr and tidyr sections of this guide!

For more information, check out:

Pipes in the R for Data Science book
magrittr Tidyverse Documention.

dplyr

The dplyr package is the real workhorse of the tidyverse packages. It is the primary tool you will use for data munging. It actually behaves very similarly to SQL and in fact can connect to databases and actually write SQL code for you! Let’s go through the same functions that we did in SQL to see how you perform these tasks in R.

Let’s load a couple of tables to use as examples for the following tasks:

admissions <- tbl(con, "mimic3_demo.ADMISSIONS")
patients <- tbl(con, "mimic3_demo.PATIENTS")

head(admissions)

head(patients)

Selecting Variables (Columns)

You can use the select() command to select individuals columns. So if we wanted to select the SUBJECT_ID of the patients table we can run the following code:

patients %>% 
  select(SUBJECT_ID)

Try it out for yourself:

Select the date of birth column from patients.

For more information, check out:

Select Columns with select()in the R for Data Science book

Selecting Records (Rows)

We can use the filter() command to select rows that meet a particular condition. For instance if we want to only select records from patients who are female we can run the following code:

patients %>% 
  filter(GENDER=="F")

Try it out for yourself:

Filter the admissions table to include only those with an ADMISSION_TYPE of “EMERGENCY”.

For more information, check out:

Filter Rows with filter() in the R for Data Science book

Selecting Variables (Columns) and Records (Rows)

You can combine these values by using another pipe

patients %>% 
  filter(GENDER=="F") %>% 
  select(SUBJECT_ID)

Renaming Columns

Just like with SQL we can rename columns if we want to make it more clear what these two columns represent. We can use the dplyr function rename().

patients %>% 
  rename(patients_ROW_ID = ROW_ID)

Try it out for yourself:

Rename the EXPIRE_FLAG column in patients to DEAD

Creating New Variables (Columns)

A lot of times in clinical data science it is helpful to create new variables based on some data in the table. This functionality is support by the mutate() command in dplyr.

Let’s say that we want to make a new variable in the patients table called EXPIRE_HOSP that it is 1 if the date of death happened at the hospital (e.g., there is a date in the DOD_HOSP column), and a 0 if they did not have an entry in DOD_HOSP.

patients %>% 
  mutate(EXPIRE_HOSP = case_when(is.na(DOD_HOSP) ~ 0,
                                 TRUE ~ 1))

In this code mutate() makes the new column EXPIRE_HOSP. Then case_when() is used to set up the conditional statement. The function is.na() tells us whether DOD_HOSP is NA. If so, then we make EXPIRE_HOSP equal to 0. If not then we make EXPIRE_HOSP equal to 1.

Try it out for yourself:

Create a new column in the admissions table EMERGENCY that is 1 if the ADMISSION_TYPE is EMERGENCY and 0 if it is not.

For more information, check out:

Joining Tables

dplyr has the same types of joins as SQL, but it helpfully will automatically figure out which columns are common across the two data sets. To run an inner join we can run:

patients %>%
  inner_join(admissions)

## Joining, by = c("ROW_ID", "SUBJECT_ID")

Notice that dplyr recognized that the tables have two variables in common:

ROW_ID
SUBJECT_ID

In fact we know that ROW_ID is unique to each table in MIMIC, so we need to adjust the code to only join on the SUBJECT_ID column:

patients %>% 
  inner_join(admissions, 
             by = c("SUBJECT_ID"="SUBJECT_ID"),
             suffix = c("_p","_a"))

We also added the suffix argument that takes the duplicate ROW_ID columns from patients and admissions as turns them into:

ROW_ID_p - the ROW_ID from the patients table
ROW_ID_a - the ROW_ID from the admissions table

In addition to inner_join, dplyr also has the following joins:

left_join() - keeps all rows from the first table in the join
right_join() - keeps all rows from the second table in the join

Try it out for yourself:

Create a new dataframe called icustays that connects to the ICUSTAYS table then write an inner join between patients and icustays.

For more information, check out:

Relational Data in the R for Data Science book

Aggregating Data

dplyr has a few functions that can be used to aggregate data. The first is the group_by() function. It works just like group by in SQL.

Let’s say that we want to identify how many encounters each patient has had. We can group_by(SUBJECT_ID) to tell dplyr to consider all rows with the same SUBJECT_ID together. We can then use the summarise() function to apply a function like n_distinct() which counts all the unique instances of a variable.

admissions %>% 
  group_by(SUBJECT_ID) %>% 
  summarise(count_admissions = n_distinct(HADM_ID))

Should need to apply a different grouping to the same data in the future you can use ungroup() to remove the first grouping.

Try it out for yourself:

How many admissions had an admissions type of EMERGENCY?

For more information, check out:

Sorting/Ordering Data

In the aggregating data example we may have wanted to see who had the most admissions and wanted to sort descending by count. dplyr supports reordering rows with the arrange() command. To get the reverse or descending support with the desc() command.

admissions %>% 
  group_by(SUBJECT_ID) %>% 
  summarise(count_admissions = n_distinct(HADM_ID)) %>% 
  arrange(desc(count_admissions))

Try it out for yourself:

Using the icustays dataframe you created, what is the maximum number of ICU stays for any individual patient?
Using the icustays dataframe you created, what is the maximum number of ICU stays for any individual hospital stay?

For more information, check out:

Arrange Rows with arrange() in the R for Data Science book

tidyr

The tidyr package is a companion to dplyr. It is used to help reshape and tidy data sets. There are a few helpful functions you should know:

Joining and Separating Columns

Let’s say that we wanted to combine the SUBJECT_ID and HADM_ID variables into a single variable. We can use the unite() function.

Unfortunately tidyr functions do not work on external databases connected to R like we’ve been using. We can use the collect() function to load the data that’s in the admissions table directly to R. If you ever get odd errors from particular dplyr or tidyr functions, try adding this step to your chain.

admissions %>% 
  collect() %>% 
  unite(col = "SUBJECT_HADM_ID", c("SUBJECT_ID", "HADM_ID"), sep = "_", remove = TRUE)

After we get the data locally we can apply the tidyr unite() function. In this function col defines what the new, combined, column should be called. The c() argument lists which columns should be combined. sep defines what value should separate the two columns, and remove = TRUE says to remove the original columns from the data frame. If you wanted to keep them, just switch it to remove = FALSE.

If you wanted to undo this process and separate the column back into two variables you can use separate().

admissions %>%
  collect() %>% 
  unite(col = "SUBJECT_HADM_ID", c("SUBJECT_ID", "HADM_ID"), sep = "_", remove = TRUE) %>% 
  separate(col = SUBJECT_HADM_ID, into = c("SUBJECT_ID", "HADM_ID"), sep = "_", remove = TRUE)

Try it out for yourself:

Try joining the ADMISSION_TYPE and ADMISSION_LOCATION columns of admissions into a column called ADMISSION_DETAILS. How many unique values are inADMISSION_DETAILS`?

For more information, check out:

Separating and Uniting in the R for Data Science book

Converting Wide Data to Long Data and Back

The MIMIC-III data is actually already fairly tidy data, so we don’t typically need to use these tidyr functions. As such this section is outside the scope of this specialization. However in the real world, most of your data will not be so well currated. I recommended reading the Spreading and Gathering section of the R for Data Science book to learn more about these increadibly useful functions.

Analyzing Data

There are a number of packages in R for performing statistical and machine learning analyses. For the most part in this specialization instead of using the most advanced or complicated analysis tools we instead focus on simple methods that help you understand the impact either 1) the method of clinical data generation, or 2) the clinical implementation goal, has on impact on what analytics you choose to run.

While we believe that a strong foundation in statistics and computer science is important for success as a clinical data scientist, there are a lot of educational resources that will teach that content far more meaningfully than we could. As such we will be teaching you just the analytic skills you need within each of the subsequent courses in the specialization. For example Course 3 and 4 (“Identifying Patient Populations” and “Clinical Natural Language Processing”) both use 2x2 table and statistics like sensitivity and specificity to assess algorithm performance.

Visualizing Data

Somewhat similarly to the analyzing data section, we feel that there are a number of online resources for learning data visualization techniques. However we do want to introduce you to ggplot2 - a tidyverse package for plotting data. This is just a basic introduction - enough to get you started with the package!

Let’s plot the demographic features of the cohort using the PATIENTS table

patients %>% 
  ggplot() +
    geom_bar(aes(x = GENDER))

ggplot2 works by calling the ggplot() command and then adding different visualization layers. In this case we are using geom_bar() to make a bar plot. For each geom you have to pass information about you data. Bar plots require some variable for the x axis and then it counts the number of occurrences of the values of that variable to create the y axis. aes() is the function that communicates to ggplot what data mappings we want to create, in this case we are plotting GENDER. Notice that ggplot uses the same type of piping logic we’ve been using in the tidyverse with magrittr but instead of %>%, ggplot uses the +.

Let’s try a different table and look at admissions. I’m interested in where patients were discharged.

admissions %>% 
  ggplot() +
    geom_bar(aes(x = DISCHARGE_LOCATION))

Obviously we can’t read the discharge location - we can adjust the angle of that label to make it easier to read. We can use the theme() function to adjust these visual details.

admissions %>% 
  ggplot() +
    geom_bar(aes(x = DISCHARGE_LOCATION)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now let’s see what impact admission type had on the discharge location. We can apply a color fill to this variable in the aes() command.

admissions %>% 
  ggplot() +
    geom_bar(aes(x = DISCHARGE_LOCATION, fill = ADMISSION_TYPE)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

This let’s us see that the vast majority of admissions were EMERGENCY admissions. It can be hard to compare groups when the color bars are stacked together. We can change the position argument in the geom_bar() function to make them appear side-by-side instead.

admissions %>% 
  ggplot() +
    geom_bar(aes(x = DISCHARGE_LOCATION, fill = ADMISSION_TYPE), position=position_dodge()) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

This plot let’s us see that for those patients all patients who died during the hospital stay were an EMERGENCY or URGENT visit.

Try it for yourself:

Create a bar plot of ADMISSION_LOCATION that has a fill by ADMISSION_TYPE.

For more information, check out:

Data Visualization section of the R for Data Science book
ggplot2 Tidyverse Page
Grammar of Graphics book the underlying philosophy of the ggplot2 package
Garrett Grolemund ggplot2 tutorial a nice introduction to understanding how ggplot graphs are constructed
Graphics for Communication from the R for Data Science book

Reporting on Data

Please see the reading on using RMarkdown on Coursera.

For more information, check out:

RMarkdown section of the R for Data Science book
RMarkdown RStudio Page
RMarkdown Book by Yihui Xie, J.J. Allaire, and Garrett Grolemund

Answers to Questions

Q1 What code did you write to connect to the PATIENTS table in mimic3_demo?

con <- DBI::dbConnect(drv = bigquery(),
                      project = "learnclinicaldatascience")
patients <- tbl(con, "mimic3_demo.PATIENTS")

Q2 What code did you write to select the date of birth column from the patients table?

head(patients)   #look at the patients table

colnames(patients)  #get the column names for the patients table

## [1] "ROW_ID"      "SUBJECT_ID"  "GENDER"      "DOB"         "DOD"        
## [6] "DOD_HOSP"    "DOD_SSN"     "EXPIRE_FLAG"

# the birth column is DOB. lets select it

patients %>% 
  select(DOB)

Q3 What code did you write to filter the admissions table to include only those with an ADMISSION_TYPE of “EMERGENCY”?

admissions %>%
   filter(ADMISSION_TYPE == "EMERGENCY")

Question 4 What code did you write to rename the EXPIRE_FLAG column in patients to DEAD?

patients %>%
  rename(DEAD = EXPIRE_FLAG)

Question 5 What code did you write to create a new column in the admissions table EMERGENCY that is 1 if the ADMISSION_TYPE is EMERGENCY?

admissions %>% 
  mutate(EMERGENCY = case_when(ADMISSION_TYPE== "EMERGENCY" ~ 1,
                                 TRUE ~ 0))

Question 6 What code you write to create a new dataframe called icustays that connects to the ICUSTAYS table and create an inner join between patients and icustays?

icustays <- tbl(con, "mimic3_demo.ICUSTAYS")

patients %>% 
  inner_join(icustays, by = c("SUBJECT_ID"="SUBJECT_ID"))

icustays <- tbl(con, "mimic3_demo.ICUSTAYS")
patients %>% 
  inner_join(icustays, 
             by = c("SUBJECT_ID" = "SUBJECT_ID"), 
             suffix = c("_p","_i"))

How many admissions had an admissions type of EMERGENCY?

admissions %>% 
  group_by(ADMISSION_TYPE) %>% 
  summarise(count_admissions = n())

admissions %>% 
  group_by(ADMISSION_TYPE) %>% 
  summarise(count_admissions = n_distinct(HADM_ID))

Question 8 Using the icustays dataframe you created, what is the maximum number of ICU stays for any individual patient?

head(icustays)

icustays %>% 
  group_by(SUBJECT_ID) %>% 
  summarise(stay = n_distinct(ICUSTAY_ID)) %>% 
  arrange(desc(stay))

Q9 Using the icustays dataframe you created, what is the maximum number of ICU stays for any individual hospital stay?

icustays %>% 
  group_by(HADM_ID) %>% 
  summarise(count_icu = n_distinct(ICUSTAY_ID)) %>% 
  arrange(desc(count_icu))

admissions

Q11What code did you write to create a bar plot of ADMISSION_LOCATION that has a fill by ADMISSION_TYPE?

admissions %>% 
  ggplot() +
    geom_bar(aes(x = ADMISSION_LOCATION, fill = ADMISSION_TYPE)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Write R code to create a bar plot of FIRST_CAREUNIT from the ICUSTAYS table.

icustays

icustays %>% 
  ggplot() +
    geom_bar(aes(x = FIRST_CAREUNIT)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

How many ICU stays started in the MICU?

icustays %>% 
  group_by(FIRST_CAREUNIT) %>% 
  summarise(n = n())

The Data Scientist’s Workflow

Laura K. Wiley, PhD and Michael Frimpong

Loading Packages

Installing Packages

Making Packages Available

Getting Data In/Out

readr

For more information on downloading file data, check out:

bigrquery

Try it out for yourself:

For more information on connecting to databases, check out:

Tidying and Manipulating Data

magrittr

For more information, check out:

dplyr

Selecting Variables (Columns)

Try it out for yourself:

For more information, check out:

Selecting Records (Rows)

Try it out for yourself:

For more information, check out:

Selecting Variables (Columns) and Records (Rows)

Renaming Columns

Try it out for yourself:

Creating New Variables (Columns)

Try it out for yourself:

For more information, check out:

Joining Tables

Try it out for yourself:

For more information, check out:

Aggregating Data

Try it out for yourself:

For more information, check out:

Sorting/Ordering Data

Try it out for yourself:

For more information, check out:

tidyr

Joining and Separating Columns

Try it out for yourself:

For more information, check out:

Converting Wide Data to Long Data and Back

Analyzing Data

Visualizing Data

Try it for yourself:

For more information, check out:

Reporting on Data

For more information, check out:

Answers to Questions