1 Random Helpful Stuff

1.1 Clearing the Global Environment

  • All variables:

    remove(list = ls())
  • One Variable

    remove(list = VariableName)

1.2 The Help Command

  • The help command brings up useful documentation about both base R functions and even some R packages (if the documentation exists in the frist place)
    • Syntax:

      ?FunctionName

\(~\)

\(~\)

2 Creating a README

2.1 Install knitr

  • knitr allows R code to be embedded and actually work when used in with markup languages developed for report/docuemnt generation

  • Specifically, we will use knitr to run embedded R code in R Markdown documents

    Code:

    install.packages("knitr")

2.2 Create a README

  • Good guidelines for research/publishing READMEs: https://guides.lib.uci.edu/datamanagement/readme

  • Copy and paste the text below into the “body” of your README:

    ### MEEI Clinic X 2018 Study 
    
    This project analyzes a dataset of patient visits to Clinic X at Mass Eye and Ear during 2018. During 2018, the MBTA was affected by a flood that stopped all public transit in the Boston area from May 1st, 2018, to July 31st. The purpose of this study is to investigate the impact of the public transport closure on the attendance of patients at the clinic during as well as after the period of closure. 
    
    ### General Information / Data Acquisition 
    
    ### Dataset Information
    
    1. Patient visit data, with each row containing:
        + First Name 
        + Last Name
        + Age
        + Date of the visit
        + MRN
        + Gender
        + Race
        + Visit status 
            + Categorized by 3 options: Cancel, Complete, or No Show
        + Date the visit was scheduled 
        + Date the visit was canceled, if applicable 
        + Diagnosis (if completed visit)
            + ICD-10 code

\(~\)

\(~\)

3 Subscripts

Long projects (such as ours) are often broken into multiple scripts. This has a few benefits, such as better navigation/organization of the program. It can also make it easier to trouble shoot when things go wrong. It is often harder to find problems

3.1 Test Function

  • Below is the code for a simple function to use in your TestSubScript.r script

    Copy the code below:

    # This test function will average two numbers and return the answer to the user
    TestFunction = function(x, y){
      # this function averages x and y
      xy_average = (x + y)/2
    
      # return the value 
      return(xy_average)
    }

3.2 Importing subscripts

  • To import subscripts into your main script, use the source() command.

    Basic Syntax:

    source("filepath")

    Full code (if needed):

    source("Scripts/Subscripts/TestSubScript.r")

\(~\)

\(~\)

4 Basic Data Cleaning

4.1 Rename columns

  • To rename columns of a dataframe, we can use the pipe %>% and the rename function rename().

    Basic Syntax:

    df = df %>%
      rename(NewColName = OldColName)

    Class example:

    df = df %>%
      rename(FirstName = FIRSTNAME)

    Full Code:

    df = df %>%
      rename(
        FirstName = FIRSTNAME,
        LastName = LASTNAME,
        Age = AGE,
        Gender = GENDER, 
        Race = RACE,
        Provider = PROVIDER,
        ProviderName = PROVIDERNAME,
        Department = DEPARTMENT,
        DepartmentName = DEPARTMENTNAME,
        SchedulingStatus = SCHEDULINGSTATUS,
        PatientID = PATIENTID,
        ApptMadeDate = APPTMADEDATE,
        ApptCancelDate = APPTCANCELDATE,
        DiagnosisCode = DIAGNOSIS
      )

    Full Function (if needed):

    # rename columns from Screaming Case to Camel Case
    rename_columns = function(df){
    # rename columns to CamelCase
    df = df %>%
      rename(
        FirstName = FIRSTNAME,
        LastName = LASTNAME,
        Age = AGE,
        Gender = GENDER, 
        Race = RACE,
        Provider = PROVIDER,
        ProviderName = PROVIDERNAME,
        Department = DEPARTMENT,
        DepartmentName = DEPARTMENTNAME,
        SchedulingStatus = SCHEDULINGSTATUS,
        PatientID = PATIENTID,
        ApptMadeDate = APPTMADEDATE,
        ApptCancelDate = APPTCANCELDATE,
        DiagnosisCode = DIAGNOSIS
      )
    
      # return the updated dataframe
      return(df)
    }

\(~\)

4.2 Fix the Dates

When working with dates in a program, it is important to ensure that all dates are uniform. There are a few “systems” that can be used for dates in R, but I prefer lubridate, as it is technically part of tidyverse. But, our problems go beyond just ensuring all dates can be handled the same way. First, we have to fix the excel numeric serial date issue in the ApptCancelDate column. To do this, we will use janitor and the excel_numeric_to_date() function.

  • Janitor

    • Janitor has a lot of ‘everyday’ cleaning functions

    • Install:

      install.packages("janitor")
    • Syntax:

      new_date = excel_numeric_to_date(as.numeric(old_date))
    • Full code (if needed):

      # fix excel serial date issue 
      df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))
  • Lubridate

    • Lubridate is my favorite date handling system. It has a lot of useful functions, mainly ymd(). We will take about this all later, but for now lets just convert all the dates to ymd() format

    • Install:

      install.packages(lubridate)
    • Syntax:

      ymd_date = ymd(old_date)
  • Full Function (if needed):

    # Function for fixing the dates
    fix_the_dates = function(df){
      # fix excel serial date issue 
      df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))
    
      # convert everything to ymd()
      df$DOS = ymd(df$DOS)
      df$ApptMadeDate = ymd(df$ApptMadeDate)
      df$ApptCancelDate = ymd(df$ApptCancelDate)
    
      # return the updated dataframe
      return(df)
    }

\(~\)

4.3 Checkpoint

At this point, your RProject should have a main.r script, as well as a DataCleaningScript.r located in the subscripts folder. Make sure that the body of both scripts matches those outlined below. If needed, copy and paste the contents to ensure that they work properly.

  • main.r

    #### Main Script ####
    ######## Import Statements ########
    library(tidyverse)
    library(readxl)
    
    
    ### Source subscripts ####
    source("Scripts/Subscripts/DataCleaningScript.r")
    
    
    ######## Import Data ########
    # file path to raw data
    raw_data_path = "Data/ClinicX2018.xlsx"
    
    # raw data sheet in excel spreadsheed
    raw_data_sheet = "VISITS"
    
    # import raw data into a dataframe
    raw_data = read_excel(raw_data_path, sheet = raw_data_sheet)
    
    
    
    ######## Basic Data Cleaning ########
    # functions are imported from data cleaning script 
    # make a copy of raw_data to work with
    visit_data = raw_data
    
    # clean the data
    visit_data = clean_the_data(visit_data)
  • DataCleaningScript.r

    # Data Cleaning Subscript
    # This script contains the functions needed for cleaning the data 
    # these functions will be loaded into the [main.r] script
    
    ######## Import Statements ########
    library(janitor)
    library(lubridate)
    
    
    ######## Functions #########
    # rename columns from Screaming Case to Camel Case
    rename_columns = function(df){
      # rename columns to CamelCase
      df = df %>%
        rename(
          FirstName = FIRSTNAME,
          LastName = LASTNAME,
          Age = AGE,
          Gender = GENDER, 
          Race = RACE,
          Provider = PROVIDER,
          ProviderName = PROVIDERNAME,
          Department = DEPARTMENT,
          DepartmentName = DEPARTMENTNAME,
          SchedulingStatus = SCHEDULINGSTATUS,
          PatientID = PATIENTID,
          ApptMadeDate = APPTMADEDATE,
          ApptCancelDate = APPTCANCELDATE,
          DiagnosisCode = DIAGNOSIS
        )
    
      # return the updated dataframe
      return(df)
    }
    
    
    # Function for fixing the dates
    fix_the_dates = function(df){
      # fix excel serial date issue 
      df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))
    
      # convert everything to ymd()
      df$DOS = ymd(df$DOS)
      df$ApptMadeDate = ymd(df$ApptMadeDate)
      df$ApptCancelDate = ymd(df$ApptCancelDate)
    
      # return the updated dataframe
      return(df)
    }
    
    
    # combine all functions into one function
    clean_the_data = function(df){
      # remove any duplicate rows
      df = distinct(df)
    
      # rename columns 
      df = rename_columns(df)
    
      # fix the dates
      df = fix_the_dates(df)
    
      # return the clean dataframe
      return(df)
    }

\(~\)

\(~\)

5 Data Processing

5.1 Adding Gender Codes

5.1.1 Adding a column to a dataframe

  • In most programming languages you have to add an entire column at once. This becomes a little tricky when we are trying to row by row to determine the right gender code to add, but there is a common work around that’s common in programming. You first create a vector of dummy data, then you “correct” each value row by row

  • Creating a Dummy Vector

    • Use the rep() function to create a variable of repeated values of a certain length.

    • Basic Syntax:

      rep(value, length)
    • Full code (if needed):

      rep(-99, nrow(df))
  • Adding the Dummy Vector as a Column

    • Use the %>% pipe and the add_column() function to add the Male vector as a column to the dataframe

    • Basic Syntax:

      df = df %>%
        add_column(VectorName, 
          .after = Index)
    • Full code (if needed):

      df = df %>%
         add_column(Male, 
         .after)

5.1.2 Indexing with a for loop

  • The slide decks from class have the visual description of how these commands work, which are probably the most helpful. Here, I will breifly describe the functions.
  • Row by row:
    • Basic Syntax

      current_row = df[i, ]
  • Element by element
    • This form is used to go row by row through the elements of a single column

    • Basic Syntax:

      current_value = df[i, ]$ColumnName
    • Full code (if needed):

      df[i, ]$Male

5.1.3 Nested If/Else in a for loop

  • For each value of Gender, we want to check if its Male or Female and assign a number to another column accordingly. To do this, we use an If/Else statement and the indexing conventions described above

  • Full code(if needed):

    # indexing For Loop 
    for (i in 1:nrow(df)){
      if (df[i, ]$Gender == "Male"){
        df[i, ]$Male = 1
      } else {
        df[i, ]$Male = 0
      }
    }

5.1.4 Checkpoint

  • By now, you should have a working function that adds a numerical code to each row based on the value in “Gender”. Make sure your code matches and functions like the function below, and correct it if necessary
    • add_gender_codes()

      # Function for adding a numerical representation for gender
      add_gender_codes = function(df){
        # create a vector of dummy values 
        Male = rep(-99, nrow(df))
        #browser()
      
        # add column after Gender column
        df = df %>%
          add_column(
            Male,
            .after = grep("Gender", colnames(df)))
      
        # indexing For Loop 
        for (i in 1:nrow(df)){
          if (df[i, ]$Gender == "Male"){
            df[i, ]$Male = 1
          } else {
            df[i, ]$Male = 0
          }
        }
      
       return(df)
      }

\(~\)

5.2 Adding a Categorical variable for race

  • Because we have more than two options for race (as opposed to just Male/Female for gender), we will need more than 0 and 1 to represent them. Let’s use the following code:
Race Code
White or Caucasian 0
Asian 1
Black or African American 2
Hispanic or Latino 3
American Indian or Alaska Native 4
Native Hawaiian or Other Pacific Islander 5
Declined 6
Unavailable 7
Other 8
  • Use the code below in your function, if needed:
  # Add a vector of dummy values to dataframe
  # create the vector 
  RaceCode = rep(-99, nrow(df))
  
  # add vector after race column 
  df = df %>%
    add_column(RaceCode, .after = grep("Race", colnames(df)))
  
  # update dummy values based on information in Race column 
  df$RaceCode[df$Race == "White or Caucasian"] = 0
  df$RaceCode[df$Race == "Asian"] = 1
  df$RaceCode[df$Race == "Black or African American"] = 2
  df$RaceCode[df$Race == "Hispanic or Latino"] = 3
  df$RaceCode[df$Race == "American Indian or Alaskan Native"] = 4
  df$RaceCode[df$Race == "Native Hawaiian or Other Pacific Islander"] = 5
  df$RaceCode[df$Race == "Declined"] = 6
  df$RaceCode[df$Race == "Unavailable"] = 7
  df$RaceCode[df$Race == "Other"] = 8

\(~\)

5.3 The Progress Bar Function

  • Creating user friendly code is important, and considered a best practice. Because our code contains long loops that take a while to run, we don’t want to leave the user high and dry wondering what the program is doing or what step it’s on. We found this somewhat well documented progress bar function online, lets try and use it.

  • The Function:

    # progress bar function
    make_progress_bar = function(num, format_text){
      # Progress bar package
      # create the progress bar environment
      progbar = progress_bar$new(
    format = paste(format_text, "[:bar] :percent eta: :eta"),
    total = num, width = 100, 
    complete = "\u220e",
    current = "\u22b3",
    incomplete = "\u2218",
    clear = FALSE
      )
    
      # return new progress bar
      return(progbar)
    }
  • The Documentation

    • Inputs
      • num: Number of times a loop will run, often equal to ‘nrow(df)’, or whatever controls the loop
      • format_text: String input of text to be included with progress bar
    • Usage
      • Create a new progress bar before the intended loop. It is important to set the progress bar equal to a variable. This matters later when we want to “advance” the progress bar

      • write your loop as normal

      • In the last line of your loop, include the code VariableName$tick() to instruct the progress bar to advance

      • This function requires a specific package to operate. Include the below code in your script before running the function to ensure the function works:

        # Progress bar package
        if (!require(progress)) {install.packages("progress")}
        library(progress)

6 Demographic Statistitics

6.1 Creating the dataframe

  • We have covered these skills/process many times, so I will go ahead and give you the code
# make a patient demographics dataframe ----------------------------------------
# get list of patients 
pat_list = names(table(raw_data_processed$PatientID))

# empty data frame to fill out
pat_demo = data.frame(FirstName = character(),
                      LastName = character(),
                      Age = numeric(),
                      Race = character(),
                      RaceCode = numeric(),
                      Gender = character(),
                      Male = numeric(),
                      PatientID = character())

for(i in 1:length(pat_list)){
  # get current patient 
  current_pat = pat_list[i]
  
  # get a visit from raw data for reference 
  pat_visit = raw_data_processed %>%
    filter(PatientID == current_pat)
  pat_visit = pat_visit[1, ]
  
  # browser()
  
  # fill out information
  pat_demo[i, ]$FirstName = pat_visit$FirstName
  pat_demo[i, ]$LastName = pat_visit$LastName
  pat_demo[i, ]$Age = pat_visit$Age
  pat_demo[i, ]$Race = pat_visit$Race
  pat_demo[i, ]$RaceCode = pat_visit$RaceCode
  pat_demo[i, ]$Gender = pat_visit$Gender
  pat_demo[i, ]$Male = pat_visit$Male
  pat_demo[i, ]$PatientID = pat_visit$PatientID
}

6.2 2019 Census Dataframe

  • This contains projections of the 2019 population drawn from the 2010 cenus
# expected frequencies from US census data (in the same order as )
race_labels = c("American Indian or Alaskan Native", "Asian", "Black or African American",
                "Hispanic or Latino", "Native Hawaiian or Other Pacific Islander",
                "White or Caucasian")

# dataframe
expected_pop_2019 = data.frame(Race = race_labels, 
                          population = c(34310, 497517, 621902, 677211, 7410, 5054153),
                          percent = c(34310, 497517, 621902, 677211, 7410, 5054153)/6892503)

6.3 Make Fixed Factors

  • There is a lot of redundant typing here so I will give you the code
# encode demographic variables as factors --------------------------------------
make_fixed_factors = function(df){
  # create label vectors 
  gender_labels = c("Female", "Male")
  race_labels = c("White or Caucasian", "Asian", "Black or African American", 
                  "Hispanic or Latino", "American Indian or Alaskan Native",
                  "Native Hawaiian or Other Pacific Islander", "Declined", 
                  "Unavailable", "Other")
  
  # encode as factors 
  df$Male = factor(df$Male, levels = c(0, 1), labels = gender_labels)
  df$RaceCode = factor(df$RaceCode, levels = c(0:8), labels = race_labels)
  
  # return df
  return(df)
}