1 Random Helpful Stuff

This section just has reminders about some helpful stuff/commands in R

1.1 Clearing the Global Environment

All variables:
```
remove(list = ls())
```
One Variable
```
remove(list = VariableName)
```

1.2 The Help Command

The help command brings up useful documentation about both base R functions and even some R packages (if the documentation exists in the frist place)
- Syntax:
```
?FunctionName
```

$~$

2 Creating a README

2.1 Install knitr

knitr allows R code to be embedded and actually work when used in with markup languages developed for report/docuemnt generation
Specifically, we will use knitr to run embedded R code in R Markdown documents

Code:
```
install.packages("knitr")
```

2.2 Create a README

Good guidelines for research/publishing READMEs: https://guides.lib.uci.edu/datamanagement/readme

Copy and paste the text below into the “body” of your README:

### MEEI Clinic X 2018 Study 

This project analyzes a dataset of patient visits to Clinic X at Mass Eye and Ear during 2018. During 2018, the MBTA was affected by a flood that stopped all public transit in the Boston area from May 1st, 2018, to July 31st. The purpose of this study is to investigate the impact of the public transport closure on the attendance of patients at the clinic during as well as after the period of closure. 

### General Information / Data Acquisition 

### Dataset Information

1. Patient visit data, with each row containing:
    + First Name 
    + Last Name
    + Age
    + Date of the visit
    + MRN
    + Gender
    + Race
    + Visit status 
        + Categorized by 3 options: Cancel, Complete, or No Show
    + Date the visit was scheduled 
    + Date the visit was canceled, if applicable 
    + Diagnosis (if completed visit)
        + ICD-10 code

$~$

3 Subscripts

Long projects (such as ours) are often broken into multiple scripts. This has a few benefits, such as better navigation/organization of the program. It can also make it easier to trouble shoot when things go wrong. It is often harder to find problems

3.1 Test Function

Below is the code for a simple function to use in your TestSubScript.r script

Copy the code below:

# This test function will average two numbers and return the answer to the user
TestFunction = function(x, y){
  # this function averages x and y
  xy_average = (x + y)/2

  # return the value 
  return(xy_average)
}

3.2 Importing subscripts

To import subscripts into your main script, use the source() command.

Basic Syntax:
```
source("filepath")
```
Full code (if needed):
```
source("Scripts/Subscripts/TestSubScript.r")
```

$~$

4 Basic Data Cleaning

4.1 Rename columns

To rename columns of a dataframe, we can use the pipe %>% and the rename function rename().

Basic Syntax:

df = df %>%
  rename(NewColName = OldColName)

Class example:

df = df %>%
  rename(FirstName = FIRSTNAME)

Full Code:

df = df %>%
  rename(
    FirstName = FIRSTNAME,
    LastName = LASTNAME,
    Age = AGE,
    Gender = GENDER, 
    Race = RACE,
    Provider = PROVIDER,
    ProviderName = PROVIDERNAME,
    Department = DEPARTMENT,
    DepartmentName = DEPARTMENTNAME,
    SchedulingStatus = SCHEDULINGSTATUS,
    PatientID = PATIENTID,
    ApptMadeDate = APPTMADEDATE,
    ApptCancelDate = APPTCANCELDATE,
    DiagnosisCode = DIAGNOSIS
  )

Full Function (if needed):

# rename columns from Screaming Case to Camel Case
rename_columns = function(df){
# rename columns to CamelCase
df = df %>%
  rename(
    FirstName = FIRSTNAME,
    LastName = LASTNAME,
    Age = AGE,
    Gender = GENDER, 
    Race = RACE,
    Provider = PROVIDER,
    ProviderName = PROVIDERNAME,
    Department = DEPARTMENT,
    DepartmentName = DEPARTMENTNAME,
    SchedulingStatus = SCHEDULINGSTATUS,
    PatientID = PATIENTID,
    ApptMadeDate = APPTMADEDATE,
    ApptCancelDate = APPTCANCELDATE,
    DiagnosisCode = DIAGNOSIS
  )

  # return the updated dataframe
  return(df)
}

$~$

4.2 Fix the Dates

When working with dates in a program, it is important to ensure that all dates are uniform. There are a few “systems” that can be used for dates in R, but I prefer lubridate, as it is technically part of tidyverse. But, our problems go beyond just ensuring all dates can be handled the same way. First, we have to fix the excel numeric serial date issue in the ApptCancelDate column. To do this, we will use janitor and the excel_numeric_to_date() function.

Janitor

Janitor has a lot of ‘everyday’ cleaning functions
Install:
```
install.packages("janitor")
```

Syntax:

new_date = excel_numeric_to_date(as.numeric(old_date))

Full code (if needed):

# fix excel serial date issue 
df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))

Lubridate
- Lubridate is my favorite date handling system. It has a lot of useful functions, mainly ymd(). We will take about this all later, but for now lets just convert all the dates to ymd() format
- Install:
```
install.packages(lubridate)
```
- Syntax:
```
ymd_date = ymd(old_date)
```

Full Function (if needed):

# Function for fixing the dates
fix_the_dates = function(df){
  # fix excel serial date issue 
  df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))

  # convert everything to ymd()
  df$DOS = ymd(df$DOS)
  df$ApptMadeDate = ymd(df$ApptMadeDate)
  df$ApptCancelDate = ymd(df$ApptCancelDate)

  # return the updated dataframe
  return(df)
}

$~$

4.3 Checkpoint

At this point, your RProject should have a main.r script, as well as a DataCleaningScript.r located in the subscripts folder. Make sure that the body of both scripts matches those outlined below. If needed, copy and paste the contents to ensure that they work properly.

main.r

#### Main Script ####
######## Import Statements ########
library(tidyverse)
library(readxl)


### Source subscripts ####
source("Scripts/Subscripts/DataCleaningScript.r")


######## Import Data ########
# file path to raw data
raw_data_path = "Data/ClinicX2018.xlsx"

# raw data sheet in excel spreadsheed
raw_data_sheet = "VISITS"

# import raw data into a dataframe
raw_data = read_excel(raw_data_path, sheet = raw_data_sheet)



######## Basic Data Cleaning ########
# functions are imported from data cleaning script 
# make a copy of raw_data to work with
visit_data = raw_data

# clean the data
visit_data = clean_the_data(visit_data)

DataCleaningScript.r

# Data Cleaning Subscript
# This script contains the functions needed for cleaning the data 
# these functions will be loaded into the [main.r] script

######## Import Statements ########
library(janitor)
library(lubridate)


######## Functions #########
# rename columns from Screaming Case to Camel Case
rename_columns = function(df){
  # rename columns to CamelCase
  df = df %>%
    rename(
      FirstName = FIRSTNAME,
      LastName = LASTNAME,
      Age = AGE,
      Gender = GENDER, 
      Race = RACE,
      Provider = PROVIDER,
      ProviderName = PROVIDERNAME,
      Department = DEPARTMENT,
      DepartmentName = DEPARTMENTNAME,
      SchedulingStatus = SCHEDULINGSTATUS,
      PatientID = PATIENTID,
      ApptMadeDate = APPTMADEDATE,
      ApptCancelDate = APPTCANCELDATE,
      DiagnosisCode = DIAGNOSIS
    )

  # return the updated dataframe
  return(df)
}


# Function for fixing the dates
fix_the_dates = function(df){
  # fix excel serial date issue 
  df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))

  # convert everything to ymd()
  df$DOS = ymd(df$DOS)
  df$ApptMadeDate = ymd(df$ApptMadeDate)
  df$ApptCancelDate = ymd(df$ApptCancelDate)

  # return the updated dataframe
  return(df)
}


# combine all functions into one function
clean_the_data = function(df){
  # remove any duplicate rows
  df = distinct(df)

  # rename columns 
  df = rename_columns(df)

  # fix the dates
  df = fix_the_dates(df)

  # return the clean dataframe
  return(df)
}

$~$

5 Data Processing

5.1 Adding Gender Codes

5.1.1 Adding a column to a dataframe

In most programming languages you have to add an entire column at once. This becomes a little tricky when we are trying to row by row to determine the right gender code to add, but there is a common work around that’s common in programming. You first create a vector of dummy data, then you “correct” each value row by row
Creating a Dummy Vector
- Use the rep() function to create a variable of repeated values of a certain length.
- Basic Syntax:
```
rep(value, length)
```
- Full code (if needed):
```
rep(-99, nrow(df))
```
Adding the Dummy Vector as a Column
- Use the %>% pipe and the add_column() function to add the Male vector as a column to the dataframe
- Basic Syntax:
```
df = df %>%
  add_column(VectorName, 
    .after = Index)
```
- Full code (if needed):
```
df = df %>%
   add_column(Male, 
   .after)
```

5.1.2 Indexing with a for loop

The slide decks from class have the visual description of how these commands work, which are probably the most helpful. Here, I will breifly describe the functions.
Row by row:
- Basic Syntax
```
current_row = df[i, ]
```
Element by element
- This form is used to go row by row through the elements of a single column
- Basic Syntax:
```
current_value = df[i, ]$ColumnName
```
- Full code (if needed):
```
df[i, ]$Male
```

5.1.3 Nested If/Else in a for loop

For each value of Gender, we want to check if its Male or Female and assign a number to another column accordingly. To do this, we use an If/Else statement and the indexing conventions described above

Full code(if needed):

# indexing For Loop 
for (i in 1:nrow(df)){
  if (df[i, ]$Gender == "Male"){
    df[i, ]$Male = 1
  } else {
    df[i, ]$Male = 0
  }
}

5.1.4 Checkpoint

By now, you should have a working function that adds a numerical code to each row based on the value in “Gender”. Make sure your code matches and functions like the function below, and correct it if necessary

add_gender_codes()

# Function for adding a numerical representation for gender
add_gender_codes = function(df){
  # create a vector of dummy values 
  Male = rep(-99, nrow(df))
  #browser()

  # add column after Gender column
  df = df %>%
    add_column(
      Male,
      .after = grep("Gender", colnames(df)))

  # indexing For Loop 
  for (i in 1:nrow(df)){
    if (df[i, ]$Gender == "Male"){
      df[i, ]$Male = 1
    } else {
      df[i, ]$Male = 0
    }
  }

 return(df)
}

$~$

5.2 Adding a Categorical variable for race

Because we have more than two options for race (as opposed to just Male/Female for gender), we will need more than 0 and 1 to represent them. Let’s use the following code:

Race	Code
White or Caucasian	0
Asian	1
Black or African American	2
Hispanic or Latino	3
American Indian or Alaska Native	4
Native Hawaiian or Other Pacific Islander	5
Declined	6
Unavailable	7
Other	8

Use the code below in your function, if needed:

  # Add a vector of dummy values to dataframe
  # create the vector 
  RaceCode = rep(-99, nrow(df))
  
  # add vector after race column 
  df = df %>%
    add_column(RaceCode, .after = grep("Race", colnames(df)))
  
  # update dummy values based on information in Race column 
  df$RaceCode[df$Race == "White or Caucasian"] = 0
  df$RaceCode[df$Race == "Asian"] = 1
  df$RaceCode[df$Race == "Black or African American"] = 2
  df$RaceCode[df$Race == "Hispanic or Latino"] = 3
  df$RaceCode[df$Race == "American Indian or Alaskan Native"] = 4
  df$RaceCode[df$Race == "Native Hawaiian or Other Pacific Islander"] = 5
  df$RaceCode[df$Race == "Declined"] = 6
  df$RaceCode[df$Race == "Unavailable"] = 7
  df$RaceCode[df$Race == "Other"] = 8

$~$

5.3 The Progress Bar Function

Creating user friendly code is important, and considered a best practice. Because our code contains long loops that take a while to run, we don’t want to leave the user high and dry wondering what the program is doing or what step it’s on. We found this somewhat well documented progress bar function online, lets try and use it.

The Function:

# progress bar function
make_progress_bar = function(num, format_text){
  # Progress bar package
  # create the progress bar environment
  progbar = progress_bar$new(
format = paste(format_text, "[:bar] :percent eta: :eta"),
total = num, width = 100, 
complete = "\u220e",
current = "\u22b3",
incomplete = "\u2218",
clear = FALSE
  )

  # return new progress bar
  return(progbar)
}

The Documentation
- Inputs
  - num: Number of times a loop will run, often equal to ‘nrow(df)’, or whatever controls the loop
  - format_text: String input of text to be included with progress bar
- Usage
  - Create a new progress bar before the intended loop. It is important to set the progress bar equal to a variable. This matters later when we want to “advance” the progress bar
  - write your loop as normal
  - In the last line of your loop, include the code VariableName$tick() to instruct the progress bar to advance
  - This function requires a specific package to operate. Include the below code in your script before running the function to ensure the function works:
```
# Progress bar package
if (!require(progress)) {install.packages("progress")}
library(progress)
```

6 Demographic Statistitics

6.1 Creating the dataframe

We have covered these skills/process many times, so I will go ahead and give you the code

# make a patient demographics dataframe ----------------------------------------
# get list of patients 
pat_list = names(table(raw_data_processed$PatientID))

# empty data frame to fill out
pat_demo = data.frame(FirstName = character(),
                      LastName = character(),
                      Age = numeric(),
                      Race = character(),
                      RaceCode = numeric(),
                      Gender = character(),
                      Male = numeric(),
                      PatientID = character())

for(i in 1:length(pat_list)){
  # get current patient 
  current_pat = pat_list[i]
  
  # get a visit from raw data for reference 
  pat_visit = raw_data_processed %>%
    filter(PatientID == current_pat)
  pat_visit = pat_visit[1, ]
  
  # browser()
  
  # fill out information
  pat_demo[i, ]$FirstName = pat_visit$FirstName
  pat_demo[i, ]$LastName = pat_visit$LastName
  pat_demo[i, ]$Age = pat_visit$Age
  pat_demo[i, ]$Race = pat_visit$Race
  pat_demo[i, ]$RaceCode = pat_visit$RaceCode
  pat_demo[i, ]$Gender = pat_visit$Gender
  pat_demo[i, ]$Male = pat_visit$Male
  pat_demo[i, ]$PatientID = pat_visit$PatientID
}

6.2 2019 Census Dataframe

This contains projections of the 2019 population drawn from the 2010 cenus

# expected frequencies from US census data (in the same order as )
race_labels = c("American Indian or Alaskan Native", "Asian", "Black or African American",
                "Hispanic or Latino", "Native Hawaiian or Other Pacific Islander",
                "White or Caucasian")

# dataframe
expected_pop_2019 = data.frame(Race = race_labels, 
                          population = c(34310, 497517, 621902, 677211, 7410, 5054153),
                          percent = c(34310, 497517, 621902, 677211, 7410, 5054153)/6892503)

6.3 Make Fixed Factors

There is a lot of redundant typing here so I will give you the code

# encode demographic variables as factors --------------------------------------
make_fixed_factors = function(df){
  # create label vectors 
  gender_labels = c("Female", "Male")
  race_labels = c("White or Caucasian", "Asian", "Black or African American", 
                  "Hispanic or Latino", "American Indian or Alaskan Native",
                  "Native Hawaiian or Other Pacific Islander", "Declined", 
                  "Unavailable", "Other")
  
  # encode as factors 
  df$Male = factor(df$Male, levels = c(0, 1), labels = gender_labels)
  df$RaceCode = factor(df$RaceCode, levels = c(0:8), labels = race_labels)
  
  # return df
  return(df)
}

Data Processing Reference Sheet

Abigail Payne

5/4/2021