All variables:
One Variable
knitr allows R code to be embedded and actually work when used in with markup languages developed for report/docuemnt generation
Specifically, we will use knitr to run embedded R code in R Markdown documents
Code:
Good guidelines for research/publishing READMEs: https://guides.lib.uci.edu/datamanagement/readme
Copy and paste the text below into the “body” of your README:
### MEEI Clinic X 2018 Study
This project analyzes a dataset of patient visits to Clinic X at Mass Eye and Ear during 2018. During 2018, the MBTA was affected by a flood that stopped all public transit in the Boston area from May 1st, 2018, to July 31st. The purpose of this study is to investigate the impact of the public transport closure on the attendance of patients at the clinic during as well as after the period of closure.
### General Information / Data Acquisition
### Dataset Information
1. Patient visit data, with each row containing:
+ First Name
+ Last Name
+ Age
+ Date of the visit
+ MRN
+ Gender
+ Race
+ Visit status
+ Categorized by 3 options: Cancel, Complete, or No Show
+ Date the visit was scheduled
+ Date the visit was canceled, if applicable
+ Diagnosis (if completed visit)
+ ICD-10 code\(~\)
\(~\)
Long projects (such as ours) are often broken into multiple scripts. This has a few benefits, such as better navigation/organization of the program. It can also make it easier to trouble shoot when things go wrong. It is often harder to find problems
Below is the code for a simple function to use in your TestSubScript.r script
Copy the code below:
To rename columns of a dataframe, we can use the pipe %>% and the rename function rename().
Basic Syntax:
Class example:
Full Code:
df = df %>%
rename(
FirstName = FIRSTNAME,
LastName = LASTNAME,
Age = AGE,
Gender = GENDER,
Race = RACE,
Provider = PROVIDER,
ProviderName = PROVIDERNAME,
Department = DEPARTMENT,
DepartmentName = DEPARTMENTNAME,
SchedulingStatus = SCHEDULINGSTATUS,
PatientID = PATIENTID,
ApptMadeDate = APPTMADEDATE,
ApptCancelDate = APPTCANCELDATE,
DiagnosisCode = DIAGNOSIS
)Full Function (if needed):
# rename columns from Screaming Case to Camel Case
rename_columns = function(df){
# rename columns to CamelCase
df = df %>%
rename(
FirstName = FIRSTNAME,
LastName = LASTNAME,
Age = AGE,
Gender = GENDER,
Race = RACE,
Provider = PROVIDER,
ProviderName = PROVIDERNAME,
Department = DEPARTMENT,
DepartmentName = DEPARTMENTNAME,
SchedulingStatus = SCHEDULINGSTATUS,
PatientID = PATIENTID,
ApptMadeDate = APPTMADEDATE,
ApptCancelDate = APPTCANCELDATE,
DiagnosisCode = DIAGNOSIS
)
# return the updated dataframe
return(df)
}\(~\)
When working with dates in a program, it is important to ensure that all dates are uniform. There are a few “systems” that can be used for dates in R, but I prefer lubridate, as it is technically part of tidyverse. But, our problems go beyond just ensuring all dates can be handled the same way. First, we have to fix the excel numeric serial date issue in the ApptCancelDate column. To do this, we will use janitor and the excel_numeric_to_date() function.
Janitor
Janitor has a lot of ‘everyday’ cleaning functions
Install:
Syntax:
Full code (if needed):
Lubridate
Lubridate is my favorite date handling system. It has a lot of useful functions, mainly ymd(). We will take about this all later, but for now lets just convert all the dates to ymd() format
Install:
Syntax:
Full Function (if needed):
# Function for fixing the dates
fix_the_dates = function(df){
# fix excel serial date issue
df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))
# convert everything to ymd()
df$DOS = ymd(df$DOS)
df$ApptMadeDate = ymd(df$ApptMadeDate)
df$ApptCancelDate = ymd(df$ApptCancelDate)
# return the updated dataframe
return(df)
}\(~\)
At this point, your RProject should have a main.r script, as well as a DataCleaningScript.r located in the subscripts folder. Make sure that the body of both scripts matches those outlined below. If needed, copy and paste the contents to ensure that they work properly.
main.r
#### Main Script ####
######## Import Statements ########
library(tidyverse)
library(readxl)
### Source subscripts ####
source("Scripts/Subscripts/DataCleaningScript.r")
######## Import Data ########
# file path to raw data
raw_data_path = "Data/ClinicX2018.xlsx"
# raw data sheet in excel spreadsheed
raw_data_sheet = "VISITS"
# import raw data into a dataframe
raw_data = read_excel(raw_data_path, sheet = raw_data_sheet)
######## Basic Data Cleaning ########
# functions are imported from data cleaning script
# make a copy of raw_data to work with
visit_data = raw_data
# clean the data
visit_data = clean_the_data(visit_data)DataCleaningScript.r
# Data Cleaning Subscript
# This script contains the functions needed for cleaning the data
# these functions will be loaded into the [main.r] script
######## Import Statements ########
library(janitor)
library(lubridate)
######## Functions #########
# rename columns from Screaming Case to Camel Case
rename_columns = function(df){
# rename columns to CamelCase
df = df %>%
rename(
FirstName = FIRSTNAME,
LastName = LASTNAME,
Age = AGE,
Gender = GENDER,
Race = RACE,
Provider = PROVIDER,
ProviderName = PROVIDERNAME,
Department = DEPARTMENT,
DepartmentName = DEPARTMENTNAME,
SchedulingStatus = SCHEDULINGSTATUS,
PatientID = PATIENTID,
ApptMadeDate = APPTMADEDATE,
ApptCancelDate = APPTCANCELDATE,
DiagnosisCode = DIAGNOSIS
)
# return the updated dataframe
return(df)
}
# Function for fixing the dates
fix_the_dates = function(df){
# fix excel serial date issue
df$ApptCancelDate = excel_numeric_to_date(as.numeric(df$ApptCancelDate))
# convert everything to ymd()
df$DOS = ymd(df$DOS)
df$ApptMadeDate = ymd(df$ApptMadeDate)
df$ApptCancelDate = ymd(df$ApptCancelDate)
# return the updated dataframe
return(df)
}
# combine all functions into one function
clean_the_data = function(df){
# remove any duplicate rows
df = distinct(df)
# rename columns
df = rename_columns(df)
# fix the dates
df = fix_the_dates(df)
# return the clean dataframe
return(df)
}\(~\)
\(~\)
In most programming languages you have to add an entire column at once. This becomes a little tricky when we are trying to row by row to determine the right gender code to add, but there is a common work around that’s common in programming. You first create a vector of dummy data, then you “correct” each value row by row
Creating a Dummy Vector
Use the rep() function to create a variable of repeated values of a certain length.
Basic Syntax:
Full code (if needed):
Adding the Dummy Vector as a Column
Use the %>% pipe and the add_column() function to add the Male vector as a column to the dataframe
Basic Syntax:
Full code (if needed):
Basic Syntax
This form is used to go row by row through the elements of a single column
Basic Syntax:
Full code (if needed):
For each value of Gender, we want to check if its Male or Female and assign a number to another column accordingly. To do this, we use an If/Else statement and the indexing conventions described above
Full code(if needed):
add_gender_codes()
# Function for adding a numerical representation for gender
add_gender_codes = function(df){
# create a vector of dummy values
Male = rep(-99, nrow(df))
#browser()
# add column after Gender column
df = df %>%
add_column(
Male,
.after = grep("Gender", colnames(df)))
# indexing For Loop
for (i in 1:nrow(df)){
if (df[i, ]$Gender == "Male"){
df[i, ]$Male = 1
} else {
df[i, ]$Male = 0
}
}
return(df)
}\(~\)
| Race | Code |
|---|---|
| White or Caucasian | 0 |
| Asian | 1 |
| Black or African American | 2 |
| Hispanic or Latino | 3 |
| American Indian or Alaska Native | 4 |
| Native Hawaiian or Other Pacific Islander | 5 |
| Declined | 6 |
| Unavailable | 7 |
| Other | 8 |
# Add a vector of dummy values to dataframe
# create the vector
RaceCode = rep(-99, nrow(df))
# add vector after race column
df = df %>%
add_column(RaceCode, .after = grep("Race", colnames(df)))
# update dummy values based on information in Race column
df$RaceCode[df$Race == "White or Caucasian"] = 0
df$RaceCode[df$Race == "Asian"] = 1
df$RaceCode[df$Race == "Black or African American"] = 2
df$RaceCode[df$Race == "Hispanic or Latino"] = 3
df$RaceCode[df$Race == "American Indian or Alaskan Native"] = 4
df$RaceCode[df$Race == "Native Hawaiian or Other Pacific Islander"] = 5
df$RaceCode[df$Race == "Declined"] = 6
df$RaceCode[df$Race == "Unavailable"] = 7
df$RaceCode[df$Race == "Other"] = 8\(~\)
Creating user friendly code is important, and considered a best practice. Because our code contains long loops that take a while to run, we don’t want to leave the user high and dry wondering what the program is doing or what step it’s on. We found this somewhat well documented progress bar function online, lets try and use it.
The Function:
# progress bar function
make_progress_bar = function(num, format_text){
# Progress bar package
# create the progress bar environment
progbar = progress_bar$new(
format = paste(format_text, "[:bar] :percent eta: :eta"),
total = num, width = 100,
complete = "\u220e",
current = "\u22b3",
incomplete = "\u2218",
clear = FALSE
)
# return new progress bar
return(progbar)
}The Documentation
num: Number of times a loop will run, often equal to ‘nrow(df)’, or whatever controls the loopformat_text: String input of text to be included with progress barCreate a new progress bar before the intended loop. It is important to set the progress bar equal to a variable. This matters later when we want to “advance” the progress bar
write your loop as normal
In the last line of your loop, include the code VariableName$tick() to instruct the progress bar to advance
This function requires a specific package to operate. Include the below code in your script before running the function to ensure the function works:
# make a patient demographics dataframe ----------------------------------------
# get list of patients
pat_list = names(table(raw_data_processed$PatientID))
# empty data frame to fill out
pat_demo = data.frame(FirstName = character(),
LastName = character(),
Age = numeric(),
Race = character(),
RaceCode = numeric(),
Gender = character(),
Male = numeric(),
PatientID = character())
for(i in 1:length(pat_list)){
# get current patient
current_pat = pat_list[i]
# get a visit from raw data for reference
pat_visit = raw_data_processed %>%
filter(PatientID == current_pat)
pat_visit = pat_visit[1, ]
# browser()
# fill out information
pat_demo[i, ]$FirstName = pat_visit$FirstName
pat_demo[i, ]$LastName = pat_visit$LastName
pat_demo[i, ]$Age = pat_visit$Age
pat_demo[i, ]$Race = pat_visit$Race
pat_demo[i, ]$RaceCode = pat_visit$RaceCode
pat_demo[i, ]$Gender = pat_visit$Gender
pat_demo[i, ]$Male = pat_visit$Male
pat_demo[i, ]$PatientID = pat_visit$PatientID
}# expected frequencies from US census data (in the same order as )
race_labels = c("American Indian or Alaskan Native", "Asian", "Black or African American",
"Hispanic or Latino", "Native Hawaiian or Other Pacific Islander",
"White or Caucasian")
# dataframe
expected_pop_2019 = data.frame(Race = race_labels,
population = c(34310, 497517, 621902, 677211, 7410, 5054153),
percent = c(34310, 497517, 621902, 677211, 7410, 5054153)/6892503)# encode demographic variables as factors --------------------------------------
make_fixed_factors = function(df){
# create label vectors
gender_labels = c("Female", "Male")
race_labels = c("White or Caucasian", "Asian", "Black or African American",
"Hispanic or Latino", "American Indian or Alaskan Native",
"Native Hawaiian or Other Pacific Islander", "Declined",
"Unavailable", "Other")
# encode as factors
df$Male = factor(df$Male, levels = c(0, 1), labels = gender_labels)
df$RaceCode = factor(df$RaceCode, levels = c(0:8), labels = race_labels)
# return df
return(df)
}