Analytics Project Presentation

May 4, 2015

Reading the data

library(dplyr)
library(ggplot2)
library(reshape2)

providers = read.csv('providers.csv')
work = read.csv('work.csv')
emr_touches = read.csv('emr_touches.csv')
emr_time = read.csv('emr_time.csv')

Data cleaning

For ease of use, two of the tables are converted to wide format:

work %>% 
  dcast(Practice.ID + Provider.ID + Observation.Month ~ Variable) -> 
  work

emr_time %>% 
  dcast(Practice.ID + Provider.ID + Observation.Month ~ Variable) -> 
  emr_time

# correcting the column names
names(emr_time) <- sub(" ", ".", names(emr_time))

Then we join all tables:

df = inner_join(emr_time, emr_touches)
df = inner_join(df, work)
df = inner_join(df, providers)

Data cleaning

The table is pretty clean, but we need to take care of the few non-coherent data points.

# remove the cases without Encounters value
df = df[complete.cases(df), ]

# remove cases with 0 number of encounters
# remove cases with negative RVU
df %>% filter(Encounters > 0) %>%
  filter(RVUs > 0) -> 
  df

# remove cases with 0 number of Keystrokes and Mouseclicks
df %>% filter(df$Keystrokes > 0, df$Mouseclicks > 0) -> df

In order to have EMR time, we add up all times and divide them by number of encounters. Also we normalize the RVU by number of encounters.

df %>% mutate(EMR_time = (Time.Intake + Time.Exam + 
                            Time.Signoff + Time.Postvisit) / Encounters, 
              rvu = RVUs / Encounters) -> df

Massaging the parameters

Now let's check the distribution of EMR time.

df %>% ggplot(aes(EMR_time)) +
  geom_density()

Massaging the parameters

Since it's not normal, we can use log function to make it almost normal:

df %>% ggplot(aes(log10(EMR_time))) +
  geom_density()

Massaging the parameters

Same thing will be done with other parameters, such as Keystrokes, and Mouseclicks.

We also normalize the features by substracting the mean and dividing by standard deviation. This later helps in the modeling step.

normalize = function(x) {
  return((x - mean(x)) / sd(x))
}

df$mouse = normalize(log10(df$Mouseclicks / df$Encounters))
df$key = normalize(log10(df$Keystrokes / df$Encounters))

df$tenure = normalize(df$Tenure.Using.EMR)
df$rvu_cent = normalize(df$rvu)

Keystrokes and mouse clicks

Let's see how mouse clicks and key strokes are related to EMR time

df %>% ggplot(aes(key, mouse, color = log10(EMR_time))) +
  geom_point() + scale_colour_gradient2() +
  stat_smooth(method="lm") +
  theme(legend.position = "bottom") +
  xlab("Key strokes") + ylab("Mouse clicks")

Keystrokes and mouse clicks

Keystrokes and mouse clicks are correlated with time.

We define a new feature which is the normalized ratio of mouseclicks and keystrokes.

df$mouse_key = normalize(log10(df$Mouseclicks / df$Keystrokes))

Exploratory graphs

Model building

We use linear regression to find out the effect of each parameter in a more comprehensive manner.

df %>% lm(formula = log(EMR_time) ~ mouse_key + 
       Specialty + 
       tenure +
       rvu +
       Provider.Level) -> model

Model building

## 
## Call:
## lm(formula = log(EMR_time) ~ mouse_key + Specialty + tenure + 
##     rvu + Provider.Level, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0803 -0.3761  0.0202  0.4039  3.6669 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.420050   0.026875  52.839  < 2e-16 ***
## mouse_key               -0.380690   0.005256 -72.425  < 2e-16 ***
## SpecialtyOB/GYN         -0.512863   0.025171 -20.375  < 2e-16 ***
## SpecialtyPCP             0.253915   0.023020  11.030  < 2e-16 ***
## SpecialtySurgery        -0.811843   0.032771 -24.773  < 2e-16 ***
## tenure                  -0.157856   0.005238 -30.138  < 2e-16 ***
## rvu                      0.086667   0.006282  13.795  < 2e-16 ***
## Provider.LevelMid-level  0.087098   0.012284   7.091  1.4e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6341 on 14916 degrees of freedom
## Multiple R-squared:  0.4157, Adjusted R-squared:  0.4154 
## F-statistic:  1516 on 7 and 14916 DF,  p-value: < 2.2e-16

Model building

The model has a low R-squared (0.4156601):

lack of parameters
complexity of the data.

Which parameters are playing a more important role?

we can use the relaimpo package.

Model building

## Response variable: log(EMR_time) 
## Total response variance: 0.6877858 
## Analysis based on 14924 observations 
## 
## 7 Regressors: 
## Some regressors combined in groups: 
##         Group  Specialty : SpecialtyOB/GYN SpecialtyPCP SpecialtySurgery 
## 
##  Relative importance of 5 (groups of) regressors assessed: 
##  Specialty mouse_key tenure rvu Provider.Level 
##  
## Proportion of variance explained by model: 41.57%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                       lmg
## Specialty      0.30002627
## mouse_key      0.50155768
## tenure         0.11053379
## rvu            0.07306116
## Provider.Level 0.01482111
## 
## Average coefficients for different model sizes: 
## 
##                      1group     2groups     3groups     4groups
## mouse_key        -0.3777059 -0.38108769 -0.38281942 -0.38273374
## SpecialtyOB/GYN  -0.4576141 -0.47856180 -0.49431536 -0.50553386
## SpecialtyPCP      0.1508838  0.17362434  0.19860659  0.22548518
## SpecialtySurgery -0.5266598 -0.60574876 -0.67925129 -0.74776481
## tenure           -0.1958891 -0.18724194 -0.17797448 -0.16818353
## rvu              -0.1141023 -0.06567245 -0.01621825  0.03450754
## Provider.Level    0.2100879  0.17348110  0.13945418  0.11010709
##                      5groups
## mouse_key        -0.38068958
## SpecialtyOB/GYN  -0.51286258
## SpecialtyPCP      0.25391539
## SpecialtySurgery -0.81184342
## tenure           -0.15785600
## rvu               0.08666688
## Provider.Level    0.08709754

Summary

The data is very complicated/noisy. Not enough parameters.
Doctors with different specialties seem to use the interface differently.
The ratio of mouse clicks over keystrokes seems to have an inverse relationship with EMR time.
Users with higher tenure using EMR have lower EMR time. But the effect is not big enough.