library(dplyr) library(ggplot2) library(reshape2) providers = read.csv('providers.csv') work = read.csv('work.csv') emr_touches = read.csv('emr_touches.csv') emr_time = read.csv('emr_time.csv')
May 4, 2015
library(dplyr) library(ggplot2) library(reshape2) providers = read.csv('providers.csv') work = read.csv('work.csv') emr_touches = read.csv('emr_touches.csv') emr_time = read.csv('emr_time.csv')
For ease of use, two of the tables are converted to wide format:
work %>% dcast(Practice.ID + Provider.ID + Observation.Month ~ Variable) -> work emr_time %>% dcast(Practice.ID + Provider.ID + Observation.Month ~ Variable) -> emr_time # correcting the column names names(emr_time) <- sub(" ", ".", names(emr_time))
Then we join all tables:
df = inner_join(emr_time, emr_touches) df = inner_join(df, work) df = inner_join(df, providers)
The table is pretty clean, but we need to take care of the few non-coherent data points.
# remove the cases without Encounters value df = df[complete.cases(df), ] # remove cases with 0 number of encounters # remove cases with negative RVU df %>% filter(Encounters > 0) %>% filter(RVUs > 0) -> df # remove cases with 0 number of Keystrokes and Mouseclicks df %>% filter(df$Keystrokes > 0, df$Mouseclicks > 0) -> df
In order to have EMR time, we add up all times and divide them by number of encounters. Also we normalize the RVU by number of encounters.
df %>% mutate(EMR_time = (Time.Intake + Time.Exam + Time.Signoff + Time.Postvisit) / Encounters, rvu = RVUs / Encounters) -> df
Now let's check the distribution of EMR time.
df %>% ggplot(aes(EMR_time)) + geom_density()
Since it's not normal, we can use log function to make it almost normal:
df %>% ggplot(aes(log10(EMR_time))) + geom_density()
Same thing will be done with other parameters, such as Keystrokes, and Mouseclicks.
We also normalize the features by substracting the mean and dividing by standard deviation. This later helps in the modeling step.
normalize = function(x) { return((x - mean(x)) / sd(x)) } df$mouse = normalize(log10(df$Mouseclicks / df$Encounters)) df$key = normalize(log10(df$Keystrokes / df$Encounters)) df$tenure = normalize(df$Tenure.Using.EMR) df$rvu_cent = normalize(df$rvu)
Let's see how mouse clicks and key strokes are related to EMR time
df %>% ggplot(aes(key, mouse, color = log10(EMR_time))) + geom_point() + scale_colour_gradient2() + stat_smooth(method="lm") + theme(legend.position = "bottom") + xlab("Key strokes") + ylab("Mouse clicks")
Keystrokes and mouse clicks are correlated with time.
We define a new feature which is the normalized ratio of mouseclicks and keystrokes.
df$mouse_key = normalize(log10(df$Mouseclicks / df$Keystrokes))
We use linear regression to find out the effect of each parameter in a more comprehensive manner.
df %>% lm(formula = log(EMR_time) ~ mouse_key + Specialty + tenure + rvu + Provider.Level) -> model
## ## Call: ## lm(formula = log(EMR_time) ~ mouse_key + Specialty + tenure + ## rvu + Provider.Level, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.0803 -0.3761 0.0202 0.4039 3.6669 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.420050 0.026875 52.839 < 2e-16 *** ## mouse_key -0.380690 0.005256 -72.425 < 2e-16 *** ## SpecialtyOB/GYN -0.512863 0.025171 -20.375 < 2e-16 *** ## SpecialtyPCP 0.253915 0.023020 11.030 < 2e-16 *** ## SpecialtySurgery -0.811843 0.032771 -24.773 < 2e-16 *** ## tenure -0.157856 0.005238 -30.138 < 2e-16 *** ## rvu 0.086667 0.006282 13.795 < 2e-16 *** ## Provider.LevelMid-level 0.087098 0.012284 7.091 1.4e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6341 on 14916 degrees of freedom ## Multiple R-squared: 0.4157, Adjusted R-squared: 0.4154 ## F-statistic: 1516 on 7 and 14916 DF, p-value: < 2.2e-16
The model has a low R-squared (0.4156601):
Which parameters are playing a more important role?
## Response variable: log(EMR_time) ## Total response variance: 0.6877858 ## Analysis based on 14924 observations ## ## 7 Regressors: ## Some regressors combined in groups: ## Group Specialty : SpecialtyOB/GYN SpecialtyPCP SpecialtySurgery ## ## Relative importance of 5 (groups of) regressors assessed: ## Specialty mouse_key tenure rvu Provider.Level ## ## Proportion of variance explained by model: 41.57% ## Metrics are normalized to sum to 100% (rela=TRUE). ## ## Relative importance metrics: ## ## lmg ## Specialty 0.30002627 ## mouse_key 0.50155768 ## tenure 0.11053379 ## rvu 0.07306116 ## Provider.Level 0.01482111 ## ## Average coefficients for different model sizes: ## ## 1group 2groups 3groups 4groups ## mouse_key -0.3777059 -0.38108769 -0.38281942 -0.38273374 ## SpecialtyOB/GYN -0.4576141 -0.47856180 -0.49431536 -0.50553386 ## SpecialtyPCP 0.1508838 0.17362434 0.19860659 0.22548518 ## SpecialtySurgery -0.5266598 -0.60574876 -0.67925129 -0.74776481 ## tenure -0.1958891 -0.18724194 -0.17797448 -0.16818353 ## rvu -0.1141023 -0.06567245 -0.01621825 0.03450754 ## Provider.Level 0.2100879 0.17348110 0.13945418 0.11010709 ## 5groups ## mouse_key -0.38068958 ## SpecialtyOB/GYN -0.51286258 ## SpecialtyPCP 0.25391539 ## SpecialtySurgery -0.81184342 ## tenure -0.15785600 ## rvu 0.08666688 ## Provider.Level 0.08709754