Analysis of Survey Data

Introduction

A department at a university hospital has physician staffing obligations at two community hospitals. Physicians working in that department can therefore work at up to three different locations. Work location preferences vary individual, but in general, working at the primary hospital is preferred. The process by which physicians’ overall hours were allocated among different locations led to a level of dissatisfaction. A project was undertaken to improve this process.

In theory, given everyone’s preferences for how many hours to spend working at which location, hours could be assigned in a way to minimize the discrepancy between hours preferred and hours actually assigned. There was a sentiment, however, that simply taking everyone’s preferences equally into consideration would not be desirable. For example, someone with ten years of experience should get more of a say in where they work than a first year faculty member. Physicians contributing in specific ways or with needs at a certain location should also have more of a say. The general idea is easy enough to understand, but what factors specifically determine whose preferences should be more strongly taken into consideration? And just how important are those various factors?

To help answer these questions, a series of surveys were given to physicians. The first survey collected some general demographic information about the physicians, as well as a series of questions regarding how satisfied they were with aspects of the department, such as fairness and openness in decision-making processes. Physicians were also asked to rate the importance of various factors (obtained from preliminary discussions) in determining the importance of a physician’s scheduling preferences. Additionally, open-ended response sections were provided to allow physicians to provide additional feedback. Based on the responses from the first survey, a follow-up survey was later sent out, asking physicians among other things to rate the importance of an updated list of factors. Both surveys were administered through Qualtrics, and responses were exported as .csv files.

Using R, survey data was imported and cleaned. The data was analyzed to look at trends evident in the satisfaction levels among physicians. Cluster analysis and random forest models were built.

Packages Required

The following packages are required for this analysis:

Package	Use
pacman	For easy installation and loading of packages
tidyverse	data manipulation
magrittr	for piping operators
data.table	data importing
anonymizer	anonymizing sensitive information
DT	data table creation
devtools	installation of packages from github
ggplot2	plotting
ggthemes	plot themes
vip	more plots
caret	model building
glmnet	modeling
ranger	random forest modeling
factoextra	cluster analysis
likert	plotting likert data
NbClust	determining optimal cluster number

# Installs "pacman" package if not already installed
if (!require("pacman")) install.packages("pacman")

# "pacman" is then used to install/load other packages
pacman::p_load(tidyverse,
               magrittr,
               data.table,
               anonymizer,
               DT,
               devtools,
               ggplot2,
               ggthemes,
               vip,
               caret,
               glmnet,
               ranger,
               factoextra,
               NbClust)

install_github("m-dev-/likert")
library("likert")

Data Preparation

Importing the data

Survey 1

Figuring out what to import

As previously mentioned, data was exported from the Qualtrics website as a .csv file. The files contain three rows at the top before column values begin. The table below shows the contents of the first three rows. The table was transposed (columns indicate row numbers and contents) for easier viewing.

# File path of data
fpath <- "data/survey1.csv"

# Reading in the first three rows of data


# fread(fpath, header = FALSE, nrows = 3, showProgress = FALSE) %>%
#   rownames_to_column() %>%
#   t() %>%
#   datatable(class = "nowrap compact", rownames = FALSE, colnames = c("Row 1", "Row 2", "Row 3"))

# Importing first three rows, using row 1 as header
view_s1 <- fread(fpath, header = TRUE, nrows = 2)

#Viewing
view_s1  %>%
  t() %>%
  datatable(colnames = c("Item", "Description", "Import ID"), options = list(pageLength = 109, deferRender = TRUE, scrollY = 200, scroller = TRUE, dom = 't'))

The first row of the file (column 1 above) contains what appear to be good candidate column names. Row two may appear to have no difference with row one, but looking at more entries shows that the second row contains valuable information regarding the actual questions on the survey. Row three does not appear to have any information worth keeping, though it did present a challenge with importing the data.

My approach to obtaining a data set to work with was as follows:

Importing the first two rows of the data to make a data dictionary
Skipping the first three rows and importing the remainder of the data, using the names from the dictionary as column names

In each case, however, I did not want to import all of the columns. Looking at the table above, we see that the first columns contain information that’s not of interest, or identifying information, such as name and e-mail address. For that reason, I drop all the columns leading up to and including recipient last name.

The recipient e-mail address is special, however. I don’t want to remove the column altogether because it may prove very useful later. Instead, I use the anonymize function to convert the adress of each respondent into an anonymous (but unique) value.

The first two steps worked as expected (see code below).

# For use as dictionary
info_s1 <- fread(fpath, header = TRUE, nrows = 1) %$% # Import
  mutate(., Email = anonymize(RecipientEmail, .algo = "crc32", .seed = 1)) %>% # Anonymized Email variable
  select(-StartDate:-Q3_2) # Unnecessary columns removed

# Import column names from "view_s1"
survey_1 <-
  fread(fpath, skip = "2017", col.names = colnames(view_s1)) %$%     # Import
  mutate(., Email = anonymize(RecipientEmail, .algo = "crc32", .seed = 1)) %>% 
  select(-StartDate:-Q3_2)

Now that we have an initial import of survey 1 (as well as a corresponding dictionary), lets take a look at the dictionary to see what else we may not need:

#Viewing info vector
info_s1 %>%
  t() %>%
  datatable(
  colnames = c("Item", "Description"),
  options = list(
  pageLength = 109,
  deferRender = TRUE,
  scrollY = 200,
  scroller = TRUE
  )
  )

The survey has a number of sections where respondents can answer with free text. Although there can be significant insights within those comments, this analysis will not focus on that content. Therefore, any free text sections will be removed. A few things become apparent when looking at the dictionary above:

A section of the survey asks respondents to rate the importance of a given factor. For each factor listed, there is a corresponding question where they can write in comments regarding the factor of interest. Where QX#1_Y_Z is a question asking for a rating, QX#2_Y_Z is a corresponding free response question. Thus, question (column) names with “#2_” can be removed.
Further inspection of the dictionary allows me to identify a few more free text portions remove before proceeding further -Q26:-Q25, -Q10, -Q30_5_TEXT:-Q11

These variables are dropped from both survey 1 and the dictionary as well

survey_1 <- survey_1 %>%
  select(-matches("#2_"), -Q26:-Q25, -Q10, -Q30_5_TEXT:-Q11)

info_s1 <- info_s1 %>%
  select(-matches("#2_"), -Q26:-Q25, -Q10, -Q30_5_TEXT:-Q11)

There remain some questions relating to comments that I do not want. Their positions are defined in the comment_remove vector, which is then removed from the survey and dictionary:

comment_remove <- c(33:34, 37:38, 42:43, 45:46, 48:49,52:54)

survey_1 <- survey_1 %>% 
  select(-comment_remove)

info_s1 <- info_s1 %>% 
  select(-comment_remove)

Now that we are left with the desired remaining data, the next steps are to rename the columns:

names_section_1 <-
  c(
    "years_MD",
    "years_residency",
    "years_UC",
    "gender",
    "age",
    "hrs_current_UC",
    "hrs_current_J",
    "hrs_current_WC",
    "hrs_ideal_UC",
    "hrs_ideal_J",
    "hrs_ideal_WC",
    "hrs_fair_UC",
    "hrs_fair_J",
    "hrs_fair_WC",
    "comm_pref",
    "The current hours allocation process overall",
    "Transparency of the current process",
    "Fairness of the current process",
    "Clarity of the current process",
    "Equity of the current process",
    "Consistency with how procedures are applied",
    "Lack of bias",
    "Ability to base decisions off of accurate information",
    "Ability to uphold ethical and moral standards in the decision-making process",
    "Express views and feelings",
    "Influence outcome",
    "Appeal outcomes"
  )

education <- c("student_ratings",
               "resident_ratings",
               "eval_cont",
               "GR_att",
               "JC_need")

seniority <- c("years_MD", "years_UC")

pcq <- c("pat_satis",
         "pat_ph",
         "total_clin_hrs")

ct <- c("team_fit")

tr <- c("research_involvement")

io <- c("perceived_effort",
        "location_fit")

EDU <- paste("EDU_", education, sep = "")
SEN <- paste("SEN_", seniority, sep = "")
PCQ <- paste("PCQ_", pcq , sep = "")
CT  <- paste("CT_", ct, sep = "")
TR  <- paste("TR_", tr, sep = "")
IO  <- paste("IO_", io, sep = "")


names_section_2 <- c(EDU, SEN, PCQ, CT, TR, IO)

new_names <- c(names_section_1, names_section_2, "total_hrs_factor", "Email")

#Assigning new names
names(survey_1) <- new_names

# Making a copy of survey_1
s1 <- survey_1

Let’s take a look at the data:

survey_1 %>%
  t() %>%
  datatable(
  extensions = 'FixedColumns',
    options = list(
      pageLength = 100,
      scrollY = 200,
      scrollX = TRUE,
      fixedColumns = TRUE,
      dom = 'ftirs'
  ))

In addition to renaming the data, some variables need to be recoded. These include:

gender: Currently indicated by 1, 2, and 4 for “male”, “female”, and “prefer not to say”, respectively

age: integer values actually indicate age groups, consisting of groups of 5 years from 30-59

satisfaction ratings: here, values of 1-7 indicate responses on a likert scale, ranging from “Strongly Disagree” to “Strongly Agree”

likert_lables <-
  c(
    "Strongly Disagree",
    "Disagree",
    "Somewhat Disagree",
    "Neutral",
    "Somewhat Agree",
    "Agree",
    "Strongly Agree"
  )
likert_levels <- c(1, 2, 3, 4, 5, 6, 7)

# Variables to refactor: 16-27

for (i in seq(16,27)) {
  s1[[i]] <- factor(s1[[i]], labels = likert_lables, levels = likert_levels, ordered = TRUE)
}


gender_labels <- c("Male",
                   "Female",
                   "Prefer not to say")

s1$gender <-  factor(s1$gender, labels = gender_labels)

age_labels <- c("30-34", 
                "35-39",
                "40-44",
                "45-49",
                "50-54",
                "55-59")

s1$age <- factor(s1$age, labels = age_labels, ordered = TRUE)

surv1 <- s1

x <- c(0,4,9,14,19,24)
surv1$years_UC <- cut(surv1$years_UC, breaks = x, ordered_result = TRUE, labels = c("0-4","4-9","9-14","14-19","19-24"))

Survey 2

Importing Survey 2 follows a similar sequence of steps as with Survey 1

# File path of data
fpath2 <- "data/survey2.csv"

# For use in viewing
view_s2 <- fread(fpath2, header = TRUE, nrows = 1) 

#Viewing Survey 2
view_s2 %>%
  t() %>%
  datatable(colnames = c("Item", "Description"), options = list(pageLength = 109, deferRender = TRUE, scrollY = 200, scroller = TRUE, dom = 't'))

Next, I import Survey 2 and make a corresponding dictionary. As with Survey 1, I drop unnecessary columns and anonymize e-mail addresses.

survey_2 <-
  fread(fpath2, skip = "2017", col.names = colnames(view_s2)) %$% # Import
  mutate(., Email = anonymize(RecipientEmail, .algo = "crc32", .seed = 1)) %>% # Anonymized Email variable
  select(-StartDate:-Q25, -Q18:-Q19, -matches("Topics")) # Unnecessary columns removed

info_s2 <- fread(fpath2, header = TRUE, nrows = 1) %$%
  mutate(., Email = "Anonymized e-mail address") %>%
  select(-StartDate:-Q25,-Q18:-Q19,-matches("Topics"))

# Copy of survey_2
s2 <- survey_2

Renaming

# names of variables containing demographic information
names_s2_dem <- c("years_MD",
                  "years_residency",
                  "years_UC",
                  "age")

names(s2)[1:4] <- names_s2_dem

# Renaming Q7 to "community preference"
s2 %<>%
  rename("community preference" = Q7)

# names indicating factors to be taken into consideration when assigning hours
s2_factor_names <- c(
  "med_rank",
  "res_rank",
  "eval_contr",
  "GR_att",
  "att_need",
  "mentoring",
  "GR_lec",
  "years_MD",
  "years_UC",
  "age",
  "rank",
  "pt_sat",
  "pt_per_hr",
  "yrl_clin_hrs",
  "wRVUs",
  "n_charts",
  "team",
  "director",
  "committee ",
  "clin_trials",
  "productivity",
  "non_clin",
  "loc_study",
  "funding",
  "perc_effort",
  "loc_fit",
  "peer_rat"
)

s2 %<>%
  rename_at(vars(matches("#2_")), funs(paste0("BIN.", s2_factor_names)))

s2 %<>%
  rename_at(vars(matches("#1_")), funs(paste0("RAT.", s2_factor_names)))


s2 %<>%
  rename_all(funs(str_replace_all(., "Q5_1", "current_alloc"))) %>%
  rename_all(funs(str_replace_all(., "Q5_2", "ideal_alloc"))) %>%
  rename_all(funs(str_replace_all(., "Q5_3", "fair_alloc"))) %>%
  rename_at(vars(matches("_alloc_")), funs(str_replace_all(., c("1", "2", "3"), c("UC", "JH", "WC"))))

Exploring Missing Data

Lets take a look at survey 2 to get an overview:

s2 %>% 
  t() %>%
  datatable(
  extensions = 'FixedColumns',
    options = list(
      pageLength = 100,
      scrollY = 200,
      scrollX = TRUE,
      fixedColumns = TRUE,
      dom = 'ftirs'
  ))

The survey has * demographic data * Allocation of hours at locations + Current, ideal, fair * Preference of community location to work at * Factor rankings: + Various factors to potentially take into consideration are asked about. For each factor, there are two questions: + Should this be considered? This is a Yes/No response, and these questions are labeled BIN for binary + How important is this factor? This is a rating from 0-10, and these questions are labeled RAT for ratings

The BIN and RAT data contain a number of missing values. Many are scattered throughout, rather than isolated to particular rows. Because listwise deletion would result in the loss of a lot of useful data, it would be useful if there were a way to reliably impute some of those missing values.

Some respondants answered the BIN question for a particular factor but not the corresponding RAT part, and vice versa. If a respondant left RAT.X blank and answered “No” for Bin.X, does that imply a value of 0 would have been given? To examine the plausibility of such assumptions, I first create a data set that allows for easy comparison

#creating binary and ratings data frames
s2binary <-
  s2 %>%
  select(matches("BIN."))

s2ratings <-
  s2 %>% 
  select(matches("RAT."))

#Combining binary and ratings
s2comb <- bind_cols(s2binary, s2ratings)
# adding ID column to track rows when gathering
s2comb$ID  <- 1:nrow(s2comb)

s2gathered <- s2comb %>% 
  gather(factor, rating, -ID) %>% # all variables collapsed into one column with values in another column
  separate(factor, c("ranking", "variable"), "\\.") %>%
  spread(ranking, rating)

# To see what is happening
s2steps <- s2comb %>% 
  gather(factor, rating, -ID) %>%   # 55 columns, 54 w/o ID, 54 * 44 obs = 2376 rows
  separate(factor, c("ranking", "variable"), "\\.") %>% # factor column split into "ranking" indicating BIN   or RAT and "variable" indicating variable
  spread(ranking, rating) # Ranking col indicating "BIN" or "RAT" spread across "rating" col indicating ratings

# Rewriting missing values in binary column as "Missing"
s2gathered$BIN <- ifelse(s2gathered$BIN == "", "Missing", ifelse(s2gathered$BIN == "No", "No", "Yes"))
# Converting to factor
s2gathered$BIN <- factor(s2gathered$BIN)

s2gathered$RAT <- as.integer(s2gathered$RAT)

s2gathered %>%
  select(-ID) %>% 
  datatable(extensions = 'Scroller', options = list(
  deferRender = TRUE,
  scrollY = 200,
  scroller = TRUE
), rownames = FALSE) %>% 
  formatStyle(columns = c('BIN','RAT'),
              background = styleEqual(c("Missing", NA), c("yellow", "yellow")))

There are many questions that can be asked. For example, for cases where RAT is missing, what is the corresponding BIN response?

#Where ratings data is missing, what does the binary data look like?
s2gathered %>% 
  filter(is.na(RAT)) %>% 
  count(BIN)

## # A tibble: 3 x 2
##   BIN         n
##   <fct>   <int>
## 1 Missing    54
## 2 No        208
## 3 Yes        24

Indeed, it appears that the majority of cases where RAT is left blank (73%), BIN has been selected as “No”

Does that mean we can replace all those instances with 0’s for RAT? Before making such claims, it is worth looking at the values of RAT when BIN = “No”

s2gathered %>% 
  filter(BIN == "No" & !is.na(RAT)) %>% 
  count(RAT)

## # A tibble: 10 x 2
##      RAT     n
##    <int> <int>
##  1     0   151
##  2     1    13
##  3     2    13
##  4     3    15
##  5     4     3
##  6     5     4
##  7     6     1
##  8     8     1
##  9     9     1
## 10    10     7

s2gathered %>% 
  filter(BIN == "No" & !is.na(RAT)) %>% 
  ggplot(aes(x = as.factor(RAT))) + geom_bar() + theme_few()

When subjects rate a factor as “No”, they most often give a RAT of 0. What is the mean RAT value associated with each value of BIN?

s2gathered %>% 
  filter(!is.na(RAT)) %>% 
  group_by(BIN) %>% 
  summarise(avg = mean(RAT), median = median(RAT))

## # A tibble: 3 x 3
##   BIN       avg median
##   <fct>   <dbl>  <dbl>
## 1 Missing  7.86      8
## 2 No       1         0
## 3 Yes      6.10      6

It seems reasonable from the data above to replace missing RAT values with 1, when the corresponding BIN value is “No”.

What about cases where BIN is missing but a value for RAT is provided? There are 14 such instances. The breakdown is as follows:

s2gathered %>% 
  filter(!is.na(RAT) & BIN == "Missing") %>% 
  count(RAT) %>% 
  arrange()

## # A tibble: 5 x 2
##     RAT     n
##   <int> <int>
## 1     2     1
## 2     7     1
## 3     8     9
## 4     9     1
## 5    10     2

When BIN is missing, RAT tends to be a higher number. From the above table, it seems reasonable to assign BIN as “No” when RAT is a 2, and to assign BIN as “Yes” for all the other cases (BIN >6)

From the rules determined above, certain missing values can be substituted. Instead of making such changes on the gathered data set, we go back to BIN and RAT data extracted from the original. To be safe, I make some copies first:

sv2rat <- s2ratings
sv2bin <- s2binary

# To add the "Missing" value back to the binary data:
sv2bin <- ifelse(sv2bin == "", "Missing", ifelse(sv2bin == "No", "No", "Yes")) %>% 
  as.data.frame() %>% # converts to data frame and converts to factor
  as.tibble() #converts to tibble

Lets look at how many missing values we have to start in the binary and ratings data:

as_tibble(list("Total BIN Missing" = sum(sv2bin == "Missing"), "Total RAT NAs" = sum(is.na(sv2rat))))

## # A tibble: 1 x 2
##   `Total BIN Missing` `Total RAT NAs`
##                 <int>           <int>
## 1                  68             286

Some for loops are utilized to make the desired subsitutions. First, we look for missing values of RAT. When the corresponding BIN is “No”, we set RAT = 0. If BIN = “Yes”, RAT is set to 6

for (i in 1:ncol(sv2rat)) {
  for (j in 1:nrow(sv2rat)) {
    if (is.na(sv2rat[j, i])) {
      if (sv2bin[j, i] != "Missing") {
        sv2rat[j, i] = ifelse(sv2bin[j,i] == "No", 0, 6)
      }
    }
  }
}

The next loop searches for BIN values of “Missing” that have a value for the corresponding RAT. Where RAT > 6, BIN is set to “Yes”, otherwise it is set to “No”

for (i in 1:ncol(sv2bin)) {
  for (j in 1:nrow(sv2bin)) {
    if (sv2bin[j,i] == "Missing") {
      if (!is.na(sv2rat[j,i])) {
        sv2bin[j,i] = ifelse(sv2rat[j,i] > 6, "Yes", "No")
      }
    }
  }
}

Now lets look at the overall missing values:

as_tibble(list("Total BIN Missing" = sum(sv2bin == "Missing"), "Total RAT NAs" = sum(is.na(sv2rat))))

## # A tibble: 1 x 2
##   `Total BIN Missing` `Total RAT NAs`
##                 <int>           <int>
## 1                  54              54

We now have 54 missing binary ratings, down from 68. Where we once had 286 missing RAT values, we now only have 54.

Analysis

Organizational Justice Perceptions

A number of questions in survey 1 asked respondents how satisfied they were with departmental procedures, as well as their ability to influence and/or appeal outcomes. These responses were on a seven-point scale, ranging from “very dissatisfied” to “very satisfied”. Previous analysis of these sentiments involved examining the median values for each question (by having responses coded as integers from 1-7). Such aggregate analysis can be misleading. For example, responses to a certain question may consist of two groups- one very satisfied, and one very dissatisfied, but appear “neutral” when analyzed in aggregate.

To gain better insights, responses among certain groups were plotted. Such groups included gender, age groupings, and groupings for the number of years a faculty member has been employed. The code to produce such plots are shown below:

org.just_1_group.age <- likert(surv1[16:20], grouping = surv1$age)
org.just_2_group.age <- likert(surv1[21:24], grouping = surv1$age)
org.just_3_group.age <- likert(surv1[25:27], grouping = surv1$age)
org.just_1_group.gender <- likert(surv1[16:20], grouping = surv1$gender)
org.just_2_group.gender <- likert(surv1[21:24], grouping = surv1$gender)
org.just_3_group.gender <- likert(surv1[25:27], grouping = surv1$gender)
org.just_1_group.years_UC <- likert(surv1[16:20], grouping = surv1$years_UC)
org.just_2_group.years_UC <- likert(surv1[21:24], grouping = surv1$years_UC)
org.just_3_group.years_UC <- likert(surv1[25:27], grouping = surv1$years_UC)

Satisfaction ratings by age

plot(org.just_1_group.age)

plot(org.just_2_group.age)

plot(org.just_3_group.age)

Satisfaction Ratings by Gender

lik_a1 <- plot(org.just_1_group.gender) 
lik_a2 <- plot(org.just_2_group.gender)
lik_a3 <- plot(org.just_3_group.gender)

Satisfaction Ratings by Years on Faculty

plot(org.just_1_group.years_UC, group.order = levels(org.just_1_group.years_UC$grouping))

plot(org.just_2_group.years_UC, group.order = levels(org.just_2_group.years_UC$grouping))

plot(org.just_3_group.years_UC, group.order = levels(org.just_3_group.years_UC$grouping))

These plots appear to indicate less satisfaction in procedures among younger and more junior faculty members. Men also appear more satisfied with departmental processes.

Clustering

One matter of interest is the grouping among faculty by their perceptions. To look at this, clustering analysis is used on the set of organizational justice questions.

First, a data set for use in analysis is constructed

# Questions from survey 1 regarding organizational justice are put into their own data set
s1_justice <- s1[16:27]

# Factor responses are converted to integers to allow for cluster analysis
for (x in 1:ncol(s1_justice)) {
  s1_justice[[x]] <- as.integer(s1_justice[[x]])
}

# NA values removed
s1j <- na.omit(s1_justice)

Next, the following is run to help determine the best number of clusters:

All three methods indicated 2 as the best number of clusters A plot of kmeans = 2 is shown:

final_s1j <- kmeans(s1j, 2, nstart = 25)
fviz_cluster(final_s1j, data = s1j)

Differences in average values between the two clusters are shown below:

ggplot(km_cluster_dem1, aes(y = Average, x = Item, fill = Item, color = Item)) +
  geom_bar(aes(group = factor(Cluster)), stat = "identity", position = "dodge", width = 0.5)

ggplot(km_cluster_pref, aes(y = Average, x = Item, fill = Item, color = Item)) +
  geom_bar(aes(group = factor(Cluster)), stat = "identity", position = "dodge", width = 0.5)

ggplot(km_cluster_just, aes(y = Average, x = Item, fill = Item, color = Item)) +
  geom_bar(aes(group = factor(Cluster)), stat = "identity", position = "dodge", width = 0.5) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x=element_blank())

ggplot(km_cluster_rat, aes(y = Average, x = Item, fill = Item, color = Item)) +
  geom_bar(aes(group = factor(Cluster)), stat = "identity", position = "dodge", width = 0.5) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x=element_blank())

Examining Ratings data

Joining Survey 1 and Survey 2 Data

s2metrics <- bind_cols(s2[69], sv2bin, sv2rat) %>% 
  na.omit()
# na.omit leaves 39 observations (from 44)

s2other <- bind_cols(s2[69], s2[1:14])

s2rebuild <- right_join(s2other, s2metrics, by = "Email")

s_joined <- inner_join(surv1, s2rebuild, by = "Email")

fviz_nbclust(s_joined[85:111], kmeans, method = "silhouette") # 3

fviz_nbclust(s_joined[85:111], cluster::pam, method = "silhouette") # 3

fviz_nbclust(s_joined[85:111], cluster::clara, method = "silhouette")

fviz_nbclust(s_joined[85:111], cluster::fanny, method = "silhouette") # 2

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/
## k. Maybe decrease 'memb.exp' ?

fviz_nbclust(s_joined[85:111], hcut, method = "silhouette") # 3

final_s2RAT_km <- kmeans(s_joined[85:111], 3)
fviz_cluster(final_s2RAT_km, data = s_joined[85:111])

s_joined %>% 
  mutate_if(is.factor, as.integer) %>% 
  mutate(Cluster = final_s2RAT_km$cluster) %>% 
  group_by(Cluster) %>% 
  summarise_all("mean", na.rm = TRUE)

## Warning in mean.default(Email, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Email, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Email, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(age.y, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(age.y, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(age.y, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## Warning in mean.default(`community preference`, na.rm = TRUE): argument is
## not numeric or logical: returning NA

## Warning in mean.default(`community preference`, na.rm = TRUE): argument is
## not numeric or logical: returning NA

## Warning in mean.default(`community preference`, na.rm = TRUE): argument is
## not numeric or logical: returning NA

## # A tibble: 3 x 112
##   Cluster years_MD.x years_residency~ years_UC.x gender age.x
##     <int>      <dbl>            <dbl>      <dbl>  <dbl> <dbl>
## 1       1       10.3             6.27       1.73   1.45  2.09
## 2       2       21.5            17.5        3.17   1.17  4   
## 3       3       16.4            12.4        3.08   1     3.33
## # ... with 106 more variables: hrs_current_UC <dbl>, hrs_current_J <dbl>,
## #   hrs_current_WC <dbl>, hrs_ideal_UC <dbl>, hrs_ideal_J <dbl>,
## #   hrs_ideal_WC <dbl>, hrs_fair_UC <dbl>, hrs_fair_J <dbl>,
## #   hrs_fair_WC <dbl>, comm_pref <dbl>, `The current hours allocation
## #   process overall` <dbl>, `Transparency of the current process` <dbl>,
## #   `Fairness of the current process` <dbl>, `Clarity of the current
## #   process` <dbl>, `Equity of the current process` <dbl>, `Consistency
## #   with how procedures are applied` <dbl>, `Lack of bias` <dbl>, `Ability
## #   to base decisions off of accurate information` <dbl>, `Ability to
## #   uphold ethical and moral standards in the decision-making
## #   process` <dbl>, `Express views and feelings` <dbl>, `Influence
## #   outcome` <dbl>, `Appeal outcomes` <dbl>, EDU_student_ratings <dbl>,
## #   EDU_resident_ratings <dbl>, EDU_eval_cont <dbl>, EDU_GR_att <dbl>,
## #   EDU_JC_need <dbl>, SEN_years_MD <dbl>, SEN_years_UC <dbl>,
## #   PCQ_pat_satis <dbl>, PCQ_pat_ph <dbl>, PCQ_total_clin_hrs <dbl>,
## #   CT_team_fit <dbl>, TR_research_involvement <dbl>,
## #   IO_perceived_effort <dbl>, IO_location_fit <dbl>,
## #   total_hrs_factor <dbl>, Email <dbl>, years_MD.y <dbl>,
## #   years_residency.y <dbl>, years_UC.y <dbl>, age.y <dbl>,
## #   current_alloc_UC <dbl>, current_alloc_JH <dbl>,
## #   current_alloc_WC <dbl>, ideal_alloc_UC <dbl>, ideal_alloc_JH <dbl>,
## #   ideal_alloc_WC <dbl>, fair_alloc_UC <dbl>, fair_alloc_JH <dbl>,
## #   fair_alloc_WC <dbl>, `community preference` <dbl>, BIN.med_rank <dbl>,
## #   BIN.res_rank <dbl>, BIN.eval_contr <dbl>, BIN.GR_att <dbl>,
## #   BIN.att_need <dbl>, BIN.mentoring <dbl>, BIN.GR_lec <dbl>,
## #   BIN.years_MD <dbl>, BIN.years_UC <dbl>, BIN.age <dbl>, BIN.rank <dbl>,
## #   BIN.pt_sat <dbl>, BIN.pt_per_hr <dbl>, BIN.yrl_clin_hrs <dbl>,
## #   BIN.wRVUs <dbl>, BIN.n_charts <dbl>, BIN.team <dbl>,
## #   BIN.director <dbl>, `BIN.committee ` <dbl>, BIN.clin_trials <dbl>,
## #   BIN.productivity <dbl>, BIN.non_clin <dbl>, BIN.loc_study <dbl>,
## #   BIN.funding <dbl>, BIN.perc_effort <dbl>, BIN.loc_fit <dbl>,
## #   BIN.peer_rat <dbl>, RAT.med_rank <dbl>, RAT.res_rank <dbl>,
## #   RAT.eval_contr <dbl>, RAT.GR_att <dbl>, RAT.att_need <dbl>,
## #   RAT.mentoring <dbl>, RAT.GR_lec <dbl>, RAT.years_MD <dbl>,
## #   RAT.years_UC <dbl>, RAT.age <dbl>, RAT.rank <dbl>, RAT.pt_sat <dbl>,
## #   RAT.pt_per_hr <dbl>, RAT.yrl_clin_hrs <dbl>, RAT.wRVUs <dbl>,
## #   RAT.n_charts <dbl>, RAT.team <dbl>, RAT.director <dbl>, `RAT.committee
## #   ` <dbl>, RAT.clin_trials <dbl>, RAT.productivity <dbl>, ...

Regression and Tree Building

lmControl <- trainControl(method = "cv", number = 10, verboseIter = FALSE)
lm_just <- train(x = s1_justice[2:12], y = s1_justice[[1]], method = "glm", preProcess = c("medianImpute"))
lmj <- train(x = s1j[2:12], y = s1j[[1]], method = "glm", trControl = lmControl)
lmj1 <- train(x = s1j[2:12], y = s1j[[1]], method = "glmnet", trControl = lmControl)

plot(varImp(lm_just))

plot(varImp(lmj))

plot(varImp(lmj1))

Random Forests

rfj <- surv1[16:27]
s1j_rev <- surv1[16:27] %>% 
  mutate_all(fct_rev) %>% 
  mutate_all(as.integer)

just_rf <- train(`The current hours allocation process overall` ~ ., data = rfj, method = "ranger", tuneLength = 10, importance = 'permutation', respect.unordered.factors = 'order', na.action = na.omit)

just_rf1 <- train(`The current hours allocation process overall` ~ ., data = s1j_rev, method = "ranger", tuneLength = 10, importance = 'permutation', na.action = na.omit)

vip(just_rf, bar = FALSE)

plot(varImp(just_rf1))

sjoined <- inner_join(surv1[, -c(15, 28:42)], bind_cols(s2[14], sv2rat, s2[69]), by = "Email")
sjoined_clean <- na.omit(sjoined)
sjust_ord <- bind_cols(sjoined[15], sjoined[1:14], sjoined[28:55])
sjust_ord_clean <- na.omit(sjust_ord)
sj_current_hrs_only <- sjust_ord_clean[,-c(10:15)]
sj_uc_percent_fair <- sjust_ord_clean[, -c(17:43)] %>% 
  mutate(UC_percent_fair = hrs_current_UC/hrs_fair_UC)

rf1 <- train(`The current hours allocation process overall` ~ ., data = sjust_ord_clean, method = "ranger", tuneLength = 10, importance = 'permutation')
rf2 <- train(`The current hours allocation process overall` ~ ., data = sj_current_hrs_only, method = "ranger", tuneLength = 10, importance = 'permutation')
rf3 <- train(`The current hours allocation process overall` ~ ., data = sj_uc_percent_fair, method = "ranger", tuneLength = 10, importance = 'permutation', na.action = na.omit)

vip(rf1)

vip(rf2)

vip(rf3)

Summary

Data from two surveys given to physicians was analyzed. These surveys contained demographic information about the respondents, their satisfaction with the department, preferences for work hours, and the importance of different factors for assessing contributions.

Visualizing satisfaction response data across different groupings showed clear patterns: older and more senior faculty were more content with processes. Clustering of respondents by those responses indicated 2 to be the best number of clusters. This suggests that faculty can be segmented into two groups. One group is happy with departmental procedures and tends to rank various potential contribution metrics highly. The other group is much less satisfied, and does not place as much importance on factors by which to asses their work. This seems to indicate a potential distrust of any new system to assign hours to employees.

Linear and random forest modeling was utilized to examine the most important factors contributing to overall satisfaction with the current process used by the department. First, overall satisfatction was modeled against other measures of satisfaction to gain insight into what was most important to employees. Consistently, satisfaction with the fairness of the current process was shown to be the most important variable in determining overall satisfaction. When examining overall satisfaction against different variables, age, years practicing medicine, and the proportion of total hours currently spent working at the primary location were shown to be the most important variables.

This analysis shows the importance of looking beyond aggregated statistics. Large differences were found in satisfaction among different physicians. The sample size available was small, and for that reason any implications for prediction arising from modeling should be cautiously assessed, however, it is clear that there appears to be two major groups within the department. Management should be aware of this and be prepared for the possibility that efforts at improving scheduling might be seen not only as ineffective, but possibly as making things worse. This is not to suggest any futility in the effort, only that a new scheduling process may need to go through a number of iterations before having the desired effect.

BANA 8090 Final Project

Amir Babar