Introduction

A department at a university hospital has physician staffing obligations at two community hospitals. Physicians working in that department can therefore work at up to three different locations. Work location preferences vary individual, but in general, working at the primary hospital is preferred. The process by which physicians’ overall hours were allocated among different locations led to a level of dissatisfaction. A project was undertaken to improve this process.

In theory, given everyone’s preferences for how many hours to spend working at which location, hours could be assigned in a way to minimize the discrepancy between hours preferred and hours actually assigned. There was a sentiment, however, that simply taking everyone’s preferences equally into consideration would not be desirable. For example, someone with ten years of experience should get more of a say in where they work than a first year faculty member. Physicians contributing in specific ways or with needs at a certain location should also have more of a say. The general idea is easy enough to understand, but what factors specifically determine whose preferences should be more strongly taken into consideration? And just how important are those various factors?

To help answer these questions, a series of surveys were given to physicians. The first survey collected some general demographic information about the physicians, as well as a series of questions regarding how satisfied they were with aspects of the department, such as fairness and openness in decision-making processes. Physicians were also asked to rate the importance of various factors (obtained from preliminary discussions) in determining the importance of a physician’s scheduling preferences. Additionally, open-ended response sections were provided to allow physicians to provide additional feedback. Based on the responses from the first survey, a follow-up survey was later sent out, asking physicians among other things to rate the importance of an updated list of factors. Both surveys were administered through Qualtrics, and responses were exported as .csv files.

I am interested in underlying patterns revealed in the survey data. Satisfaction with departmental procedures varied among physicians. Can a number of groups be identified, with specific underlying characteristics? This could assist in better understanding the department. Certain groups may differ in what they find important, and understanding that could help leadership understand what to focus on. This could be of particular importance if insights are lost when looking at responses in aggregate.

I am also interested in seeing the relationships among different ratings given to different factors. This could help identify the key things that need to be taken into consideration by management. It could also assist in identifying appropriate measures for such factors.

Packages required

# Installs "pacman" package if not already installed
if (!require("pacman")) install.packages("pacman")

# "pacman" is then used to install/load other packages
pacman::p_load(tidyverse,
               data.table,
               anonymizer,
               DT,
               likert)

Data preparation

Figuring out how and what to import

As previously mentioned, data was exported from the Qualtrics website as a .csv file. The files contain three rows at the top before column values begin. The table below shows the contents of the first three rows. The table was transposed (columns indicate row numbers and contents) for easier viewing.

The first row of the file contains what appear to be good candidate column names. Row two may appear to have no difference with row one, but looking at more entries shows that the second row contains valuable information regarding the actual questions on the survey. Row three does not appear to have any information worth keeping, though it did present a challenge with importing the data.

My approach to obtaining a data set to work with was as follows:

Importing the first row of data only, as a character vector for use in naming columns of the data set
Importing the second row of data only, for use in creating a data dictionary
Importing the remainder of the data by skipping the first three rows of data and assigning values from row one as the column names

The first two steps worked as expected (see code below).

# Import column names
cnames_s1 <- fread(fpath, header = FALSE, nrows = 1) %>%
  as.character()

# Import information for the data dictionary 
info_s1 <- fread(fpath, header = FALSE, nrows = 1, skip = 1) %>%
  as.character()

Initial data reference

Extracting the column names and descriptions allowed for the creation of a data reference:

# Tibble to make dictionary
dict <- tibble(Item = cnames_s1, Description = info_s1)
# And the table itself
datatable(dict, rownames = FALSE, class = "display compact", options = list(dom = 'lrtip'))

Some precautionary measures

I know I will end up removing many of the columns in the data set to keep only numerical values or factors and characters of interest. Free text responses, though incredibly important, are not the focus of this current analysis. To determine the relevant rows to keep/remove, it makes sense to import the whole data set first. This is what I do, with some minor exceptions regarding identifying information:

two columns containing “name” information are disregarded from the start
a column containing the respondent’s e-mail address is anonymized. I don’t want to remove the column altogether because it may prove very useful later

Two steps in obtaining the data necessary to create a data set are shown below:

The characters present in the third row prevented the “skip = n” argument from working when importing the data, for any number n. That is to say, importing the data with “skip = 3” (or 4 or 5) would not remove the third row. I was able to get around this by setting “skip = ‘2017’” (as seen above). This is an option in the fread function which searches for a given character string and begins importing data from the first row the string is found.

Now that I’ve read the data without any confidential information, I’m going to focus on finding the rest of the information I can discard before proceeding further. My selection decisions are as follows:

I don’t need anything from “StartDate” to “UserLanguage”, but I do want to keep RecipientEmail.
A number of questions are set up such that where there is variable QX#1_Y_Z, there is a corresponding section for free text formatted QX#2_Y_Z. Thus, question (column) names with “#2_” can be removed.
Further inspection of the dictionary allows me to identify a few more free text portions remove before proceeding further

# Selecting the columns I want based on the previously stated rationale
s1 <- select(survey_1, -StartDate:-UserLanguage, RecipientEmail, -matches("#2_"), -Q26:-Q25, -Q10, -Q30_5_TEXT:-Q11)

#For my new and imporoved data dictionary
index <- tibble(Item = colnames(s1))
# Selecting the elements from the first data dictionary to keep
new_dict <- semi_join(dict, index, by = "Item")

Future plans

There is much I intend to do from this point. There is another survey of data that I wish to import and process. In both surveys, e-mail addresses are present as unique identifiers for respondents. Joining the data and looking at relationships between responses in the first and second survey are of particular interest, although I’m sure examining groupings within the surveys themselves may reveal equally exciting insights. Before even getting to any of that, showing basic information about the data through plots will be of interest. There are also missing values that must be dealt with. I did not straight remove them at this point, as I believe responses in other parts of the survey(s) will allow for some level of imputation. Once that has been taken care of, I plan to look at different forms of clustering analysis. I am particularly interested in latent class analysis, and if it can be meaningfully applied here.

Analysis of Survey Data

Amir Babar

November 10, 2018