Loading in data

This portion of the document focuses on loading in the three data sets, as well as combining them into a master table.

# Read in Visits, Clients, and Visits_attendees
visits <- read_xlsx("Technical Assessment - Dataset.xlsx", 
                    sheet = "visits")
clients <- read_xlsx("Technical Assessment - Dataset.xlsx", 
                     sheet = "clients")
visits_attendees <- read_xlsx("Technical Assessment - Dataset.xlsx", 
                              sheet = "visits_attendees")

# Set up appropriate factor variables
visits$visit_scheduled_type <- as.factor(visits$visit_scheduled_type)
clients$client_location <- as.factor(clients$client_location)

# Combined 3 data sets
full_data <- inner_join(clients, visits_attendees, by = "client_id") %>% 
  inner_join(., visits, by = "visit_id")

# View summary of data
glimpse(full_data)
## Rows: 955
## Columns: 12
## $ client_id                          <chr> "6705e40f-1c8c-4263-a804-daf3e200b1…
## $ client_location                    <fct> USA, USA, USA, USA, USA, USA, USA, …
## $ client_referral_source             <chr> "Google / Search engine", "Google /…
## $ client_gender                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ client_gender_other                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ visit_id                           <chr> "08cbdbcf-3199-46e4-b534-1c8047db28…
## $ provider_id                        <chr> "8471ae04-eecd-44a9-b2f3-60a2f0c746…
## $ visit_scheduled_type               <fct> therapy_individual, therapy_individ…
## $ visit_scheduled_title              <chr> "ZZZEmily ZZZHelander - Individual …
## $ visit_scheduled_start_datetime_est <dttm> 2021-11-04 21:00:00, 2022-01-18 18…
## $ visit_scheduled_end_datetime_est   <dttm> 2021-11-04 21:45:00, 2022-01-18 18…
## $ visit_duration                     <dbl> 45, 45, NA, NA, 45, NA, NA, NA, 49,…

1. What does a “typical” client look like?

a. Client Location

Looking only at our registered client table, the typical clients come from CA (73.4%), followed by NY (12%). The rest are comprised of missing values (N/A or NA) and three vague values for USA, which will require some data cleaning. NA values might be okay if the user decided to opt out of providing their location. We should make sure to investigate if these were left blank in error due to bad UI/UX.

# Registered Clients by Location
clients %>%
  group_by(client_location) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 5 × 3
##   client_location count    prop
##   <fct>           <int>   <dbl>
## 1 California        226 0.734  
## 2 New York           37 0.120  
## 3 <NA>               24 0.0779 
## 4 N/A                18 0.0584 
## 5 USA                 3 0.00974
qplot(data=clients, x = client_location, main = 'Registered Clients by Location')

The volume of visits by location is slightly different. CA generates 71.7% of the volume, the three individuals that have a location of USA generate 14.7% of the visit volume, while NY generates 9.8% of the visit volume.

full_data %>%
  group_by(client_location) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 4 × 3
##   client_location count   prop
##   <fct>           <int>  <dbl>
## 1 California        685 0.717 
## 2 USA               140 0.147 
## 3 New York           94 0.0984
## 4 <NA>               36 0.0377
qplot(data=full_data, x = client_location, main = 'Visit Volume by Client Location')

b. Client Gender

When looking at registered clients by gender, we see that 57% of clients omit their gender. The next largest category (11.4%) prefer not to disclose. The next two largest categories are woman (10.4%) and Trans Woman (9.4%).

Overall, our insights into our client’s gender profile is very incomplete due to missing data or due to them not wanting to disclose that information to us.

clients %>%
  group_by(client_gender) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 11 × 3
##    client_gender                  count    prop
##    <chr>                          <int>   <dbl>
##  1 <NA>                             176 0.571  
##  2 Prefer not to disclose            35 0.114  
##  3 Woman                             32 0.104  
##  4 Trans Woman                       29 0.0942 
##  5 Man                               17 0.0552 
##  6 Genderqueer/Non-binary             9 0.0292 
##  7 Intersex                           5 0.0162 
##  8 I prefer to self-describe [as]     2 0.00649
##  9 Genderqueer                        1 0.00325
## 10 Nonbinary                          1 0.00325
## 11 Trans Man                          1 0.00325
clients %>%
  group_by(client_gender) %>%
  summarise(count = n() ) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x=client_gender, y=count)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Registered Clients by Gender")

Now let’s look at the scheduled visits by gender. We see a somewhat similar picture where the top two categories are “NA” and “Prefer not to disclose”, which make up about 72.6% of our total data. We do see that the third highest category is “Nonbinary” at 16.8%.

full_data %>%
  group_by(client_gender) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 6 × 3
##   client_gender          count    prop
##   <chr>                  <int>   <dbl>
## 1 <NA>                     359 0.376  
## 2 Prefer not to disclose   334 0.350  
## 3 Nonbinary                160 0.168  
## 4 Woman                     77 0.0806 
## 5 Man                       24 0.0251 
## 6 Genderqueer/Non-binary     1 0.00105
full_data %>%
  group_by(client_gender) %>%
  summarise(count = n() ) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x=client_gender, y=count)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title='Visit Volume by Client Gender')


2. Which referral sources should the marketing team invest in?

It is clear that the majority of the client’s referral source comes from “Psychology Today” at 28.6%. The next two highest categories are “NA” and and “Other” which doesn’t provide us with much information. Word of mouth “Through friends or family” is the fourth highest at 14%, followed by “Google / Search engine” at 12.3%.

clients %>%
  group_by(client_referral_source) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 17 × 3
##    client_referral_source        count    prop
##    <chr>                         <int>   <dbl>
##  1 Psychology Today                 88 0.286  
##  2 <NA>                             56 0.182  
##  3 Other                            43 0.140  
##  4 Through friends or family        43 0.140  
##  5 Google / Search engine           38 0.123  
##  6 Facebook / Instagram              6 0.0195 
##  7 I was referred by One Medical     6 0.0195 
##  8 Yelp                              6 0.0195 
##  9 I was referred by my doctor       5 0.0162 
## 10 Insurance directory               5 0.0162 
## 11 My employer                       4 0.0130 
## 12 Through Friends or Family         3 0.00974
## 13 Google / Search Engine            1 0.00325
## 14 Insurance Directory               1 0.00325
## 15 Other: ALIFORNEY EMPLOYEE         1 0.00325
## 16 Other: s                          1 0.00325
## 17 University of California          1 0.00325
clients %>%
  group_by(client_referral_source) %>%
  summarise(count = n() ) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x=client_referral_source, y=count)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Registered Clients by Referral Source")

Note that not everyone of these referral sources has generated an actual visit. “Visit Volume by Referral”Source” can better explain which referral source is generating the largest number of actual visits.

full_data %>%
  group_by(client_referral_source) %>%
  summarise(count = n() ) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(desc(prop))
## # A tibble: 6 × 3
##   client_referral_source    count   prop
##   <chr>                     <int>  <dbl>
## 1 Psychology Today            350 0.366 
## 2 Through friends or family   219 0.229 
## 3 Google / Search engine      171 0.179 
## 4 <NA>                        141 0.148 
## 5 Other                        61 0.0639
## 6 Other: s                     13 0.0136
full_data %>%
  group_by(client_referral_source) %>%
  summarise(count = n() ) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x=client_referral_source, y=count)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Visit Volume by Referral Source")

One important metric to look at is Cost Per Acquisition. If Finance can provide the amount of money that was spent on each of our marketing sources, along with the marketing team (and/or sales team) salary, we can determine how profitable each referral source is. Unfortunately, we do not have that data in this exercise, so we can not answer this question. If we look strictly at volume, then we would blindly assume that we should spend more money on marketing via “Psychology Today”. However, we do not know if those marketing dollars are being used efficiently.


3. How many visits do you predict for next month?

We first begin by noting that our oldest scheduled visit date is 5/2/2021 and our most recent one is 4/7/2022. This means that we can safety assume that April 2022 will have a more than 8 visits in that month since it is not an entire month’s worth of data.

min(full_data$visit_scheduled_start_datetime_est)
## [1] "2021-05-02 22:00:00 UTC"
max(full_data$visit_scheduled_start_datetime_est)
## [1] "2022-04-07 21:30:00 UTC"

Typically we forecast future monthly volume by first looking at YOY growth from the past several years. We use the average YOY growth and apply that to our most recent year’s worth of volume. From here, we can then calculate the average volume for each of the 12 unique month’s, calculate the ratio from the sum of the averages, and then multiply the corresponding ratio from our YOY estimate volume for the following year.

Unfortunately, we cannot do that in this case since we only have a single year’s worth of data. So we have to use a combination of 1 year’s worth of seasonal trend, any current world events, and business intelligence (i.e. insights into a new contracts that the company might be in the process of signing).

There is a downward trend that we see from 2022-02 to 2022-03. Would it be safe to assume that this downward trend will continue into 2022-04 (currently incomplete amount of data) and the following new month of 2022-05? Based on the current world events, this may very well be the case since many individuals are cost cutting including in areas such as medical care and prescription drugs. Russia invaded Ukraine in 2022-02 and that had negative consequences worldwide, which might explain the huge drop after this month.

From a seasonality point of view from the past year’s worth of visit volume, the visit volume bottoms out during 2021-06 through 2021-10 and slowly begins to increase during the fall/winter months, peaks in February and begins to decline again.

(visits_per_yearmo <- 
  full_data %>%
  mutate(year_month_visit_scheduled = format(visit_scheduled_start_datetime_est, 
                                             "%Y-%m")) %>%
  count(year_month_visit_scheduled) %>%
  arrange(year_month_visit_scheduled))
## # A tibble: 12 × 2
##    year_month_visit_scheduled     n
##    <chr>                      <int>
##  1 2021-05                       45
##  2 2021-06                        8
##  3 2021-07                        2
##  4 2021-08                        3
##  5 2021-09                       10
##  6 2021-10                       11
##  7 2021-11                       37
##  8 2021-12                       92
##  9 2022-01                      260
## 10 2022-02                      300
## 11 2022-03                      179
## 12 2022-04                        8
ggplot(aes(x=year_month_visit_scheduled, y=n, group=1), data=visits_per_yearmo) +
  geom_line() +
  ggtitle("Number of Visits by YearMo Scheduled")

I would safety assume that visit volume would decrease in May 2022 due to world events and seasonality trends. The visit amount could be lower than the 45 that we had a year ago. Note that we only have one year’s worth of data. Unless we determine from our internal partners that we signed a new contract (ex: Aetna) and we increased our marketing spend significantly, then we can increase our estimated volume by a fixed percentage amount. This amount can be determined in partnership with our internal stakeholders.


4. Which provider should we assign our next client to?

The first step would be to determine if the next client is a returning client. Does the client already have a preference for a provider? If they do, then it makes sense to keep them with their existing provider.

If they do not have a preference, we would then need to have a master Provider table that lists each therapists specialty field of work, which we do not have for this assignment. Depending on the reason for someone’s visit, we can determine a subset of therapists that are eligible to see these patients.

Next, we can create a measure of ‘best fit’ between this subset of providers. Which providers have the most availability on their calender (i.e. % of open 15 minute slots based on the clients availability)? Most EHR systems allow you to query time slots for a provider given a specific date range.

Lastly, we can figure out the time difference between each provider’s last seen patient and today’s date (i.e. give higher priority to a provider that hasn’t seen a patient in a while). We want to make sure that we are distributing the volume of patients appropriately among our employees. This iterative process can allow us to create an automated ranking system for our scheduling team.


5. What additional data or information would be helpful for you to provide deeper analysis on any of the questions above?

As previously mentioned, some of the useful data sets would include: marketing spend and sales (i.e. broker) spend in order to calculate Cost per Acquisition.

A provider table with details of their specialties along with EHR/scheduling data would help with creating an automated ranking system for our provider allocation.

To better understand our client demographics, we can encourage our patients to provide additional personal demographic data and assure them that this will help us with our provider matching process. A big concern for many is that providers will sell their personal information. We need to make them feel safe and reiterate our company values so that they don’t omit crucial information from us.

Lastly, we should make an effort to partner with our UI/UX team to determine if all of the NA’s in our data are due to user error and if there are ways to enhance our user’s experience when populating their personal information.