Let’s pretend we are helping our customer, a pharmaceutical company, optimize their sales and marketing activity.
Our customer sells Product A, and it competes against Product B and Product C
Our customer has a team of sales representatives that go to visit doctors on a regular basis and promote Product A.
In 2017 our customer developed a new sales message, but only gave it to half of their sales representatives (those who cover doctors in group #2). They told those sales representatives to use the new message all year, while the sales representatives who cover doctors in group #1 continued to use the same message as in the previous year"
1 - Explain any trends you find in the data and think are interesting
2 - Try to answer the question "“Was the new message (that was used on group #2) a successful message?”
Loading the required packages
library(readr)
library(reshape)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(plotly)
library(tidyverse)
List_of_doctors <- read_csv("case study - List of doctors.csv")
After loading into R, List of doctors dataframe has 9 columns with last 5 columns being NULL. Hence deleting last 5 columns
List_of_doctors <- List_of_doctors[,1:4]
The data look something like this
head(List_of_doctors,5)
## # A tibble: 5 x 4
## `Doctor ID` Role Group `Place of work`
## <dbl> <chr> <chr> <chr>
## 1 101 Specialist Group 1 Group practice
## 2 102 Specialist Group 1 Individual practice
## 3 103 Primary Care Group 2 Hospital
## 4 104 Specialist Group 2 Group practice
## 5 105 Primary Care Group 1 Individual practice
str(List_of_doctors)
## Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of 4 variables:
## $ Doctor ID : num 101 102 103 104 105 106 107 108 109 110 ...
## $ Role : chr "Specialist" "Specialist" "Primary Care" "Specialist" ...
## $ Group : chr "Group 1" "Group 1" "Group 2" "Group 2" ...
## $ Place of work: chr "Group practice" "Individual practice" "Hospital" "Group practice" ...
summary(List_of_doctors)
## Doctor ID Role Group Place of work
## Min. :101.0 Length:48 Length:48 Length:48
## 1st Qu.:112.8 Class :character Class :character Class :character
## Median :124.5 Mode :character Mode :character Mode :character
## Mean :124.5
## 3rd Qu.:136.2
## Max. :148.0
List_of_doctors$Role <- as.factor(List_of_doctors$Role)
List_of_doctors$Group <- as.factor(List_of_doctors$Group)
List_of_doctors$`Place of work` <- as.factor(List_of_doctors$`Place of work`)
List_of_doctors$`Doctor ID` <- as.factor(List_of_doctors$`Doctor ID`)
Checking the summary again after converting to factors
summary(List_of_doctors)
## Doctor ID Role Group Place of work
## 101 : 1 Primary Care:29 Group 1:22 Group practice :16
## 102 : 1 Specialist :19 Group 2:26 Hospital :16
## 103 : 1 Individual practice:16
## 104 : 1
## 105 : 1
## 106 : 1
## (Other):42
Doctors by role
Primary Care 29 60%
Specialist 19 40%
It is usual that PCP’s are more than specialist doctors.
List_of_doctors %>% group_by(Group) %>% summarise(Doctors=n()) %>% mutate('% per'=Doctors/sum(Doctors) *100)
## # A tibble: 2 x 3
## Group Doctors `% per`
## <fct> <int> <dbl>
## 1 Group 1 22 45.8
## 2 Group 2 26 54.2
new message is tried on more than half of the doctors.
List_of_doctors %>% group_by(Role,Group) %>% summarise(Doctors=n()) %>% mutate('% per'=Doctors/sum(Doctors) *100)
## # A tibble: 4 x 4
## # Groups: Role [2]
## Role Group Doctors `% per`
## <fct> <fct> <int> <dbl>
## 1 Primary Care Group 1 10 34.5
## 2 Primary Care Group 2 19 65.5
## 3 Specialist Group 1 12 63.2
## 4 Specialist Group 2 7 36.8
Prescription_volumes <- read_csv("case study - Prescription volumes.csv")
Looking at the summary of data
summary(Prescription_volumes)
## Doctor ID Product Month nRx
## Min. :101.0 Length:3459 Length:3459 Min. :-20.00
## 1st Qu.:113.0 Class :character Class :character 1st Qu.: 13.00
## Median :125.0 Mode :character Mode :character Median : 20.00
## Mean :124.5 Mean : 21.27
## 3rd Qu.:137.0 3rd Qu.: 29.00
## Max. :148.0 Max. :141.00
## NA's :6
Converting the charecter variables to factors and checking the summary of dataset
Prescription_volumes$Product <- as.factor(Prescription_volumes$Product)
Prescription_volumes$Month <- as.factor(Prescription_volumes$Month)
Prescription_volumes$`Doctor ID` <- as.factor(Prescription_volumes$`Doctor ID`)
summary(Prescription_volumes)
## Doctor ID Product Month nRx
## 147 : 74 Brand A:1151 2016-09: 145 Min. :-20.00
## 115 : 73 Brand B:1154 2017-07: 145 1st Qu.: 13.00
## 140 : 73 Brand C:1154 2017-12: 145 Median : 20.00
## 101 : 72 2016-01: 144 Mean : 21.27
## 102 : 72 2016-02: 144 3rd Qu.: 29.00
## 103 : 72 2016-03: 144 Max. :141.00
## (Other):3023 (Other):2592 NA's :6
Are the records repeated in Doctor, product and month combinations?
Prescription_volumes <- group_by(Prescription_volumes,Product,`Doctor ID`, Month) %>% mutate(s=1:n())
Prescription_volumes %>% filter(s >1)
## # A tibble: 4 x 5
## # Groups: Product, Doctor ID, Month [3]
## `Doctor ID` Product Month nRx s
## <fct> <fct> <fct> <dbl> <int>
## 1 115 Brand C 2017-12 16 2
## 2 140 Brand C 2016-09 2 2
## 3 147 Brand B 2017-07 7 2
## 4 147 Brand B 2017-07 7 3
Deleting duplicate records
Prescription_volumes <- Prescription_volumes[! (Prescription_volumes$s > 1) , ]
Prescription_volumes$s <- NULL
qplot(Prescription_volumes$nRx,
geom="histogram",
binwidth = 5,
main = "Distribution of NRx",
xlab = "NRx", fill=I("blue"), col=I("blue"),
alpha=I(.2))
## Warning: Removed 6 rows containing non-finite values (stat_bin).
One reason for negative values in NRx could be restatements i.e scripts got rejected after the entire claim life cycle and was adjusted in database as negative values.
For this demonstration i am removing these records
Deleting negative NRx records
Prescription_volumes <- subset(Prescription_volumes, nRx >= 0 | is.na(nRx))
summary(Prescription_volumes)
## Doctor ID Product Month nRx
## 101 : 72 Brand A:1149 2016-02: 144 Min. : 0.00
## 102 : 72 Brand B:1150 2016-03: 144 1st Qu.: 13.00
## 103 : 72 Brand C:1151 2016-04: 144 Median : 20.00
## 104 : 72 2016-05: 144 Mean : 21.33
## 105 : 72 2016-06: 144 3rd Qu.: 29.00
## 106 : 72 2016-07: 144 Max. :141.00
## (Other):3018 (Other):2586 NA's :6
Dealing with NA’s (missing values) in NRx.
For every doctor, we will calculate the median of NRx by product
And fill the missing values with median NRx for that product by same doctor.
NRx_by_Doc_and_Prod <- Prescription_volumes %>% group_by(`Doctor ID`, Product) %>%
summarise(median = median(nRx, na.rm = TRUE))
head(NRx_by_Doc_and_Prod,6)
## # A tibble: 6 x 3
## # Groups: Doctor ID [2]
## `Doctor ID` Product median
## <fct> <fct> <dbl>
## 1 101 Brand A 20
## 2 101 Brand B 19
## 3 101 Brand C 21
## 4 102 Brand A 18
## 5 102 Brand B 17
## 6 102 Brand C 20.5
NRx_NA <- which(is.na(Prescription_volumes$nRx))
head(NRx_NA,5)
## [1] 1139 2817 3039 3206 3279
For loop to replace each missing value in NRx with Median NRx for that product by same doctor
for(i in NRx_NA){
Doc_ID_NA <- Prescription_volumes$`Doctor ID`[i]
Product_NA <- Prescription_volumes$Product[i]
NRx_median <- NRx_by_Doc_and_Prod[(NRx_by_Doc_and_Prod$`Doctor ID`==Doc_ID_NA[1]) & (NRx_by_Doc_and_Prod$Product == Product_NA[1]),'median']
Prescription_volumes[i,'nRx'] = NRx_median[[1]]
}
There are two NRx values which are more than 100, which seems to be odd,we could replace them with median as the same procedure above
## [1] 3119 3404
Now that the dataset is clean. Checking the summary of dataset
summary(Prescription_volumes)
## Doctor ID Product Month nRx
## 101 : 72 Brand A:1149 2016-02: 144 Min. : 0.00
## 102 : 72 Brand B:1150 2016-03: 144 1st Qu.:13.00
## 103 : 72 Brand C:1151 2016-04: 144 Median :20.00
## 104 : 72 2016-05: 144 Mean :21.26
## 105 : 72 2016-06: 144 3rd Qu.:29.00
## 106 : 72 2016-07: 144 Max. :46.00
## (Other):3018 (Other):2586
Let us join the datasets to get Master Dataset
PV_Master<-merge(x=Prescription_volumes,y=List_of_doctors,by="Doctor ID",all.x=TRUE)
summary(PV_Master)
## Doctor ID Product Month nRx
## 101 : 72 Brand A:1149 2016-02: 144 Min. : 0.00
## 102 : 72 Brand B:1150 2016-03: 144 1st Qu.:13.00
## 103 : 72 Brand C:1151 2016-04: 144 Median :20.00
## 104 : 72 2016-05: 144 Mean :21.26
## 105 : 72 2016-06: 144 3rd Qu.:29.00
## 106 : 72 2016-07: 144 Max. :46.00
## (Other):3018 (Other):2586
## Role Group Place of work
## Primary Care:2084 Group 1:1582 Group practice :1152
## Specialist :1366 Group 2:1868 Hospital :1146
## Individual practice:1152
##
##
##
##
Looking at NRx Sales Overtime across products
NRx_by_Product_Mon <- PV_Master %>% group_by(Product,Month) %>% summarise(NRx_Total =sum(nRx),Docs =n_distinct(`Doctor ID`), NRx_Depth = sum(nRx)/n_distinct(`Doctor ID`))
P_NRx_by_Prod_Mon <- ggplot(data=NRx_by_Product_Mon, aes(x=Month, y=NRx_Total, group=Product)) +
geom_line(aes(color=Product))+ geom_point(aes(color=Product)) + ylim(0, 1500)
P_NRx_by_Prod_Mon
We already saw this in workbook, and can see here that Brand A and and Brand B are performing well overtime where as Brand C isn’t(did it went LOE)? or the increase in sales among Brand A and B have heavily affected the sales of Brand C in the year 2017
adding year column to find year over year growth in NRx sales
PV_Master <- PV_Master %>% mutate(year = substr(PV_Master$Month,1,4))
PV_Master$year <- as.factor(PV_Master$year)
YOY_NRx_Vol_by_Prodcuct <- PV_Master %>% group_by(Product,year) %>% summarise(NRx_Total =sum(nRx))
YOY_NRx_Vol_by_Prodcuct <- YOY_NRx_Vol_by_Prodcuct %>% spread(year,NRx_Total, convert = FALSE) %>% mutate(vol_change = `2017`-`2016`, per_change=((`2017`-`2016`)*100)/`2016`)
YOY_NRx_Vol_by_Prodcuct
## # A tibble: 3 x 5
## # Groups: Product [3]
## Product `2016` `2017` vol_change per_change
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Brand A 8647 14043 5396 62.4
## 2 Brand B 10387 16249 5862 56.4
## 3 Brand C 12928 11103 -1825 -14.1
Bullet chart to visualise change in volume
P_YOY_NRx_by_Prod
Green Bar is 2017 NRx Sales, Black Line is 2016 NRx Sales
YOY_NRx_Depth_by_Product_Role <- PV_Master %>% group_by(Product,year, Role) %>% summarise(NRx_Total =sum(nRx),Docs =n_distinct(`Doctor ID`), NRx_Depth = sum(nRx)/n_distinct(`Doctor ID`))
YOY_NRx_Depth_by_Product_Role
## # A tibble: 12 x 6
## # Groups: Product, year [6]
## Product year Role NRx_Total Docs NRx_Depth
## <fct> <fct> <fct> <dbl> <int> <dbl>
## 1 Brand A 2016 Primary Care 5015 29 173.
## 2 Brand A 2016 Specialist 3632 19 191.
## 3 Brand A 2017 Primary Care 8924 29 308.
## 4 Brand A 2017 Specialist 5119 19 269.
## 5 Brand B 2016 Primary Care 6339 29 219.
## 6 Brand B 2016 Specialist 4048 19 213.
## 7 Brand B 2017 Primary Care 10539 29 363.
## 8 Brand B 2017 Specialist 5710 19 301.
## 9 Brand C 2016 Primary Care 8268 29 285.
## 10 Brand C 2016 Specialist 4660 19 245.
## 11 Brand C 2017 Primary Care 6984 29 241.
## 12 Brand C 2017 Specialist 4119 19 217.
P_YOY_NRx_Depth_by_Product_Role <- ggplot(data=YOY_NRx_Depth_by_Product_Role,
aes(x=factor(year), y=NRx_Depth,
group=Role, shape=Role, color=Role)) +
geom_line() +
geom_point() +
scale_x_discrete("Year") +
scale_y_continuous("NRx Depth (Productivity)") +
facet_grid(.~Product )
P_YOY_NRx_Depth_by_Product_Role
Prod_A_PV <- PV_Master %>% filter(Product=="Brand A")
Prod_A_PV_by_Grp_Year <- Prod_A_PV%>% group_by(Group,year) %>% summarise(nRx=sum(nRx))
Prod_A_PV_by_Grp_Year %>% spread(year, nRx, convert = FALSE) %>% mutate(vol_change =`2017`-`2016` ,Increase_percentage=((`2017`-`2016`)*100)/`2016`)
## # A tibble: 2 x 5
## # Groups: Group [2]
## Group `2016` `2017` vol_change Increase_percentage
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Group 1 4198 6066 1868 44.5
## 2 Group 2 4449 7977 3528 79.3
Group 1 sales increased by 44% from previous year
Group 2 sales increased by ~80% from previous year
45% of the doctors belongs to Group 1
55% of the doctors belongs to Group 2
So ultimately message desgined to test the Group 2 doctors to increase the sales has worked well
Prod_A_PV_by_Grp_Year_Role <- Prod_A_PV %>% group_by(Group,year,Role) %>% summarise(nRx=sum(nRx))
Prod_A_PV_by_Grp_Year_Role <- Prod_A_PV_by_Grp_Year_Role %>% spread(year, nRx, convert = FALSE) %>% mutate(vol_change =`2017`-`2016`, increase_percentage=((`2017`-`2016`)*100)/`2016`)
Prod_A_PV_by_Grp_Year_Role
## # A tibble: 4 x 6
## # Groups: Group [2]
## Group Role `2016` `2017` vol_change increase_percentage
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Group 1 Primary Care 1929 3079 1150 59.6
## 2 Group 1 Specialist 2269 2987 718 31.6
## 3 Group 2 Primary Care 3086 5845 2759 89.4
## 4 Group 2 Specialist 1363 2132 769 56.4
P_Prod_A_PV_by_Grp_Year_Role <- ggplot(Prod_A_PV_by_Grp_Year_Role, aes(x=Group, y=increase_percentage,fill = Role)) + geom_bar(stat="identity", position = "dodge") + ggtitle( 'YoY % change in NRx sales by Group')
P_Prod_A_PV_by_Grp_Year_Role
The result shows that with new message for Group 2 the percentage of increase in sales from previous year is higher than Group 1 sales
Especially PCP’s have contributed more than specialist with higher number of sales due to higher number in count as 60% of the doctors are from primary care
Prod_A_PV_by_Grp_Year_Role_POW
## # A tibble: 12 x 7
## # Groups: Group, Role [4]
## Group Role `Place of work` `2016` `2017` vol_change increase_percent…
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Group… Primary … Group practice 764 1253 489 64.0
## 2 Group… Primary … Hospital 233 332 99 42.5
## 3 Group… Primary … Individual pract… 932 1494 562 60.3
## 4 Group… Speciali… Group practice 704 783 79 11.2
## 5 Group… Speciali… Hospital 862 1238 376 43.6
## 6 Group… Speciali… Individual pract… 703 966 263 37.4
## 7 Group… Primary … Group practice 792 1561 769 97.1
## 8 Group… Primary … Hospital 1350 2409 1059 78.4
## 9 Group… Primary … Individual pract… 944 1875 931 98.6
## 10 Group… Speciali… Group practice 783 1112 329 42.0
## 11 Group… Speciali… Hospital 354 694 340 96.0
## 12 Group… Speciali… Individual pract… 226 326 100 44.2
We can clearly see all the doctors from group 2 have made higher number of sales compared to the doctors in Group 1.
The sales of product A by the PCP doctors in Individual practice workplace reached the maximum of 98% with the new message strategy.
Prod_B_PV <- PV_Master %>% filter(Product=="Brand B")
Prod_B_PV_by_Grp_Year_Role_POW <- Prod_B_PV%>% group_by(Group,year,Role,`Place of work`) %>% summarise(nRx=sum(nRx))
Prod_B_PV_by_Grp_Year_Role_POW %>% spread(year, nRx, convert = FALSE) %>% mutate(vol_change =`2017`-`2016` ,Increase_percentage=((`2017`-`2016`)*100)/`2016`)
## # A tibble: 12 x 7
## # Groups: Group, Role [4]
## Group Role `Place of work` `2016` `2017` vol_change Increase_percent…
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Group… Primary … Group practice 943 1423 480 50.9
## 2 Group… Primary … Hospital 269 353 84 31.2
## 3 Group… Primary … Individual pract… 1051 1741 690 65.7
## 4 Group… Speciali… Group practice 601 843 242 40.3
## 5 Group… Speciali… Hospital 1003 1300 297 29.6
## 6 Group… Speciali… Individual pract… 781 1018 237 30.3
## 7 Group… Primary … Group practice 1089 1896 807 74.1
## 8 Group… Primary … Hospital 1645 2938 1293 78.6
## 9 Group… Primary … Individual pract… 1342 2188 846 63.0
## 10 Group… Speciali… Group practice 1010 1352 342 33.9
## 11 Group… Speciali… Hospital 455 850 395 86.8
## 12 Group… Speciali… Individual pract… 198 347 149 75.3
From the results we can see that Brand B is bigger brand than Brand A in terms of total sales and can see the increase in sale percentage in group 2 doctors is higher compared to group 1.
But the increase percentage is not higher as much as group 2 for Brand A sales, which informs that new message is working for Brand A.
As we have 2 categorical variables and with more than 2 categories , hence we will use ANOVA here
Prod_A_PV <- PV_Master %>% filter(Product=="Brand A")
Prod_A_PV <- Prod_A_PV %>% group_by(Month,Group) %>% summarise(nRx=sum(nRx,na.rm = T))
Prod_A_PV$year <- substr(Prod_A_PV$Month, 1, 4)
Prod_A_PV$level <- ifelse((Prod_A_PV$year == "2016") & (Prod_A_PV$Group == "Group 1"), "Group1_2016",
ifelse((Prod_A_PV$year == "2016") & (Prod_A_PV$Group == "Group 2"), "Group2_2016",
ifelse((Prod_A_PV$year == "2017") & (Prod_A_PV$Group == "Group 1"), "Group1_2017", "Group2_2017")))
Prod_A_PV$level <- ordered(Prod_A_PV$level,
levels = c("Group1_2016", "Group2_2016", "Group1_2017","Group2_2017"))
group_by(Prod_A_PV, level) %>%
summarise(
count = n(),
mean = mean(nRx, na.rm = TRUE),
sd = sd(nRx, na.rm = TRUE)
)
## # A tibble: 4 x 4
## level count mean sd
## <ord> <int> <dbl> <dbl>
## 1 Group1_2016 12 350. 78.1
## 2 Group2_2016 12 371. 116.
## 3 Group1_2017 12 506. 78.8
## 4 Group2_2017 12 665. 74.1
ggboxplot(Prod_A_PV, x = "level", y = "nRx",
color = "level", palette = c("#00AFBB", "#E7B800", "#00AFBB","#E7B800"),
order = c("Group1_2016", "Group2_2016", "Group1_2017","Group2_2017"),
ylab = "nRx(Mean)", xlab = "Group")+ggtitle('Distribution of Mean across groups')
res.aov <- aov(nRx ~ level, data = Prod_A_PV)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## level 3 761389 253796 32.57 3.01e-11 ***
## Residuals 44 342817 7791
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Prod_A_PV <- Prod_A_PV %>% filter(level==c("Group2_2016")|level==c ("Group2_2017"))
res.aov_same_groups <- aov(nRx ~ level, data = Prod_A_PV)
summary(res.aov_same_groups)
## Df Sum Sq Mean Sq F value Pr(>F)
## level 1 518616 518616 55.03 2.02e-07 ***
## Residuals 22 207326 9424
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gather more data and do predictive modelling at Doctor level to find the important factors among the doctors whose NRx sales has decreased in 2017 from 2016.
Create a binary Target column for each doctor based on 2017 - 2016 Nrx sales