Hi, please find final thoughts at the end of analysis.
install.packages("tidyverse")
library(tidyverse)
Installing necessary packages (In this case, tidyverse should do the
trick)
hornet <- read.csv("abc_test_data.csv")
view(hornet)
glimpse(hornet)
Load the dataset into the R environment, quick overview to check if
it has loaded properly and what data types the fields are coded as.
hornet <- hornet %>% rename(date = DATE)
hornet <- hornet %>% rename(amt = Total_purchased_amount)
hornet <- hornet %>% rename(nr_purchased = Purchases)
Rename variables for easier usage throughout script (not a fan of
capitals)
hornet2 <- hornet
Create a duplicate dataset as a test before applying changes to main
dataset.
hornet$date <- as.Date(hornet$date, format = "%Y/%m/%d")
hornet$amt <- gsub(",", ".", hornet$amt)
hornet$amt <- as.numeric(hornet$amt)
glimpse(hornet)
sum(is.na(hornet$amt))
Converting to correct data types, and confirming it has pulled
through correctly.
Now, on to answering the questions.
Q3a.i: Is there a difference between the different versions regarding
daily revenue brought in by the feature?
daily_revenue <- hornet %>%
group_by(date, variant) %>%
summarise(total_revenue = sum(amt, na.rm = TRUE))
summary(daily_revenue)
Create table showing total revenue per day, split by each version
ggplot(daily_revenue, aes(x = date, y = total_revenue, color = as.factor(variant))) +
geom_line() +
labs(title = "Daily Revenue by Version", x = "Date", y = "Total Revenue") +
theme_minimal()
Line graph showing Daily Revenue, split by each version. Similar
trend line for all versions, no notable differences. Version 3 appears
to be the only one with purchases on last day (May 17th), but appears to
be statistically insignificant (On further inspection in Excel, total
amounted to approximately 670).
ggplot(daily_revenue, aes(x = as.factor(variant), y = total_revenue, fill = as.factor(variant))) +
geom_boxplot() +
labs(title = "Daily Revenue By Version", x = "Variant", y = "Total Revenue") +
theme_minimal()
Box Plot showing Daily Revenue, split by each version. Version 1 has
highest variability, but median very similar across all versions.
Version 2 has lowest outlier revenue.
anova_q3ai <- aov(total_revenue ~ as.factor(variant), data = daily_revenue)
summary(anova_q3ai)
ANOVA test shows that there is no statistically significant
difference between the 3 versions daily revenue.
User behavior difference between the different versions
Q3aii1: Is there a difference in single purchase values?
Null Hypothesis: There is no statistically significant difference in
single purchase values between the different versions.
Alternative Hypothesis: There is a statistically significant
difference in single purchase values in at least one version.
hornet$avg_single_purchase_value <- hornet$amt / hornet$nr_purchased
Create average single purchase value column
single_purchase_summary <- hornet %>%
group_by(variant) %>%
summarise(
mean_purchase_value = mean(avg_single_purchase_value, na.rm = TRUE),
median_purchase_value = median(avg_single_purchase_value, na.rm = TRUE)
) %>%
pivot_longer(cols = c(mean_purchase_value, median_purchase_value),
names_to = "Metric", values_to = "Value")
Create summary table, showing mean & median single purchase
values. Pivot longer to be able to plot grouped bar graph.
ggplot(single_purchase_summary, aes(x = as.factor(variant), y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Mean & Median Single Purchase Value by Version",
x = "Version",
y = "Purchase Value") +
theme_minimal()
Bar graph showing mean & median single purchase values by
version. Version 3 has highest mean appearing to be a couple cents
higher than v1 and v2. Median appears similar across all versions,
supporting previous graphs findings of differing distributions across
versions.
anova_q3aii1 <- aov(avg_single_purchase_value ~ as.factor(variant), data = hornet)
summary(anova_q3aii1)
ANOVA shows P-value not statistically significant - fail to reject
null hypothesis and conclude that there is no evidence that single
purchase values differ significantly.
Q3aii2: Is there a difference in number of purchases?
Null Hypothesis: There is no statistically significant difference in
number of purchases between the different versions.
Alternative Hypothesis: There is a statistically significant
difference in number of purchases in at least one version.
num_purchases_summary <- hornet %>%
group_by(variant) %>%
summarise(mean_nr_purchases = mean(nr_purchased, na.rm = TRUE),
median_nr_purchases = median(nr_purchased, na.rm = TRUE)
) %>% pivot_longer(cols = c(mean_nr_purchases, median_nr_purchases),
names_to = "Metric", values_to = "Value")
Creating summary table showing mean & median nr of purchases by
version.
ggplot(num_purchases_summary, aes(x = as.factor(variant), y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Mean & Median Number of Purchases by Version",
x = "Version",
y = "Nr Purchases") +
theme_minimal()
Create bar graph showing mean & median nr of purchases by
version. Fewest purchases in version 2. v1 & v3 appear to have a
similar nr of purchases.
anova_q3aii2 <- aov(nr_purchased ~ as.factor(variant), data = hornet)
summary(anova_q3aii2)
ANOVA test shows extremely strong evidence that number of purchases
is statistically significantly different (p-value < 0.0001).
Therefore, reject null hypothesis and conclude that there is a
statistically significant difference in number of purchases in at least
one version.
tukey_test <- TukeyHSD(anova_q3aii2)
print(tukey_test)
Use Tukey Test to confirm that version 1 & 3 perform similarly in
nr of purchases, and both perform better than version 2.
Q3iii: Retention Rates
Null Hypothesis: There is no statistically significant difference in
retention rates between the different versions.
Alternative Hypothesis: There is a statistically significant
difference in retention rates in at least one version.
first_purchase <- hornet %>%
group_by(userid) %>%
summarise(first_purchase_date = min(date))
Create table showing each userid’s 1st purchase by version.
hornet <- hornet %>%
left_join(first_purchase, by = "userid")
Left Join onto original table using key userid, to add the
first_purchase as a column
hornet$retained <- hornet$date > hornet$first_purchase_date
Create a binary retained column, to indicate whether a user had an
additional purchase after the 1st purchase
retention_summary <- hornet %>%
group_by(variant) %>%
summarise(
total_users = n_distinct(userid),
retained_users = sum(retained, na.rm = TRUE),
retention_rate = retained_users / total_users
)
print(retention_summary)
Summary table showing total users by version, retained users by
version, and then each version’s retention rate. Version 3 having
highest retention rate (58.7%), followed by version 1 (55.0%) and
version 2 (53.7%)
n_distinct(hornet$userid)
user_variant_counts <- hornet %>%
group_by(userid) %>%
summarise(variant_count = n_distinct(variant))
Check: Some userid’s appear in multiple versions, explains why sum of
total distinct users split by versions (52 281) > than sum of total
distinct users (50 397)
retention_table <- table(hornet$variant, hornet$retained)
print(retention_table)
Create retention table to be able to use in bar graph.
ggplot(retention_summary, aes(x = as.factor(variant), y = retention_rate, fill = as.factor(variant))) +
geom_bar(stat = "identity") +
labs(title = "Retention Rate by Variant",
x = "Variant",
y = "Retention Rate") +
theme_minimal()
Bar graph showing retention rates by version (1 being 100%).
chisq_result <- chisq.test(retention_table)
print(chisq_result)
Chi Square test shows extremely statistically significant evidence
that there is difference in retention rates (p-value < 0.0001)
between the different versions. Therefore, reject null hypothesis and
conclude that there is a statistically significant difference in at
least one version. Version 3 has the highest retention rate.
Final thoughts:
Is there a difference between the different versions regarding daily
revenue brought in by the feature? No
Is there a difference in single purchase values? No
Is there a difference in number of purchases? Yes
Is there a difference in Retention Rates? Yes
There is no strong evidence that there is a difference between
versions in terms of daily revenue and single purchase values. However,
in number of purchases, there is very strong evidence that v1 & v3
outperform v2. And in retention rates, there is very strong evidence
that v3 outperforms both v2 & v1.
Further analysis needs to be done to conclusively say that v3 should
be rolled out to users, and depends on what metrics Hornet is attempting
to drive. But provisional analysis indicates that v3 performs the
best.
---
title: "Hornet Assessment"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---


Hi, please find final thoughts at the end of analysis.

```{r}
install.packages("tidyverse")

library(tidyverse)
```
Installing necessary packages (In this case, tidyverse should do the trick)


```{r}
hornet <- read.csv("abc_test_data.csv")

view(hornet)
glimpse(hornet)
```
Load the dataset into the R environment, quick overview to check if it has loaded properly and what data types the fields are coded as. 


```{r}
hornet <- hornet %>% rename(date = DATE)
hornet <- hornet %>% rename(amt = Total_purchased_amount)
hornet <- hornet %>% rename(nr_purchased = Purchases)
```
Rename variables for easier usage throughout script (not a fan of capitals)


```{r}
hornet2 <- hornet
```
Create a duplicate dataset as a test before applying changes to main dataset.


```{r}

hornet$date <- as.Date(hornet$date, format = "%Y/%m/%d")

hornet$amt <- gsub(",", ".", hornet$amt)

hornet$amt <- as.numeric(hornet$amt)

glimpse(hornet)
    
sum(is.na(hornet$amt))
```
Converting to correct data types, and confirming it has pulled through correctly.


Now, on to answering the questions.



Q3a.i: Is there a difference between the different versions regarding daily
revenue brought in by the feature?


```{r}

daily_revenue <- hornet %>%
  group_by(date, variant) %>%
  summarise(total_revenue = sum(amt, na.rm = TRUE))

summary(daily_revenue)
```
Create table showing total revenue per day, split by each version




```{r}
ggplot(daily_revenue, aes(x = date, y = total_revenue, color = as.factor(variant))) +
  geom_line() +
  labs(title = "Daily Revenue by Version", x = "Date", y = "Total Revenue") +
  theme_minimal()
```
Line graph showing Daily Revenue, split by each version. Similar trend line for all versions, no notable differences. Version 3 appears to be the only one with purchases on last day (May 17th), but appears to be statistically insignificant (On further inspection in Excel, total amounted to approximately 670).


```{r}
ggplot(daily_revenue, aes(x = as.factor(variant), y = total_revenue, fill = as.factor(variant))) +
  geom_boxplot() +
  labs(title = "Daily Revenue By Version", x = "Variant", y = "Total Revenue") +
  theme_minimal()
```
Box Plot showing Daily Revenue, split by each version. Version 1 has highest variability, but median very similar across all versions. Version 2 has lowest outlier revenue.




```{r}
anova_q3ai <- aov(total_revenue ~ as.factor(variant), data = daily_revenue)
summary(anova_q3ai)
```
ANOVA test shows that there is no statistically significant difference between the 3 versions daily revenue.



User behavior difference between the different versions


Q3aii1: Is there a difference in single purchase values?

Null Hypothesis: There is no statistically significant difference in single purchase values between the different versions.

Alternative Hypothesis: There is a statistically significant difference in single purchase values in at least one version.



```{r}
hornet$avg_single_purchase_value <- hornet$amt / hornet$nr_purchased
```
Create average single purchase value column


```{r}
single_purchase_summary <- hornet %>%
  group_by(variant) %>%
  summarise(
    mean_purchase_value = mean(avg_single_purchase_value, na.rm = TRUE),
    median_purchase_value = median(avg_single_purchase_value, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(mean_purchase_value, median_purchase_value), 
               names_to = "Metric", values_to = "Value")
```
Create summary table, showing mean & median single purchase values. Pivot longer to be able to plot grouped bar graph.


```{r}
ggplot(single_purchase_summary, aes(x = as.factor(variant), y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean & Median Single Purchase Value by Version",
       x = "Version",
       y = "Purchase Value") +
  theme_minimal()
```
Bar graph showing mean & median single purchase values by version. Version 3 has highest mean appearing to be a couple cents higher than v1 and v2. Median appears similar across all versions, supporting previous graphs findings of differing distributions across versions.




```{r}
anova_q3aii1 <- aov(avg_single_purchase_value ~ as.factor(variant), data = hornet)
summary(anova_q3aii1)
```
ANOVA shows P-value not statistically significant - fail to reject null hypothesis and conclude that there is no evidence that single purchase values differ significantly.



Q3aii2: Is there a difference in number of purchases?

Null Hypothesis: There is no statistically significant difference in number of purchases between the different versions.

Alternative Hypothesis: There is a statistically significant difference in number of purchases in at least one version.


```{r}
num_purchases_summary <- hornet %>% 
  group_by(variant) %>% 
  summarise(mean_nr_purchases = mean(nr_purchased, na.rm = TRUE),
            median_nr_purchases = median(nr_purchased, na.rm = TRUE)
  ) %>% pivot_longer(cols = c(mean_nr_purchases, median_nr_purchases), 
                     names_to = "Metric", values_to = "Value")
```
Creating summary table showing mean & median nr of purchases by version.


```{r}
ggplot(num_purchases_summary, aes(x = as.factor(variant), y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean & Median Number of Purchases by Version",
       x = "Version",
       y = "Nr Purchases") +
  theme_minimal()
```
Create bar graph showing mean & median nr of purchases by version. Fewest purchases in version 2. v1 & v3 appear to have a similar nr of purchases.






```{r}
anova_q3aii2 <- aov(nr_purchased ~ as.factor(variant), data = hornet)
summary(anova_q3aii2)
```
ANOVA test shows extremely strong evidence that number of purchases is statistically significantly different (p-value < 0.0001). Therefore, reject null hypothesis and conclude that there is a statistically significant difference in number of purchases in at least one version.



```{r}
tukey_test <- TukeyHSD(anova_q3aii2)
print(tukey_test)
```
Use Tukey Test to confirm that version 1 & 3 perform similarly in nr of purchases, and both perform better than version 2. 


Q3iii: Retention Rates

Null Hypothesis: There is no statistically significant difference in retention rates between the different versions.

Alternative Hypothesis: There is a statistically significant difference in retention rates in at least one version.



```{r}
first_purchase <- hornet %>% 
  group_by(userid) %>% 
  summarise(first_purchase_date = min(date))

```
Create table showing each userid's 1st purchase by version.


```{r}
hornet <- hornet %>% 
  left_join(first_purchase, by = "userid")
```
Left Join onto original table using key userid, to add the first_purchase as a column


```{r}
hornet$retained <- hornet$date > hornet$first_purchase_date
```
Create a binary retained column, to indicate whether a user had an additional purchase after the 1st purchase


```{r}
retention_summary <- hornet %>% 
  group_by(variant) %>% 
  summarise(
    total_users = n_distinct(userid),
    retained_users = sum(retained, na.rm = TRUE),
    retention_rate = retained_users / total_users
  )
print(retention_summary)
```
Summary table showing total users by version, retained users by version, and then each version's retention rate. Version 3 having highest retention rate (58.7%), followed by version 1 (55.0%) and version 2 (53.7%)






```{r}
n_distinct(hornet$userid)

user_variant_counts <- hornet %>%
  group_by(userid) %>%
  summarise(variant_count = n_distinct(variant))

```
Check: Some userid's appear in multiple versions, explains why sum of total distinct users split by versions (52 281) > than sum of total distinct users (50 397)


```{r}
retention_table <- table(hornet$variant, hornet$retained)

print(retention_table)
```
Create retention table to be able to use in bar graph.


```{r}
ggplot(retention_summary, aes(x = as.factor(variant), y = retention_rate, fill = as.factor(variant))) +
  geom_bar(stat = "identity") +
  labs(title = "Retention Rate by Variant",
       x = "Variant",
       y = "Retention Rate") +
  theme_minimal()
```
Bar graph showing retention rates by version (1 being 100%).




```{r}
chisq_result <- chisq.test(retention_table)
print(chisq_result)
```
Chi Square test shows extremely statistically significant evidence that there is difference in retention rates (p-value < 0.0001) between the different versions. Therefore, reject null hypothesis and conclude that there is a statistically significant difference in at least one version. Version 3 has the highest retention rate.

Final thoughts:

Is there a difference between the different versions regarding daily
revenue brought in by the feature? No

Is there a difference in single purchase values? No

Is there a difference in number of purchases? Yes

Is there a difference in Retention Rates? Yes

There is no strong evidence that there is a difference between versions in terms of daily revenue and single purchase values. 
However, in number of purchases, there is very strong evidence that v1 & v3 outperform v2. 
And in retention rates, there is very strong evidence that v3 outperforms both v2 & v1. 

Further analysis needs to be done to conclusively say that v3 should be rolled out to users, and depends on what metrics Hornet is attempting to drive. But provisional analysis indicates that v3 performs the best.

