For this assignment, we will be working on understanding the behaviors and characteristics of people who use a digital application. The product offers recommendations on nearby attractions, restaurants, and businesses based on the user’s location. This includes a free version for any user along with a subscription model that provides more customized recommendations for users who pay for the service.
With free installation on a mobile device, digital applications have a low barrier to entry. They also experience high rates of attrition, as users may not continue to log in. With this in mind, the company is interested in better understanding the early experience of users with the application. A time point of 30 days was selected as an important milestone. Which factors might impact whether new users remain active beyond 30 days? Who is likely to subscribe within 30 days?
The company would benefit from analyzing the available data to understand the current trends.
To begin to investigate these questions, the company has gathered some simple information about new users of the application. A simple random sample of users was taken by gathering information in the company’s database. The sample was limited only to users who first installed the application in the last 6 months, when a new version of the application was released. The sample was further limited to users who signed up and had enough time for the company to measure its key milestones. To ensure reasonable comparisons, the data were limited to users in Australia, Canada, United Kingdom, and the United States, which were deemed appropriately similar in terms of their linguistic and economic characteristics.
For each user, basic information (age group, gender, and country) was collected from the user’s profile. Then the following characteristics were measured:
daily_sessions: This is the average number of sessions per day in the first 30 days for each user. One session consists of a period of active use as measured by the company’s database. Then the daily sessions for a user is the total number of sessions for the period divided by 30.
subscribed_30: This measure (TRUE/FALSE) indicates whether the user paid for any subscription service within 30 days.
active_30: This measures (TRUE/FALSE) whether the user remained active at 30 days. The company decided to measure this by identifying whether the user had at least one active session in the 7-day period after the first 30 days.
Based upon the information above and the data provided, please answer the following questions. When numeric answers are requested, a few sentences of explanation should also be provided. Please show your code for any calculations performed.
This section of the report is reserved for any work you plan to do ahead of answering the questions – such as loading or exploring the data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(miscTools)
digital.application.user.data <- read.csv("~/Downloads/digital application user data.csv")
We are interested in the question of whether female users have higher rates of daily sessions than other users do. What kind of parameter should we select as our metric for each group?
We want to look at the mean number of daily sessions for each user group as our parameter of interest.
avg_sessions <- mean(digital.application.user.data$daily_sessions)
Use the data to estimate the values of your selected parameter for female users and for other users.
The mean daily sessions for female users is 1.472, and the mean daily sessions for other users is 1.429.
data_f <- digital.application.user.data %>%
filter(female == TRUE)
avg_sessions_f <- mean(data_f$daily_sessions); avg_sessions_f
## [1] 1.472535
data_m <- digital.application.user.data %>%
filter(female == FALSE)
avg_sessions_m <- mean(data_m$daily_sessions); avg_sessions_m
## [1] 1.42895
Does there appear to be an observed difference between the groups? Without performing statistical tests, would you consider this difference to be meaningful for the business? Explain your answer.
There appears to be a slight difference between the groups. Female users spend, on average, .043 more sessions on the app daily than other users. This means that female users spend approximately 3% more sessions on the app daily than other users. Without performing statistical tests, this difference seems like it could be meaningful for the business, because 3% usage difference is a non-negligible lift for female users. However, it is important to perform statistical tests to confirm this difference is significant rather than simply a result of chance.
daily_sessions_difference <- avg_sessions_f - avg_sessions_m; daily_sessions_difference
## [1] 0.0435844
difference_pct <- (daily_sessions_difference/avg_sessions_m) * 100; difference_pct
## [1] 3.050099
Which statistical test would be appropriate for testing the two groups for differences in their daily sessions according to your selected metric?
Since we want to compare the means of samples in two independent groups, it is appropriate to perform a two sample t-test.
How many samples (groups) are included in your selected statistical test?
Two groups will be included in the statistical test: female app users and other gender users.
How many tails are considered in your selected statistical test?
This will be a two-tailed test. We are looking for any statistical difference between the two sample groups, whereas we would use a one-tailed test if we were only interested in one specific direction.
Perform your selected statistical test. Report a p-value for the results.
The t-test returns a p-value of .081.
t.test(x = data_f$daily_sessions, y = data_m$daily_sessions, alternative = c("two.sided"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: data_f$daily_sessions and data_m$daily_sessions
## t = 1.7451, df = 4539.1, p-value = 0.08104
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.005380538 0.092549329
## sample estimates:
## mean of x mean of y
## 1.472535 1.428950
How would you interpret this finding for the product’s managers of the digital application? Make sure to frame the result in terms that will be meaningful for their work.
The test results do not sufficiently prove that there is any significant difference between the mean daily app sessions for female users and other users.
The product’s managers are also interested in the age groups that tend to use the product and how they vary by country. Create a table with the following characteristics:
Each row represents an age group.
Each column represents a country
Each listed value shows the number of users of that age group within that country.
age_country_count <- digital.application.user.data %>%
group_by(age_group, country) %>%
count()
age_country_count_pivot <- age_country_count %>%
pivot_wider(names_from = country, values_from = n)
age_country_count_pivot
## # A tibble: 4 x 5
## # Groups: age_group [4]
## age_group Australia Canada UK USA
## <chr> <int> <int> <int> <int>
## 1 18-34 282 242 439 894
## 2 35-49 219 204 363 792
## 3 50-64 191 128 255 554
## 4 65+ 60 41 101 235
Now convert the previous table of counts by age group and country into percentages. However, we want the percentages to be calculated separately within each country. Show the resulting table as percentages (ranging from 0 to 100) rounded to 1 decimal place.
aus_pct = prop.table(age_country_count_pivot$Australia)
can_pct = prop.table(age_country_count_pivot$Canada)
uk_pct = prop.table(age_country_count_pivot$UK)
usa_pct = prop.table(age_country_count_pivot$USA)
Australia <- round(aus_pct*100, digits=1)
Canada <- round(can_pct*100, digits=1)
UK <- round(uk_pct*100, digits=1)
USA <- round(usa_pct*100, digits=1)
Age_Group <- age_country_count_pivot$age_group
age_country_pct <- data.frame(Age_Group, Australia, Canada, UK, USA)
age_country_pct
## Age_Group Australia Canada UK USA
## 1 18-34 37.5 39.3 37.9 36.1
## 2 35-49 29.1 33.2 31.3 32.0
## 3 50-64 25.4 20.8 22.0 22.4
## 4 65+ 8.0 6.7 8.7 9.5
Without performing any statistical tests, do you think that each country has a similar distribution of users across the age groups? Explain why or why not.
Without performing any statistical tests, it does seem that each country has a similar distribution of users across the age groups. The table below shows the difference between each country’s proportions and the aggregate proportions (all countries combined). The difference is within 3% for each age group-country proportion. Thus, the age group breakdown of each country appears to be similar to that of the data set as a whole and to those of each other.
age_count <- digital.application.user.data %>%
group_by(age_group) %>%
count()
age_pct <- prop.table(age_count$n)
Proportion <- round(age_pct*100, digits=1)
Australia_Diff <- Australia - Proportion
Canada_Diff <- Canada - Proportion
UK_Diff <- UK - Proportion
USA_Diff <- USA - Proportion
age_country_proportion_differences <- data.frame(Age_Group, Australia_Diff, Canada_Diff, UK_Diff, USA_Diff); age_country_proportion_differences
## Age_Group Australia_Diff Canada_Diff UK_Diff USA_Diff
## 1 18-34 0.4 2.2 0.8 -1.0
## 2 35-49 -2.5 1.6 -0.3 0.4
## 3 50-64 2.8 -1.8 -0.6 -0.2
## 4 65+ -0.7 -2.0 0.0 0.8
Which statistical test would help you determine if there are age-based differences across these countries? Explain why you selected this test.
To determine if there are age-based differences across these countries, it is appropriate to perform a chi-square statistical test. This is because we want to know if the age-group proportions across countries are independent or if they are related. We will do a goodness of fit test to see how closely the actual observed values match the expected values if the proportions were equal for each country.
What is the value of the test statistic for your selected test? Calculate this answer independently without using an existing testing function. (You may use such a function to check your answer.) Show your code along with the result.
The value of the chi-square test statistic is 12.64096.
country_totals <- digital.application.user.data %>%
group_by(country) %>%
count()
Age_Group_1_Expected <- age_pct[1] * country_totals$n
Age_Group_2_Expected <- age_pct[2] * country_totals$n
Age_Group_3_Expected <- age_pct[3] * country_totals$n
Age_Group_4_Expected <- age_pct[4] * country_totals$n
expected_values <- t(data.frame(Age_Group_1_Expected, Age_Group_2_Expected, Age_Group_3_Expected, Age_Group_4_Expected))
colnames(expected_values) <- c("Australia", "Canada", "UK", "USA")
E <- c(expected_values)
A <- c(as.data.frame(age_country_count_pivot[,-1]))
A <- c(A$Australia, A$Canada, A$UK, A$USA)
test_stat <- sum(((A-E)^2)/E)
test_stat
## [1] 12.64096
What is the p-value for this test? Calculate this answer independently without using an existing testing function. (You may use such a function to check your answer.) Show your code along with the result.
The p-value for this test is 0.1795. This was calculated using a chi-square distribution with a test statistic of 12.64096 on 9 degrees of freedom.
num_rows = 4
num_cols = 4
df = (num_rows - 1)*(num_cols - 1)
pval <- pchisq(q = test_stat, df=df, lower.tail = F); pval
## [1] 0.1795367
How would you interpret this finding for the product’s managers of the digital application? Make sure to frame the result in terms that will be meaningful for their work.
The test results do not sufficiently prove that there is any significant age-based difference across these countries.
Canada and the United States are geographically connected and often having overlapping media markets. We can place them in one group and compare them to a second group with Australia and the United Kingdom. Do these two groups have similar rates of users who remain active at 30 days? Perform a statistical test, explain why you selected it, and interpret the results.
To compare the rates of users who remain active at 30 days between the group containing Canada and the United States and the group containing Australia and the United Kingdom, it is appropriate to perform a two-sample test of proportions. Active_30 is a logical variable, and we want to find the proportion of users in each group for which it is true, and perform the test to determine if there is a significant difference between those proportions. The result of the test was a p-value of .3764, which is above our significance level of .05, so we fail to reject the null. The test results do not sufficiently prove that there is any significant difference in 30 day activity rates between the two country groups.
country_groups <- digital.application.user.data %>%
mutate(country_group = ifelse(country == "Canada" | country == "USA", "Canada/USA", "Australia/UK")) %>%
group_by(country_group, active_30) %>%
count()
aus_uk_true <- c(as.data.frame(country_groups[2,3]))
aus_uk_tot <- c(as.data.frame(country_groups[1,3] + country_groups[2,3]))
can_usa_true <- c(as.data.frame(country_groups[4,3]))
can_usa_tot <- c(as.data.frame(country_groups[3,3] + country_groups[4,3]))
prop_matrix <- matrix(c(aus_uk_true, aus_uk_tot, can_usa_true, can_usa_tot), ncol=2)
colnames(prop_matrix) <- c('Australia/UK', 'Canada/USA')
rownames(prop_matrix) <- c('True', 'Total')
prop_matrix_t <- t(prop_matrix)
prop_matrix_t
## True Total
## Australia/UK 717 1910
## Canada/USA 1220 3090
prop_matrix_t_values <- matrix(c(717, 1220, 1910, 3090), ncol=2)
prop_matrix_t_values
## [,1] [,2]
## [1,] 717 1910
## [2,] 1220 3090
prop.test(prop_matrix_t_values)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: prop_matrix_t_values
## X-squared = 0.78227, df = 1, p-value = 0.3764
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.03213794 0.01188246
## sample estimates:
## prop 1 prop 2
## 0.2729349 0.2830626
The application’s managers would like to study the relationship between daily sessions and subscriptions. Anecdotally, they think that having at least 1 session per day could be a meaningful indicator. Using the outcome of subscriptions at 30 days, compare the rates of subscriptions for users with at least 1 daily session to those with fewer. Perform a statistical test, explain the reasons for your selection, and interpret the results.
To compare the rates of subscriptions at 30 days between users with at least 1 session per day and users with fewer than 1 session per day, it is appropriate to perform a two-sample test of proportions. Subscribed_30 is a logical variable, and we want to find the proportion of users in each group for which it is true, and perform the test to determine if there is a significant difference between these proportions. The result of the test was a p-value of .01852, which is below our significance level of .05, so we reject the null hypothesis. The test results show that there is evidence of a significant difference in 30 day subscription rates between users with at least 1 session per day and users with fewer than 1 session per day.
session_groups <- digital.application.user.data %>%
mutate(session_group = ifelse(daily_sessions >= 1, "At Least 1", "Fewer than 1")) %>%
group_by(session_group, subscribed_30) %>%
count()
one_plus_true <- c(as.data.frame(session_groups[2,3]))
one_plus_tot <- c(as.data.frame(session_groups[1,3] + session_groups[2,3]))
fewer_one_true <- c(as.data.frame(session_groups[4,3]))
fewer_one_tot <- c(as.data.frame(session_groups[3,3] + session_groups[4,3]))
prop_matrix2 <- matrix(c(one_plus_true, one_plus_tot, fewer_one_true, fewer_one_tot), ncol=2)
colnames(prop_matrix2) <- c('At Least 1', 'Fewer than 1')
rownames(prop_matrix2) <- c('True', 'Total')
prop_matrix2_t <- t(prop_matrix2)
prop_matrix2_t
## True Total
## At Least 1 190 3079
## Fewer than 1 86 1921
prop_matrix2_t_values <- matrix(c(190, 86, 3079, 1921), ncol=2)
prop.test(prop_matrix2_t_values)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: prop_matrix2_t_values
## X-squared = 5.5461, df = 1, p-value = 0.01852
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.00291841 0.02762504
## sample estimates:
## prop 1 prop 2
## 0.05812175 0.04285002
What type of study was conducted? Are there any concerns about the analyses based upon the method of research?
The study conducted was an observational study. There are some limitations to the study. The sample was taken only from users who had first installed the application in the last 6 months, so it excluded users who started using it before then and also users who installed it too recently for the company to measure their key milestones. The sample was also only taken from 4 countries - Canada, the USA, the UK, and Australia - so it excludes users in other areas. These exclusions could mean that valuable insights are being overlooked or that the analyses performed are not indicative of the entire user base. Another concern about the method of research is that the study takes into account the number of daily sessions but not the amount of time spent in those sessions, so while daily sessions is an indicator of how many times users open the app, it may not be an indicator of how often they are actually using it.
How actionable are the findings of this analysis? Do the independent variables help us to make choices about how to improve the outcomes of activity and subscription at 30 days?
The independent variables measured in this analysis were: mean daily sessions, proportions of age groups, active 30 day rates, and subscription 30 day rates. The analysis did not reveal a significant difference in mean daily sessions between genders, nor a significant difference in proportions of age groups between countries, nor a significant difference in the active 30 day rates between users in the USA and Canada and users in Australia and the UK. It did show evidence of a significant difference in 30 day subscription rates between users with at least 1 average session per day and those with under 1 average session per day. To improve the outcome of activity at 30 days, additional analysis is recommended beyond the difference between users in the USA and Canada and users in Australia and the UK. To improve the outcome of subscription at 30 days, we can make choices based on the findings that users with at least 1 average daily session are significantly more likely to subscribe than users with under 1 average daily session.
What else could you recommend to the managers of the product for improving their preferred outcomes of activity and subscriptions at 30 days? Provide a number of strategic recommendations that are actionable, measurable, and amenable to experimentation.
Some additional analyses that could be performed to improve the company’s preferred outcomes of activity and subscriptions at 30 days are as follows: -Is age group associated with users’ rates of activity and/or subscription at 30 days? -Is gender associated with users’ rates of activity and/or subscription at 30 days? -Is the number of average daily sessions associated with users’ rates of activity at 30 days? -Is country associated with users’ rates of subscription at 30 days? -Are users’ rates of activity at 30 days and subscription at 30 days associated with each other? -Beyond the data collected in this observational study, the company could also perform an experiment to see if various aspects of the user experience influence users’ propensity to remain active and/or subscribe at 30 days.