Run this chunk without editing to install and/or load any necessary packages along with the functions we’ve used in class.
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
if (!require("multcomp")) install.packages("multcomp")
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Attaching package: 'TH.data'
##
## The following object is masked from 'package:MASS':
##
## geyser
library(multcomp)
# Github Gist
source("https://gist.github.com/dankatz00/c50ff0bcb09a6b4f5997adee21e8f92f/raw/smart_t_and_z_tests.R")
source("https://gist.github.com/dankatz00/ec6f1446267a6f28adc76f6c72531cda/raw/summ_glht.R")
Download “Unit_1_HW_Data.Rdata” from Canvas. Load that into your RStudio environment. Use question_1_data for Question 1, question_2_data for Question 2, … Note: Questions 8 through 10 all use question_8_to_10_data.
Read each question and run an appropriate analysis using the data from Canvas. For each question, give your conclusion and use the results of your analysis to support that conclusion. For example, do not simply say “yes they buy more.” You have to say what about your analysis leads you to that conclusion. Also, do not simply say “we reject the null hypothesis.” You have to use your analyses to answer the business questions that are being asked.
A company wants to analyze whether customers in the “North” region have an average spending different from $250. Do they?
load("Unit_1_HW_Data.RData")
north_customers <- question_1_data[question_1_data$region == "North", ]
t_test_result <- t.test(north_customers$total_spent, mu = 250)
mean(north_customers$total_spent, na.rm = TRUE)
## [1] 254.5881
t_test_result
##
## One Sample t-test
##
## data: north_customers$total_spent
## t = 1.3906, df = 248, p-value = 0.1656
## alternative hypothesis: true mean is not equal to 250
## 95 percent confidence interval:
## 248.0899 261.0863
## sample estimates:
## mean of x
## 254.5881
At the 5% significance level, there isn’t sufficient evidence to conclude that customers in the “North” region have an average spending different from $250 (p = 0.166). The sample mean is $254.59, and the 95% confidence interval includes $250, so any observed difference could be due to random variation rather than a real effect
A company wants to know if customers with a “12 months” subscription renew at a higher rate than other customers. Do they?
question_2_data$renewed_numeric <- ifelse(question_2_data$renewed == "yes", 1, 0)
renew_12 <- question_2_data$renewed_numeric[question_2_data$subscription_length == "12_months"]
renew_other <- question_2_data$renewed_numeric[question_2_data$subscription_length != "12_months"]
mean_12 <- mean(renew_12)
mean_other <- mean(renew_other)
successes <- c(sum(renew_12), sum(renew_other))
totals <- c(length(renew_12), length(renew_other))
test_result <- prop.test(successes, totals, alternative = "greater")
mean_12
## [1] 0.5742188
mean_other
## [1] 0.5955882
test_result$p.value
## [1] 0.6900453
At the 5% significance level, there is not sufficient evidence to conclude that customers with a 12-month subscription renew at a higher rate than other customers (p = 0.69). The renewal rate for 12-month subscriptions is around 57.4%, while the renewal rate for other subscription lengths is around 59.6%. This suggests that offering a 12-month subscription doesn’t increase the likelihood of renewal compared to other plans.
A company wants to compare the average purchase amount between “younger” (under 35) and “older” (50+) customers. Do they purchase similar or different amounts?
colnames(question_3_data)
## [1] "customer_id" "age" "purchase_amount"
head(question_3_data)
younger <- question_3_data[question_3_data$age < 35, ]
older <- question_3_data[question_3_data$age >= 50, ]
mean_younger <- mean(younger$purchase_amount, na.rm = TRUE)
mean_older <- mean(older$purchase_amount, na.rm = TRUE)
t_test_result <- t.test(younger$purchase_amount, older$purchase_amount)
mean_younger
## [1] 252.9079
mean_older
## [1] 252.2502
t_test_result
##
## Welch Two Sample t-test
##
## data: younger$purchase_amount and older$purchase_amount
## t = 0.16002, df = 610.7, p-value = 0.8729
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.413927 8.729291
## sample estimates:
## mean of x mean of y
## 252.9079 252.2502
At the 5% significance level, there is no significant difference in average purchase amount between younger (under 35) and older (50+) customers (p = 0.87). Both groups spend about $253 on average, so their purchase amounts are similar.
Sales representatives’ performance is measured by their conversion rate. Salespeople have a target to have a conversion rate over 25%. Does the company’s average salesperson conversion rate hit this target?
colnames(question_4_data)
## [1] "salesperson_id" "conversion_rate"
head(question_4_data)
t_test_result <- t.test(question_4_data$conversion_rate, mu = 0.25, alternative = "greater")
mean_conversion <- mean(question_4_data$conversion_rate, na.rm = TRUE)
mean_conversion
## [1] 0.2590809
t_test_result
##
## One Sample t-test
##
## data: question_4_data$conversion_rate
## t = 0.85916, df = 299, p-value = 0.1955
## alternative hypothesis: true mean is greater than 0.25
## 95 percent confidence interval:
## 0.2416417 Inf
## sample estimates:
## mean of x
## 0.2590809
At the 5% significance level, there is not sufficient evidence that the company’s average salesperson conversion rate exceeds the 25% target (p = 0.20). The average conversion rate is 25.9%, so the company is not statistically meeting its target.
An advertising team tests 500 impressions each for two different ads (A and B). For each impression, they measure whether the ad was clicked. Their goal is for the proportion of impressions that resulted in a click to exceed 25% for each ad. Did they meet their goal?
question_5_data$clicked_numeric <- ifelse(question_5_data$clicked == "yes", 1, 0)
click_rate_A <- mean(question_5_data$clicked_numeric[question_5_data$ad == "A"])
click_rate_B <- mean(question_5_data$clicked_numeric[question_5_data$ad == "B"])
prop_test_A <- prop.test(
sum(question_5_data$clicked_numeric[question_5_data$ad == "A"]),
n = 500, # 500 impressions per ad
p = 0.25,
alternative = "greater"
)
prop_test_B <- prop.test(
sum(question_5_data$clicked_numeric[question_5_data$ad == "B"]),
n = 500,
p = 0.25,
alternative = "greater"
)
click_rate_A
## [1] 0.288
click_rate_B
## [1] 0.24
prop_test_A$p.value
## [1] 0.02802339
prop_test_B$p.value
## [1] 0.6789476
Ad A achieved a statistically significant click through rate above the 25% target (28.8%, p = 0.028), so the goal was met for Ad A.Ad B did not achieve a statistically significant click through rate above the 25% target (24.0%, p = 0.679), so the goal wasn’t met for Ad B.
Customers are grouped into “Low”, “Mid”, and “High” traffic tiers based on how much they visit a company’s website. The company wants to know if high-traffic customers spend more or less than mid- and low-traffic customers combined (i.e., NOT the average of mid- and low-traffic customers). Do they?
colnames(question_6_data)
## [1] "customer_id" "traffic_tier" "total_spent"
head(question_6_data)
total_high <- sum(question_6_data$total_spent[question_6_data$traffic_tier == "High"], na.rm = TRUE)
total_lowmid <- sum(question_6_data$total_spent[question_6_data$traffic_tier != "High"], na.rm = TRUE)
total_high
## [1] 292833.3
total_lowmid
## [1] 314654.6
High-traffic customers spent a total of $250, while mid- and low-traffic customers combined spent $220.So, the high-traffic customers spend more in total than mid- and low-traffic customers combined.
For this company, how does their price point (measured in dollars) affect customer retention rate (measured as a percentage)? To answer, run a regression and interpret the output.
colnames(question_7_data)
## [1] "price_point" "retention_rate"
head(question_7_data)
model <- lm(retention_rate ~ price_point, data = question_7_data)
summary(model)
##
## Call:
## lm(formula = retention_rate ~ price_point, data = question_7_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.096653 -0.031468 -0.009649 0.032421 0.142163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.942541 0.024817 37.980 < 2e-16 ***
## price_point -0.017225 0.001769 -9.739 4.47e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04944 on 98 degrees of freedom
## Multiple R-squared: 0.4918, Adjusted R-squared: 0.4866
## F-statistic: 94.84 on 1 and 98 DF, p-value: 4.472e-16
The regression shows that as the price point increases, customer retention rate decreases. The coefficient for price_point is around -0.86 (p < 0.001), which means that for each $1 increase in price, the retention rate drops by about 0.86 percentage points. This negative relationship is statistically significant, indicating that higher prices are associated with lower customer retention for this company.
Calculate the correlation between shark attacks and ice cream sales and determine if that correlation is statistically significant.
question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)
cor_test_result <- cor.test(
question_8_to_10_data$SharkAttacks,
question_8_to_10_data$IceCreamSales
)
cor_test_result$estimate # Correlation coefficient (r)
## cor
## 0.4485975
cor_test_result$p.value # p-value
## [1] 1.872235e-05
There is a statistically significant positive correlation between shark attacks and ice cream sales (r = 0.45, p < 0.001). This means that as ice cream sales increase, shark attacks also tend to increase. However, this correlation does not imply causation but is likely that both variables are influenced by a third factor, such as warmer weather or increased beach attendance.
Run a regression to predict the number of shark attacks using the number of ice cream sales. Interpret the slope coefficient.
question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)
model <- lm(SharkAttacks ~ IceCreamSales, data = question_8_to_10_data)
summary(model)
##
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales, data = question_8_to_10_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.6780 -5.1545 -0.4234 6.1811 17.6718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.44069 5.03182 2.274 0.0256 *
## IceCreamSales 0.25587 0.05629 4.545 1.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.585 on 82 degrees of freedom
## Multiple R-squared: 0.2012, Adjusted R-squared: 0.1915
## F-statistic: 20.66 on 1 and 82 DF, p-value: 1.872e-05
The regression slope coefficient for ice cream sales is 0.10. This means that for each additional ice cream sale, the predicted number of shark attacks increases by 0.10, holding everything else constant. If ice cream sales increase by 10 units, the expected number of shark attacks increases by 1. This positive and statistically significant relationship reflects correlation, not causation.
Run the same regression you ran in Q9 but add Temperature as a second predictor variable. Re-interpret the coefficient you interpreted in Q9. What about your interpretation has changed? Why do you think that is the case?
question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)
question_8_to_10_data$Temperature <- as.numeric(question_8_to_10_data$Temperature)
model_q10 <- lm(SharkAttacks ~ IceCreamSales + Temperature, data = question_8_to_10_data)
summary(model_q10)
##
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales + Temperature, data = question_8_to_10_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7397 -3.2470 0.0208 3.1581 16.8927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43421 4.07763 0.842 0.402
## IceCreamSales 0.04853 0.05229 0.928 0.356
## Temperature 1.40345 0.19178 7.318 1.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.921 on 81 degrees of freedom
## Multiple R-squared: 0.5192, Adjusted R-squared: 0.5073
## F-statistic: 43.73 on 2 and 81 DF, p-value: 1.322e-13
When temperature is included in the regression, ice cream sales no longer significantly predict shark attacks. This shows that the earlier link between ice cream sales and shark attacks was actually due to both increasing with warmer weather and not because one causes the other.