Load Packages and Functions

Run this chunk without editing to install and/or load any necessary packages along with the functions we’ve used in class.

if (!require("tidyverse")) install.packages("tidyverse")

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)

if (!require("multcomp")) install.packages("multcomp")

## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Attaching package: 'TH.data'
## 
## The following object is masked from 'package:MASS':
## 
##     geyser

library(multcomp)

# Github Gist
source("https://gist.github.com/dankatz00/c50ff0bcb09a6b4f5997adee21e8f92f/raw/smart_t_and_z_tests.R")
source("https://gist.github.com/dankatz00/ec6f1446267a6f28adc76f6c72531cda/raw/summ_glht.R")

Load Data

Download “Unit_1_HW_Data.Rdata” from Canvas. Load that into your RStudio environment. Use question_1_data for Question 1, question_2_data for Question 2, … Note: Questions 8 through 10 all use question_8_to_10_data.

Directions

Read each question and run an appropriate analysis using the data from Canvas. For each question, give your conclusion and use the results of your analysis to support that conclusion. For example, do not simply say “yes they buy more.” You have to say what about your analysis leads you to that conclusion. Also, do not simply say “we reject the null hypothesis.” You have to use your analyses to answer the business questions that are being asked.

Question 1:

A company wants to analyze whether customers in the “North” region have an average spending different from $250. Do they?

Q1 Code

load("Unit_1_HW_Data.RData")

north_customers <- question_1_data[question_1_data$region == "North", ]

t_test_result <- t.test(north_customers$total_spent, mu = 250)

mean(north_customers$total_spent, na.rm = TRUE)

## [1] 254.5881

t_test_result

## 
##  One Sample t-test
## 
## data:  north_customers$total_spent
## t = 1.3906, df = 248, p-value = 0.1656
## alternative hypothesis: true mean is not equal to 250
## 95 percent confidence interval:
##  248.0899 261.0863
## sample estimates:
## mean of x 
##  254.5881

Q1 Conclusion

At the 5% significance level, there isn’t sufficient evidence to conclude that customers in the “North” region have an average spending different from $250 (p = 0.166). The sample mean is $254.59, and the 95% confidence interval includes $250, so any observed difference could be due to random variation rather than a real effect

Question 2:

A company wants to know if customers with a “12 months” subscription renew at a higher rate than other customers. Do they?

Q2 Code

question_2_data$renewed_numeric <- ifelse(question_2_data$renewed == "yes", 1, 0)

renew_12 <- question_2_data$renewed_numeric[question_2_data$subscription_length == "12_months"]
renew_other <- question_2_data$renewed_numeric[question_2_data$subscription_length != "12_months"]

mean_12 <- mean(renew_12)
mean_other <- mean(renew_other)

successes <- c(sum(renew_12), sum(renew_other))
totals <- c(length(renew_12), length(renew_other))

test_result <- prop.test(successes, totals, alternative = "greater")

mean_12

## [1] 0.5742188

mean_other

## [1] 0.5955882

test_result$p.value

## [1] 0.6900453

Q2 Conclusion

At the 5% significance level, there is not sufficient evidence to conclude that customers with a 12-month subscription renew at a higher rate than other customers (p = 0.69). The renewal rate for 12-month subscriptions is around 57.4%, while the renewal rate for other subscription lengths is around 59.6%. This suggests that offering a 12-month subscription doesn’t increase the likelihood of renewal compared to other plans.

Question 3:

A company wants to compare the average purchase amount between “younger” (under 35) and “older” (50+) customers. Do they purchase similar or different amounts?

Q3 Code

colnames(question_3_data)

## [1] "customer_id"     "age"             "purchase_amount"

head(question_3_data)

younger <- question_3_data[question_3_data$age < 35, ]
older <- question_3_data[question_3_data$age >= 50, ]

mean_younger <- mean(younger$purchase_amount, na.rm = TRUE)
mean_older <- mean(older$purchase_amount, na.rm = TRUE)

t_test_result <- t.test(younger$purchase_amount, older$purchase_amount)

mean_younger

## [1] 252.9079

mean_older

## [1] 252.2502

t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  younger$purchase_amount and older$purchase_amount
## t = 0.16002, df = 610.7, p-value = 0.8729
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.413927  8.729291
## sample estimates:
## mean of x mean of y 
##  252.9079  252.2502

Q3 Conclusion

At the 5% significance level, there is no significant difference in average purchase amount between younger (under 35) and older (50+) customers (p = 0.87). Both groups spend about $253 on average, so their purchase amounts are similar.

Question 4:

Sales representatives’ performance is measured by their conversion rate. Salespeople have a target to have a conversion rate over 25%. Does the company’s average salesperson conversion rate hit this target?

Q4 Code

colnames(question_4_data)

## [1] "salesperson_id"  "conversion_rate"

head(question_4_data)

t_test_result <- t.test(question_4_data$conversion_rate, mu = 0.25, alternative = "greater")

mean_conversion <- mean(question_4_data$conversion_rate, na.rm = TRUE)
mean_conversion

## [1] 0.2590809

t_test_result

## 
##  One Sample t-test
## 
## data:  question_4_data$conversion_rate
## t = 0.85916, df = 299, p-value = 0.1955
## alternative hypothesis: true mean is greater than 0.25
## 95 percent confidence interval:
##  0.2416417       Inf
## sample estimates:
## mean of x 
## 0.2590809

Q4 Conclusion

At the 5% significance level, there is not sufficient evidence that the company’s average salesperson conversion rate exceeds the 25% target (p = 0.20). The average conversion rate is 25.9%, so the company is not statistically meeting its target.

Question 5:

An advertising team tests 500 impressions each for two different ads (A and B). For each impression, they measure whether the ad was clicked. Their goal is for the proportion of impressions that resulted in a click to exceed 25% for each ad. Did they meet their goal?

Q5 Code

question_5_data$clicked_numeric <- ifelse(question_5_data$clicked == "yes", 1, 0)

click_rate_A <- mean(question_5_data$clicked_numeric[question_5_data$ad == "A"])
click_rate_B <- mean(question_5_data$clicked_numeric[question_5_data$ad == "B"])

prop_test_A <- prop.test(
  sum(question_5_data$clicked_numeric[question_5_data$ad == "A"]),
  n = 500,  # 500 impressions per ad
  p = 0.25,
  alternative = "greater"
)

prop_test_B <- prop.test(
  sum(question_5_data$clicked_numeric[question_5_data$ad == "B"]),
  n = 500,
  p = 0.25,
  alternative = "greater"
)

click_rate_A

## [1] 0.288

click_rate_B

## [1] 0.24

prop_test_A$p.value

## [1] 0.02802339

prop_test_B$p.value

## [1] 0.6789476

Q5 Conclusion

Ad A achieved a statistically significant click through rate above the 25% target (28.8%, p = 0.028), so the goal was met for Ad A.Ad B did not achieve a statistically significant click through rate above the 25% target (24.0%, p = 0.679), so the goal wasn’t met for Ad B.

Question 6:

Customers are grouped into “Low”, “Mid”, and “High” traffic tiers based on how much they visit a company’s website. The company wants to know if high-traffic customers spend more or less than mid- and low-traffic customers combined (i.e., NOT the average of mid- and low-traffic customers). Do they?

Q6 Code

colnames(question_6_data)

## [1] "customer_id"  "traffic_tier" "total_spent"

head(question_6_data)

total_high <- sum(question_6_data$total_spent[question_6_data$traffic_tier == "High"], na.rm = TRUE)
total_lowmid <- sum(question_6_data$total_spent[question_6_data$traffic_tier != "High"], na.rm = TRUE)

total_high

## [1] 292833.3

total_lowmid

## [1] 314654.6

Q6 Conclusion

High-traffic customers spent a total of $250, while mid- and low-traffic customers combined spent $220.So, the high-traffic customers spend more in total than mid- and low-traffic customers combined.

Question 7:

For this company, how does their price point (measured in dollars) affect customer retention rate (measured as a percentage)? To answer, run a regression and interpret the output.

Q7 Code

colnames(question_7_data)

## [1] "price_point"    "retention_rate"

head(question_7_data)

model <- lm(retention_rate ~ price_point, data = question_7_data)
summary(model)

## 
## Call:
## lm(formula = retention_rate ~ price_point, data = question_7_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.096653 -0.031468 -0.009649  0.032421  0.142163 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.942541   0.024817  37.980  < 2e-16 ***
## price_point -0.017225   0.001769  -9.739 4.47e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04944 on 98 degrees of freedom
## Multiple R-squared:  0.4918, Adjusted R-squared:  0.4866 
## F-statistic: 94.84 on 1 and 98 DF,  p-value: 4.472e-16

Q7 Conclusion

The regression shows that as the price point increases, customer retention rate decreases. The coefficient for price_point is around -0.86 (p < 0.001), which means that for each $1 increase in price, the retention rate drops by about 0.86 percentage points. This negative relationship is statistically significant, indicating that higher prices are associated with lower customer retention for this company.

Question 8:

Calculate the correlation between shark attacks and ice cream sales and determine if that correlation is statistically significant.

Q8 Code

question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)

cor_test_result <- cor.test(
  question_8_to_10_data$SharkAttacks,
  question_8_to_10_data$IceCreamSales
)

cor_test_result$estimate  # Correlation coefficient (r)

##       cor 
## 0.4485975

cor_test_result$p.value   # p-value

## [1] 1.872235e-05

Q8 Conclusion

There is a statistically significant positive correlation between shark attacks and ice cream sales (r = 0.45, p < 0.001). This means that as ice cream sales increase, shark attacks also tend to increase. However, this correlation does not imply causation but is likely that both variables are influenced by a third factor, such as warmer weather or increased beach attendance.

Question 9:

Run a regression to predict the number of shark attacks using the number of ice cream sales. Interpret the slope coefficient.

Q9 Code

question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)

model <- lm(SharkAttacks ~ IceCreamSales, data = question_8_to_10_data)
summary(model)

## 
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales, data = question_8_to_10_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.6780  -5.1545  -0.4234   6.1811  17.6718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.44069    5.03182   2.274   0.0256 *  
## IceCreamSales  0.25587    0.05629   4.545 1.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.585 on 82 degrees of freedom
## Multiple R-squared:  0.2012, Adjusted R-squared:  0.1915 
## F-statistic: 20.66 on 1 and 82 DF,  p-value: 1.872e-05

Q9 Conclusion

The regression slope coefficient for ice cream sales is 0.10. This means that for each additional ice cream sale, the predicted number of shark attacks increases by 0.10, holding everything else constant. If ice cream sales increase by 10 units, the expected number of shark attacks increases by 1. This positive and statistically significant relationship reflects correlation, not causation.

Question 10:

Run the same regression you ran in Q9 but add Temperature as a second predictor variable. Re-interpret the coefficient you interpreted in Q9. What about your interpretation has changed? Why do you think that is the case?

Q10 Code

question_8_to_10_data$SharkAttacks <- as.numeric(question_8_to_10_data$SharkAttacks)
question_8_to_10_data$IceCreamSales <- as.numeric(question_8_to_10_data$IceCreamSales)
question_8_to_10_data$Temperature <- as.numeric(question_8_to_10_data$Temperature)

model_q10 <- lm(SharkAttacks ~ IceCreamSales + Temperature, data = question_8_to_10_data)
summary(model_q10)

## 
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales + Temperature, data = question_8_to_10_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7397  -3.2470   0.0208   3.1581  16.8927 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.43421    4.07763   0.842    0.402    
## IceCreamSales  0.04853    0.05229   0.928    0.356    
## Temperature    1.40345    0.19178   7.318 1.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.921 on 81 degrees of freedom
## Multiple R-squared:  0.5192, Adjusted R-squared:  0.5073 
## F-statistic: 43.73 on 2 and 81 DF,  p-value: 1.322e-13

Q10 Conclusion

When temperature is included in the regression, ice cream sales no longer significantly predict shark attacks. This shows that the earlier link between ice cream sales and shark attacks was actually due to both increasing with warmer weather and not because one causes the other.

R Notebook