ISyE 6414 Regression Analysis | Spring 2023 Homework 8 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that = 0:05. For the following problems, refer to the le 6414-HW8-Shopper.csv. This data set con- sists of shopper data from the trial launch of a new product. The company has introduced a new product and they think this one will be the type that people buy multiple times a week, so they’re interested in modeling the amount purchased per customer. PriorWeekPurchase | If the customer bought one of the company’s products in the previous week LastWeekCompPurchase | If the customer bought one of the company’s main competitor’s products in the previous week Age | The age of the customer Region | The region in which the store is located Count | Number of people surveyed PurchaseCount (Yi) | Number of purchases made this week 1. Generate box plots for purchase counts per customer vs. age groups. Do you think they are identical?
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
shopper_data <- read.csv("6414-HW8-Shopper.csv")
shopper_data$Age <- factor(shopper_data$Age)
ggplot(shopper_data, aes(x=Age, y=PurchaseCount)) +
geom_boxplot() +
xlab("Age Group") +
ylab("Purchase Count") +
ggtitle("Purchase Counts per Customer vs. Age Groups")
# Create a data frame with the counts by region
region_counts <- aggregate(Count ~ Region, data = shopper_data, FUN = sum)
# Create a bar chart with different colors for each region
ggplot(region_counts, aes(x = Region, y = Count, fill = Region)) +
geom_bar(stat = "identity", width =0.5) +
labs(title = "Counts by Region", x = "Region", y = "Count") +
scale_fill_manual(values = c("darkred", "darkblue", "darkgreen", "purple"))
model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region, data = shopper_data, family = "poisson", offset = log(Count))
summary(model)
##
## Call:
## glm(formula = PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase +
## Age + Region, family = "poisson", data = shopper_data, offset = log(Count))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48516 -0.65035 -0.06216 0.63731 2.77040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.75055 0.04021 18.665 < 2e-16 ***
## PriorWeekPurchase 0.39280 0.02678 14.669 < 2e-16 ***
## LastWeekCompPurchase -0.21569 0.02663 -8.099 5.54e-16 ***
## Age25 to 34 0.06285 0.03586 1.753 0.07967 .
## Age35 to 44 -0.04930 0.03686 -1.337 0.18110
## Age45+ -0.24960 0.03929 -6.352 2.12e-10 ***
## Regionb 0.11974 0.03914 3.060 0.00222 **
## Regionc 0.28095 0.03760 7.472 7.92e-14 ***
## Regiond 0.17495 0.03876 4.513 6.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 492.205 on 63 degrees of freedom
## Residual deviance: 70.202 on 55 degrees of freedom
## AIC: 490.66
##
## Number of Fisher Scoring iterations: 4
anova(model, test = "Chi")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: PurchaseCount
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 63 492.20
## PriorWeekPurchase 1 224.558 62 267.65 < 2.2e-16 ***
## LastWeekCompPurchase 1 68.225 61 199.42 < 2.2e-16 ***
## Age 3 70.702 58 128.72 3.019e-15 ***
## Region 3 58.517 55 70.20 1.219e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# full model with all predictor variables
full_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region, family = poisson(), data = shopper_data)
# reduced model without Age
reduced_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Region, family = poisson(), data = shopper_data)
# likelihood ratio test
anova(full_model, reduced_model, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age +
## Region
## Model 2: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Region
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 55 175.43
## 2 58 267.18 -3 -91.743 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see in the output a chi-squared test statistic of 91.742 and a p-value of 2.2e-16. This indicates strong evidence against the null hypothesis that the reduced model is a better fit than the full model, and therefore we can say that Age is a significant predictor of PurchaseCount.
In the Poisson regression output, the p-values for the Region variables are less than 0.05, which indicates that Region is a significant variable. A p-value less than 0.05 indicates that there is strong evidence against the null hypothesis, which states that the coefficient for that variable is zero. Therefore, we can conclude that the Region variable has a significant effect on the PurchaseCount response variable.
# Fit the Poisson regression model
poisson_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region + Count, data = shopper_data, family = "poisson")
poisson_model
##
## Call: glm(formula = PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase +
## Age + Region + Count, family = "poisson", data = shopper_data)
##
## Coefficients:
## (Intercept) PriorWeekPurchase LastWeekCompPurchase
## 3.44476 0.37384 -0.20940
## Age25 to 34 Age35 to 44 Age45+
## 0.06600 -0.04325 -0.25071
## Regionb Regionc Regiond
## 0.11675 0.28574 0.17403
## Count
## 0.02432
##
## Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
## Null Deviance: 472.2
## Residual Deviance: 64.78 AIC: 487.2
# Calculate the rate ratio
rate_ratio <- exp(coef(poisson_model)["PriorWeekPurchase"])
rate_ratio
## PriorWeekPurchase
## 1.453306
Therefore, the rate ratio of purchase counts per week for the new product for the case where prior week purchase took place to the case where prior week purchase didn’t take place, holding other variables constant, is 1.453306.
new_data_1 <- data.frame(PriorWeekPurchase = 1, LastWeekCompPurchase = 1, Age = "<25", Region = "a", Count = 37)
log_expected_1 <- predict(poisson_model, new_data_1, type = "response", se.fit = FALSE, interval = "none", terms = NULL, na.action = na.pass)
new_data_2 <- data.frame(PriorWeekPurchase = 0, LastWeekCompPurchase = 0, Age = "<25", Region = "a", Count = 37)
log_expected_2 <- predict(poisson_model, new_data_2, type = "response", se.fit = FALSE, interval = "none", terms = NULL, na.action = na.pass)
expected_1 <- exp(log_expected_1)
expected_2 <- exp(log_expected_2)
# printing the results
cat("Expected purchase counts per week (log scale) for case where prior week purchase from competitor took place: ", round(log_expected_1, 2), "\n")
## Expected purchase counts per week (log scale) for case where prior week purchase from competitor took place: 90.84
cat("Expected purchase counts per week (log scale) for case where prior week purchase from competitor did not take place: ", round(log_expected_2, 2), "\n")
## Expected purchase counts per week (log scale) for case where prior week purchase from competitor did not take place: 77.07
cat("Expected purchase counts per week for case where prior week purchase from competitor took place: ", round(expected_1, 2), "\n")
## Expected purchase counts per week for case where prior week purchase from competitor took place: 2.829794e+39
cat("Expected purchase counts per week for case where prior week purchase from competitor did not take place: ", round(expected_2, 2), "\n")
## Expected purchase counts per week for case where prior week purchase from competitor did not take place: 2.949253e+33
The null hypothesis is that the model fits the data well, and the alternative hypothesis is that the model does not fit the data well.
model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region + Count, data = shopper_data, family = poisson())
# Obtain the deviance residuals
residuals <- residuals(model, type = "deviance")
# Compute the test statistic and the p-value
test_stat <- sum(residuals^2)
df <- nrow(data) - length(model$coefficients)
p_value <- pchisq(test_stat, df, lower.tail = FALSE)
# Print the results
cat("Chi-squared test using deviance residuals:\n")
## Chi-squared test using deviance residuals:
cat("Test statistic:", round(test_stat, 4), "\n")
## Test statistic: 64.7824
cat("Degrees of freedom:", df, "\n")
## Degrees of freedom:
cat("p-value:", p_value, "\n")
## p-value:
# Obtain the Pearson residuals
residuals <- residuals(model, type = "pearson")
# Compute the test statistic and the p-value
test_stat <- sum(residuals^2)
df <- nrow(data) - length(model$coefficients)
p_value <- pchisq(test_stat, df, lower.tail = FALSE)
# Print the results
cat("Chi-squared test using Pearson residuals:\n")
## Chi-squared test using Pearson residuals:
cat("Test statistic:", round(test_stat, 4), "\n")
## Test statistic: 64.7332
cat("Degrees of freedom:", df, "\n")
## Degrees of freedom:
cat("p-value:", p_value, "\n")
## p-value:
Both tests give us a very small p-values, indicating strong evidence against the null hypothesis that the model fits the data well. Therefore, we can conclude that the model does not fit the data well.
Using a significance level of 0.05, we reject the null hypothesis and conclude that the model does not fit the data well. The same conclusion can be reached using a significance level of 0.01.
The amount of extra variation in the data in comparison to ehat would be expected aunder the suumed model. It is estomated as a ratio of residual deviace to residual degree of freedom. In this sceneio, deviance is 64.78 and overdispersion parameter is 64.78 / 54 = 1.2015. As we can see that overdispersion can be an issue if it is greater than the phi value of 1. Since the value is slightly over the required number, we can say that there should not be a major issue.