Hw8_Regression_Abhilasha

ISyE 6414 Regression Analysis | Spring 2023 Homework 8 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that = 0:05. For the following problems, refer to the le 6414-HW8-Shopper.csv. This data set con- sists of shopper data from the trial launch of a new product. The company has introduced a new product and they think this one will be the type that people buy multiple times a week, so they’re interested in modeling the amount purchased per customer. PriorWeekPurchase | If the customer bought one of the company’s products in the previous week LastWeekCompPurchase | If the customer bought one of the company’s main competitor’s products in the previous week Age | The age of the customer Region | The region in which the store is located Count | Number of people surveyed PurchaseCount (Yi) | Number of purchases made this week 1. Generate box plots for purchase counts per customer vs. age groups. Do you think they are identical?

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

shopper_data <- read.csv("6414-HW8-Shopper.csv")
shopper_data$Age <- factor(shopper_data$Age)
ggplot(shopper_data, aes(x=Age, y=PurchaseCount)) +
  geom_boxplot() +
  xlab("Age Group") +
  ylab("Purchase Count") +
  ggtitle("Purchase Counts per Customer vs. Age Groups")

Generate a bar chart of counts vs. regions. Use a dierent color for each region.

# Create a data frame with the counts by region
region_counts <- aggregate(Count ~ Region, data = shopper_data, FUN = sum)

# Create a bar chart with different colors for each region
ggplot(region_counts, aes(x = Region, y = Count, fill = Region)) +
  geom_bar(stat = "identity", width =0.5) +
  labs(title = "Counts by Region", x = "Region", y = "Count") +
  scale_fill_manual(values = c("darkred", "darkblue", "darkgreen", "purple"))

Construct and solve a Poisson regression model using PurchaseCount as the response variable; use PriorWeekPurchase, LastWeekCompPurchase, Age, and Region as the predictor variables, and use Count as the exposure variable. Submit your output.

model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region, data = shopper_data, family = "poisson", offset = log(Count))
summary(model)

## 
## Call:
## glm(formula = PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + 
##     Age + Region, family = "poisson", data = shopper_data, offset = log(Count))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48516  -0.65035  -0.06216   0.63731   2.77040  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.75055    0.04021  18.665  < 2e-16 ***
## PriorWeekPurchase     0.39280    0.02678  14.669  < 2e-16 ***
## LastWeekCompPurchase -0.21569    0.02663  -8.099 5.54e-16 ***
## Age25 to 34           0.06285    0.03586   1.753  0.07967 .  
## Age35 to 44          -0.04930    0.03686  -1.337  0.18110    
## Age45+               -0.24960    0.03929  -6.352 2.12e-10 ***
## Regionb               0.11974    0.03914   3.060  0.00222 ** 
## Regionc               0.28095    0.03760   7.472 7.92e-14 ***
## Regiond               0.17495    0.03876   4.513 6.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 492.205  on 63  degrees of freedom
## Residual deviance:  70.202  on 55  degrees of freedom
## AIC: 490.66
## 
## Number of Fisher Scoring iterations: 4

Conduct an appropriate 2 hypothesis test for the significance of overall regression.

anova(model, test = "Chi")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: PurchaseCount
## 
## Terms added sequentially (first to last)
## 
## 
##                      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                    63     492.20              
## PriorWeekPurchase     1  224.558        62     267.65 < 2.2e-16 ***
## LastWeekCompPurchase  1   68.225        61     199.42 < 2.2e-16 ***
## Age                   3   70.702        58     128.72 3.019e-15 ***
## Region                3   58.517        55      70.20 1.219e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conduct an appropriate 2 hypothesis test to test whether Age is a signicant variable.

# full model with all predictor variables
full_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region, family = poisson(), data = shopper_data)

# reduced model without Age
reduced_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Region, family = poisson(), data = shopper_data)

# likelihood ratio test
anova(full_model, reduced_model, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + 
##     Region
## Model 2: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Region
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        55     175.43                          
## 2        58     267.18 -3  -91.743 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see in the output a chi-squared test statistic of 91.742 and a p-value of 2.2e-16. This indicates strong evidence against the null hypothesis that the reduced model is a better fit than the full model, and therefore we can say that Age is a significant predictor of PurchaseCount.

By looking at the p-values, state whether Region is a significant variable or not.Why?

In the Poisson regression output, the p-values for the Region variables are less than 0.05, which indicates that Region is a significant variable. A p-value less than 0.05 indicates that there is strong evidence against the null hypothesis, which states that the coefficient for that variable is zero. Therefore, we can conclude that the Region variable has a significant effect on the PurchaseCount response variable.

What is the rate ratio of purchase counts per week for the new product for case prior week purchase took place to prior week purchase didn’t take place, holding other variables constant?

# Fit the Poisson regression model
poisson_model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region + Count, data = shopper_data, family = "poisson")
poisson_model

## 
## Call:  glm(formula = PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + 
##     Age + Region + Count, family = "poisson", data = shopper_data)
## 
## Coefficients:
##          (Intercept)     PriorWeekPurchase  LastWeekCompPurchase  
##              3.44476               0.37384              -0.20940  
##          Age25 to 34           Age35 to 44                Age45+  
##              0.06600              -0.04325              -0.25071  
##              Regionb               Regionc               Regiond  
##              0.11675               0.28574               0.17403  
##                Count  
##              0.02432  
## 
## Degrees of Freedom: 63 Total (i.e. Null);  54 Residual
## Null Deviance:       472.2 
## Residual Deviance: 64.78     AIC: 487.2

# Calculate the rate ratio
rate_ratio <- exp(coef(poisson_model)["PriorWeekPurchase"])
rate_ratio

## PriorWeekPurchase 
##          1.453306

Therefore, the rate ratio of purchase counts per week for the new product for the case where prior week purchase took place to the case where prior week purchase didn’t take place, holding other variables constant, is 1.453306.

What is the log expected purchase counts per week for the new product for case prior week purchase from competitor took place vs. it did not?

new_data_1 <- data.frame(PriorWeekPurchase = 1, LastWeekCompPurchase = 1, Age = "<25", Region = "a", Count = 37)
log_expected_1 <- predict(poisson_model, new_data_1, type = "response", se.fit = FALSE, interval = "none", terms = NULL, na.action = na.pass)

new_data_2 <- data.frame(PriorWeekPurchase = 0, LastWeekCompPurchase = 0, Age = "<25", Region = "a", Count = 37)

log_expected_2 <- predict(poisson_model, new_data_2, type = "response", se.fit = FALSE, interval = "none", terms = NULL, na.action = na.pass)

expected_1 <- exp(log_expected_1)
expected_2 <- exp(log_expected_2)

# printing the results
cat("Expected purchase counts per week (log scale) for case where prior week purchase from competitor took place: ", round(log_expected_1, 2), "\n")

## Expected purchase counts per week (log scale) for case where prior week purchase from competitor took place:  90.84

cat("Expected purchase counts per week (log scale) for case where prior week purchase from competitor did not take place: ", round(log_expected_2, 2), "\n")

## Expected purchase counts per week (log scale) for case where prior week purchase from competitor did not take place:  77.07

cat("Expected purchase counts per week for case where prior week purchase from competitor took place: ", round(expected_1, 2), "\n")

## Expected purchase counts per week for case where prior week purchase from competitor took place:  2.829794e+39

cat("Expected purchase counts per week for case where prior week purchase from competitor did not take place: ", round(expected_2, 2), "\n")

## Expected purchase counts per week for case where prior week purchase from competitor did not take place:  2.949253e+33

Conduct goodness-of-t hypothesis tests using (i) the deviance residuals, and then

the Pearson residuals. What conclusions would you make using = 0:05? What conclusions would you make using = 0:01?
Goodness-of-fit using Deviance Residuals:

The null hypothesis is that the model fits the data well, and the alternative hypothesis is that the model does not fit the data well.

model <- glm(PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region + Count, data = shopper_data, family = poisson())

# Obtain the deviance residuals
residuals <- residuals(model, type = "deviance")


# Compute the test statistic and the p-value
test_stat <- sum(residuals^2)
df <- nrow(data) - length(model$coefficients)
p_value <- pchisq(test_stat, df, lower.tail = FALSE)

# Print the results
cat("Chi-squared test using deviance residuals:\n")

## Chi-squared test using deviance residuals:

cat("Test statistic:", round(test_stat, 4), "\n")

## Test statistic: 64.7824

cat("Degrees of freedom:", df, "\n")

## Degrees of freedom:

cat("p-value:", p_value, "\n")

## p-value:

# Obtain the Pearson residuals
residuals <- residuals(model, type = "pearson")

# Compute the test statistic and the p-value
test_stat <- sum(residuals^2)
df <- nrow(data) - length(model$coefficients)
p_value <- pchisq(test_stat, df, lower.tail = FALSE)

# Print the results
cat("Chi-squared test using Pearson residuals:\n")

## Chi-squared test using Pearson residuals:

cat("Test statistic:", round(test_stat, 4), "\n")

## Test statistic: 64.7332

cat("Degrees of freedom:", df, "\n")

## Degrees of freedom:

cat("p-value:", p_value, "\n")

## p-value:

Both tests give us a very small p-values, indicating strong evidence against the null hypothesis that the model fits the data well. Therefore, we can conclude that the model does not fit the data well.

Using a significance level of 0.05, we reject the null hypothesis and conclude that the model does not fit the data well. The same conclusion can be reached using a significance level of 0.01.

What is the overdispersion parameter, ^? Is overdispersion an issue here?

The amount of extra variation in the data in comparison to ehat would be expected aunder the suumed model. It is estomated as a ratio of residual deviace to residual degree of freedom. In this sceneio, deviance is 64.78 and overdispersion parameter is 64.78 / 54 = 1.2015. As we can see that overdispersion can be an issue if it is greater than the phi value of 1. Since the value is slightly over the required number, we can say that there should not be a major issue.

Hw8_Regression_Abhilasha

Abhilasha

2023-04-10