Homework 8

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.4.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.4.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(leaps)

## Warning: package 'leaps' was built under R version 4.4.3

Question 5:

setwd("C:\\Users\\srini\\OneDrive\\Desktop\\Regression Analysis\\HW 8")
product_data = read.csv("6414-hw8-s25-ProductData.csv")

product_data$Age <- as.factor(product_data$Age)
product_data$Region <- as.factor(product_data$Region)

logit_model <- glm(
  Purchase / Count ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region,
  family = binomial(),
  weights = Count,
  data = product_data
)

summary(logit_model)

## 
## Call:
## glm(formula = Purchase/Count ~ PriorWeekPurchase + LastWeekCompPurchase + 
##     Age + Region, family = binomial(), data = product_data, weights = Count)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.32746    0.12373  -2.646  0.00813 ** 
## PriorWeekPurchase     0.46121    0.08509   5.421 5.94e-08 ***
## LastWeekCompPurchase -0.23903    0.08544  -2.797  0.00515 ** 
## Age25 to 34           0.13297    0.11715   1.135  0.25637    
## Age35 to 44          -0.18594    0.12070  -1.540  0.12346    
## Age45+               -0.28506    0.12182  -2.340  0.01929 *  
## Regionb               0.17474    0.12030   1.453  0.14634    
## Regionc               0.19731    0.11939   1.653  0.09840 .  
## Regiond               0.23061    0.12101   1.906  0.05669 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.720  on 63  degrees of freedom
## Residual deviance:  82.611  on 55  degrees of freedom
## AIC: 353.52
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model shows that PriorWeekPurchase has a strong and significant positive effect on the likelihood of purchasing the new product (p < 0.001). LastWeekCompPurchase has a significant negative effect (p ≈ 0.005), suggesting competitor purchases reduce the likelihood. Age and Region effects are not all significant, though Age45+ shows a significant negative impact (p ≈ 0.013).

Question 6:

null_model <- glm(
  Purchase / Count ~ 1,
  family = binomial(),
  weights = Count,
  data = product_data
)

anova(null_model, logit_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase/Count ~ 1
## Model 2: Purchase/Count ~ PriorWeekPurchase + LastWeekCompPurchase + Age + 
##     Region
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        63    139.720                          
## 2        55     82.611  8   57.109 1.714e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null Hypothesis (H0): None of the predictors (PriorWeekPurchase, LastWeekCompPurchase, Age, Region) are associated with the likelihood of purchasing the new product — i.e., the intercept-only model fits as well as the full model. Alternative Hypothesis (H1): At least one predictor is significantly associated with the likelihood of purchasing the new product.

From the likelihood ratio test output:

Deviance difference: 57.109

Degrees of freedom: 8

p-value: 1.714e-09

Since the p-value is far below 0.05, we reject the null hypothesis.

Question 7:

deviance_val <- deviance(logit_model)
df_resid <- df.residual(logit_model)
pval_deviance <- 1 - pchisq(deviance_val, df_resid)
c(Deviance = deviance_val, DF = df_resid, p_Deviance = pval_deviance)

##     Deviance           DF   p_Deviance 
## 82.610828697 55.000000000  0.009396075

The deviance goodness-of-fit test gives a p-value of 0.0094 with 55 degrees of freedom. At α = 0.05, we reject the null hypothesis, indicating some lack of fit. However, at α = 0.01, we fail to reject the null, suggesting the model fits reasonably well at a stricter significance level.

pearson_resid <- residuals(logit_model, type = "pearson")
pearson_chisq <- sum(pearson_resid^2)
pval_pearson <- 1 - pchisq(pearson_chisq, df_resid)
c(Pearson_ChiSq = pearson_chisq, p_Pearson = pval_pearson)

## Pearson_ChiSq     p_Pearson 
##   81.10777954    0.01256744

The Pearson goodness-of-fit test yields a chi-square value of 81.11 with a p-value of 0.0126. At α = 0.05, we reject the null hypothesis, suggesting some evidence of lack of fit. However, at α = 0.01, we fail to reject the null, indicating acceptable model fit under a stricter criterion.

Question 8:

poisson_model <- glm(
  PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + Region,
  family = poisson(),
  offset = log(Count),
  data = product_data
)

summary(poisson_model)

## 
## Call:
## glm(formula = PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + 
##     Age + Region, family = poisson(), data = product_data, offset = log(Count))
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.753456   0.039968  18.852  < 2e-16 ***
## PriorWeekPurchase     0.321762   0.026777  12.016  < 2e-16 ***
## LastWeekCompPurchase -0.242606   0.026709  -9.083  < 2e-16 ***
## Age25 to 34           0.029001   0.035961   0.806  0.41998    
## Age35 to 44           0.007244   0.036836   0.197  0.84410    
## Age45+               -0.193149   0.039293  -4.916 8.85e-07 ***
## Regionb               0.121861   0.039135   3.114  0.00185 ** 
## Regionc               0.255937   0.037592   6.808 9.87e-12 ***
## Regiond               0.175835   0.038765   4.536 5.73e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 438.37  on 63  degrees of freedom
## Residual deviance: 108.55  on 55  degrees of freedom
## AIC: 529.01
## 
## Number of Fisher Scoring iterations: 4

The model indicates that prior company purchases significantly increase purchase rates, while competitor purchases and being older (45+) reduce them. Region also plays a role, with regions b, c, and d showing higher purchase rates than region a.

Intercept = 0.7535

PriorWeekPurchase = 0.3218

LastWeekCompPurchase = -0.2426

Age25 to 34 = 0.0290

Age35 to 44 = 0.0246

Age45+ = -0.1931

Regionb = 0.1219

Regionc = 0.2559

Regiond = 0.1758

Question 9:

poisson_null <- glm(
  PurchaseCount ~ 1,
  family = poisson(),
  offset = log(Count),
  data = product_data
)

anova(poisson_null, poisson_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: PurchaseCount ~ 1
## Model 2: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + 
##     Region
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        63     438.37                          
## 2        55     108.55  8   329.83 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null Hypothesis (H0): None of the predictors (PriorWeekPurchase, LastWeekCompPurchase, Age, Region) are associated with the rate of purchasing the new product.

Alternative Hypothesis (H1): At least one predictor is significantly associated with the purchase rate.

From the likelihood ratio test:

Deviance difference: 329.83

Degrees of freedom: 8

p-value: < 2.2e-16

Since the p-value is far less than 0.05, we reject the null hypothesis.

Question 10:

poisson_deviance <- deviance(poisson_model)
poisson_df_resid <- df.residual(poisson_model)
pval_poisson_deviance <- 1 - pchisq(poisson_deviance, poisson_df_resid)

poisson_pearson <- sum(residuals(poisson_model, type = "pearson")^2)
pval_poisson_pearson <- 1 - pchisq(poisson_pearson, poisson_df_resid)

# Print values
c(Deviance = poisson_deviance, DF = poisson_df_resid, p_Deviance = pval_poisson_deviance,
  Pearson_ChiSq = poisson_pearson, p_Pearson = pval_poisson_pearson)

##      Deviance            DF    p_Deviance Pearson_ChiSq     p_Pearson 
##  1.085466e+02  5.500000e+01  2.262380e-05  1.089460e+02  2.035885e-05

To evaluate the fit of the Poisson regression model, we performed two goodness-of-fit tests. The deviance test produced a deviance value of 108.55 with 55 degrees of freedom, resulting in a p-value of 0.0000226. The Pearson chi-square test yielded a Pearson statistic of 108.95 with 55 degrees of freedom, and a corresponding p-value of 0.0000204. Both tests assess how well the model fits the observed data compared to what would be expected under the model.

Since the p-values from both tests are less than 0.01, we reject the null hypothesis at both the 0.05 and 0.01 significance levels. This indicates that the Poisson model does not provide a good fit for the data.

Question 11:

poisson_no_region <- glm(
  PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age,
  family = poisson(),
  offset = log(Count),
  data = product_data
)

anova(poisson_no_region, poisson_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age
## Model 2: PurchaseCount ~ PriorWeekPurchase + LastWeekCompPurchase + Age + 
##     Region
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        58     157.55                          
## 2        55     108.55  3   49.003 1.302e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To test whether the coefficients associated with Region are jointly equal to zero, we performed a likelihood ratio test by comparing the full Poisson model (including Region) with a reduced model that excludes Region. The test yielded a deviance difference of 49.003 with 3 degrees of freedom, and a p-value of 1.302e-10.

Since the p-value is far below both 0.05 and 0.01, we reject the null hypothesis. This indicates that the Region variable significantly improves the model and that at least one Region coefficient is not equal to zero.

Question 12:

In the Poisson regression model, the coefficient for PriorWeekPurchase is 0.32176. This represents the natural logarithm of the rate ratio. By exponentiating the coefficient, we obtain a rate ratio of approximately 1.38. This means that, holding all other variables constant, customers who made a purchase in the prior week are expected to purchase the new product at a rate that is 38% higher than those who did not make a prior-week purchase.