DATA 501 Final Project

Research Questions: 1) Does price has a significant impact on cigarette sales per state? 2) Which variables influence cigarette sales?

Initial findings

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE, sep = "\t")
data_numeric <- data[c(-1)]
cor(data_numeric)

##                Age          HS      Income       Black      Female       Price
## Age     1.00000000 -0.09891626  0.25658098 -0.04033021  0.55303189  0.24775673
## HS     -0.09891626  1.00000000  0.53400534 -0.50171191 -0.41737794  0.05697473
## Income  0.25658098  0.53400534  1.00000000  0.01728756 -0.06882666  0.21455717
## Black  -0.04033021 -0.50171191  0.01728756  1.00000000  0.45089974 -0.14777619
## Female  0.55303189 -0.41737794 -0.06882666  0.45089974  1.00000000  0.02247351
## Price   0.24775673  0.05697473  0.21455717 -0.14777619  0.02247351  1.00000000
## Sales   0.22655492  0.06669476  0.32606789  0.18959037  0.14622124 -0.30062263
##              Sales
## Age     0.22655492
## HS      0.06669476
## Income  0.32606789
## Black   0.18959037
## Female  0.14622124
## Price  -0.30062263
## Sales   1.00000000

Research question 1) Does price have a significant effect on cigarette sales? Null hypothesis: Price does not have a significant impact on cigarette sales. Alternative hypothesis: Price has a significant impact on cigarette sales. Testing level: alpha = 0.05 Conclusion: Since the p-value for the Price variable is less than 0.05, we can reject the null hypothesis that price does not have a significant effect on cigarette sales.

price_model <- lm(data = data_numeric, Sales ~ Price)
summary(price_model)

## 
## Call:
## lm(formula = Sales ~ Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.484 -17.587  -3.217   8.556 134.878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  210.453     40.528   5.193 3.98e-06 ***
## Price         -2.335      1.058  -2.206   0.0321 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.9 on 49 degrees of freedom
## Multiple R-squared:  0.09037,    Adjusted R-squared:  0.07181 
## F-statistic: 4.868 on 1 and 49 DF,  p-value: 0.03207

Research question 2: Finding which variables have an effect on cigarette sales. Initial regression model

model.lm <- lm(data = data_numeric, Sales ~ .)
summary(model.lm)

## 
## Call:
## lm(formula = Sales ~ ., data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

Not a great model, 0.2282 adjusted R^2. P-value suggests the model is significant.

First model diagnostics

plot(model.lm)

The residuals vs fitted and scale-location plot mostly have a random pattern and suggest there is unequal variance in the model. The QQ-plot suggests a couple points stray away from normality. There are no outliers or influential point according to the residuals vs leverage plot.

Backwards selection

# Removing HS because it has the highest p-value
model.lm2 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Female + Price)
summary(model.lm2)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Black + Female + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.136 -12.508  -5.277   6.351 133.141 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 100.133255 239.217507   0.419  0.67751   
## Age           4.593832   3.035858   1.513  0.13722   
## Income        0.018418   0.007371   2.499  0.01618 * 
## Black         0.379049   0.391066   0.969  0.33759   
## Female       -1.067115   5.496066  -0.194  0.84692   
## Price        -3.243835   1.009591  -3.213  0.00243 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.86 on 45 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2453 
## F-statistic:  4.25 on 5 and 45 DF,  p-value: 0.002998

# Removing Female because it has the highest p-value
model.lm3 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Price)
summary(model.lm3)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Black + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.784 -11.810  -5.380   5.758 132.789 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 55.329580  62.395293   0.887   0.3798   
## Age          4.191538   2.195535   1.909   0.0625 . 
## Income       0.018892   0.006882   2.745   0.0086 **
## Black        0.334162   0.312098   1.071   0.2899   
## Price       -3.239941   0.998778  -3.244   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.57 on 46 degrees of freedom
## Multiple R-squared:  0.3202, Adjusted R-squared:  0.2611 
## F-statistic: 5.416 on 4 and 46 DF,  p-value: 0.001168

# Removing black because it has the highest p-value
model.lm4 <- lm(data = data_numeric, Sales ~ Age + Income + Price)
summary(model.lm4)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.430 -13.853  -4.962   6.691 128.947 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 64.248227  61.933008   1.037  0.30487   
## Age          4.155909   2.198699   1.890  0.06491 . 
## Income       0.019281   0.006883   2.801  0.00737 **
## Price       -3.399234   0.989172  -3.436  0.00124 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.61 on 47 degrees of freedom
## Multiple R-squared:  0.3032, Adjusted R-squared:  0.2588 
## F-statistic: 6.818 on 3 and 47 DF,  p-value: 0.0006565

# Removing Age because it has the highest p-value
model.lm5 <- lm(data = data_numeric, Sales ~ Income + Price)
summary(model.lm5)

## 
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.560 -11.863  -4.597   3.796 132.755 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 153.33841   41.23890   3.718 0.000525 ***
## Income        0.02208    0.00690   3.200 0.002440 ** 
## Price        -3.01756    0.99396  -3.036 0.003868 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.34 on 48 degrees of freedom
## Multiple R-squared:  0.2503, Adjusted R-squared:  0.219 
## F-statistic: 8.012 on 2 and 48 DF,  p-value: 0.0009945

After removing all predictors that have an insignificant p-value, the overall p-value decreased. The adjusted R^2 slightly decreased.

Anova test

# Testing if the reduced model (2 predictors) is a better fit than the full model (all predictors)
anova(model.lm5, model.lm)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Income + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     48 38555                           
## 2     44 34926  4    3628.8 1.1429  0.349

# Fail to reject the null that the reduced model is best


# Testing if the full model is better than a model with no predictors
empty_model <- lm(data = data_numeric, Sales ~ 1)
anova(empty_model, model.lm)

## Analysis of Variance Table
## 
## Model 1: Sales ~ 1
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1     50 51425                                
## 2     44 34926  6     16500 3.4644 0.006857 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Reject the null hypothesis that the reduced model is best

Checking predictors individually

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

avPlots(model.lm)

crPlots(model.lm)

Checking collinearity

library(faraway)

## Warning: package 'faraway' was built under R version 4.5.2

## 
## Attaching package: 'faraway'

## The following objects are masked from 'package:car':
## 
##     logit, vif

vif(model.lm)

##      Age       HS   Income    Black   Female    Price 
## 2.300617 2.676465 2.325164 2.392152 2.406417 1.142181

Going to stick with full model

Transformations

model.log <- lm(data = data_numeric, log(Sales) ~ .)
summary(model.log)

## 
## Call:
## lm(formula = log(Sales) ~ ., data = data_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43999 -0.08543 -0.03525  0.06235  0.73191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  4.822e+00  1.657e+00   2.910  0.00565 **
## Age          3.883e-02  2.172e-02   1.788  0.08070 . 
## HS          -3.484e-03  5.496e-03  -0.634  0.52944   
## Income       1.702e-04  6.892e-05   2.470  0.01747 * 
## Black        1.831e-03  3.287e-03   0.557  0.58030   
## Female      -1.298e-02  3.752e-02  -0.346  0.73099   
## Price       -2.438e-02  6.958e-03  -3.504  0.00107 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1901 on 44 degrees of freedom
## Multiple R-squared:  0.3914, Adjusted R-squared:  0.3084 
## F-statistic: 4.717 on 6 and 44 DF,  p-value: 0.0008698

# Adjusted r-squared increased

model.sqrt <- lm(data = data_numeric, sqrt(Sales) ~ .)
summary(model.sqrt)

## 
## Call:
## lm(formula = sqrt(Sales) ~ ., data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2955 -0.5091 -0.2194  0.3277  4.8972 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 10.6746959  9.9033499   1.078  0.28696   
## Age          0.2066623  0.1298272   1.592  0.11858   
## HS          -0.0111207  0.0328496  -0.339  0.73657   
## Income       0.0008952  0.0004119   2.173  0.03518 * 
## Black        0.0129864  0.0196456   0.661  0.51204   
## Female      -0.0587536  0.2242304  -0.262  0.79453   
## Price       -0.1397288  0.0415883  -3.360  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.136 on 44 degrees of freedom
## Multiple R-squared:  0.3589, Adjusted R-squared:  0.2715 
## F-statistic: 4.105 on 6 and 44 DF,  p-value: 0.002346

# adjusted r aqured increased

boxCox(model.lm)

data_boxcox <- data_numeric
data_boxcox$transformation <- ((data$Sales)^(-1) - 1)/-1
model.boxcox <- lm(data = data_boxcox, transformation ~ Age + HS + Income + Black + Female + Price + Sales)
summary(model.boxcox)

## 
## Call:
## lm(formula = transformation ~ Age + HS + Income + Black + Female + 
##     Price + Sales, data = data_boxcox)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.374e-03 -2.426e-04  7.479e-05  3.310e-04  9.536e-04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.889e-01  6.131e-03 161.279  < 2e-16 ***
## Age          1.576e-04  8.200e-05   1.921   0.0613 .  
## HS          -5.111e-05  2.030e-05  -2.518   0.0156 *  
## Income       6.778e-07  2.643e-07   2.565   0.0139 *  
## Black       -8.920e-06  1.221e-05  -0.730   0.4691    
## Female      -1.136e-04  1.386e-04  -0.820   0.4169    
## Price       -4.067e-05  2.846e-05  -1.429   0.1601    
## Sales        4.706e-05  3.756e-06  12.528 6.05e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0007019 on 43 degrees of freedom
## Multiple R-squared:  0.878,  Adjusted R-squared:  0.8581 
## F-statistic:  44.2 on 7 and 43 DF,  p-value: < 2.2e-16

# Model greatly improved


weights <- 1 / abs(model.lm$residuals)^2
model.wls <- lm(data = data_numeric, Sales ~ ., weights = weights)
summary(model.wls)

## 
## Call:
## lm(formula = Sales ~ ., data = data_numeric, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0737 -0.9272 -0.7750  0.9905  1.4581 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 166.693214  48.929872   3.407  0.00141 ** 
## Age           4.760988   0.439901  10.823 5.51e-14 ***
## HS           -0.092395   0.085004  -1.087  0.28298    
## Income        0.020094   0.002066   9.728 1.55e-12 ***
## Black         0.325405   0.109713   2.966  0.00486 ** 
## Female       -2.382165   1.108647  -2.149  0.03720 *  
## Price        -3.400717   0.236029 -14.408  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.995 on 44 degrees of freedom
## Multiple R-squared:  0.8945, Adjusted R-squared:  0.8801 
## F-statistic: 62.17 on 6 and 44 DF,  p-value: < 2.2e-16

# model greatly improved

Boxcox model diagnostics

plot(model.boxcox)

sqrt model diagnostics

plot(model.sqrt)

log model diagnostics

plot(model.log)

wls model diagnostics

plot(model.wls)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

DATA 501 Final Project

2025-11-18