DATA 501 Final Project

Research Questions: 1) Does price has a significant impact on cigarette sales per state? 2) Which variables influence cigarette sales?

Initial findings

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE, sep = "\t")
data_numeric <- data[c(-1)]
cor(data_numeric)

##                Age          HS      Income       Black      Female       Price
## Age     1.00000000 -0.09891626  0.25658098 -0.04033021  0.55303189  0.24775673
## HS     -0.09891626  1.00000000  0.53400534 -0.50171191 -0.41737794  0.05697473
## Income  0.25658098  0.53400534  1.00000000  0.01728756 -0.06882666  0.21455717
## Black  -0.04033021 -0.50171191  0.01728756  1.00000000  0.45089974 -0.14777619
## Female  0.55303189 -0.41737794 -0.06882666  0.45089974  1.00000000  0.02247351
## Price   0.24775673  0.05697473  0.21455717 -0.14777619  0.02247351  1.00000000
## Sales   0.22655492  0.06669476  0.32606789  0.18959037  0.14622124 -0.30062263
##              Sales
## Age     0.22655492
## HS      0.06669476
## Income  0.32606789
## Black   0.18959037
## Female  0.14622124
## Price  -0.30062263
## Sales   1.00000000

Research question 1) Does price have a significant effect on cigarette sales? Null hypothesis: Price does not have a significant impact on cigarette sales. Alternative hypothesis: Price has a significant impact on cigarette sales. Testing level: alpha = 0.05 Conclusion: Since the p-value for the Price variable is less than 0.05, we can reject the null hypothesis that price does not have a significant effect on cigarette sales.

price_model <- lm(data = data_numeric, Sales ~ Price)
summary(price_model)

## 
## Call:
## lm(formula = Sales ~ Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.484 -17.587  -3.217   8.556 134.878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  210.453     40.528   5.193 3.98e-06 ***
## Price         -2.335      1.058  -2.206   0.0321 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.9 on 49 degrees of freedom
## Multiple R-squared:  0.09037,    Adjusted R-squared:  0.07181 
## F-statistic: 4.868 on 1 and 49 DF,  p-value: 0.03207

Research question 2: Finding which variables have an effect on cigarette sales. Initial regression model

model.lm <- lm(data = data_numeric, Sales ~ .)
summary(model.lm)

## 
## Call:
## lm(formula = Sales ~ ., data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

Not a great model, 0.2282 adjusted R^2. P-value suggests the model is significant.

First model diagnostics

plot(model.lm)

The residuals vs fitted and scale-location plot mostly have a random pattern and suggest there is unequal variance in the model. The QQ-plot suggests a couple points stray away from normality. There are no outliers or influential point according to the residuals vs leverage plot.

Backwards selection

# Removing HS because it has the highest p-value
model.lm2 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Female + Price)
summary(model.lm2)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Black + Female + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.136 -12.508  -5.277   6.351 133.141 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 100.133255 239.217507   0.419  0.67751   
## Age           4.593832   3.035858   1.513  0.13722   
## Income        0.018418   0.007371   2.499  0.01618 * 
## Black         0.379049   0.391066   0.969  0.33759   
## Female       -1.067115   5.496066  -0.194  0.84692   
## Price        -3.243835   1.009591  -3.213  0.00243 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.86 on 45 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2453 
## F-statistic:  4.25 on 5 and 45 DF,  p-value: 0.002998

# Removing Female because it has the highest p-value
model.lm3 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Price)
summary(model.lm3)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Black + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.784 -11.810  -5.380   5.758 132.789 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 55.329580  62.395293   0.887   0.3798   
## Age          4.191538   2.195535   1.909   0.0625 . 
## Income       0.018892   0.006882   2.745   0.0086 **
## Black        0.334162   0.312098   1.071   0.2899   
## Price       -3.239941   0.998778  -3.244   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.57 on 46 degrees of freedom
## Multiple R-squared:  0.3202, Adjusted R-squared:  0.2611 
## F-statistic: 5.416 on 4 and 46 DF,  p-value: 0.001168

# Removing black because it has the highest p-value
model.lm4 <- lm(data = data_numeric, Sales ~ Age + Income + Price)
summary(model.lm4)

## 
## Call:
## lm(formula = Sales ~ Age + Income + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.430 -13.853  -4.962   6.691 128.947 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 64.248227  61.933008   1.037  0.30487   
## Age          4.155909   2.198699   1.890  0.06491 . 
## Income       0.019281   0.006883   2.801  0.00737 **
## Price       -3.399234   0.989172  -3.436  0.00124 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.61 on 47 degrees of freedom
## Multiple R-squared:  0.3032, Adjusted R-squared:  0.2588 
## F-statistic: 6.818 on 3 and 47 DF,  p-value: 0.0006565

# Removing Age because it has the highest p-value
model.lm5 <- lm(data = data_numeric, Sales ~ Income + Price)
summary(model.lm5)

## 
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.560 -11.863  -4.597   3.796 132.755 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 153.33841   41.23890   3.718 0.000525 ***
## Income        0.02208    0.00690   3.200 0.002440 ** 
## Price        -3.01756    0.99396  -3.036 0.003868 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.34 on 48 degrees of freedom
## Multiple R-squared:  0.2503, Adjusted R-squared:  0.219 
## F-statistic: 8.012 on 2 and 48 DF,  p-value: 0.0009945

After removing all predictors that have an insignificant p-value, the overall p-value decreased. The adjusted R^2 slightly decreased.

Anova test

# Testing if the reduced model (2 predictors) is a better fit than the full model (all predictors)
anova(model.lm5, model.lm)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Income + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     48 38555                           
## 2     44 34926  4    3628.8 1.1429  0.349

# Fail to reject the null that the reduced model is best


# Testing if the full model is better than a model with no predictors
empty_model <- lm(data = data_numeric, Sales ~ 1)
anova(empty_model, model.lm5)

## Analysis of Variance Table
## 
## Model 1: Sales ~ 1
## Model 2: Sales ~ Income + Price
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     50 51425                                  
## 2     48 38555  2     12871 8.0119 0.0009945 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Reject the null hypothesis that the reduced model is best

Checking predictors individually

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

avPlots(model.lm5)

crPlots(model.lm5)

Checking collinearity

library(faraway)

## Warning: package 'faraway' was built under R version 4.5.2

## 
## Attaching package: 'faraway'

## The following objects are masked from 'package:car':
## 
##     logit, vif

vif(model.lm5)

##   Income    Price 
## 1.048256 1.048256

Going to stick with full model

Transformations

model.log <- lm(data = data_numeric, log(Sales) ~ Price + Income)
summary(model.log)

## 
## Call:
## lm(formula = log(Sales) ~ Price + Income, data = data_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54413 -0.07959 -0.01604  0.04696  0.72926 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.932e+00  2.879e-01  17.132  < 2e-16 ***
## Price       -2.143e-02  6.938e-03  -3.089 0.003333 ** 
## Income       1.746e-04  4.817e-05   3.625 0.000697 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1978 on 48 degrees of freedom
## Multiple R-squared:  0.2808, Adjusted R-squared:  0.2508 
## F-statistic:  9.37 on 2 and 48 DF,  p-value: 0.0003668

# Adjusted r-squared increased

model.sqrt <- lm(data = data_numeric, sqrt(Sales) ~ Price + Income)
summary(model.sqrt)

## 
## Call:
## lm(formula = sqrt(Sales) ~ Price + Income, data = data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8963 -0.4925 -0.1098  0.1973  4.8785 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.0872813  1.6885148   7.159  4.2e-09 ***
## Price       -0.1263833  0.0406972  -3.105  0.00319 ** 
## Income       0.0009752  0.0002825   3.452  0.00117 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.16 on 48 degrees of freedom
## Multiple R-squared:  0.2703, Adjusted R-squared:  0.2398 
## F-statistic: 8.888 on 2 and 48 DF,  p-value: 0.0005202

# adjusted r squared increased

boxCox(model.lm5)

data_boxcox <- data_numeric
data_boxcox$transformation <- ((data$Sales)^(-.8) - .8)/-.8
model.boxcox <- lm(data = data_boxcox, transformation ~ Income + Price)
summary(model.boxcox)

## 
## Call:
## lm(formula = transformation ~ Income + Price, data = data_boxcox)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0147512 -0.0014242  0.0000876  0.0013650  0.0119469 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.740e-01  6.102e-03 159.616  < 2e-16 ***
## Income       3.772e-06  1.021e-06   3.695 0.000564 ***
## Price       -4.228e-04  1.471e-04  -2.875 0.006006 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004194 on 48 degrees of freedom
## Multiple R-squared:  0.2749, Adjusted R-squared:  0.2447 
## F-statistic: 9.097 on 2 and 48 DF,  p-value: 0.0004467

# Model slightly improved


weights <- 1 / abs(model.lm5$residuals)^2
model.wls <- lm(data = data_numeric, Sales ~ Income + Price, weights = weights)
summary(model.wls)

## 
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0526 -0.9836 -0.9558  0.9874  1.1805 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 152.048798   3.985704   38.15   <2e-16 ***
## Income        0.021752   0.000353   61.61   <2e-16 ***
## Price        -2.955722   0.088683  -33.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9913 on 48 degrees of freedom
## Multiple R-squared:  0.9911, Adjusted R-squared:  0.9907 
## F-statistic:  2665 on 2 and 48 DF,  p-value: < 2.2e-16

# model greatly improved

Boxcox model diagnostics

plot(model.boxcox)

sqrt model diagnostics

plot(model.sqrt)

log model diagnostics

plot(model.log)

wls model diagnostics

plot(model.wls)

DATA 501 Final Project

2025-11-18