Research Questions: 1) Does price has a significant impact on cigarette sales per state? 2) Which variables influence cigarette sales?
Initial findings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE, sep = "\t")
data_numeric <- data[c(-1)]
cor(data_numeric)
## Age HS Income Black Female Price
## Age 1.00000000 -0.09891626 0.25658098 -0.04033021 0.55303189 0.24775673
## HS -0.09891626 1.00000000 0.53400534 -0.50171191 -0.41737794 0.05697473
## Income 0.25658098 0.53400534 1.00000000 0.01728756 -0.06882666 0.21455717
## Black -0.04033021 -0.50171191 0.01728756 1.00000000 0.45089974 -0.14777619
## Female 0.55303189 -0.41737794 -0.06882666 0.45089974 1.00000000 0.02247351
## Price 0.24775673 0.05697473 0.21455717 -0.14777619 0.02247351 1.00000000
## Sales 0.22655492 0.06669476 0.32606789 0.18959037 0.14622124 -0.30062263
## Sales
## Age 0.22655492
## HS 0.06669476
## Income 0.32606789
## Black 0.18959037
## Female 0.14622124
## Price -0.30062263
## Sales 1.00000000
Research question 1) Does price have a significant effect on cigarette sales? Null hypothesis: Price does not have a significant impact on cigarette sales. Alternative hypothesis: Price has a significant impact on cigarette sales. Testing level: alpha = 0.05 Conclusion: Since the p-value for the Price variable is less than 0.05, we can reject the null hypothesis that price does not have a significant effect on cigarette sales.
price_model <- lm(data = data_numeric, Sales ~ Price)
summary(price_model)
##
## Call:
## lm(formula = Sales ~ Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.484 -17.587 -3.217 8.556 134.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.453 40.528 5.193 3.98e-06 ***
## Price -2.335 1.058 -2.206 0.0321 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.9 on 49 degrees of freedom
## Multiple R-squared: 0.09037, Adjusted R-squared: 0.07181
## F-statistic: 4.868 on 1 and 49 DF, p-value: 0.03207
Research question 2: Finding which variables have an effect on cigarette sales. Initial regression model
model.lm <- lm(data = data_numeric, Sales ~ .)
summary(model.lm)
##
## Call:
## lm(formula = Sales ~ ., data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.398 -12.388 -5.367 6.270 133.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.34485 245.60719 0.421 0.67597
## Age 4.52045 3.21977 1.404 0.16735
## HS -0.06159 0.81468 -0.076 0.94008
## Income 0.01895 0.01022 1.855 0.07036 .
## Black 0.35754 0.48722 0.734 0.46695
## Female -1.05286 5.56101 -0.189 0.85071
## Price -3.25492 1.03141 -3.156 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2282
## F-statistic: 3.464 on 6 and 44 DF, p-value: 0.006857
Not a great model, 0.2282 adjusted R^2. P-value suggests the model is significant.
First model diagnostics
plot(model.lm)
The residuals vs fitted and scale-location plot mostly have a random pattern and suggest there is unequal variance in the model. The QQ-plot suggests a couple points stray away from normality. There are no outliers or influential point according to the residuals vs leverage plot.
Backwards selection
# Removing HS because it has the highest p-value
model.lm2 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Female + Price)
summary(model.lm2)
##
## Call:
## lm(formula = Sales ~ Age + Income + Black + Female + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.136 -12.508 -5.277 6.351 133.141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.133255 239.217507 0.419 0.67751
## Age 4.593832 3.035858 1.513 0.13722
## Income 0.018418 0.007371 2.499 0.01618 *
## Black 0.379049 0.391066 0.969 0.33759
## Female -1.067115 5.496066 -0.194 0.84692
## Price -3.243835 1.009591 -3.213 0.00243 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.86 on 45 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2453
## F-statistic: 4.25 on 5 and 45 DF, p-value: 0.002998
# Removing Female because it has the highest p-value
model.lm3 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Price)
summary(model.lm3)
##
## Call:
## lm(formula = Sales ~ Age + Income + Black + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.784 -11.810 -5.380 5.758 132.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.329580 62.395293 0.887 0.3798
## Age 4.191538 2.195535 1.909 0.0625 .
## Income 0.018892 0.006882 2.745 0.0086 **
## Black 0.334162 0.312098 1.071 0.2899
## Price -3.239941 0.998778 -3.244 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.57 on 46 degrees of freedom
## Multiple R-squared: 0.3202, Adjusted R-squared: 0.2611
## F-statistic: 5.416 on 4 and 46 DF, p-value: 0.001168
# Removing black because it has the highest p-value
model.lm4 <- lm(data = data_numeric, Sales ~ Age + Income + Price)
summary(model.lm4)
##
## Call:
## lm(formula = Sales ~ Age + Income + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.430 -13.853 -4.962 6.691 128.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.248227 61.933008 1.037 0.30487
## Age 4.155909 2.198699 1.890 0.06491 .
## Income 0.019281 0.006883 2.801 0.00737 **
## Price -3.399234 0.989172 -3.436 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.61 on 47 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2588
## F-statistic: 6.818 on 3 and 47 DF, p-value: 0.0006565
# Removing Age because it has the highest p-value
model.lm5 <- lm(data = data_numeric, Sales ~ Income + Price)
summary(model.lm5)
##
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.560 -11.863 -4.597 3.796 132.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.33841 41.23890 3.718 0.000525 ***
## Income 0.02208 0.00690 3.200 0.002440 **
## Price -3.01756 0.99396 -3.036 0.003868 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.34 on 48 degrees of freedom
## Multiple R-squared: 0.2503, Adjusted R-squared: 0.219
## F-statistic: 8.012 on 2 and 48 DF, p-value: 0.0009945
After removing all predictors that have an insignificant p-value, the overall p-value decreased. The adjusted R^2 slightly decreased.
Anova test
# Testing if the reduced model (2 predictors) is a better fit than the full model (all predictors)
anova(model.lm5, model.lm)
## Analysis of Variance Table
##
## Model 1: Sales ~ Income + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 38555
## 2 44 34926 4 3628.8 1.1429 0.349
# Fail to reject the null that the reduced model is best
# Testing if the full model is better than a model with no predictors
empty_model <- lm(data = data_numeric, Sales ~ 1)
anova(empty_model, model.lm)
## Analysis of Variance Table
##
## Model 1: Sales ~ 1
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50 51425
## 2 44 34926 6 16500 3.4644 0.006857 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Reject the null hypothesis that the reduced model is best
Checking predictors individually
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
avPlots(model.lm)
crPlots(model.lm)
Checking collinearity
library(faraway)
## Warning: package 'faraway' was built under R version 4.5.2
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
##
## logit, vif
vif(model.lm)
## Age HS Income Black Female Price
## 2.300617 2.676465 2.325164 2.392152 2.406417 1.142181
Going to stick with full model
Transformations
model.log <- lm(data = data_numeric, log(Sales) ~ .)
summary(model.log)
##
## Call:
## lm(formula = log(Sales) ~ ., data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43999 -0.08543 -0.03525 0.06235 0.73191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.822e+00 1.657e+00 2.910 0.00565 **
## Age 3.883e-02 2.172e-02 1.788 0.08070 .
## HS -3.484e-03 5.496e-03 -0.634 0.52944
## Income 1.702e-04 6.892e-05 2.470 0.01747 *
## Black 1.831e-03 3.287e-03 0.557 0.58030
## Female -1.298e-02 3.752e-02 -0.346 0.73099
## Price -2.438e-02 6.958e-03 -3.504 0.00107 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1901 on 44 degrees of freedom
## Multiple R-squared: 0.3914, Adjusted R-squared: 0.3084
## F-statistic: 4.717 on 6 and 44 DF, p-value: 0.0008698
# Adjusted r-squared increased
model.sqrt <- lm(data = data_numeric, sqrt(Sales) ~ .)
summary(model.sqrt)
##
## Call:
## lm(formula = sqrt(Sales) ~ ., data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2955 -0.5091 -0.2194 0.3277 4.8972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6746959 9.9033499 1.078 0.28696
## Age 0.2066623 0.1298272 1.592 0.11858
## HS -0.0111207 0.0328496 -0.339 0.73657
## Income 0.0008952 0.0004119 2.173 0.03518 *
## Black 0.0129864 0.0196456 0.661 0.51204
## Female -0.0587536 0.2242304 -0.262 0.79453
## Price -0.1397288 0.0415883 -3.360 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.136 on 44 degrees of freedom
## Multiple R-squared: 0.3589, Adjusted R-squared: 0.2715
## F-statistic: 4.105 on 6 and 44 DF, p-value: 0.002346
# adjusted r aqured increased
boxCox(model.lm)
data_boxcox <- data_numeric
data_boxcox$transformation <- ((data$Sales)^(-1) - 1)/-1
model.boxcox <- lm(data = data_boxcox, transformation ~ Age + HS + Income + Black + Female + Price + Sales)
summary(model.boxcox)
##
## Call:
## lm(formula = transformation ~ Age + HS + Income + Black + Female +
## Price + Sales, data = data_boxcox)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.374e-03 -2.426e-04 7.479e-05 3.310e-04 9.536e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.889e-01 6.131e-03 161.279 < 2e-16 ***
## Age 1.576e-04 8.200e-05 1.921 0.0613 .
## HS -5.111e-05 2.030e-05 -2.518 0.0156 *
## Income 6.778e-07 2.643e-07 2.565 0.0139 *
## Black -8.920e-06 1.221e-05 -0.730 0.4691
## Female -1.136e-04 1.386e-04 -0.820 0.4169
## Price -4.067e-05 2.846e-05 -1.429 0.1601
## Sales 4.706e-05 3.756e-06 12.528 6.05e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007019 on 43 degrees of freedom
## Multiple R-squared: 0.878, Adjusted R-squared: 0.8581
## F-statistic: 44.2 on 7 and 43 DF, p-value: < 2.2e-16
# Model greatly improved
weights <- 1 / abs(model.lm$residuals)^2
model.wls <- lm(data = data_numeric, Sales ~ ., weights = weights)
summary(model.wls)
##
## Call:
## lm(formula = Sales ~ ., data = data_numeric, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.0737 -0.9272 -0.7750 0.9905 1.4581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 166.693214 48.929872 3.407 0.00141 **
## Age 4.760988 0.439901 10.823 5.51e-14 ***
## HS -0.092395 0.085004 -1.087 0.28298
## Income 0.020094 0.002066 9.728 1.55e-12 ***
## Black 0.325405 0.109713 2.966 0.00486 **
## Female -2.382165 1.108647 -2.149 0.03720 *
## Price -3.400717 0.236029 -14.408 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.995 on 44 degrees of freedom
## Multiple R-squared: 0.8945, Adjusted R-squared: 0.8801
## F-statistic: 62.17 on 6 and 44 DF, p-value: < 2.2e-16
# model greatly improved
Boxcox model diagnostics
plot(model.boxcox)
sqrt model diagnostics
plot(model.sqrt)
log model diagnostics
plot(model.log)
wls model diagnostics
plot(model.wls)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced