Research Questions: 1) Does price has a significant impact on cigarette sales per state? 2) Which variables influence cigarette sales?
Initial findings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE, sep = "\t")
data_numeric <- data[c(-1)]
cor(data_numeric)
## Age HS Income Black Female Price
## Age 1.00000000 -0.09891626 0.25658098 -0.04033021 0.55303189 0.24775673
## HS -0.09891626 1.00000000 0.53400534 -0.50171191 -0.41737794 0.05697473
## Income 0.25658098 0.53400534 1.00000000 0.01728756 -0.06882666 0.21455717
## Black -0.04033021 -0.50171191 0.01728756 1.00000000 0.45089974 -0.14777619
## Female 0.55303189 -0.41737794 -0.06882666 0.45089974 1.00000000 0.02247351
## Price 0.24775673 0.05697473 0.21455717 -0.14777619 0.02247351 1.00000000
## Sales 0.22655492 0.06669476 0.32606789 0.18959037 0.14622124 -0.30062263
## Sales
## Age 0.22655492
## HS 0.06669476
## Income 0.32606789
## Black 0.18959037
## Female 0.14622124
## Price -0.30062263
## Sales 1.00000000
Research question 1) Does price have a significant effect on cigarette sales? Null hypothesis: Price does not have a significant impact on cigarette sales. Alternative hypothesis: Price has a significant impact on cigarette sales. Testing level: alpha = 0.05 Conclusion: Since the p-value for the Price variable is less than 0.05, we can reject the null hypothesis that price does not have a significant effect on cigarette sales.
price_model <- lm(data = data_numeric, Sales ~ Price)
summary(price_model)
##
## Call:
## lm(formula = Sales ~ Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.484 -17.587 -3.217 8.556 134.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.453 40.528 5.193 3.98e-06 ***
## Price -2.335 1.058 -2.206 0.0321 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.9 on 49 degrees of freedom
## Multiple R-squared: 0.09037, Adjusted R-squared: 0.07181
## F-statistic: 4.868 on 1 and 49 DF, p-value: 0.03207
Research question 2: Finding which variables have an effect on cigarette sales. Initial regression model
model.lm <- lm(data = data_numeric, Sales ~ .)
summary(model.lm)
##
## Call:
## lm(formula = Sales ~ ., data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.398 -12.388 -5.367 6.270 133.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.34485 245.60719 0.421 0.67597
## Age 4.52045 3.21977 1.404 0.16735
## HS -0.06159 0.81468 -0.076 0.94008
## Income 0.01895 0.01022 1.855 0.07036 .
## Black 0.35754 0.48722 0.734 0.46695
## Female -1.05286 5.56101 -0.189 0.85071
## Price -3.25492 1.03141 -3.156 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2282
## F-statistic: 3.464 on 6 and 44 DF, p-value: 0.006857
Not a great model, 0.2282 adjusted R^2. P-value suggests the model is significant.
First model diagnostics
plot(model.lm)
The residuals vs fitted and scale-location plot mostly have a random pattern and suggest there is unequal variance in the model. The QQ-plot suggests a couple points stray away from normality. There are no outliers or influential point according to the residuals vs leverage plot.
Backwards selection
# Removing HS because it has the highest p-value
model.lm2 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Female + Price)
summary(model.lm2)
##
## Call:
## lm(formula = Sales ~ Age + Income + Black + Female + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.136 -12.508 -5.277 6.351 133.141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.133255 239.217507 0.419 0.67751
## Age 4.593832 3.035858 1.513 0.13722
## Income 0.018418 0.007371 2.499 0.01618 *
## Black 0.379049 0.391066 0.969 0.33759
## Female -1.067115 5.496066 -0.194 0.84692
## Price -3.243835 1.009591 -3.213 0.00243 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.86 on 45 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2453
## F-statistic: 4.25 on 5 and 45 DF, p-value: 0.002998
# Removing Female because it has the highest p-value
model.lm3 <- lm(data = data_numeric, Sales ~ Age + Income + Black + Price)
summary(model.lm3)
##
## Call:
## lm(formula = Sales ~ Age + Income + Black + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.784 -11.810 -5.380 5.758 132.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.329580 62.395293 0.887 0.3798
## Age 4.191538 2.195535 1.909 0.0625 .
## Income 0.018892 0.006882 2.745 0.0086 **
## Black 0.334162 0.312098 1.071 0.2899
## Price -3.239941 0.998778 -3.244 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.57 on 46 degrees of freedom
## Multiple R-squared: 0.3202, Adjusted R-squared: 0.2611
## F-statistic: 5.416 on 4 and 46 DF, p-value: 0.001168
# Removing black because it has the highest p-value
model.lm4 <- lm(data = data_numeric, Sales ~ Age + Income + Price)
summary(model.lm4)
##
## Call:
## lm(formula = Sales ~ Age + Income + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.430 -13.853 -4.962 6.691 128.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.248227 61.933008 1.037 0.30487
## Age 4.155909 2.198699 1.890 0.06491 .
## Income 0.019281 0.006883 2.801 0.00737 **
## Price -3.399234 0.989172 -3.436 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.61 on 47 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2588
## F-statistic: 6.818 on 3 and 47 DF, p-value: 0.0006565
# Removing Age because it has the highest p-value
model.lm5 <- lm(data = data_numeric, Sales ~ Income + Price)
summary(model.lm5)
##
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.560 -11.863 -4.597 3.796 132.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.33841 41.23890 3.718 0.000525 ***
## Income 0.02208 0.00690 3.200 0.002440 **
## Price -3.01756 0.99396 -3.036 0.003868 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.34 on 48 degrees of freedom
## Multiple R-squared: 0.2503, Adjusted R-squared: 0.219
## F-statistic: 8.012 on 2 and 48 DF, p-value: 0.0009945
After removing all predictors that have an insignificant p-value, the overall p-value decreased. The adjusted R^2 slightly decreased.
Anova test
# Testing if the reduced model (2 predictors) is a better fit than the full model (all predictors)
anova(model.lm5, model.lm)
## Analysis of Variance Table
##
## Model 1: Sales ~ Income + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 38555
## 2 44 34926 4 3628.8 1.1429 0.349
# Fail to reject the null that the reduced model is best
# Testing if the full model is better than a model with no predictors
empty_model <- lm(data = data_numeric, Sales ~ 1)
anova(empty_model, model.lm5)
## Analysis of Variance Table
##
## Model 1: Sales ~ 1
## Model 2: Sales ~ Income + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50 51425
## 2 48 38555 2 12871 8.0119 0.0009945 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Reject the null hypothesis that the reduced model is best
Checking predictors individually
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
avPlots(model.lm5)
crPlots(model.lm5)
Checking collinearity
library(faraway)
## Warning: package 'faraway' was built under R version 4.5.2
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
##
## logit, vif
vif(model.lm5)
## Income Price
## 1.048256 1.048256
Going to stick with full model
Transformations
model.log <- lm(data = data_numeric, log(Sales) ~ Price + Income)
summary(model.log)
##
## Call:
## lm(formula = log(Sales) ~ Price + Income, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54413 -0.07959 -0.01604 0.04696 0.72926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.932e+00 2.879e-01 17.132 < 2e-16 ***
## Price -2.143e-02 6.938e-03 -3.089 0.003333 **
## Income 1.746e-04 4.817e-05 3.625 0.000697 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1978 on 48 degrees of freedom
## Multiple R-squared: 0.2808, Adjusted R-squared: 0.2508
## F-statistic: 9.37 on 2 and 48 DF, p-value: 0.0003668
# Adjusted r-squared increased
model.sqrt <- lm(data = data_numeric, sqrt(Sales) ~ Price + Income)
summary(model.sqrt)
##
## Call:
## lm(formula = sqrt(Sales) ~ Price + Income, data = data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8963 -0.4925 -0.1098 0.1973 4.8785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.0872813 1.6885148 7.159 4.2e-09 ***
## Price -0.1263833 0.0406972 -3.105 0.00319 **
## Income 0.0009752 0.0002825 3.452 0.00117 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.16 on 48 degrees of freedom
## Multiple R-squared: 0.2703, Adjusted R-squared: 0.2398
## F-statistic: 8.888 on 2 and 48 DF, p-value: 0.0005202
# adjusted r squared increased
boxCox(model.lm5)
data_boxcox <- data_numeric
data_boxcox$transformation <- ((data$Sales)^(-.8) - .8)/-.8
model.boxcox <- lm(data = data_boxcox, transformation ~ Income + Price)
summary(model.boxcox)
##
## Call:
## lm(formula = transformation ~ Income + Price, data = data_boxcox)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0147512 -0.0014242 0.0000876 0.0013650 0.0119469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.740e-01 6.102e-03 159.616 < 2e-16 ***
## Income 3.772e-06 1.021e-06 3.695 0.000564 ***
## Price -4.228e-04 1.471e-04 -2.875 0.006006 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.004194 on 48 degrees of freedom
## Multiple R-squared: 0.2749, Adjusted R-squared: 0.2447
## F-statistic: 9.097 on 2 and 48 DF, p-value: 0.0004467
# Model slightly improved
weights <- 1 / abs(model.lm5$residuals)^2
model.wls <- lm(data = data_numeric, Sales ~ Income + Price, weights = weights)
summary(model.wls)
##
## Call:
## lm(formula = Sales ~ Income + Price, data = data_numeric, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.0526 -0.9836 -0.9558 0.9874 1.1805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.048798 3.985704 38.15 <2e-16 ***
## Income 0.021752 0.000353 61.61 <2e-16 ***
## Price -2.955722 0.088683 -33.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9913 on 48 degrees of freedom
## Multiple R-squared: 0.9911, Adjusted R-squared: 0.9907
## F-statistic: 2665 on 2 and 48 DF, p-value: < 2.2e-16
# model greatly improved
Boxcox model diagnostics
plot(model.boxcox)
sqrt model diagnostics
plot(model.sqrt)
log model diagnostics
plot(model.log)
wls model diagnostics
plot(model.wls)