June 02, 2023library(GGally) # for ggpairs()
library(car) # for ncvTest()
library(stats) # for shapiro test
library(leaps) # for regsubsets()
library(DAAG) # Press()
Where needed a significance level \(\alpha=5\%\) is used, unless stated otherwise.
In 1981, Proctor and Gamble wanted to determine their Crest toothpaste’s future sales. Crest’s advertising budget, Advertising ratio (Crest’s advertising budget as a proportion of that for Colgate dental cream, Crest’s closest competitor) and personal disposable income was collected for years between 1967 to 1980.
Our aim for the Crest data collected over the years is to determine a regression model using one of these 3 factors that determines Crest’s sales. This is to be achieved by fitting a model using the 3 predictor variables and eliminating any insignificant variables from the model by analyzing their coefficients.
# Raw Crest data from pdf
CrestRawdata = data.frame(Year = seq(from=1967, to =1980),
CrestSales = c(105000,105000,121600,113750,113750,128925,142500,126000,162000,191625,189000,210000,224250,245000),
CrestBudget = c(16300,15800,16000,14200,15000,14000,15400,18250,17300,23000,19300,23056,26000,28000),
Ratio = c(1.25,1.34,1.22,1.00,1.15,1.13,1.05,1.27,1.07,1.17,1.07,1.54,1.59,1.56),
Income = c(547.9,593.4,638.9,695.3,751.8,810.3,914.5,998.3,1096.1,1194.4,1311.5,1462.9,1641.7,1821.7))
# We don't need year column
Crest <- CrestRawdata[,c(2,3,4,5)]
# Crest data in Y, X1, X2, X3 format
Crest_data <- data.frame( y = Crest$CrestSales,
x1 = Crest$CrestBudget,
x2 = Crest$Ratio,
x3 = Crest$Income)
\(Crest\) is our base data.
Let’s first fit a model for CrestSales Y directly using the 3 variables CrestBudget X1, Ratio X2 and Income X3 using the lm().
crestfit1= lm(CrestSales~ CrestBudget+Ratio+Income, data=Crest)
summary(crestfit1)
##
## Call:
## lm(formula = CrestSales ~ CrestBudget + Ratio + Income, data = Crest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24088 -2568 1021 3836 10100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34104.559 17654.144 1.932 0.082187 .
## CrestBudget 3.746 1.976 1.896 0.087243 .
## Ratio -30046.343 22859.674 -1.314 0.218066
## Income 85.926 17.911 4.797 0.000727 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9574 on 10 degrees of freedom
## Multiple R-squared: 0.9691, Adjusted R-squared: 0.9598
## F-statistic: 104.5 on 3 and 10 DF, p-value: 7.537e-08
The regression model using all 3 predictors is significant as p-value
(7.537e-08) is less than 0.05 significance level. The Adjuster R-squared
suggests 95.98% variability in CrestSales is
explained by the model with all 3 predictors, Crest
Budget, Ratio and
Income.
The fitted model using all the 3 variables is
\[\hat y= 34104.559 + 3.746x_1 - 30046.343*x_2 +
85.926*x_3+\varepsilon\] \[or\]
\[E(\hat y)= 34104.559 + 3.746x_1 -
30046.343x_2 + 85.926x_3\]
To check if the linear model for Crest Sales \(Y\) using all the predictors is
significant, We need to perform significance test on the regression.
Significance test using F-statistic or P-value statistic using ANOVA
table will be performed on the regression. Lets state the hypothesis for
the significance test,
Null hypothesis; \(H_0\): \(\beta_1\) = 0, \(\beta_2\) = 0, \(\beta_3\) = 0
Alternate hypothesis;
\(H_1\): \(\beta_1 \not = 0\), \(\beta_2 \not = 0\), \(\beta_3 \not = 0\).
Lets generate Anova table using anova(),
CrestANOVA <- anova(crestfit1)
CrestANOVA
## Analysis of Variance Table
##
## Response: CrestSales
## Df Sum Sq Mean Sq F value Pr(>F)
## CrestBudget 1 2.5611e+10 2.5611e+10 279.392 1.23e-08 ***
## Ratio 1 1.0202e+09 1.0202e+09 11.130 0.0075401 **
## Income 1 2.1097e+09 2.1097e+09 23.015 0.0007265 ***
## Residuals 10 9.1667e+08 9.1667e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We cannot rely on this output as we need to all the regressors together to test significance of regression. We need to sum up the predictors and calculate the p value for them as shown below,
# Manually
regression<- CrestANOVA$`Sum Sq`[1]+ CrestANOVA$`Sum Sq`[2]+ CrestANOVA$`Sum Sq`[3]
Mean_Sq<-regression/(CrestANOVA$Df[1]+CrestANOVA$Df[2]+CrestANOVA$Df[3])
F_value<-Mean_Sq/CrestANOVA$`Mean Sq`[4]
pf(F_value, 3, CrestANOVA$Df[4], lower.tail = FALSE)
## [1] 7.537224e-08
# Using anova()
Crestlm.fit<-lm(CrestSales~1,Crest)
anova(Crestlm.fit,crestfit1)
## Analysis of Variance Table
##
## Model 1: CrestSales ~ 1
## Model 2: CrestSales ~ CrestBudget + Ratio + Income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 2.9658e+10
## 2 10 9.1667e+08 3 2.8741e+10 104.51 7.537e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (7.537e-08) is \(<\) \(0.05\), the regression is significant.
To answer the question, Is the model using all the predictors significant?
Yes, the regression is significant with p-value of 7.537e-08 and large F-score.
summary(crestfit1)
##
## Call:
## lm(formula = CrestSales ~ CrestBudget + Ratio + Income, data = Crest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24088 -2568 1021 3836 10100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34104.559 17654.144 1.932 0.082187 .
## CrestBudget 3.746 1.976 1.896 0.087243 .
## Ratio -30046.343 22859.674 -1.314 0.218066
## Income 85.926 17.911 4.797 0.000727 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9574 on 10 degrees of freedom
## Multiple R-squared: 0.9691, Adjusted R-squared: 0.9598
## F-statistic: 104.5 on 3 and 10 DF, p-value: 7.537e-08
We have seen from the summary(crestfit1) that coefficient of Income is significant for Y(Crest Sales) at 5% significance level and coefficients of Ratio and Crest Budget are insignificant. Thus, Only Income predictor/variable is important from the model summary statistics. Also, note the intercept term is insignificant.
Lets test the significance of these 3 variables by performing T-Test on the coefficients,
We use t-test to test significance of coefficients. The Test statistic is given by, \(t = \frac {\hat{\beta_j} - \beta_j} {se\left(\hat{\beta_j}\right)}\) where j indicate regressor variables (3 in our case). We reject \(H_0\) if \(|t|\) > \(t_{\alpha/2, n-p}\) (where p is the number of parameters, k+1) or by p-value. (Note, j=1 to k. p = k+1)
For \(\beta_1\) :
For the coefficient B1 of Crest Budget the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_1\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_1 \not = 0\)
The \(|t|\)-critical for 2 tailed t-test with n-2 DF is 2.228139,
qt(p = .025, df = length(Crest$CrestBudget) - 3 - 1)
## [1] -2.228139
From the summary statistics, the t-statistic value (1.896) is closer to 0 than the t-critical value (2.228139), hence the null hypothesis is not rejected. Thus, the \(H_0\) is true. Thus, the coefficient \(\beta_1\) is insignificant for the regression.
For \(\beta_2\) :
For the coefficient B2 of Ratio the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_2\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_2 \not = 0\)
the t-statistic value (1.932) is closer to 0 than the t-critical value (2.228139), hence null hypothesis is not rejected. Thus, the \(H_0\) is true. Thus, the constant \(\beta_0\) is insignificant for the regression.
For \(\beta_3\) :
For the coefficient B2 of Income the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_3\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_3 \not = 0\)
the t-statistic value (4.797) is farther from 0 than the t-critical value (2.228139), hence the null hypothesis is rejected. Thus, the \(H_a\) is true. Meaning, the coefficient \(\beta_3\) is not zero. Thus, there is a significant coefficient \(\beta_3\) for the regression.
For the intercept \(\beta_0\) the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_0\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_0 \not = 0\)
the t-statistic value (1.314) is closer to 0 than the t-critical value (2.228139), hence null hypothesis is not rejected. Thus, the \(H_0\) is true. Thus, the intercept \(\beta_0\) is insignificant for the regression.
To answer the question, From the output, which variables are important?
From the results of significance test for coefficients, Only \(\beta_3\) is significant. That is, only Income variable is a significant predictor of the response term, Crest Sales Y in the linear model. The other 2 variables, Crest Budget and Ratio are insignificant in the model.
Lets perform Residual Analysis to check if any model assumptions have been violated.
The estimator error (or residual) is defined by:
\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)
Residual checks are done by plotting error/residual plots which will show up the following problems:
par(mfrow=c(2,2))
plot(crestfit1)
1. Residuals vs Fitted
This plot shows if residuals(\(\epsilon_i\)) have non-linear patterns. We find the residuals spread almost equally around the horizontal line without forming any distinct pattern. This is a good indication that we don’t have non-linear relationships. Thus, the residuals form linearity. Also, no distinct pattern indicates randomness of the residuals. Also, since the horizontal red line is almost at the 0 level, we can say that the mean of residuals/errors is 0, that is \(E(\epsilon_i) = 0\).
2. Normal Q-Q
This plot shows if residuals are normally distributed. It is necessary that the errors be normally distributed as we are carrying out significance tests. The residuals follow a straight line mostly (except for observation number 3,8 and 12 which deviate near the tails). Residuals lining well on the straight dashed line indicate linearity. Since the residuals are almost perfectly aligned in a linear fashion, it is fair to say that the residuals are normally distributed.
3. Scale-Location
This plot shows if residuals are spread equally along the ranges of predictors, hence indicating equal variance or homoscedasticity. We see an almost horizontal line with equally (randomly) spread points. With the exception of observation number 3,8 and 12 (show higher variation than other data points). All the other residuals obey homoscedasticity (or equal variance).
Although the plots imply linearity, randomness, homoscedasticity, etc. it is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.
1- Non-constant error variance test
\(H_0\): Errors have a constant
variance
\(H_1\): Errors have a
non-constant variance
ncvTest(crestfit1) #library(car)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.07424735, Df = 1, p = 0.78525
Since the p-value is \(>\) \(0.05\), we do not enough evidence to reject \(H_0\). This implies that constant error variance assumption is not violated.
2- Test for Normally Distributed Errors
\(H_0\): Errors are normally
distributed
\(H_1\): Errors are
not normally distributed
shapiro.test(crestfit1$residuals) # library(stats)
##
## Shapiro-Wilk normality test
##
## data: crestfit1$residuals
## W = 0.83777, p-value = 0.01522
Since the p-value is < 0.05 we reject \(H_0\). This implies that normality error assumption is violated.
3- Test for Autocorrelated Errors
\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated
durbinWatsonTest(crestfit1) # library(car)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1129573 2.211283 0.904
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.
acf(crestfit1$residuals)
Since no vertical bars exceed the Confidence limit (dashed line), no significant correlations exist between the errors.
4. Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if any. We have noticed observations 3,8 and 12 form exceptions. But do they influence the regression results significantly and form outliers? We notice no observations situated outside the cooks distance. Hence, the regression results wont be altered by any outliers.
To answer the question Are the assumptions governing the residuals satisfied? Which of these assumptions have been violated?
Thus, Normality of residuals/error terms is not satisfied.
Lets check using scatter plot and correlation coefficients between the predictors.
ggpairs(data = Crest, columns = c(1,2,3,4))
As shown in the matrix above,
- Crest
Budget and Ratio are
strongly correlated by \(0.771\) r (correlation coefficient)
- Income and Ratio
are correlated by \(0.615\) r
-
Crest Budget and
Income pair of predictors are
strongly correlated by \(0.919\) r
Thus there is indication of multicollinearity.
To test multicollinearity statistically, we should also use the Variance Inflation Factor (\(VIF\)). VIF can be viewed as the factor by which the variance of the coefficient is increase due to multicollinearity.
vif(crestfit1)
## CrestBudget Ratio Income
## 11.4220 2.8658 7.4432
VIFs > 5 to 10 indicate significant multicollinearity. We can see CrestBudget with VIF \(11.421674\) is significant in multicollinearity.
lets begin by defining the full model with all 3 variables and null model with none of the variables, namely Crestfull and crestnull.
# Full model contains all the 3 variables
Crestfull=lm(CrestSales~., data=Crest)
# null model contains no variable
Crestnull=lm(CrestSales~1, data=Crest)
step(Crestnull, scope=list(lower=Crestnull, upper=Crestfull), direction="forward")
## Start: AIC=302.64
## CrestSales ~ 1
##
## Df Sum of Sq RSS AIC
## + Income 1 2.8411e+10 1.2467e+09 260.27
## + CrestBudget 1 2.5611e+10 4.0466e+09 276.75
## + Ratio 1 1.0637e+10 1.9020e+10 298.42
## <none> 2.9658e+10 302.63
##
## Step: AIC=260.27
## CrestSales ~ Income
##
## Df Sum of Sq RSS AIC
## + CrestBudget 1 171624035 1075039723 260.19
## <none> 1246663758 260.27
## + Ratio 1 574590 1246089168 262.26
##
## Step: AIC=260.19
## CrestSales ~ Income + CrestBudget
##
## Df Sum of Sq RSS AIC
## + Ratio 1 158364757 916674966 259.96
## <none> 1075039723 260.19
##
## Step: AIC=259.96
## CrestSales ~ Income + CrestBudget + Ratio
##
## Call:
## lm(formula = CrestSales ~ Income + CrestBudget + Ratio, data = Crest)
##
## Coefficients:
## (Intercept) Income CrestBudget Ratio
## 34104.559 85.926 3.746 -30046.343
Forward selection using AIC scores iterates the null model 3 times adding Income (AIC = 260.27), CrestBudget (AIC = 260.19) and Ratio (AIC = 259.96) respectively to give the resulting model with least AIC score. This technique suggests all 3 variables are significant with Income the most significant and Ratio the least. Null model has worst AIC score of 302.64 and full model has best AIC score of 259.96.
step(Crestfull, data=Crest, direction="backward")
## Start: AIC=259.96
## CrestSales ~ CrestBudget + Ratio + Income
##
## Df Sum of Sq RSS AIC
## <none> 916674966 259.96
## - Ratio 1 158364757 1075039723 260.19
## - CrestBudget 1 329414202 1246089168 262.26
## - Income 1 2109684588 3026359554 274.68
##
## Call:
## lm(formula = CrestSales ~ CrestBudget + Ratio + Income, data = Crest)
##
## Coefficients:
## (Intercept) CrestBudget Ratio Income
## 34104.559 3.746 -30046.343 85.926
The backward elimination/selection using AIC scores tells us that from the full model with all 3 variables gives the least AIC score of 259.96 and hence is the best.
step(Crestnull, scope = list(upper=Crestfull), data=Crest, direction="both")
## Start: AIC=302.64
## CrestSales ~ 1
##
## Df Sum of Sq RSS AIC
## + Income 1 2.8411e+10 1.2467e+09 260.27
## + CrestBudget 1 2.5611e+10 4.0466e+09 276.75
## + Ratio 1 1.0637e+10 1.9020e+10 298.42
## <none> 2.9658e+10 302.63
##
## Step: AIC=260.27
## CrestSales ~ Income
##
## Df Sum of Sq RSS AIC
## + CrestBudget 1 1.7162e+08 1.0750e+09 260.19
## <none> 1.2467e+09 260.27
## + Ratio 1 5.7459e+05 1.2461e+09 262.26
## - Income 1 2.8411e+10 2.9658e+10 302.63
##
## Step: AIC=260.19
## CrestSales ~ Income + CrestBudget
##
## Df Sum of Sq RSS AIC
## + Ratio 1 158364757 916674966 259.96
## <none> 1075039723 260.19
## - CrestBudget 1 171624035 1246663758 260.27
## - Income 1 2971530267 4046569990 276.75
##
## Step: AIC=259.96
## CrestSales ~ Income + CrestBudget + Ratio
##
## Df Sum of Sq RSS AIC
## <none> 916674966 259.96
## - Ratio 1 158364757 1075039723 260.19
## - CrestBudget 1 329414202 1246089168 262.26
## - Income 1 2109684588 3026359554 274.68
##
## Call:
## lm(formula = CrestSales ~ Income + CrestBudget + Ratio, data = Crest)
##
## Coefficients:
## (Intercept) Income CrestBudget Ratio
## 34104.559 85.926 3.746 -30046.343
Stepwise regression, a combination of foreward and backward process, using AIC after 4 iterations of addition and rejection of variables suggests the best model to be the full model.
As best model is the full model, our model remains the same,
\(\hat y= 34104.559 + 3.746x_1 -
30046.343*X2 + 85.926*X3+\varepsilon\) or
\(E(\hat y)= 34104.559 + 3.746x_1 - 30046.343*X2 +
85.926*X3\)
We begin with the null model and add the most significant variable one by one to get lowest AIC scores for the resulting model,
add1(Crestnull, scope =Crestfull, test = "F")
## Single term additions
##
## Model:
## CrestSales ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.9658e+10 302.63
## CrestBudget 1 2.5611e+10 4.0466e+09 276.75 75.9493 1.549e-06 ***
## Ratio 1 1.0637e+10 1.9020e+10 298.42 6.7111 0.02363 *
## Income 1 2.8411e+10 1.2467e+09 260.27 273.4764 1.268e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Income is the most significant variable with the highest F-value or lowest p-value of 1.268e-09, hence we add Income to null model,
add1(update(Crestnull, ~ . +Income), scope = Crestfull, test = "F")
## Single term additions
##
## Model:
## CrestSales ~ Income
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1246663758 260.27
## CrestBudget 1 171624035 1075039723 260.19 1.7561 0.2120
## Ratio 1 574590 1246089168 262.26 0.0051 0.9445
Now no variable is significant at p < 0.05 as F-values are small. Hence manual F-test based forward selection suggests a model with only Income as the predictor variable.
We begin with the full model and remove the least significant variable one by one to get lowest AIC scores for the resulting model,
drop1(Crestfull,test="F")
## Single term deletions
##
## Model:
## CrestSales ~ CrestBudget + Ratio + Income
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 916674966 259.96
## CrestBudget 1 329414202 1246089168 262.26 3.5936 0.0872426 .
## Ratio 1 158364757 1075039723 260.19 1.7276 0.2180656
## Income 1 2109684588 3026359554 274.68 23.0145 0.0007265 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ratio is the least significant with p-value 0.2180656 with lowest F-value. Remove Ratio from the model.
drop1(update(Crestfull, ~ . -Ratio), test = "F")
## Single term deletions
##
## Model:
## CrestSales ~ CrestBudget + Income
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1075039723 260.19
## CrestBudget 1 171624035 1246663758 260.27 1.7561 0.2119796
## Income 1 2971530267 4046569990 276.75 30.4052 0.0001823 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now CrestBudget is the least significant at p value 0.2119796 with lowest F-value. Remove CrestBudget from the model.
drop1(update(Crestfull, ~ . -Ratio-CrestBudget), test = "F")
## Single term deletions
##
## Model:
## CrestSales ~ Income
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1.2467e+09 260.27
## Income 1 2.8411e+10 2.9658e+10 302.63 273.48 1.268e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now remaining 1 variable, Income is significant at p < 0.05 with high F-value.
The forward and backward process using F-scores suggests only Income variable is a significant predictor of response, Crest Sales.
Since models built as per AIC and F-scores are different, lets use other statistics to substantiate a model.
regsubsets.out = regsubsets(CrestSales~., data=Crest, nbest=3) # library(leaps)
summary.out <- summary(regsubsets.out)
as.data.frame(summary.out$outmat)
## CrestBudget Ratio Income
## 1 ( 1 ) *
## 1 ( 2 ) *
## 1 ( 3 ) *
## 2 ( 1 ) * *
## 2 ( 2 ) * *
## 2 ( 3 ) * *
## 3 ( 1 ) * * *
par(mfrow=c(1,2))
plot(regsubsets.out, scale="r2")
plot(regsubsets.out, scale="adjr2")
par(mfrow=c(1,2))
plot(regsubsets.out, scale="bic")
plot(regsubsets.out, scale="Cp")
Although forward, backward and stepwise selection using AIC gives a model with all 3 variable as significant, the best model for our Crest data by forward and backward selection using F-test has Income variable alone as significant. This model selection is backed by the model chosen using \(R^2\), Adjusted \(R^2\), BIC and Mallow’s \(Cp\) statistic.
Lets fit reduced model for Crest Sales (Y) using Income (X3) variable alone,
crestfit2<-lm(CrestSales~Income,Crest)
summary(crestfit2)
##
## Call:
## lm(formula = CrestSales ~ Income, data = Crest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25451.5 -1606.2 -204.3 3974.9 17508.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36068.352 7724.362 4.669 0.000542 ***
## Income 115.580 6.989 16.537 1.27e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10190 on 12 degrees of freedom
## Multiple R-squared: 0.958, Adjusted R-squared: 0.9545
## F-statistic: 273.5 on 1 and 12 DF, p-value: 1.268e-09
The regression model using Income (X3)
variable is significant as p-value (1.268e-09) is less than 0.05
significance level. The Adjuster R-squared suggests 95.45% variability
in \(CrestSales\) is explained by this
model.
Compared to the full model, the reduced model is more significant (1.268e-09 vs 7.537e-08) but has slightly less Adjusted R-squared (95.45% vs 95.98%).
Reduced model is
\[\hat y=
36068.352 + 115.580X3 +\varepsilon\] \[or\] \[E(\hat
y)= 36068.352 + 115.580X3\]
We use F-statistic or P-value from ANOVA table to test significance of reduced regression model (Note, now the linear model is of 1 variable). The hypothesis are,
Null hypothesis; \(H_0\): \(\beta_3\) = 0
Alternate hypothesis;
\(H_1\): \(\beta_3 \not = 0\).
CrestANOVA2 <- anova(crestfit2)
CrestANOVA2
## Analysis of Variance Table
##
## Response: CrestSales
## Df Sum Sq Mean Sq F value Pr(>F)
## Income 1 2.8411e+10 2.8411e+10 273.48 1.268e-09 ***
## Residuals 12 1.2467e+09 1.0389e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (1.268e-09) is \(<\) \(0.05\), the regression is significant.
For the coefficient \(\beta_3\) of
Income X3 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_3\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_3 \not = 0\)
And the Test statistic:
\(t = \frac {\hat{\beta_j} - \beta_j} {se\left(\hat{\beta_j}\right)}\)
We reject \(H_0\) if \(|t|\) > \(t_{\alpha/2, n-p}\) or by p-value.
The \(|t|\)-critical for 2 tailed t-test with n-p or n-k-1 DF is 2.178813,
qt(p = .025, df = length(Crest$CrestBudget) - 1 - 1)
## [1] -2.178813
From the summary(crestfit2), the t-statistic value (16.537) is farther from 0 than the t-critical value (2.178813), hence the null hypothesis is rejected. The \(H_a\) is true. Meaning, the coefficient \(\beta_3\) is not zero. Thus, there is a significant coefficient \(\beta_3\) for the regression.
For the intercept \(\beta_0\) the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_0\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_0 \not = 0\)
From the summary(crestfit2), the t-statistic value (4.669) is farther from 0 than the t-critical value (2.178813), hence the null hypothesis is rejected. The \(H_a\) is true. Meaning. Thus, there is a significant intercept \(\beta_0\) for the regression.
par(mfrow=c(2,2))
plot(crestfit2)
1. Residuals vs Fitted
2. Normal Q-Q
3. Scale-Location
Although the plots imply linearity, randomness, homoscedasticity, etc. it is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.
1- Non-constant error variance test
\(H_0\): Errors have a constant
variance
\(H_1\): Errors have a
non-constant variance
ncvTest(crestfit2) #library(car)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0926299, Df = 1, p = 0.76086
Since the p-value is \(>\) \(0.05\), we do not enough evidence to reject \(H_0\). This implies that constant error variance assumption is not violated.
2- Test for Normally Distributed Errors
\(H_0\): Errors are normally
distributed
\(H_1\): Errors are
not normally distributed
shapiro.test(crestfit2$residuals) # library(stats)
##
## Shapiro-Wilk normality test
##
## data: crestfit2$residuals
## W = 0.88389, p-value = 0.06604
Since the p-value is > 0.05 we do not reject \(H_0\). This implies that normality error assumption is not violated.
3- Test for Autocorrelated Errors
\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated
durbinWatsonTest(crestfit2) # library(car)
## lag Autocorrelation D-W Statistic p-value
## 1 0.01486808 1.942954 0.606
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.
acf(crestfit2$residuals)
Since no vertical bars exceed the Confidence limit (dashed line), no significant correlations exist between the errors.
4. Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if any. We have noticed observations 3,8 and 10 form exceptions. But do they influence the regression results significantly and form outliers? We notice no observations situated outside the cooks distance. Hence, the regression results wont be altered by any outliers.
Thus, all assumptions are satisfied by the errors.
The fitted models were full model (using Crest Budget, Ratio and Income) and reduced model (using Income alone). On testing these models as per \(AIC\), \(BIC\), \(F\), \(R^2\), Adjusted \(R^2\) and Mallow’s \(Cp\) statistics, reduced model was selected. The reduced model was found to be significant (p-value = 1.268e-09) with significant slope \(\beta_3\) (p-value = 1.27e-09) for the Income variable and significant intercept (p-value = 0.000542). From residual analysis on the reduced model, all assumptions were satisfied. Thus, though the model is efficient and forecasts, if generated, will be reliable.
Since a model of good fit is obtained, forecasts can be generated using this model.
PU (Purdue University) (2023) Critical Values of the F-Distribution: \(\alpha\) = 0.05, accessed 8th May 2023, https://www.stat.purdue.edu/~lfindsen/stat511/F_alpha_05.pdf