This exercise considers an example of data that do not satisfy all the standard assumptions of simple regression. In the considered case, one particular observation lies far off from the others, that is, it is an outlier. This violates assumptions A3 and A4, which state that all error terms \(\epsilon\)i are drawn from one and the same distribution with mean zero and fixed variance \(\sigma^2\)
The dataset contains twenty weekly observations on sales and advertising of a department store. The question of interest lies in estimating the effect of advertising on sales. One of the weeks was special, as the store was also open in the evenings during this week, but this aspect will first be ignored in the analysis.
(a) Make the scatter diagram with sales on the vertical axis and advertising on the horizontal axis. What do you expect to find if you would fit a regression line to these data?
#import the dataset
SalesRound1<-read.csv("TestExer1-sales-round1.csv", header = TRUE)
str(SalesRound1)
## 'data.frame': 20 obs. of 3 variables:
## $ Observation: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Advertising: int 12 12 9 11 6 9 15 6 11 16 ...
## $ Sales : int 24 27 25 27 23 25 27 25 26 27 ...
head(SalesRound1)
## Observation Advertising Sales
## 1 1 12 24
## 2 2 12 27
## 3 3 9 25
## 4 4 11 27
## 5 5 6 23
## 6 6 9 25
#scatterplot
plot(SalesRound1$Advertising,SalesRound1$Sales,pch=19,col="darkblue")
#fit a model
fit<-lm(SalesRound1$Sales~SalesRound1$Advertising,data=SalesRound1)
#Fit the regression Line
abline(fit,lwd=2,col="darkred")
From the scatterplot and the regression line we see a weak negative association between Advertising and Sales variable with low explanatory power (Multiple R-squared= 0.02704).
(b) Estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
#fit a model
fit<-lm(SalesRound1$Sales~SalesRound1$Advertising,data=SalesRound1)
summary(fit)
##
## Call:
## lm(formula = SalesRound1$Sales ~ SalesRound1$Advertising, data = SalesRound1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6794 -2.7869 -1.3811 0.6803 22.3206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.6269 4.8815 6.069 9.78e-06 ***
## SalesRound1$Advertising -0.3246 0.4589 -0.707 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.836 on 18 degrees of freedom
## Multiple R-squared: 0.02704, Adjusted R-squared: -0.02701
## F-statistic: 0.5002 on 1 and 18 DF, p-value: 0.4885
From the Regression output we see that the slope coefficient (-0.3246 ) is having a pvalue= 0.488 which is >0.05, hence NOT SIGNIFICANT implying that the slope b is not significantly different from 0.
Also Multiple R-squared= 0.02704 which implies that only 2% of the variation in Sales is being explained by Variation in Adverstising variables indicating a low explanatory power of the model.
Also the F satistics is Not Significant indicating that the R squared (or the model) is NOT SIGNIFICANT
(c) Compute the residuals and draw a histogram of these residuals. What conclusion do you draw from this histogram?
resid<-residuals(fit)
resid
## 1 2 3 4 5 6
## -1.73199382 1.26800618 -1.70571870 0.94343122 -4.67944359 -1.70571870
## 7 8 9 10 11 12
## 2.24173107 -2.67944359 -0.05656878 2.56630603 -1.05656878 22.32055641
## 13 14 15 16 17 18
## 0.59258114 -3.05656878 0.59258114 -4.35486862 -4.03029366 -3.03029366
## 19 20
## 0.26800618 -2.70571870
hist(resid,probability= TRUE,breaks=12,col="steelblue")
lines(density(resid),col="red",lwd=2)
From the plot of histogram of residuals, we see a very highly right skewed distribution with majority of values lying in the range of +5 & -5 and one extreme value (outlier) which is making the distribution highly right skewed (non-normal).
(d) Apparently, the regression result of part (b) is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?
#lets find the outlier
which.max(SalesRound1$Sales)
## [1] 12
#The 12th obs is the outlier
#outlier value in terms of sales
SalesRound1$Sales[12]
## [1] 50
Since we have an extreme outlier which is falling way beyond the general cluster of the data, it seems prudent then to remove it.
(e) Delete this special week from the sample and use the remaining 19 weeks to estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
#lets remove the outlier (12th obs)
SaleswithoutNA<-SalesRound1[-12,]
SaleswithoutNA
## Observation Advertising Sales
## 1 1 12 24
## 2 2 12 27
## 3 3 9 25
## 4 4 11 27
## 5 5 6 23
## 6 6 9 25
## 7 7 15 27
## 8 8 6 25
## 9 9 11 26
## 10 10 16 27
## 11 11 11 25
## 13 13 13 26
## 14 14 11 23
## 15 15 13 26
## 16 16 7 23
## 17 17 8 23
## 18 18 8 24
## 19 19 12 26
## 20 20 9 24
#fit the new regression model
fit2<-lm(SaleswithoutNA$Sales~SaleswithoutNA$Advertising,data=SaleswithoutNA)
summary(fit2)
##
## Call:
## lm(formula = SaleswithoutNA$Sales ~ SaleswithoutNA$Advertising,
## data = SaleswithoutNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2500 -0.4375 0.0000 0.5000 1.7500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.1250 0.9548 22.124 5.72e-14 ***
## SaleswithoutNA$Advertising 0.3750 0.0882 4.252 0.000538 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 17 degrees of freedom
## Multiple R-squared: 0.5154, Adjusted R-squared: 0.4869
## F-statistic: 18.08 on 1 and 17 DF, p-value: 0.0005379
From the regression output, the slope coefficient(= 0.3750) of the model is highly significant (pvalue<0.001).Hence we reject the null hypothesis in favour of the alternate that the slope coefficient b is significantly different from 0.
(f) Discuss the differences between your findings in parts (b) and (e). Describe in words what you have learned from these results.
#lets plot the scatter plot and regression line to the data without outlier
#scatterplot
plot(SaleswithoutNA$Advertising,SaleswithoutNA$Sales,pch=19,col="darkblue")
#fit the regression line
abline(fit2,lwd=2,col="darkred")
Comparing the summary regression output results from point b & e we see that the after removing the outlier, the slope coefficient has become significant.
Also from the scatter plot & the regression line, we see now a positive linear association with Multiple R-squared=0.5154 which implies that about 51% of the variation in Sales is being explained by variation in Advertsisng i.e. the explanatory power of the model has drastically improved from the original model.
Also from the F satistics, we see that it has become significant as compared to the original model impying the R squared (or the model) is significant.