This document was created with R Markdown, and then printed as pdf for peer-graded evaluation purposes.
Code chunks will not be echoed in the paper.
Simulated data set on weekly sales and advertising of a department store:
- Advertising: index of advertising efforts in current and previous week
- Sales: sales volume in current week
This exercise considers an example of data that do not satisfy all the standard assumptions of simple regression. In the considered case, one particular observation lies far off from the others, that is, it is an outlier. This violates assumptions A3 and A4, which state that all error terms \(ε_i\) are drawn from one and the same distribution with mean zero and fixed variance \(σ^2\). The dataset contains twenty weekly observations on sales and advertising of a department store. The question of interest lies in estimating the effect of advertising on sales. One of the weeks was special, as the store was also open in the evenings during this week, but this aspect will first be ignored in the analysis.
(a) Make the scatter diagram with sales on the vertical axis and advertising on the horizontal axis. What do you expect to find if you would fit a regression line to these data?
## 'data.frame': 20 obs. of 3 variables:
## $ Observ.: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Advert.: int 12 12 9 11 6 9 15 6 11 16 ...
## $ Sales : int 24 27 25 27 23 25 27 25 26 27 ...
## Observ. Advert. Sales
## 1 1 12 24
## 2 2 12 27
## 3 3 9 25
## 4 4 11 27
## 5 5 6 23
## 6 6 9 25
## `geom_smooth()` using formula 'y ~ x'
From the scatter plot, humans would imagine an ideal line with a light positive slope. But there is an outlier, with low leverage (\(x\) is in range) but high influence due to abnormal response.
So the computer correctly calculates a regression line with a weak negative association between Advertising and Sales variable, that does not really helps in predicting Sales (more Advertising, less Sales).
(b) Estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
##
## Call:
## lm(formula = df$Sales ~ df$Advert., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6794 -2.7869 -1.3811 0.6803 22.3206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.6269 4.8815 6.069 9.78e-06 ***
## df$Advert. -0.3246 0.4589 -0.707 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.836 on 18 degrees of freedom
## Multiple R-squared: 0.02704, Adjusted R-squared: -0.02701
## F-statistic: 0.5002 on 1 and 18 DF, p-value: 0.4885
From the Regression output we see that the slope coefficient (-0.3246 ) has a very high p value = 0.488 ( > 0.05 ), hence non significant implying that the slope b is not significantly different from 0.
R-squared = 0.02704 is extremely low. It means that less than 3% of the variation in Sales is being explained by variation in Advertising.
The explanatory of this model thus is negligible.
(c) Compute the residuals and draw a histogram of these residuals. What conclusion do you draw from this histogram?
## 1 2 3 4 5 6
## -1.73199382 1.26800618 -1.70571870 0.94343122 -4.67944359 -1.70571870
## 7 8 9 10 11 12
## 2.24173107 -2.67944359 -0.05656878 2.56630603 -1.05656878 22.32055641
## 13 14 15 16 17 18
## 0.59258114 -3.05656878 0.59258114 -4.35486862 -4.03029366 -3.03029366
## 19 20
## 0.26800618 -2.70571870
From the plot of histogram of residuals, we see that they would be normally distributed, if only there was not the outlier we noticed above. In fact a single residual is over 20, whereas all the others are in the range + 5 / -5.
(d) Apparently, the regression result of part (b) is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?
Since we have an extreme outlier, which is falling way beyond the general cluster of the data, it would proper to subset the data and exclude the culprit. We need a different model for the weeks with opening hours during the evening.
(e) Delete this special week from the sample and use the remaining 19 weeks to estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
##
## Call:
## lm(formula = SaleswithoutNA$Sales ~ SaleswithoutNA$Advert., data = SaleswithoutNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2500 -0.4375 0.0000 0.5000 1.7500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.1250 0.9548 22.124 5.72e-14 ***
## SaleswithoutNA$Advert. 0.3750 0.0882 4.252 0.000538 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 17 degrees of freedom
## Multiple R-squared: 0.5154, Adjusted R-squared: 0.4869
## F-statistic: 18.08 on 1 and 17 DF, p-value: 0.0005379
From the regression output, now the slope coefficient(= 0.3750) of the model is highly significant (p value<0.001). Hence we reject the null hypothesis in favor of the alternate that the slope coefficient b is significantly different from 0.
(f) Discuss the differences between your findings in parts (b) and (e). Describe in words what you have learned from these results.
## `geom_smooth()` using formula 'y ~ x'
Comparing the summary regression output results from question b and question e we see that after removing the outlier, the slope coefficient has become significant.
Also from the scatter plot and the regression line, we see now a positive linear association with R-squared = 0.5154 which implies that about 51% of the variation in Sales is being explained by variation in Advertising i.e. the explanatory power of the model has drastically improved from the original model.
Also from the F statistics, we see that it has become significant as compared to the original model implying the R squared (or the model) is significant.
We have learned that outliers can be powerful and above all that we need tidy data and to really know the phenomena we want to analyze.