• See website for how to submit your answers and how feedback is organized. • This exercise uses the datafile TestExer1 and requires a computer. • The dataset TestExer1 is available on the website.
• Get hands-on experience with performing simple regressions. • Get feeling for consequences of violations of regression assumptions. • Obtain some experience with how to diagnose that an assumption is violated.
This exercise considers an example of data that do not satisfy all the standard assumptions of simple regression. In the considered case, one particular observation lies far off from the others, that is, it is an outlier. This violates assumptions A3 and A4, which state that all error terms εi are drawn from one and the same distribution withmean zero and fixed variance σ2. The dataset contains twenty weekly observations on sales and advertising of adepartment store. The question of interest lies in estimating the effect of advertising on sales. One of the weeks was special, as the store was also open in the evenings during this week, but this aspect will first be ignored in the analysis.
We expect to find a normal positive relationship between tgose variables.
library(readxl)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read_excel("exam1.xls", col_names = TRUE)
df %>% ggplot(aes(Advertising,Sales))+geom_point()
As we can see, exits a positive relationsip but exist one outlier when the expenses in Advertising was 6 giving 50 um in the total sales.
SO, for this question we need to fit a linear model as:
fit <- lm(Sales ~ Advertising, data = df)
summary(fit)
Call:
lm(formula = Sales ~ Advertising, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.6794 -2.7869 -1.3811 0.6803 22.3206
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.6269 4.8815 6.069 9.78e-06 ***
Advertising -0.3246 0.4589 -0.707 0.488
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.836 on 18 degrees of freedom
Multiple R-squared: 0.02704, Adjusted R-squared: -0.02701
F-statistic: 0.5002 on 1 and 18 DF, p-value: 0.4885
In this model, B1(Advertising) is not statistically different from zero, so we accept the Ho: B1=0
Computing the residuals:
resid_fit<- residuals.lm(fit)
df <- df %>% mutate(Resids = resid_fit)
df %>% ggplot(aes(Resids))+geom_histogram(bins = 30)
The previous graph show us that the errors are not totally normal distributed.The outlier is wording the model evaluation.
As we know, is necessary to drop out that observation to clean the data and get a better coefficient with the minimum error.
head(df)
# A tibble: 6 x 4
Observation Advertising Sales Resids
<dbl> <dbl> <dbl> <dbl>
1 1 12 24 -1.73
2 2 12 27 1.27
3 3 9 25 -1.71
4 4 11 27 0.943
5 5 6 23 -4.68
6 6 9 25 -1.71
df <- df[-which.max(df$Resids),]
df
# A tibble: 19 x 4
Observation Advertising Sales Resids
<dbl> <dbl> <dbl> <dbl>
1 1 12 24 -1.73
2 2 12 27 1.27
3 3 9 25 -1.71
4 4 11 27 0.943
5 5 6 23 -4.68
6 6 9 25 -1.71
7 7 15 27 2.24
8 8 6 25 -2.68
9 9 11 26 -0.0566
10 10 16 27 2.57
11 11 11 25 -1.06
12 13 13 26 0.593
13 14 11 23 -3.06
14 15 13 26 0.593
15 16 7 23 -4.35
16 17 8 23 -4.03
17 18 8 24 -3.03
18 19 12 26 0.268
19 20 9 24 -2.71
We just deleted that week and then we need to compute the coefficients of our model:
fit2 <- lm(Sales ~ Advertising, data= df)
summary(fit2)
Call:
lm(formula = Sales ~ Advertising, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.2500 -0.4375 0.0000 0.5000 1.7500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.1250 0.9548 22.124 5.72e-14 ***
Advertising 0.3750 0.0882 4.252 0.000538 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.054 on 17 degrees of freedom
Multiple R-squared: 0.5154, Adjusted R-squared: 0.4869
F-statistic: 18.08 on 1 and 17 DF, p-value: 0.0005379
Now we can reject the Ho: B1=0, so beta is statically important.
df %>% ggplot(aes(Advertising,Sales))+geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
I learnt that the presence of outliers the model change a lot and to fix ot is needed to find that outliers to fit a new model. But also my question is if all outliers are bad? A question that in the next classes I’ll learn.