MOOC Econometrics

Test Exercise 1

Notes:

• See website for how to submit your answers and how feedback is organized. • This exercise uses the datafile TestExer1 and requires a computer. • The dataset TestExer1 is available on the website.

Goals and skills being used:

• Get hands-on experience with performing simple regressions. • Get feeling for consequences of violations of regression assumptions. • Obtain some experience with how to diagnose that an assumption is violated.

Questions

This exercise considers an example of data that do not satisfy all the standard assumptions of simple regression. In the considered case, one particular observation lies far off from the others, that is, it is an outlier. This violates assumptions A3 and A4, which state that all error terms εi are drawn from one and the same distribution withmean zero and fixed variance σ2. The dataset contains twenty weekly observations on sales and advertising of adepartment store. The question of interest lies in estimating the effect of advertising on sales. One of the weeks was special, as the store was also open in the evenings during this week, but this aspect will first be ignored in the analysis.

(a) Make the scatter diagram with sales on the vertical axis and advertising on the horizontal axis. What do you expect to find if you would fit a regression line to these data?

We expect to find a normal positive relationship between tgose variables.

library(readxl)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- read_excel("exam1.xls", col_names = TRUE)
df %>% ggplot(aes(Advertising,Sales))+geom_point()

As we can see, exits a positive relationsip but exist one outlier when the expenses in Advertising was 6 giving 50 um in the total sales.

(b) Estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?

SO, for this question we need to fit a linear model as:

fit <- lm(Sales ~ Advertising, data = df)
summary(fit)

Call:
lm(formula = Sales ~ Advertising, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6794 -2.7869 -1.3811  0.6803 22.3206 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.6269     4.8815   6.069 9.78e-06 ***
Advertising  -0.3246     0.4589  -0.707    0.488    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.836 on 18 degrees of freedom
Multiple R-squared:  0.02704,   Adjusted R-squared:  -0.02701 
F-statistic: 0.5002 on 1 and 18 DF,  p-value: 0.4885

In this model, B1(Advertising) is not statistically different from zero, so we accept the Ho: B1=0

(c) Compute the residuals and draw a histogram of these residuals. What conclusion do you draw from this histogram?

Computing the residuals:

resid_fit<- residuals.lm(fit)
df <- df %>% mutate(Resids = resid_fit)
df %>% ggplot(aes(Resids))+geom_histogram(bins = 30)

The previous graph show us that the errors are not totally normal distributed.The outlier is wording the model evaluation.

(d) Apparently, the regression result of part (b) is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?

As we know, is necessary to drop out that observation to clean the data and get a better coefficient with the minimum error.

head(df)
# A tibble: 6 x 4
  Observation Advertising Sales Resids
        <dbl>       <dbl> <dbl>  <dbl>
1           1          12    24 -1.73 
2           2          12    27  1.27 
3           3           9    25 -1.71 
4           4          11    27  0.943
5           5           6    23 -4.68 
6           6           9    25 -1.71 
df <- df[-which.max(df$Resids),]
df
# A tibble: 19 x 4
   Observation Advertising Sales  Resids
         <dbl>       <dbl> <dbl>   <dbl>
 1           1          12    24 -1.73  
 2           2          12    27  1.27  
 3           3           9    25 -1.71  
 4           4          11    27  0.943 
 5           5           6    23 -4.68  
 6           6           9    25 -1.71  
 7           7          15    27  2.24  
 8           8           6    25 -2.68  
 9           9          11    26 -0.0566
10          10          16    27  2.57  
11          11          11    25 -1.06  
12          13          13    26  0.593 
13          14          11    23 -3.06  
14          15          13    26  0.593 
15          16           7    23 -4.35  
16          17           8    23 -4.03  
17          18           8    24 -3.03  
18          19          12    26  0.268 
19          20           9    24 -2.71  

(e) Delete this special week from the sample and use the remaining 19 weeks to estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Alsocompute the standard error and t-value of b. Is b significantly different from 0?

We just deleted that week and then we need to compute the coefficients of our model:

fit2 <- lm(Sales ~ Advertising, data= df)
summary(fit2)

Call:
lm(formula = Sales ~ Advertising, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2500 -0.4375  0.0000  0.5000  1.7500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.1250     0.9548  22.124 5.72e-14 ***
Advertising   0.3750     0.0882   4.252 0.000538 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.054 on 17 degrees of freedom
Multiple R-squared:  0.5154,    Adjusted R-squared:  0.4869 
F-statistic: 18.08 on 1 and 17 DF,  p-value: 0.0005379

Now we can reject the Ho: B1=0, so beta is statically important.

df %>% ggplot(aes(Advertising,Sales))+geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

(f) Discuss the differences between your findings in parts (b) and (e). Describe in words what you have learned from these results.

I learnt that the presence of outliers the model change a lot and to fix ot is needed to find that outliers to fit a new model. But also my question is if all outliers are bad? A question that in the next classes I’ll learn.