Load all the necessary packages that will be required for this assignment…

library(statsr)
library(dplyr)
library(ggplot2)
library(gdata)
  1. Make the scatter diagram with sales on the vertical axis and advertising on the horizontal axis. What do you expect to find if you would fit a regression line to these data?.
  2. Estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
  3. Compute the residuals and draw a histogram of these residuals. What conclusion do you draw from this histogram?
  4. Apparently, the regression result of part (b) is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?
  5. Delete this special week from the sample and use the remaining 19 weeks to estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
  6. Discuss the differences between your findings in parts (b) and (e). Describe in words what you have learned from these results.

The data

# Let's load the data

assign_data <- read.xls("F:/COURSERA COURSES/Econometrics/Week 1/TestExer1-sales-round1.xls", sheet = 1)

Summary of the data

# Let's review the data

summary(assign_data[2:3])
##   Advertising        Sales      
##  Min.   : 6.00   Min.   :23.00  
##  1st Qu.: 8.00   1st Qu.:24.00  
##  Median :11.00   Median :25.00  
##  Mean   :10.25   Mean   :26.30  
##  3rd Qu.:12.00   3rd Qu.:26.25  
##  Max.   :16.00   Max.   :50.00

The variables in the given dataset, Cost of (Advertising) and the (Sales) price seems to be normal except for the possible outlier(s) in Sales, which can be guessed from the Max vaalue as in the above result.

We will, hence, explore the distribution of date visually and with summary statistics. Let’s first create a visualization, a scatter plot:

p <- ggplot(assign_data, aes(Advertising, Sales))
p + geom_point()

Exercise: 1. Clearly from the above plot, we can observe the outlier of (Advertising). To Answer the first quesion of the assignment..

What do you expect to find if you would fit a regression line to these data?

I am expecting a rising regression line. 

Run Regression in R

We now will run a simple regression:

How much (Sales) will there be for every unit spent in (Advertising)

# Parameter for the model
y = assign_data$Sales
x = assign_data$Advertising

# Simple Regression Model 
model <- lm(y ~ x)

# Summary of the model
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6794 -2.7869 -1.3811  0.6803 22.3206 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.6269     4.8815   6.069 9.78e-06 ***
## x            -0.3246     0.4589  -0.707    0.488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.836 on 18 degrees of freedom
## Multiple R-squared:  0.02704,    Adjusted R-squared:  -0.02701 
## F-statistic: 0.5002 on 1 and 18 DF,  p-value: 0.4885

Exercise: 3. Clearly from the summary of the model above, we can compute the coefficients a and b in the simple regression model with (Sales) as dependent variable and (Advertising) as explanatory factor. To Answer the second quesion of the assignment..

y = a + b * x + Er

Coefficient of (`a`) is     :  29.6269
Coefficient of (`b`) is     :  -0.3246
Standard Error of (`b`) is  :  0.4589
t-value of (`b`) is         :  -0.707

Also, (`b`) is not significantly different from 0 (p-value > 0.05 and is = 0.4885) .

Plot the outcome of the model

par(mfrow=c(2,2))
plot(model)

Plot for residual

resd <- resid(model) 

hist(resd, breaks="FD", xlab="Residuals", main="Histogram of Residuals")
x <- -7:25
lines(x, 60*dnorm(x,0,sd(resd)),col="red", lwd=2)

Exercise: 3. Above is the computation of residuals and histogram of these residuals. And, answer to the third quesion of the assignment “What conclusion do you draw from this histogram?”..

In this case, the residuals seem to be highly positively skewed.

which is obviously due to outlier in the (Advertising) data.

Exercise 4: Answer to the fourth quesion of the assignment “Apparently, the regression result of part (b), above, is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?”..

In this case, I will first remove the outlier and run the regression again to see if there are some significant differences.

Below is the process….

# Remove outlier from the data and store new data into new variable
assign_data2 <- assign_data[-12, ]

# Plot the Sales vs. Advertising
p <- ggplot(assign_data2, aes(Advertising, Sales))
p + geom_point()

# Parameter for the Regression Model
y = assign_data2$Sales
x = assign_data2$Advertising

# Rebuild Simple Regression Model 
model <- lm(y ~ x)

# Summary of the New model
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2500 -0.4375  0.0000  0.5000  1.7500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.1250     0.9548  22.124 5.72e-14 ***
## x             0.3750     0.0882   4.252 0.000538 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 17 degrees of freedom
## Multiple R-squared:  0.5154, Adjusted R-squared:  0.4869 
## F-statistic: 18.08 on 1 and 17 DF,  p-value: 0.0005379
And Plots for the new model..
par(mfrow=c(2,2))
plot(model)

 And Residual Plot...
resd <- resid(model) 

hist(resd, breaks="FD", xlab="Residuals", main="Histogram of Residuals")
x <- -2.25:2.2
lines(x, 16*dnorm(x,-0.1,sd(resd)),col="red", lwd=2)

Exercise (Answer no. 5): Clearly from the summary of the data cleaning, plots and new regression model above, we can infer that the new model gives us better results and hence we can compute the coefficients a and b in the simple regression model with (Sales) as dependent variable and (Advertising) as explanatory factor.

y = a + b * x + Er

Coefficient of (`a`) is     :  21.1250
Coefficient of (`b`) is     :  0.3750
Standard Error of (`b`) is  :  0.0882
t-value of (`b`) is         :  4.252

Also, (`b`) is significantly different from 0 (p-value < 0.05) in the new model.

Exercise (Answer no. 6): The major learning from the models developed (based on original data and after removal of outlier), we saw that the results were significatly different from each other, later being the best.

So, even one outlier in the data could lead to the adverse results and hence making it clear that how important data cleaning and processing is.

(End of the Assignment)