library(statsr)
library(dplyr)
library(ggplot2)
library(gdata)
# Let's load the data
assign_data <- read.xls("F:/COURSERA COURSES/Econometrics/Week 1/TestExer1-sales-round1.xls", sheet = 1)
# Let's review the data
summary(assign_data[2:3])
## Advertising Sales
## Min. : 6.00 Min. :23.00
## 1st Qu.: 8.00 1st Qu.:24.00
## Median :11.00 Median :25.00
## Mean :10.25 Mean :26.30
## 3rd Qu.:12.00 3rd Qu.:26.25
## Max. :16.00 Max. :50.00
The variables in the given dataset, Cost of (Advertising
) and the (Sales
) price seems to be normal except for the possible outlier(s) in Sales, which can be guessed from the Max vaalue as in the above result.
We will, hence, explore the distribution of date visually and with summary statistics. Let’s first create a visualization, a scatter plot:
p <- ggplot(assign_data, aes(Advertising, Sales))
p + geom_point()
Exercise: 1. Clearly from the above plot, we can observe the outlier of (Advertising
). To Answer the first quesion of the assignment..
What do you expect to find if you would fit a regression line to these data?
I am expecting a rising regression line.
We now will run a simple regression:
How much (Sales
) will there be for every unit spent in (Advertising
)
# Parameter for the model
y = assign_data$Sales
x = assign_data$Advertising
# Simple Regression Model
model <- lm(y ~ x)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6794 -2.7869 -1.3811 0.6803 22.3206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.6269 4.8815 6.069 9.78e-06 ***
## x -0.3246 0.4589 -0.707 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.836 on 18 degrees of freedom
## Multiple R-squared: 0.02704, Adjusted R-squared: -0.02701
## F-statistic: 0.5002 on 1 and 18 DF, p-value: 0.4885
Exercise: 3. Clearly from the summary of the model above, we can compute the coefficients a and b in the simple regression model with (Sales
) as dependent variable and (Advertising
) as explanatory factor. To Answer the second quesion of the assignment..
y = a + b * x + Er
Coefficient of (`a`) is : 29.6269
Coefficient of (`b`) is : -0.3246
Standard Error of (`b`) is : 0.4589
t-value of (`b`) is : -0.707
Also, (`b`) is not significantly different from 0 (p-value > 0.05 and is = 0.4885) .
par(mfrow=c(2,2))
plot(model)
resd <- resid(model)
hist(resd, breaks="FD", xlab="Residuals", main="Histogram of Residuals")
x <- -7:25
lines(x, 60*dnorm(x,0,sd(resd)),col="red", lwd=2)
Exercise: 3. Above is the computation of residuals and histogram of these residuals. And, answer to the third quesion of the assignment “What conclusion do you draw from this histogram?”..
In this case, the residuals seem to be highly positively skewed.
which is obviously due to outlier in the (Advertising
) data.
Exercise 4: Answer to the fourth quesion of the assignment “Apparently, the regression result of part (b), above, is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model?”..
In this case, I will first remove the outlier and run the regression again to see if there are some significant differences.
Below is the process….
# Remove outlier from the data and store new data into new variable
assign_data2 <- assign_data[-12, ]
# Plot the Sales vs. Advertising
p <- ggplot(assign_data2, aes(Advertising, Sales))
p + geom_point()
# Parameter for the Regression Model
y = assign_data2$Sales
x = assign_data2$Advertising
# Rebuild Simple Regression Model
model <- lm(y ~ x)
# Summary of the New model
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2500 -0.4375 0.0000 0.5000 1.7500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.1250 0.9548 22.124 5.72e-14 ***
## x 0.3750 0.0882 4.252 0.000538 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 17 degrees of freedom
## Multiple R-squared: 0.5154, Adjusted R-squared: 0.4869
## F-statistic: 18.08 on 1 and 17 DF, p-value: 0.0005379
And Plots for the new model..
par(mfrow=c(2,2))
plot(model)
And Residual Plot...
resd <- resid(model)
hist(resd, breaks="FD", xlab="Residuals", main="Histogram of Residuals")
x <- -2.25:2.2
lines(x, 16*dnorm(x,-0.1,sd(resd)),col="red", lwd=2)
Exercise (Answer no. 5): Clearly from the summary of the data cleaning, plots and new regression model above, we can infer that the new model gives us better results and hence we can compute the coefficients a and b in the simple regression model with (Sales
) as dependent variable and (Advertising
) as explanatory factor.
y = a + b * x + Er
Coefficient of (`a`) is : 21.1250
Coefficient of (`b`) is : 0.3750
Standard Error of (`b`) is : 0.0882
t-value of (`b`) is : 4.252
Also, (`b`) is significantly different from 0 (p-value < 0.05) in the new model.
Exercise (Answer no. 6): The major learning from the models developed (based on original data and after removal of outlier), we saw that the results were significatly different from each other, later being the best.
So, even one outlier in the data could lead to the adverse results and hence making it clear that how important data cleaning and processing is.
(End of the Assignment
)