Advertising - A Linear Regression

Introduction

The dataset contains statistics about the sales of a product in 200 different markets, together with advertising budgets in each of these markets for different media channels: TV, radio and newspaper. The sales are in thousands of units and the budget is in thousands of dollars.
We’ll have here a deeper look at the data and what it means to apply a regression model to it.

data <- readr::read_csv("advertising.csv", col_types = "idddd")
#> Warning: Missing column names filled in: 'X1' [1]
dim <- dim(data)
knitr::kable(head(data), caption = paste("Advertising Data -- Rows:", dim[1], " / Columns:", dim[2]))

Advertising Data – Rows: 200 / Columns: 5
X1	TV	radio	newspaper	sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9
6	8.7	48.9	75.0	7.2

Look at: TV and Sales

This is not yet the big picture, but let’s first see how the TV spendings and the Sales numbers look like. In the graph, you’ll see a straight-forward fit of a linear regression to it - very basic.

fit1 <- lm(sales ~ TV, data = data)

require(ggplot2)
require(ggthemes)
ggplot1 <- ggplot() +
        # geom_smooth(data = data, aes(x = TV, y = sales), method = "lm", se = FALSE, color = "lightgrey") +
        geom_point(aes(x = data$TV, y = fit1$fitted.values), shape = 1, alpha = 0.2) +
        geom_segment(aes(x = data$TV, xend = data$TV, y = fit1$fitted.values, yend = data$sales)) +
        geom_point(data = data, aes(x = TV, y = sales), color = "red") +
        theme_tufte()
        
ggplot1

True, the code for this ggplot is not the best layered code :) - but we can nicely see that the linear fit tries to minimize the residuals. The linear fit being very simple here: we approximate the relationship between X and Y with

\[Y \approx \beta_0 + \beta_1 \cdot X.\] For this little plot, we had the following: \[sales \approx \beta_0 + \beta_1 \cdot TV.\]

What $\beta_0$ and $\beta_1$ do we have? Let’s see:

knitr::kable(summary(fit1)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	7.0325935	0.4578429	15.36028	0
TV	0.0475366	0.0026906	17.66763	0

We definitely can reject the null hypothesis saying that there is no relationship between X and Y, because there is a relationship (it’s a matter of a small p-value). So we conclude that $\beta_0 \neq 0$ and $\beta_1 \neq 0$. Do we have more graphs representing this fitted model?

par(mfrow=c(2,2))
plot(fit1)

In this case, the first two plots are of high interest. We clearly see some evidence that there is more going on that just an additive assumption of the model - which we didn’t cover yet. Well, we only looked at one variable. Before we move on to the whole picture (=all variables), let’s first have a look at the relationship between newspaper and sales.

Look at: Newspaper and Sales

We’ll go through here in a fast pace. Important in this chaper is to note the coefficient for the newspaper in the simple linear regression:

fit2 <- lm(sales ~ newspaper, data = data)
knitr::kable(summary(fit2)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	12.3514071	0.6214202	19.876096	0.0000000
newspaper	0.0546931	0.0165757	3.299591	0.0011482

From here, take away the hint that while looking only at newspaper and sales, we have a positive coefficient ($\beta_1 = 0.055$) for the newspaper. In other terms: by placing $1’000 in newspaper advertising, this model suggests that we get an average increase in sales by around 55 units. If this is true, we’ll see in the next chapter. Hint: always look at the big picture !!

Linear Regression with all variables

Let’s get all the variables into the play. The first fit we are trying is a simple linear regression fit:

\[sales = \beta_0 + \beta_1 \cdot TV + \beta_2 \cdot radio + \beta_3 \cdot newspaper + \epsilon\], whereas $\epsilon$ is the error term which is being minimized by this regression. Also, I can write $=$ and not $\neq$ when I include the $\epsilon$-term.

data <- data[, -1] # the X1 variable is merely a counting variable
fit3 <- lm(sales ~ ., data = data)
knitr::kable(summary(fit3)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.9388894	0.3119082	9.4222884	0.0000000
TV	0.0457646	0.0013949	32.8086244	0.0000000
radio	0.1885300	0.0086112	21.8934961	0.0000000
newspaper	-0.0010375	0.0058710	-0.1767146	0.8599151

But what happend with the coefficient for the newspaper? It went from 0.055 (see model fit2) to -0.001, so from positive to negative. Do the models imply the opposite? Let’s dive into that.

Correlation

The correlation of the variables should give us some hints to the answer

pairs(data) # with a plot

# it looks nice...but let's stick with the numbers
knitr::kable(cor(data)) # with numbers

	TV	radio	newspaper	sales
TV	1.0000000	0.0548087	0.0566479	0.7822244
radio	0.0548087	1.0000000	0.3541038	0.5762226
newspaper	0.0566479	0.3541038	1.0000000	0.2282990
sales	0.7822244	0.5762226	0.2282990	1.0000000

Let’s stick with the numbers and represent this nicely, before we move on:

require(corrplot)
#> Loading required package: corrplot
#> Warning: package 'corrplot' was built under R version 3.3.3
cordata <- cor(data)
corrplot.mixed(cordata)

That there is a notable correlation between sales and the variables - well - that’s clear. That’s why we are modeling this data and trying to predict the sales figures.
The more interesting part here is that there is a notable correlation between radio and newspaper of 0.35. That means that in markets where we spend on radio advertisement, we automatically spend also money on newspaper ads. In our single linear regression model with only newspaper and sales, we saw that by increasing newspaper, we increase the sales numbers, too. But here is the crazy part: when we increase newspaper ads, we also increase radio ads. Therefore, it doesn’t necessarly mean that an increase in newspaper ads results in higher sales numbers.
That’s why we were looking at the multiple regression model. We saw here that newspaper almost didn’t had any effect on the sales numbers ($\beta \approx -0.001$). The money spend on TV and radio are the main drivers.
In the single regression model, the newspaper detected the increase in sales number, but only by its correlation to the radio.

Extensions of the Linear Model

We mentioned at the beginning that the data suggests, that there is more than just an additive relationship between predictors and the response. We saw that in the plot of the residuls versus the fitted values. Let’s look at it again, but this time for all multi-variable model (fit3)

plot(fit3, which=c(1,1))

A perfect fitted model would have its red line horizontal at zero - meaning that the residuals are randomly distributed over the fitted values and therefore our model would cover the characteristics of the data. So let’s include the interaction effects in a new model:

fit4 <- lm(sales ~ (TV + radio + newspaper)^2, data = data)
knitr::kable(summary(fit4)$coef) # why are there colors in the table?? strange...

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.4601585	0.3175706	20.3424348	0.0000000
TV	0.0203271	0.0016090	12.6333046	0.0000000
radio	0.0229292	0.0114149	2.0087043	0.0459622
newspaper	0.0170339	0.0100717	1.6912706	0.0923992
TV:radio	0.0011393	0.0000572	19.9296779	0.0000000
TV:newspaper	-0.0000797	0.0000358	-2.2271949	0.0270898
radio:newspaper	-0.0001096	0.0002363	-0.4638306	0.6432918

First of all, the hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-value associated with their coefficients are not significant.
That having said and looking at the interaction terms, we clearly see the main driver between them: $TV \cdot radio$. His very low p-value indicates that the true relationship of the data is indeed not just additive, but has strong interaction effects.

As we see in the plot below, the residuals still have a characteristic which we don’t yet cover with the current model. So there is still work to be done to improve on the model. However, by including the interaction terms, we reduced the residuals in its value: compare the numbers of the above plot and the plot below and you see that the residuals of fit4 are now about half of fit3. So we imporved the model quite significantly.

plot(fit4, which=c(1,1))

# par(mfrow = c(2,2))
# plot(fit4)


# anova(fit3, fit4)

Model Accuracy

Now we covered which terms should be involved in the model. What about the overall accuracy of the model?
For that, let’s have a comparison between three models:
- fit1 which only included one variable
- fit3 which included all variables but no interaction terms
- fit4 which included all variables and all interaction terms

or:

\[ \begin{aligned} fit1: sales &\approx \beta_0 + \beta_1 \cdot TV \\ fit3: sales &\approx \beta_0 + \beta_1 \cdot TV + \beta_2 \cdot radio + \beta_3 \cdot newspaper \\ fit4: sales &\approx \beta_0 + \beta_1 \cdot TV + \beta_2 \cdot radio + \beta_3 \cdot newspaper + ... \\ ...&\beta_4 \cdot (TV \cdot radio) + \beta_4 \cdot (TV \cdot newspaper) + \beta_5 \cdot (radio \cdot newspaper) \end{aligned} \]

RSE <- c(summary(fit1)$sigma, summary(fit3)$sigma, summary(fit4)$sigma)
Rsquared <- c(summary(fit1)$r.squared, summary(fit3)$r.squared, summary(fit4)$r.squared)
overview <- data.frame(RSE, Rsquared)
rownames(overview) <- c("fit1", "fit3", "fit4")

knitr::kable(t(overview))

	fit1	fit3	fit4
RSE	3.2586564	1.6855104	0.9383316
Rsquared	0.6118751	0.8972106	0.9686311

RSE

The Residual Standard Error (RSE) is an estimate of the standard deviation of $\epsilon$ - it’s an absolute measure. Roughly speaking, it is the average amount that the response will deviate from the regression line - or in other words: it is a measure of lack of fit of the model. Looking at the table, we see a clear decrease of the RSE value going from fit1 to fit4. Still, the model fit4 will deviate approximately by 938 units ($) on average - but significantly better than model fit1 with 3’249 units ($).

$R^2$

The $R^2$ gives an alternative measure to RSE. It is a proportional measure of the variance explained. The closer to 1 it is, the more variance can be explained only by the model (that helps to compare different models). In our case, we went from 61% variance explained by fit1, to 97% variance explained witht he model fit4. Ain’t bad at all.

Conclusion

We achieved with model fit4 a very good regression fit. How good? The model fit4 explains 97% of the variance given by the data. But the resulting output still will have on average a deviation of around 938 units.