Data Source

Advertising data set consists of the sales (in thousands of units) of a product in 200 different markets along with advertising budgets for the for the product in each of those markets for three different media: TV, radio and Newspaper.

Reading the data

# Read the CSV file 
data =read.csv("Advertising.csv",row.names=1)

# View the structure of the data
str(data)
## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
# View first few rows
head(data)
##      TV radio newspaper sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2

Objective

To find the relationship between advertising budgets on three media on sales and to develop model to predict sales on the basis of three media budgets

Questions to address

  1. Is there a relationship between advertising budget and sales?
  2. How strong is the relationship between advertising budget and sales?
  3. Which media contribute to sales?
  4. How accurately can we estimate the effect of each medium on sales?
  5. How accurately can we predict future sales?
  6. Is the relationship linear?

Plot displaying sales as a function of TV budget

plot(data$TV, data$sales,
     main = "Sales vs TV Advertising Budget",
     xlab = "TV Budget (in $1000)",
     ylab = "Sales (in 10000 of units")

Simple Linear Regression.

Simple linear Regression is the approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y .

Mathematically, we can write this linear relationship as \[ Y ≈ β_0 + β_1 X. \tag{1} \]

For example, X may represent TV advertising and Y may represent sales. Then we can regress sales onto TV by fitting the model

\[ sales ≈ β_0 + β_1 × TV. \] \(β_0\) and \(β_1\) are two unknown constants that represent the intercept and slope terms in the linear model. Together, \(β_0\) and \(β_1\) are intercept slope are known as the model coefficients or parameters. Using the training data we cand produce estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the model coefficients and predict future sales on the basis of a particular value of TV advertising by computing

\[ \hat{y} =\hat{\beta}_0 + \hat{\beta}_1 x \tag{2} \] where \(\hat{y}\) indicates prediction of Y on the basis of X=\(x\)

Estimating the Coefficients using Least Square method

Inorder to find the intercept \(\hat{\beta}_0\) and slope \(\hat{\beta}_1\) so that linear model \(y_i ≈\hat{\beta}_0 + \hat{\beta}_1 x_i\) fits all the available data (for n=200) least square criterion is used.

Let \[e_i = y_i−\hat{y_i}\] represents the \(i^{th}\) residual—this is the difference between the \(i^{th}\) observed response value and the \(i^{th}\) response value that is predicted by our linear model. Residual Sum of Squares (RSS) is defined as

\[RSS=e_1^2+e_2^2+ ....... +e_n^2\]

Or

The residual sum of squares (RSS) is given by:

\[ RSS = (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2. \tag{3} \] The least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS.
Using some calculus, one can show that the minimizers are:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. \tag{4} \]

where \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\) and \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\) are the sample means.
(4) defines the least squares coefficient estimates for simple linear regression.

# Fit the linear regression model: Sales ~ TV
model = lm(sales ~ TV, data = data)

# Display the summary of the model
summary(model)
## 
## Call:
## lm(formula = sales ~ TV, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
coef(model)
## (Intercept)          TV 
##  7.03259355  0.04753664

According to simple regression fit with \(\hat{\beta}_0=7.03\) and \(\hat{\beta}_1=0.0475\), an additional $1,000 spent on TV advertising is associated with selling 47.5 units of product.

Assessing the Accuracy of the coefficient estimates and Testing the Accuracy of Our Model

True relationship between X and Y is

\[ Y= β_0 + β_1 X + \epsilon \tag{5} \]

plot(data$TV, data$sales,
     main = "Sales vs TV Advertising Budget",
     xlab = "TV Budget (in $1000)",
     ylab = "Sales (in units")
abline(model)

Hypothesis Testing

Standard errors can be used to perform hypothesis tests on the regression coefficients. The most common hypothesis test would be testing the null hypothesis versus the alternative hypothesis.

Null hypothesis \(H_0 :\) There is no relationship between \(X\) and \(Y\)

Alternative hypothesis \(H_a :\) There is some relationship between \(X\) and \(Y\)

Mathematically, this corresponds to testing

\[ H_0 : \beta_1 = 0 \]

versus

\[ H_a : \beta_1 \ne 0 \]

Since if \(\beta_1 = 0\) then the model (5) reduces to \(Y = \beta_0 + \epsilon\), and \(X\) is not associated with \(Y\).

To test the null hypothesis, we need to determine whether \(\hat{\beta}_1\), our estimate for \(\beta_1\), is sufficiently far from zero that we can be confident that \(\beta_1\) is non-zero. How far is far enough? This of course depends on the accuracy of \(\hat{\beta}_1\)—that is, it depends on \(\text{SE}(\hat{\beta}_1)\). If \(\text{SE}(\hat{\beta}_1)\) is small, then even relatively small values of \(\hat{\beta}_1\) may provide strong evidence that \(\beta_1 \ne 0\), and hence that there is a relationship between \(X\) and \(Y\).we compute a t-statistic.

\[ t = \frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)} \]

Let’s break down this summary output:

Residuals: Residuals are the difference between the actual and estimated values. The residual section of the model output breaks it down into five summary point (Minimum, 1Q,Median, 3Q and Maximum). The distribution of our residuals should ideally be symmetrical.

Coefficients: The coefficients \(\hat{\beta}_0\) and \(\hat{\beta}_1\) represent the intercept and slope respectively.

The coefficient standard error: Measures how much our coefficient estimates vary from the actual average value of our response variable. Ideally it should be lower.

The coefficient t-value: Measures how far (in standard deviations) our coefficient estimate is from 0. A large t-value, relative to standard error, would provide evidence against the null hypothesis and indicate that a relationship exists between the predictor and response variable.

Predictors with low t-statistics can be dropped. Ideally, the t-value should be greater than 1.96 for a p-value to be less than 0.05.

The coefficient — Pr(>|t|): Represents the p-value or the probability of observing a value larger than t. The smaller the p-value, the more likely we are to reject the null hypothesis. Typically, a p-value of 5% or less is a good cut-off point. Note the ‘Signif Codes’ associated to each estimate, in our example.

Three asterisks represent a highly significant p-value. ** Since the relationship between sales and TV advertising is highly significant, we can reject our null hypothesis.**

Residual standard error: This measures the quality of our regression fit. It is the average amount the sales variable will vary from the true regression line.

For our example RSE is 3.259. Which means actual sales in each market on an average sales will deviate approx 3259 units from true sales

\[ RSE = \sqrt{\frac{RSS}{n - 2}} \]

Multiple R-squared: Besides the t-statistic and p-value, this is our most important metric for measuring regression model fit. \(R^2\) measures the linear relationship between our predictor variable (sales) and our response / target variable (TV advertising). It always lies between 0 and 1.

\[ R^2 = 1- \frac{RSS}{TSS}\] where \[ \text{TSS} = \sum (y_i - \bar{y})^2 \] is total sum of squares.

A number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable.

In our example \(R^2\) value is 0.6119, this means only 2/3 rds of variablity of sales are explained by TV

In our example, the adjusted \(R^²\) (which adjusts for degrees of freedom) is 0.6099 — only 60.99% of an increase in sales can be explained by TV advertising. If we perform a multiple regression, we will find that the \(R^2\) will increase with an increase in the number of response variables.

The further our F-statistic is away from 1, the better our regression model. In our example, the F-statistic is 312.1, which is relatively larger than 1 given the size of our data set (200 observations). The F-statistic is more relevant in a multiple regression model.

plot(data$TV,data$sales,main="Sales vs TV",xlab="TV",ylab="sales")

abline(model, col = "blue", lwd = 2)

Multiple Linear Regression

Multiple Regression is used when there are multiple independent variables in the data and there is a moderate correlation between independent and dependent variable. We use TV, Radio and Newspaper to create a linear regression model that can predict sales.

Multiple regression model for sales using the three predictors

\[ \text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \varepsilon \]

We use least square regression approach that similar to simple linear regression. We choose \(\beta_0, \beta_1, \ldots, \beta_p\) to minimize the sum of squared residuals: \[ \begin{aligned} \text{RSS} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \\ &= \sum_{i=1}^{n} \left( y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \cdots - \hat{\beta}_p x_{ip} \right)^2 \end{aligned} \]

model2 = lm(sales ~ TV + radio + newspaper, data = data)
summary(model2)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Multi-regression without newspaper

model3 = lm(sales ~ TV + radio, data = data)
summary(model3)
## 
## Call:
## lm(formula = sales ~ TV + radio, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Predicting sales on new inputs

# New input for prediction
new_input = data.frame(TV = 150, radio = 20, newspaper = 15)

# Predict sales
predicted_sale = predict(model2, newdata = new_input)

# View prediction
print(predicted_sale)
##        1 
## 13.55862

The predicted sales would be approximately 13.58 units

 confint(model2)
##                   2.5 %     97.5 %
## (Intercept)  2.32376228 3.55401646
## TV           0.04301371 0.04851558
## radio        0.17154745 0.20551259
## newspaper   -0.01261595 0.01054097

with 95% confident we say increasing TV or radio advertising will increase sales, but newspaper advertising does not show a statistically significant effect in this model.

library(corrplot)
## corrplot 0.95 loaded
corrplot(cor(data), method = "number")

pairs(data,
      labels = c("TV", "Radio", "Newspaper", "Sales"),
      main = "Pairwise Scatterplot Matrix - Advertising Dataset")

Assumptions of Linear Regression

  1. The relationship between predictor and dependent variable is linear.

  2. The errors(residuals) are normally distributed with mean zero and common variance (Homoscedasticity principle). . Also errors are statistically independent of each other

  3. Predictors are linearly independent of each other.

plot(model2)

The multiple regression equation is as follows: \[ Sales=2.93+ 0.046\ TV+ 0.189\ radio -0.001 \ newspaper + ϵ \]

Model is significant based on the statistics

These results mean that advertising on TV and Radio contribute the most in Sales, and Newspaper advertisements have little effect in Sales.

Based on these findings, it is recommended that the marketer or the business owner shall allocate more budget on TV and Radio advertisements rather than Newspaper.

Marketing plan

  1. Is there a relationship between advertising budget and sales?

    p-value of F-static indicating clear evidence of a linear relationship between advertising and sales.

  2. How strong is the relationship between advertising budget and sales?

    RSE is 1.69 units while the mean value for the response is 14.022, indicating a percentage error of roughly 12 %

    R2 statistic indicate predictors explain almost 90 % of the variance in sales

  3. Which media contribute to sales?

    TV and radio are related to sales

  4. How accurately can we estimate the effect of each medium on sales?

    By constructing separate simple linear regressions based on each predictor. Results TV and sales and between radio and sales strong association Mild association between newspaper and sales when the values of TV and radio are ignored.

  5. How accurately can we predict future sales?

    confidence interval of regression output

  6. Is the relationship linear?

    yes residual plot indicate