Sameer Mathur
# Read the data
advertising.df <- read.csv(paste("AdvertisingData.csv", sep=""))
library(car)
some(advertising.df)
TV Radio Newspaper Sales
6 8.7 48.9 75.0 7.2
10 199.8 2.6 21.2 10.6
21 218.4 27.7 53.4 18.0
79 5.4 29.9 9.4 5.3
104 187.9 17.2 17.9 14.7
106 137.9 46.4 59.0 19.2
113 175.7 15.4 2.4 14.1
140 184.9 43.9 1.7 20.7
188 191.1 28.7 18.2 17.3
190 18.7 12.1 23.4 6.7
# summarize the data
attach(advertising.df)
library(psych)
describe(advertising.df)[,1:9]
vars n mean sd median trimmed mad min max
TV 1 200 147.04 85.85 149.75 147.20 108.82 0.7 296.4
Radio 2 200 23.26 14.85 22.90 23.00 19.79 0.0 49.6
Newspaper 3 200 30.55 21.78 25.75 28.41 23.13 0.3 114.0
Sales 4 200 14.02 5.22 12.90 13.78 4.82 1.6 27.0
# checking data types of the data fields
str(advertising.df)
'data.frame': 200 obs. of 4 variables:
$ TV : num 230.1 44.5 17.2 151.5 180.8 ...
$ Radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
$ Newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
$ Sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
# mean and standard deviation of spending on different promotion strategies
sapply(advertising.df[c("Sales", "TV", "Radio", "Newspaper")], function(x)(c(mean=mean(x),sd=sd(x))))
Sales TV Radio Newspaper
mean 14.022500 147.04250 23.26400 30.55400
sd 5.217457 85.85424 14.84681 21.77862
This question can be answered by fitting a multiple regression model of sales onto TV, Radio, and Newspaper as follows:
\( sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper + \epsilon \) and testing the hypothesis \( H_0 : \beta_1 TV = \beta_2 Radio = \beta_3 Newspaper = 0 \)
Model1 <- Sales ~ TV + Radio + Newspaper
fit1 <- lm(Model1, data = advertising.df)
summary(fit1)
Call:
lm(formula = Model1, data = advertising.df)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The F-statistic can be used to determine whether or not we should reject this null hypothesis. In this case the p-value corresponding to the
F-statistic given in the following table is very low, indicating clear evidence of a relationship between advertising and sales.
# regress `Sales` on `TV`
ModelTV <- Sales ~ TV
fitTV <- lm(ModelTV, data = advertising.df)
summary(fitTV)
Call:
lm(formula = ModelTV, data = advertising.df)
Residuals:
Min 1Q Median 3Q Max
-8.3860 -1.9545 -0.1913 2.0671 7.2124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
TV 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Once we have rejected the null hypothesis \( H_0 : There is no relationship between X and Y \) in favor of the alternative hypothesis \( H_a : There is some relationship between X and Y \),
It is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the \( R^2 \) statistic.
# R-squared
summary(fitTV)$r.squared
[1] 0.6118751
# F-statistic
summary(fitTV)$fstatistic
value numdf dendf
312.145 1.000 198.000
For the Advertising data, more information about the least squares model for the regression of number of units sold on TV advertising budget.
Using the Model 1, more information about the least squares model for the regression of number of units sold on TV, newspaper, and radio advertising budgets in the Advertising data.
First, the RSE estimates the standard deviation of the response from the population regression line.
For the Advertising data, the RSE is 1,681 units while the mean value for the response is 14,022, indicating a percentage error of roughly 12%.Second, the \( R^2 \) statistic records the percentage of variability in the response that is explained by the predictors.
The predictors explain almost 90% of the variance in sales. The RSE and \( R^2 \) statistics are displayed above.
To answer this question, we can examine the p-values associated with each predictor's t-statistic. In the multiple linear regression displayed in Regression Model 1 also below.
# p-values of Model 1
summary(fit1)$coefficients[,4]
(Intercept) TV Radio Newspaper
1.267295e-17 1.509960e-81 1.505339e-54 8.599151e-01
The p-values for TV and Radio are low, but the p-value for newspaper is not. This suggests that only TV and Radio are related to sales.
The standard error of \( \hat{\beta_j} \) can be used to construct confidence intervals for \( \beta_j \). For the Advertising data, the 95% confidence intervals are as follows:
# confidence interval
confint(fit1)
2.5 % 97.5 %
(Intercept) 2.32376228 3.55401646
TV 0.04301371 0.04851558
Radio 0.17154745 0.20551259
Newspaper -0.01261595 0.01054097
library(coefplot)
coefplot(fit1, intercept=FALSE)
The confidence intervals for TV and Radio are narrow and far from zero, providing evidence that these media are related to Sales.
But the interval for Newspaper includes zero, indicating that the variable is not statistically significant given the values of TV and Radio.