This projects aims to study and understand the linear regression in R. Advertising dataset will be used to do the regression analysis, and we will find out if are there any associations between the sales and three items, namely TV, radio and newspaper.
Now, we load the dataset and study about their relationship
library(readxl)
df<-read.csv("D:/project/Advertising.csv")
head(df)
## TV radio newspaper sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
Dataset provides Sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, Radio, and newspaper in n=200 markets.
str(df)
## 'data.frame': 200 obs. of 4 variables:
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
summary(df)
## TV radio newspaper sales
## Min. : 0.70 Min. : 0.000 Min. : 0.30 Min. : 1.60
## 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38
## Median :149.75 Median :22.900 Median : 25.75 Median :12.90
## Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02
## 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40
## Max. :296.40 Max. :49.600 Max. :114.00 Max. :27.00
Dataset has 200 observations, and 4 variables. All of them are quantitative dataset. Therefore, in this case, to find the relationship between quantittive variables, we can use the simple linear regression method to learn about the associations.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2
ggplot(data=df,aes(x=TV,y=sales)) + geom_point() + geom_smooth() + labs(x="Expenditure for TV Ads.(in thousands of dollars) ",y="Number of sales (in thousands of units) ",title="The relationship between sales and TV advertising")+stat_cor(method = "pearson", label.x = -5, label.y = 30)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The strong positive correlation can be seen between TV advertising and sales of product. Or we can say the more we spend on TV Advertising, the more units will be sold.
library(ggplot2)
library(ggpubr)
ggplot(data=df,aes(x=radio,y=sales)) + geom_point() + geom_smooth() + labs(x="Expenditure for radio Ads.(in thousands of dollars) ",y="Number of sales (in thousands of units) ",title="The relationship between sales and radio advertising")+stat_cor(method = "pearson", label.x = -5, label.y = 30)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Similarly, there is a positive correlation between radio ads. and sales but the correlation is not strong as the relationship between tv ads. and sales.
library(ggplot2)
library(ggpubr)
ggplot(data=df,aes(x=newspaper,y=sales)) + geom_point() + geom_smooth() + labs(x="Expenditure for newspaper Ads.(in thousands of dollars) ",y="Number of sales (in thousands of units) ",title="The relationship between sales and newspaper advertising")+stat_cor(method = "pearson", label.x = -5, label.y = 30)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
While other two types of advertising bring the good effects on sales, the expenditure on newspaper ads. does not affect much on the sales of product.
After finding the correlation between each product and the sales, we apply the multiple linear regression to find the linear relationship between Sales and three types of advertising.
sales = B0 + B1.TV + B2.radio + B3.newspaper
The “B” values are called the regression weights (or beta coefficients). They measure the association between the predictor variable and the outcome. “B_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed. And B0 is the intercept term—that is, the expected value of sales when TV, Radio and Newspaper advertising budget = 0.
By using the lm() function in R, we easily calculate the coefficient parameters B0, B1, B2 and B3.
model <- lm(sales ~ TV + radio + newspaper, data = df)
summary(model)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
At first, we look at the F-statistics. If F-statistics is smaller than 1, it means that there are no relationship between the predictors and the responses. And if F-statistics is larger than 1, it implies that at least one predictor is associated with the response variables. In this case, we have F-statsitics is much larger than 1; hence, there are at least one relationship between advertising budget and sales.
summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889369 0.311908236 9.4222884 1.267295e-17
## TV 0.045764645 0.001394897 32.8086244 1.509960e-81
## radio 0.188530017 0.008611234 21.8934961 1.505339e-54
## newspaper -0.001037493 0.005871010 -0.1767146 8.599151e-01
We have that B0 is 2.938, or when there are no budget for advertising, around 2938 products can be sold.
For a fixed amount of radio and newspaper advertising budget, for every $1000 spent on advertising on TV, there will be an additional products 45 sold. Or for a fixed amount of TV and newspaper advertising budget, for every $1000 spent on radio advertising leads to an increase in sales by approximately 188 units.
However, we can see the P-value is large, which means that there is no relationship between newspaper advertising and sale units. As we expected on the previous scatter plot, the correlation between newspaper ads. and sales is weak; hence, it is possible to eliminate newspaper variable out of the model.
model <- lm(sales ~ TV + radio , data = df)
summary(model)
##
## Call:
## lm(formula = sales ~ TV + radio, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7977 -0.8752 0.2422 1.1708 2.8328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.92110 0.29449 9.919 <2e-16 ***
## TV 0.04575 0.00139 32.909 <2e-16 ***
## radio 0.18799 0.00804 23.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
And we have our new model is sales= 2.939 + 0.0457.TV + 0.1885*radio.
After creating the model, I would like to assess the accuracy of the model. For the linear regression, we usually look at the Residual standard error or the R squared.
In our model, we have the Residual standard error equals 1.681
sigma(model)
## [1] 1.681361
sigma(model)/mean(df$sales)
## [1] 0.1199045
Residual standard error is an estimate of the standard deviation of error terms. In our problem, since the RSE = 1.681, it means that even if the model were correct and the true values of the unknown coefficients β0, β1 and β2 were known exactly, any prediction of sales on the basis of TV and radio advertising still be off by 1681 units on average. Or if the average unit sold in the whole market is 14000 units, there will be an error rate 11.9%
An R-squared statistics that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number is close to 0 indicates that the regression does not explain much of the variability in the response.
summary(model)$r.squared
## [1] 0.8971943
We have R-squared in our model is 0.897, or we say that nearly 90% of the variability in sales is explained by a linear regression or TV and radio.
For the prediction, we can predict sales based on our model, with the new advertising budget for TV and radio. consider, there will be $47000 expenditure on TV advertising and $50000 for radio ads. We will predict the unit sales response to the new advertising budget.
predict(model,data.frame(TV= 47, radio=50))
## 1
## 14.47129
With new $47000 budget for TV advertisement and $50000 for radio advertisement, it predicts that there will be 14512 units sold.