A simple example for Multiple Linear Regression. In this example we have Sales as response variable. Price and advertising cost are predictor variables. We will see which one influences more on product sales.
We load the data first.
product
## Sales Price Cost
## 1 489 115 128
## 2 550 110 158
## 3 500 110 170
## 4 670 100 200
## 5 670 100 250
## 6 350 155 72
## 7 360 160 90
## 8 410 145 180
## 9 110 200 82
## 10 275 160 170
## 11 300 165 178
## 12 520 125 200
summary(product)
## Sales Price Cost
## Min. :110 Min. :100 Min. : 72
## 1st Qu.:338 1st Qu.:110 1st Qu.:118
## Median :450 Median :135 Median :170
## Mean :434 Mean :137 Mean :156
## 3rd Qu.:528 3rd Qu.:160 3rd Qu.:185
## Max. :670 Max. :200 Max. :250
We do plots to check the correlation between these variables.
library("ggplot2")
# Correlation plot between Price and Sales
qplot(Price, Sales, data = product, geom = c("point", "smooth"), method = "lm")
# Correlation plot between Advertising cost and Sales
qplot(Cost, Sales, data = product, geom = c("point", "smooth"), method = "lm")
Here’s the correlation test.
cor(product)
## Sales Price Cost
## Sales 1.0000 -0.9696 0.6757
## Price -0.9696 1.0000 -0.6413
## Cost 0.6757 -0.6413 1.0000
cor.test(product$Sales, product$Price)
##
## Pearson's product-moment correlation
##
## data: product$Sales and product$Price
## t = -12.54, df = 10, p-value = 1.934e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9917 -0.8922
## sample estimates:
## cor
## -0.9696
cor.test(product$Sales, product$Cost)
##
## Pearson's product-moment correlation
##
## data: product$Sales and product$Cost
## t = 2.899, df = 10, p-value = 0.01587
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1663 0.9004
## sample estimates:
## cor
## 0.6757
Sales and price have high negative correlation while Sales and Cost have positive correlation.
We build multiple linear regression model.
product_reg <- lm(Sales ~ Price + Cost, data = product)
summary(product_reg)
##
## Call:
## lm(formula = Sales ~ Price + Cost, data = product)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.37 -20.19 2.08 27.42 54.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1042.083 111.553 9.34 6.3e-06 ***
## Price -4.760 0.532 -8.95 8.9e-06 ***
## Cost 0.281 0.313 0.90 0.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.8 on 9 degrees of freedom
## Multiple R-squared: 0.945, Adjusted R-squared: 0.933
## F-statistic: 77.5 on 2 and 9 DF, p-value: 2.13e-06
From the above model we see the independent variable Price is significant where as Cost is not. R-squared is 0.9451, which means 95% of the sales is explained by the independent variables even though the cost is not significant now. RMSE is 42.84. P-value is 2.126e-06, which means the overall model is good.
We plot the residuals as well.
par(mfrow = c(2,2))
plot(product_reg)
Adding the predicted values to the original table.
Prediction <- round(fitted(product_reg), 2)
product$Prediction <- Prediction
product
## Sales Price Cost Prediction
## 1 489 115 128 530.8
## 2 550 110 158 563.0
## 3 500 110 170 566.4
## 4 670 100 200 622.4
## 5 670 100 250 636.5
## 6 350 155 72 324.6
## 7 360 160 90 305.9
## 8 410 145 180 402.6
## 9 110 200 82 113.2
## 10 275 160 170 328.4
## 11 300 165 178 306.8
## 12 520 125 200 503.4