A simple example for Multiple Linear Regression. In this example we have Sales as response variable. Price and advertising cost are predictor variables. We will see which one influences more on product sales.

We load the data first.

product
##    Sales Price Cost
## 1    489   115  128
## 2    550   110  158
## 3    500   110  170
## 4    670   100  200
## 5    670   100  250
## 6    350   155   72
## 7    360   160   90
## 8    410   145  180
## 9    110   200   82
## 10   275   160  170
## 11   300   165  178
## 12   520   125  200
summary(product)
##      Sales         Price          Cost    
##  Min.   :110   Min.   :100   Min.   : 72  
##  1st Qu.:338   1st Qu.:110   1st Qu.:118  
##  Median :450   Median :135   Median :170  
##  Mean   :434   Mean   :137   Mean   :156  
##  3rd Qu.:528   3rd Qu.:160   3rd Qu.:185  
##  Max.   :670   Max.   :200   Max.   :250

We do plots to check the correlation between these variables.

library("ggplot2")
# Correlation plot between Price and Sales
qplot(Price, Sales, data = product, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-3

# Correlation plot between Advertising cost and Sales
qplot(Cost, Sales, data = product, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-3

Here’s the correlation test.

cor(product)
##         Sales   Price    Cost
## Sales  1.0000 -0.9696  0.6757
## Price -0.9696  1.0000 -0.6413
## Cost   0.6757 -0.6413  1.0000
cor.test(product$Sales, product$Price)
## 
##  Pearson's product-moment correlation
## 
## data:  product$Sales and product$Price
## t = -12.54, df = 10, p-value = 1.934e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9917 -0.8922
## sample estimates:
##     cor 
## -0.9696
cor.test(product$Sales, product$Cost)
## 
##  Pearson's product-moment correlation
## 
## data:  product$Sales and product$Cost
## t = 2.899, df = 10, p-value = 0.01587
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1663 0.9004
## sample estimates:
##    cor 
## 0.6757

Sales and price have high negative correlation while Sales and Cost have positive correlation.

We build multiple linear regression model.

product_reg <- lm(Sales ~ Price + Cost, data = product)
summary(product_reg)
## 
## Call:
## lm(formula = Sales ~ Price + Cost, data = product)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.37 -20.19   2.08  27.42  54.13 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1042.083    111.553    9.34  6.3e-06 ***
## Price         -4.760      0.532   -8.95  8.9e-06 ***
## Cost           0.281      0.313    0.90     0.39    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.8 on 9 degrees of freedom
## Multiple R-squared:  0.945,  Adjusted R-squared:  0.933 
## F-statistic: 77.5 on 2 and 9 DF,  p-value: 2.13e-06

From the above model we see the independent variable Price is significant where as Cost is not. R-squared is 0.9451, which means 95% of the sales is explained by the independent variables even though the cost is not significant now. RMSE is 42.84. P-value is 2.126e-06, which means the overall model is good.

We plot the residuals as well.

par(mfrow = c(2,2))
plot(product_reg)

plot of chunk unnamed-chunk-6

Adding the predicted values to the original table.

Prediction <- round(fitted(product_reg), 2)
product$Prediction <- Prediction
product
##    Sales Price Cost Prediction
## 1    489   115  128      530.8
## 2    550   110  158      563.0
## 3    500   110  170      566.4
## 4    670   100  200      622.4
## 5    670   100  250      636.5
## 6    350   155   72      324.6
## 7    360   160   90      305.9
## 8    410   145  180      402.6
## 9    110   200   82      113.2
## 10   275   160  170      328.4
## 11   300   165  178      306.8
## 12   520   125  200      503.4