About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

This lab will explore the concepts of simple linear regression and multiple linear regression.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.


PART A: SIMPLE LINEAR REGRESSION

First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly.

#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
##   case_number sales radio paper  tv pos
## 1           1 11125    65    89 250 1.3
## 2           2 16121    73    55 260 1.6
## 3           3 16440    74    58 270 1.7
## 4           4 16876    75    82 270 1.3
## 5           5 13965    69    75 255 1.5
## 6           6 14999    70    71 255 2.1

Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

#Correlation matrix of all columns in the data
corr = cor(mydata)
corr
##             case_number      sales       radio       paper          tv
## case_number   1.0000000  0.2402344  0.23586825 -0.36838393  0.22282482
## sales         0.2402344  1.0000000  0.97713807 -0.28306828  0.95797025
## radio         0.2358682  0.9771381  1.00000000 -0.23835848  0.96609579
## paper        -0.3683839 -0.2830683 -0.23835848  1.00000000 -0.24587896
## tv            0.2228248  0.9579703  0.96609579 -0.24587896  1.00000000
## pos           0.0539763  0.0126486  0.06040209 -0.09006241 -0.03602314
##                     pos
## case_number  0.05397630
## sales        0.01264860
## radio        0.06040209
## paper       -0.09006241
## tv          -0.03602314
## pos          1.00000000
# Correlation matrix of columns 2,4, and 6
corr = cor( mydata[ c(2,4,6) ] )
corr
##            sales       paper         pos
## sales  1.0000000 -0.28306828  0.01264860
## paper -0.2830683  1.00000000 -0.09006241
## pos    0.0126486 -0.09006241  1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( mydata[ 2:6 ] )
corr
##            sales       radio       paper          tv         pos
## sales  1.0000000  0.97713807 -0.28306828  0.95797025  0.01264860
## radio  0.9771381  1.00000000 -0.23835848  0.96609579  0.06040209
## paper -0.2830683 -0.23835848  1.00000000 -0.24587896 -0.09006241
## tv     0.9579703  0.96609579 -0.24587896  1.00000000 -0.03602314
## pos    0.0126486  0.06040209 -0.09006241 -0.03602314  1.00000000
TASK 1A: Why is the value of “1.0” along the diagonal?

ANSWER:

Because any number correlated to itslef is always going to be 1.0

TASK 1B: Which pair of variables has the strongest correlations and what is the correlation value?

ANSWER:

radio to sales, the correlation value is 0.9771381

TASK 1C: Which pair of variables has the weakest correlations and what is the correlation value?

ANSWER:

paper to tv, the correlation value is -0.24587896

Next we will create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this we need first to install two packages in R-Studio as executed by the command lines below

## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
## Warning: package 'corrgram' was built under R version 3.5.3
# Generates a corrgram of last computed correlation matrix
corrgram(corr)

# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)

Let’s create few scatterplots to visualize the data and trending lines. First we need to extract the columns from the data file.

#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot() function from lab Worksheet 4 and using scatter.smooth() function from lab worksheet 5.
plot(radio, sales)

scatter.smooth(radio,sales)

From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.

The lm() linear modeling function is a very useful one. The function is set up as lm(y ~ x) where the x variable (the independent variable - the x axis) predicts values of the y variable (the dependent variable - the y axis).

In the simple linear regression model below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.

#Simple Linear Regression Model using the R function lm() to predict sales from radio, and display the summary statistics. Variable regSR is used to keep the regression model.
regSR <- lm(sales ~ radio)
#Summary statistics of the Model
summary(regSR)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732.85  -198.88    62.64   415.26   637.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9741.92    1362.94  -7.148 1.17e-06 ***
## radio         347.69      17.83  19.499 1.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9523 
## F-statistic: 380.2 on 1 and 18 DF,  p-value: 1.492e-13

As indicated by the summary report the intercept value is -9741.92 and the slope for radio is 347.69. Following the standard Linear model ( Y = b +- mX ) where Y is the predicted or response or dependent variable, b is the intercept, m is the slope for X and X is the predictor or independent variable, we can therefore write down the following equation for the above Linear Regression Model predicting sales based on radio Sales_predicted = -9741.92 + 347.69 * Xradio. Given this equation we can predict the value of sales for any given value of radio like for example 75 (investing $75,000 in radio ads).

#The following is the R code to predict the value of sales for a given radio value of 75 (predict the sales by investing $75,000 in radio ads).
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
### Another way to write the equation is to refer to the coefficients of the equation instead of typing the actual values out. This is demonstrated below.

sales_predicted = coef(regSR)[1] + coef(regSR)[2] * 75
coef(regSR)[1] # intercept
## (Intercept) 
##   -9741.921
coef(regSR)[2] # slope
##    radio 
## 347.6888
sales_predicted
## (Intercept) 
##    16334.74
TASK 2A: Create a Linear Regression Model using the R function lm() for sales and tv. Use variable regST to get the regression model.
#Simple Linear Regression using the R function lm() to predict sales from tv, and display the summary statistics. Variable regST is used to keep the regression model.
regST <- lm(sales ~ tv)
summary(regST)
## 
## Call:
## lm(formula = sales ~ tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1921.87  -412.24     7.02   581.59  1081.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42229.21    4164.12  -10.14 7.19e-09 ***
## tv             221.10      15.61   14.17 3.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared:  0.9177, Adjusted R-squared:  0.9131 
## F-statistic: 200.7 on 1 and 18 DF,  p-value: 3.336e-11
TASK 2B: Write down the equation for the Linear Regression Model predicting sales based on tv. Note the values for the intercept and the slope.

ASNWER:

sales predicted = -42229.21 + 221.10 * Xtv

TASK 2C: Write down the R code to predict the value of sales for a given tv value of 252 (predict the sales by investing $252,000 in tv ads).
sales_predicted = -42229.21 + 221.10 * (252)
sales_predicted
## [1] 13487.99

A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of sales versus radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the predictions are from the actual value. The difference between the actual sales (circles) and the predicted sales (solid line) is captured in the residual error calculations as reported by the summary function.

#Plot radio and sales 
plot(radio,sales)

#Add a trend line plot using the linear model created above
abline(regSR, col="blue",lwd=2) 


PART B: MULTIPLE LINEAR REGRESSION

Many times there are more than one factor or variable that affect the prediction of an outcome. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.

In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted (the dependent variable), and the x variables are the predictors (the independent variables).

Lets create a multiple linear regression predicting sales using both independent variables radio and tv.

#Multiple Linear Regression Model using the R function lm() for sales, radio and tv, and display the summary statistics. Variable mlr is used to keep the regression model.
mlr <-lm(sales ~ radio + tv)

#Summary statistics of the Multiple Linear Regression Model
summary(mlr)
## 
## Call:
## lm(formula = sales ~ radio + tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1729.58  -205.97    56.95   335.15   759.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17150.46    6965.59  -2.462 0.024791 *  
## radio          275.69      68.73   4.011 0.000905 ***
## tv              48.34      44.58   1.084 0.293351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared:  0.9577, Adjusted R-squared:  0.9527 
## F-statistic: 192.6 on 2 and 17 DF,  p-value: 2.098e-12

Note the values of 0.9577 for R-squared and 0.9527 for the Adj R-squared.
Furthermore, as indicated by the summary report the intercept value is -17150.46, the slope for radio is 275.69 and the slope for tv is 48.34. We can therefore write down the following equation for the above Multiple Linear Regression Model predicting sales based on radio and tv ‘sales_predicted = -17150.46 + 275.69 * Xradio + 48.34 * Xtv’ Given this equation we can predict the value of sales for any given value of radio for example 75 (investing $75,000 in radio ads) and for any given value of tv for example 270 (investing $270,000 in tv ads).The following R codes give an example of calculating the predicted sales for radio = 75 and tv = 270.

# sales_predicted = radio + tv
sales_predicted = -17150.46 + 275.69*(75) + 48.34*(270)
sales_predicted
## [1] 16578.09

We can compare the above predicted sales value to the actual value as obtained from the data file. The actual data file of sale when radio=75 and tv=270 is 16876. We can quantify our comparison by calculating the Error-squared as follows: (Note: The smaller the error-squared the better).

# The R code to calculate Error_squared = (actual - predicted) * (actual - predicted) 
error_squared = (16876 - 16578.09) * (16876 - 16578.09)
error_squared
## [1] 88750.37
TASK 3A: Create a Multipple Linear Regression Model using the R function lm() to predict sales from radio, tv, and pos, and display the summary statistics. Use variable mlrSRTP to get the regression model.
#Multiple Linear Regression using the R function lm() to predict sales from radio, tv, and pos, and display the summary statistics. Variable mlrSRTP is used to keep the regression model.
mlrSRTP <- lm(sales ~ radio + tv + pos)
summary(mlrSRTP)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1748.20  -187.42   -61.14   352.07   734.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15491.23    7697.08  -2.013  0.06130 . 
## radio          291.36      75.48   3.860  0.00139 **
## tv              38.26      48.90   0.782  0.44538   
## pos           -107.62     191.25  -0.563  0.58142   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9508 
## F-statistic: 123.3 on 3 and 16 DF,  p-value: 2.859e-11
#Summary statistics of the Multiple Linear Regression Model
TASK 3B: Write down the equation for the Multiple Linear Regression Model predicting sales based on radio, tv, and pos. Note the values for the intercept and the slopes.

ANSWER:

sales predicted = -15491.23 + 291.36 * Xradio + 38.26 * Xtv + -107.62 * Xpos

TASK 3C: Write down the values for R-Squared and Adj R-Squared for the above regression model.

ANSWER:

r-squared = 0.9585, adjusted r-squared = 0.9508

TASK 4A: Create a Multipple Linear Regression Model using the R function lm() to predict sales from radio, tv, pos, and paper, and display the summary statistics. Use variable mlrSRTPP to get the regression model.
#Multiple Linear Regression using the R function lm() to predict sales from radio, tv, pos, and paper and display the summary statistics. Variable mlrSRTPP is used to keep the regression model.
mlrSRTPP <- lm(sales ~ radio + tv + pos + paper)

#Summary statistics of the Multiple Linear Regression Model
summary(mlrSRTPP)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1558.13  -239.35     7.25   387.02   728.02 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -13801.015   7865.017  -1.755  0.09970 . 
## radio          294.224     75.442   3.900  0.00142 **
## tv              33.369     49.080   0.680  0.50693   
## pos           -128.875    192.156  -0.671  0.51262   
## paper           -9.159      8.991  -1.019  0.32449   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:  0.9509 
## F-statistic: 92.96 on 4 and 15 DF,  p-value: 2.13e-10
TASK 4B: Write down the equation for the Multiple Linear Regression Model predicting sales based on radio, tv, pos, and paper. Note the values for the intercept and the slopes.

ANSWER:

predicted sales = -13801.015 + 294.224 * Xradio + 33.369 * tv + -128.875 * pos + -9.159 * paper

TASK 4C: Write down the values for R-Squared and Adj R-Squared for the above regression model.

ANSWER:

r-squared: 0.9612 asjusted r-squared: 0.9509

TASK 5: Based solely on the values for R-Squared and Adj R-Squared, which of the three multiple linear regression models mlr, mlrSRTP, mlrSRTPP is best in predicting sales? Explain why.

ANSWER:

mlrSRTPP is best because it has the highest r-squared and second highest adjusted r-squared. It takes into account 3 items that boost sales and that is what makes it a good predictor.

TASK 6A: Write down the R code to predict the value of sales for model mlr given that Radio = 69 , and TV = 255.
# sales_predicted1 = radio + tv
 sales_predicted1 = -17150.46 + 275.69*(69) + 48.34*(255)
sales_predicted1
## [1] 14198.85
TASK 7A: Write down the R code to predict the value of sales for model mlrSRTP given that Radio = 69 , TV = 255 , and 'POS = 1.5.
# sales_predicted2 = radio + tv + pos
sales_predicted2 = -15491.23 + (291.36 * 69) + (38.26 * 255) + (-107.62 * 1.5)
sales_predicted2
## [1] 14207.48
TASK 8A: Write down the R code to predict the value of sales for model mlrSRTPP given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75.
# sales_predicted3 = radio + tv + pos + paper
sales_predicted3 = -13801.015 + (294.224 * 69) + (33.369 * 255) + (-128.875 * 1.5) + (-9.159 * 75)
sales_predicted3
## [1] 14129.3
TASK 9: Based on the Error-squared values which of the three models (mlr, mlrSRTP, and mlrSRTPP) is the best at predicting the actual sales value? Why?

ANSWER:

mlrSRTPP is best at predicting the actual sales because it has the lowest error squared value of all 3. This means it has the lowest correlated errors between the values and is best.

TASK 10: Compare your answer for TASK 9 to your answer for TASK 7 and share any observations.

ANSWER:

mlr has the highest error squared, lilkely because it only compares one variable with sales. mlrSTPP is best because all variables are compared and has the lowest error squared value.