In a given year, if it rains more, we might see an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the amount of produce may be predicted by the amount of rainfall in a certain year. This example represents a simple linear regression, an important concept that allows us to model (predict) the output value of a certain variable based off the input of one or many variables.
This lab explores the concepts of simple linear regression, and multiple linear regression. Empahsis is placed on distinguishing between a good fitting model and a good predicting model. Before answering any question, take time to study carefully the shared code examples, and to review course material.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.
Execute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.
First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly by checking the variables in the environment pane of Rstudio and also displaying the few header lines of the file.
#Read the input file
mydata = read.csv(file="marketing.csv")
class(mydata)
## [1] "data.frame"
head(mydata)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next, apply the cor() function to the entire data set. This is a quick way to compare the correlations between all variables.
#Correlation matrix of all variable columns in the data
corr = cor(mydata)
corr
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
# Correlation matrix of columns 2,4, and 6
corr = cor( mydata[c(2,4,6) ] )
corr
## sales paper pos
## sales 1.0000000 -0.28306828 0.01264860
## paper -0.2830683 1.00000000 -0.09006241
## pos 0.0126486 -0.09006241 1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( mydata[ 2:6 ] )
corr
## sales radio paper tv pos
## sales 1.0000000 0.97713807 -0.28306828 0.95797025 0.01264860
## radio 0.9771381 1.00000000 -0.23835848 0.96609579 0.06040209
## paper -0.2830683 -0.23835848 1.00000000 -0.24587896 -0.09006241
## tv 0.9579703 0.96609579 -0.24587896 1.00000000 -0.03602314
## pos 0.0126486 0.06040209 -0.09006241 -0.03602314 1.00000000
The above matrix has the properties of being a square (# columns = # rows), symmetric matrix (lower diagonal = upper diagonal), and all diagonal values are equal to one.
##### 1A) Explain why the value of “1.0” along the diagonal
variables are perfectly correlated to themselves.
##### 1B) Select two particular entries from the matrix to demonstrate how it is symmetric with respect to the diagonal
Sales to radio = 0.97713807 radio to sales = 0.9771381
Next we will create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this we need first to install two packages in R-Studio as executed by the command lines below
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
## Warning: package 'corrgram' was built under R version 3.6.3
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
# Generates a corrgram of last computed correlation matrix
corrgram(corr)
# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)
From the matrix, its clear that Sales, Radio and TV have the strongest correlations. Let’s create now few scatterplots to visualize the data and trending lines. First we need to extract the columns from the data file.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command
scatter.smooth(radio,sales)
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.
The lm() linear modeling function is a very useful one. The function is set up as lm(y ~ x) where the independent variable x is used to predict values of the y dependent variable.
In the simple linear regression model below, we are using radio ads to predict sales. We print out a summary to view some quantitative facts about the linear model.
#Simple Linear Regression - the R function lm()
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
As indicated by the summary report the intercept value is -9741.92 and the slope for radio is 347.69. We can therefore write the following equation for the linear regression model predicting Sales based on Radio Sales_predicted = -9741.92 + 347.69 * Xradio. Given this equation we can predict the value of sales for any given value of radio like for example 75 (investing $75,000 in radio ads). Always watch for units!
### Linear model ( Y = b +- mx )
### Sales_predicted ~ Radio_X1
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
### A more elegant way to write the equation is to refer to the coefficients of the equation instead of typing the actual values out.
### This is demonstrated below.
sales_predicted = coef(reg)[1] + coef(reg)[2] * 75
coef(reg)[1] # intercept
## (Intercept)
## -9741.921
coef(reg)[2] # slope
## radio
## 347.6888
sales_predicted
## (Intercept)
## 16334.74
A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of Sales versus Radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the predicted values are from the actual value. The difference between the actual sales (circles) and the predicted sales (solid line) is captured in the residual error calculations as reported earlier by the summary function. For the purpose of this exercise and the class we will focus mainly on Multiple R-squared and Adjusted R-squared
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
Note that the trend line shown here is the one corresponding to the regression equation. It is different from the smooth curved line used earlier in the scatter plot.
##### 1C) Repeat the above calculations to derive and plot the linear regression model for Sales versus TV
lm_1 = lm(sales~tv)
summary(lm_1)
##
## Call:
## lm(formula = sales ~ tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1921.87 -412.24 7.02 581.59 1081.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42229.21 4164.12 -10.14 7.19e-09 ***
## tv 221.10 15.61 14.17 3.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared: 0.9177, Adjusted R-squared: 0.9131
## F-statistic: 200.7 on 1 and 18 DF, p-value: 3.336e-11
plot(tv,sales)
abline(lm_1, col="red", lwd=2)
##### 1D) Write down the mathematical representation for the linear regression model. Identify the values for the intercept and the slope
Equation predicting sales based on tv ’sales_predicted = -42229.21+221.10*xTV’ Intercept = -42229.21 Slope = 221.10
Many times there are more than one factor or variable that affect the prediction of an outcome. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent variable, and the x variables are the predictors or the independent variables.
Lets create a multiple linear regression predicting sales using the two independent variables radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
Note the values of 0.9577 for R-squared and 0.9527 for the Adjusted R-squared. The predicted sales can again be calculated given the coefficients of the regression model. The example below calculates the predicted sales for TV = 270 and Radio = 75.
# sales_predicted = radio + tv
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2]*(75) + coef(mlr1)[3]*(270)
sales_predicted
## (Intercept)
## 16578.3
##### 2A) Compare the calculated predicted sales, as obtained from the above model, to the actual sales as reported in the data file, by calculating the difference error squared
Sales_observed_1= 16876
# Now i can calculate the error (difference between fitted and observed values)
error_1 = Sales_observed_1 - sales_predicted
error_1
## (Intercept)
## 297.7018
# now I can calculate error squared
error_squared_1 = error_1^2
error_squared_1
## (Intercept)
## 88626.34
##### 2B) Fill in the code chunk below to create multiple linear regression model for each of the specified two use cases, and display the summary statistics. Note that each model is assigned to a different variable for later referencing (4pts)
mlr2=lm(sales~radio+tv+pos)
summary(mlr2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#mlr2 = Sales predicted by radio, tv, and pos
#Summary of Multiple Linear Regression Model
mlr3=lm(sales~radio+tv+pos+paper)
summary(mlr3)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
#mlr3 = Sales predicted by radio, tv, pos, and paper
#Summary of Multiple Linear Regression Model
##### 2C) Write down, on separate lines, the mathematical representation for the three regression models mlr1, mlr2, and mlr3 . For each case note the corresponding values for R-squared and Adjusted R-squared
mlr1: ’Sales_predicted1 = -17150.46+275.69*xradio+48.34xtv Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
m1r2: ’Sales_predicted2 = -15491.23+291.36xradio+38.26xtv-107.62*xpos Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
m1r3: ’Sales_predicted2 = -13801.015+294.224xradio+33.369xtv-128.875xpos-9.159xpaper Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
##### 2D) Based solely on the values of R-squared and Adjusted R-squared, which of the three multiple linear regression models mlr1, mlr2, mlr3, is best in predicting sales, and which is best in fitting sales. Explain your reasoning
the best model in this set is the M1r1 data set. This is because the adjusted r-squared value is 0.9527, which is the highest r-squared value of the three data sets. This implies the least average squared error of any of the lines.
##### 2E) Calculate the sales value for each of the above three models by substituting the applicable values for Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75. Compare your calculated sales to the actual value as obtained from the data file. Which model is best at fitting the actual sales value? Quantify your answer by calculating the error squared for each case. Compare your findings to your answer from 2D)
-17150.46+275.69*(69)+48.34(255) = 14198.85
-15491.23+291.36(69)+38.26(255)-107.62*(1.5) = 14207.48
-13801.015+294.224(69)+33.369(255)-128.875(1.5)-9.159(75) = 14129.2985
# we will capture the observed / actual value of sales when radio = 69, tv = 255, pos = 1.5, and paper = 75 from the actual data in mydata
sales_observed_mlr = 13965
# now we will calculate the fitted / predicted sales value from model mlr1; here we only need to plug in the values of radio = 69, tv = 255
#I will use the coef() function to call the values of the intercept and slopes. You can do that or write the values manually
sales_predicted_mlr1 = coef(mlr1)[1]+coef(mlr1)[2]*(69)+coef(mlr1)[3]*(255)
sales_predicted_mlr1
## (Intercept)
## 14199.05
# now we will calculate the error squared (the difference between obsereved and predicted squared)
error_squared_mlr1 = (sales_observed_mlr- sales_predicted_mlr1)^2
error_squared_mlr1
## (Intercept)
## 54777.4
# now we will calculate the fitted / predicted sales value from model mlr2; here we only need to plug in the values of radio = 69, tv = 255, pos = 1.5
sales_predicted_mlr2 = coef(mlr2)[1]+coef(mlr2)[2]*(69)+coef(mlr2)[3]*(255)+coef(mlr2)[4]*(1.5)
sales_predicted_mlr2
## (Intercept)
## 14208.44
# now we will calculate the error squared (the difference between obsereved and predicted squared)
error_squared_mlr2 = (sales_observed_mlr- sales_predicted_mlr2)^2
error_squared_mlr2
## (Intercept)
## 59262.56
# now we will calculate the fitted / predicted sales value from model mlr2; here we need to plug in the values of radio = 69, tv = 255, pos = 1.5, and paper = 75
sales_predicted_mlr3 = coef(mlr3)[1]+coef(mlr3)[2]*(69)+coef(mlr3)[3]*(255)+coef(mlr3)[4]*(1.5)+coef(mlr3)[5]*(75)
sales_predicted_mlr3
## (Intercept)
## 14129.32
# now we will calculate the error squared (the difference between obsereved and predicted squared)
error_squared_mlr3 = (sales_observed_mlr- sales_predicted_mlr3)^2
error_squared_mlr3
## (Intercept)
## 27002.18
# The squared error for m1r3 is less than that of m1r1, but this is not indicative of a better model. m1r3 is a better fit model, but that is only given the data set of 20 points we have. But m1r1 is a better predictive model overall, and can predict sales in general better than m1r3.