In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
Make sure to download the folder titled ‘bsad_lab5’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.
#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.
Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix.
corr
case_number sales radio paper tv
case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
pos
case_number 0.05397630
sales 0.01264860
radio 0.06040209
paper -0.09006241
tv -0.03602314
pos 1.00000000
#install.packages("corrplot")
#install.packages("corrgram")
#library(corrgram)
#library(corrplot)
corrgram(corr)
corrplot(corr)
plot(corr)
From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
scatter.smooth(sales,sales)
#Plot of Radio and Sales using plot command from Worksheet 4
scatter.smooth(radio,sales)
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.
The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.
In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.
#Simple Linear Regression - the R function lm()
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
Call:
lm(formula = sales ~ radio)
Residuals:
Min 1Q Median 3Q Max
-1732.85 -198.88 62.64 415.26 637.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
radio 347.69 17.83 19.499 1.49e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 571.6 on 18 degrees of freedom
Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
### Linear model ( Y = b +- mx )
### Sales_predicted ~ Radio_X1
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
Report and interpret the R-Squared value below.
Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
List some observations from this plot.
#Simple Linear Regression - the R function lm()
reg_tv <- lm(sales ~ tv)
#Summary of Model
summary(reg_tv)
Call:
lm(formula = sales ~ tv)
Residuals:
Min 1Q Median 3Q Max
-1921.87 -412.24 7.02 581.59 1081.61
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42229.21 4164.12 -10.14 7.19e-09 ***
tv 221.10 15.61 14.17 3.34e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 771.3 on 18 degrees of freedom
Multiple R-squared: 0.9177, Adjusted R-squared: 0.9131
F-statistic: 200.7 on 1 and 18 DF, p-value: 3.336e-11
sales_predicted
[1] 14151.29
#Plot Radio and Sales
plot(tv,sales)
#Add a trend line plot using the linear model we created above
abline(reg_tv, col="blue",lwd=2)
Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent values.
Lets create a multiple linear regression predicting sales using radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
Call:
lm(formula = sales ~ radio + tv)
Residuals:
Min 1Q Median 3Q Max
-1729.58 -205.97 56.95 335.15 759.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17150.46 6965.59 -2.462 0.024791 *
radio 275.69 68.73 4.011 0.000905 ***
tv 48.34 44.58 1.084 0.293351
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 568.9 on 17 degrees of freedom
Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
# sales_predicted = radio + tv
sales_predicted = -17150.46 + 275.69*(radio) + 48.34*(tv)
sales_predicted = -17150.46 + 275.69*(75) + 48.34*(270)
sales_predicted
[1] 16578.09
For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.
Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)
#Summary of Multiple Linear Regression Model
#mlr3 = Sales predicted by radio, tv, pos, and paper
#Multiple Linear Regression Model
mlr3 <-lm(sales ~ radio + tv + pos + paper)
#Summary of Multiple Linear Regression Model
summary(mlr2)
Call:
lm(formula = sales ~ radio + tv + pos)
Residuals:
Min 1Q Median 3Q Max
-1748.20 -187.42 -61.14 352.07 734.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15491.23 7697.08 -2.013 0.06130 .
radio 291.36 75.48 3.860 0.00139 **
tv 38.26 48.90 0.782 0.44538
pos -107.62 191.25 -0.563 0.58142
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 580.7 on 16 degrees of freedom
Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
summary(mlr3)
Call:
lm(formula = sales ~ radio + tv + pos + paper)
Residuals:
Min 1Q Median 3Q Max
-1558.13 -239.35 7.25 387.02 728.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13801.015 7865.017 -1.755 0.09970 .
radio 294.224 75.442 3.900 0.00142 **
tv 33.369 49.080 0.680 0.50693
pos -128.875 192.156 -0.671 0.51262
paper -9.159 8.991 -1.019 0.32449
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 580 on 15 degrees of freedom
Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why.
After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.
Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.