In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
Make sure to download the folder titled ‘bsad_lab5’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.
#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next the cor()
function is used to demonstrate correlations between variables.
#Correlation Matrix
cor(mydata)
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
The ‘1.0’ valus is shown diagonal because it is correlation with the variable itself so it will always equal itself- ‘1.0’
The two variables with the highest correlation are ‘Sales’ and ‘Radio’ as it is the closest equal to ‘1’. Sales and Radio have the second highest correlation.
So, I will create a scatterplot between the two to visualize the data.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
plot(sales, radio)
The points are scattered in a linear fashion in the scatter plot.
The lm()
function can be very helpful in creating a linear regression model.
In the regression below, we are using radio ads to predict sales. We will then create a summary to view the quantitative facts about the linear model.
#Simple Linear Regression
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
Report and interpret the R-Squared value below.
Since the r-squared value is very high then this means that the model is suitable fit. A trend plot will be overlayed on the scatter plot in order to create a visualization of its accuracy.
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg,col="green",lwd=2)
List some observations from this plot. This trend plot demonstrates that a majority of the points fall within very close range of the trend line, with only very few straying far from the line.
In this task set, we will complete examples of multiple linear regression models using the following function: ‘lm(y ~ x0 + x1 + x2 + … )’
Here, a multiple linear regression model is created predicting sales using radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.
Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 = lm(sales ~ radio +tv +pos)
#Summary of mlr2
summary(mlr2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 = lm(sales ~ radio +tv +pos +paper)
#Summary of mlr3
summary(mlr3)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
Observations:
Mlr1 had the highest adjusted r-squared value, even though all mlr models had increasing r-squared values.
Therefore, mlr1 is the most accurate mode of these options to predict the dependent variable of sales.
Given that Radio = 69
, TV = 255
, POS = 1.5
, and Paper = 75
, calculate the predicted sales value for each of the three models above.
radiovalue = 69
tvvalue = 255
posvalue = 1.5
papervalue = 75
#mlr1
radiovalue + tvvalue
## [1] 324
#mlr2
radiovalue + tvvalue + posvalue
## [1] 325.5
#mlr3
radiovalue + tvvalue + posvalue + papervalue
## [1] 400.5
For this last task, I uploaded the ‘marketing.csv’ file to Watson Analytics. Here in the predictive module, it confirmed the findings we had come across in r- studio. Findings were consistent that sales are driven primarily by radio followed by tv.