First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.
mydata = read.csv(file="data/marketing.csv")
head(mydata)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.
Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix. >>>The value “1.0” is down the diagonal because the correlation is calculating the value of one varaible against itself at a time. We are finding the correlation.
cor(mydata)
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
scatter.smooth(sales,radio)
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.
The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.
In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.
reg <- lm(sales ~ radio)
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
Report and interpret the R-Squared value below. >>>The R-squared value is high, and the plot points are close to the line, and few even tough the line. This indicates a strong correlation, but not a perfect correlation.
Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.
plot(radio,sales)
abline(reg,col="blue",lwd=2)
List some observations from this plot. >>>The plot is positive, increasing uphill from left to right. The points are close to the line, and few even rest on the line.
Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent values.
Lets create a multiple linear regression predicting sales using radio and tv.
mlr1 <-lm(sales ~ radio + tv)
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
predict.sales = -17150.46 + 275.69*(75) + 48.34*(270)
For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.
Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:
#mlr2 = Sales predicted by radio, tv, and pos
lm1 = lm(sales ~ radio + tv + pos)
lm1
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Coefficients:
## (Intercept) radio tv pos
## -15491.23 291.36 38.26 -107.62
summary(lm1)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
lm2 = lm(sales ~ radio + tv + pos + paper)
lm2
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Coefficients:
## (Intercept) radio tv pos paper
## -13801.015 294.224 33.369 -128.875 -9.159
summary(lm2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why. >>>summar(mlr1) because the adjusted r-squared is higher.
#Linear Regression
sales_predicted = -13801.015 + 294.224*(radio) + 33.369*(tv) + -128.875*(pos) + -9.159*(paper)
sales_predicted = -13801.015 + 294.224*(69) + 33.369*(255) + -128.875*(1.5) + -9.159*(75)
sales_predicted
## [1] 14129.3
After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.
Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.