First, we will read the data to ensure that it is correctly stated.
#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
Next, we will apply the cor()
function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables within the dataset.
Correlation matrix
corr= cor(mydata[2:6])
corr
sales radio paper tv pos
sales 1.0000000 0.97713807 -0.28306828 0.95797025 0.01264860
radio 0.9771381 1.00000000 -0.23835848 0.96609579 0.06040209
paper -0.2830683 -0.23835848 1.00000000 -0.24587896 -0.09006241
tv 0.9579703 0.96609579 -0.24587896 1.00000000 -0.03602314
pos 0.0126486 0.06040209 -0.09006241 -0.03602314 1.00000000
Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix.
The value 1.0 is down diagonal because these are the correlations between each variable and itself (and a variable is always perfectly correlated with itself).
Strong Correlation: as paper goes up sales go down ; tv positive = as much as we advertise our sales go up. Pos = you dont have +/- correlation
Here we will install the packages needed to do a corrplot and corrgram
install.packages("corrplot")
Error in install.packages : Updating loaded packages
install.packages("corrgram")
Error in install.packages : Updating loaded packages
library("corrplot")
library("corrgram")
corrplot(corr)
From the image above we can see that there are strong and not so strong correlation
Strong - radio and sales - tv and radio
weak - pos and radio - paper and sales - paper and pos
No Correlation - pos and sales - pos and tv
Below we will extract the variables and plot radio and sales given that they are the strongest correlation.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
plot(radio,sales)
scatter.smooth(radio,sales)
*The image above demonstrate the strong correlation between radio sales given that they are all close to the regression line.
scatter.smooth(paper,sales)
The graph above is negatively correlated; line is not completely going down but it ducks down scattered all over the palce
scatter.smooth(tv,sales)
In the graph above values are on top of eachother. I assume there is not much diffence in spending. Therefore it doesnt matter how much you spend because it doesnt change the sales. There are slight increases and they are segmented by categories.
scatter.smooth(pos,sales)
Restarting R session...
Here tbhe values are all over the place. there is no trend line; misleading, doesnt say much
scatter.smooth(sales,sales)
The above graph is perfectly linear because a variable compared with itself matches up perfectly
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph. |
The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable. |
In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model. |
r #Simple Linear Regression - function lm(predicted variable~indep var) reg <- lm(sales ~ radio) #Summary of Model summary(reg) |
``` |
Call: lm(formula = sales ~ radio) |
Residuals: Min 1Q Median 3Q Max -1732.85 -198.88 62.64 415.26 637.70 |
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9741.92 1362.94 -7.148 1.17e-06 radio 347.69 17.83 19.499 1.49e-13 |
Signif. codes: 0 ‘ 0.001 0.01 ’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 571.6 on 18 degrees of freedom Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523 F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13 ```
this is the intercept. when writing an queation use this value coefficient will give you the equation of the line check adj. r s –> high, low ,etc as long as it high as we keep adding up its good and important
#linear model (Y = a + mx)
#sales- predicted var ~ radio_X1
#Sales_predicted = -9741.92 + 347.69 * (radio)
Sales_predicted = -9741.92 + 347.69 * (75)
Sales_predicted
[1] 16334.83
Report and interpret the R-Squared value below. - our multiple R squared and our adjusted R squared are relatively close. - Multiple R: 0.9548 - Adjusted R: 0.9523
Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
Theline above used the summary(reg)
follow the same steps as above
#Simple Linear Regression - function lm(predicted variable~indep var)
reg <- lm(radio ~ tv)
#Summary of Model
summary(reg)
Call:
lm(formula = radio ~ tv)
Residuals:
Min 1Q Median 3Q Max
-4.2306 -0.5473 -0.0973 1.0393 4.5028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -90.96701 10.53261 -8.637 8.10e-08 ***
tv 0.62666 0.03947 15.875 4.97e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.951 on 18 degrees of freedom
Multiple R-squared: 0.9333, Adjusted R-squared: 0.9296
F-statistic: 252 on 1 and 18 DF, p-value: 4.971e-12
Sales_predicted = -90.96701 + 0.62666 * (250)
Sales_predicted
[1] 65.69799
read.csv("data/marketing.csv")
head(mydata)
#Plot Radio and tv
plot(radio,tv)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
From the plot above we can conclude that the correlation for radio and tv are going in an upward trend.
Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... )
, where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent values.
Lets create a multiple linear regression predicting sales using radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
Call:
lm(formula = sales ~ radio + tv)
Residuals:
Min 1Q Median 3Q Max
-1729.58 -205.97 56.95 335.15 759.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17150.46 6965.59 -2.462 0.024791 *
radio 275.69 68.73 4.011 0.000905 ***
tv 48.34 44.58 1.084 0.293351
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 568.9 on 17 degrees of freedom
Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
sales_predicted = -17150.46 +275.69 * (radio) + 48.34 * (tv)
sales_predicted = -17150.46 +275.69 *(75) + 48.34*(270)
sales_predicted
[1] 16578.09
R Sq: 0.9577 Adj R Sq:0.9527
read.csv("data/marketing.csv")
head(mydata)
For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.
Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:
#mlr2 = Sales predicted by radio, tv, and pos
mlr2<- lm(sales ~ radio + tv + pos)
summary(mlr2)
Call:
lm(formula = sales ~ radio + tv + pos)
Residuals:
Min 1Q Median 3Q Max
-1748.20 -187.42 -61.14 352.07 734.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15491.23 7697.08 -2.013 0.06130 .
radio 291.36 75.48 3.860 0.00139 **
tv 38.26 48.90 0.782 0.44538
pos -107.62 191.25 -0.563 0.58142
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 580.7 on 16 degrees of freedom
Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#sales_predicted = -15491.23 + 291.36 * (radio) + 38.26 * (tv) -107.62 (pos)
sales_predicted = -15491.23 + 291.36*(75) + 38.26*(270) - 107.62* (1.3)
sales_predicted
[1] 16551.06
R Sq: 0.9585 Adj R Sq:0.9508
read.csv("data/marketing.csv")
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3<- lm(sales ~ radio + tv + pos + paper)
summary(mlr3)
Call:
lm(formula = sales ~ radio + tv + pos + paper)
Residuals:
Min 1Q Median 3Q Max
-1558.13 -239.35 7.25 387.02 728.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13801.015 7865.017 -1.755 0.09970 .
radio 294.224 75.442 3.900 0.00142 **
tv 33.369 49.080 0.680 0.50693
pos -128.875 192.156 -0.671 0.51262
paper -9.159 8.991 -1.019 0.32449
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 580 on 15 degrees of freedom
Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
#sales_predicted = -13801.015 + 294.224 * (radio) + 33.369 * (tv) - 128.875*(pos) -9.159*(paper)
sales_predicted = -13801.015 + 294.224 * (75) + 33.369 * (270) - 128.875*(1.3) -9.159*(82)
sales_predicted
[1] 16356.84
R Sq: 0.9612 Adj R Sq:0.9509
read.csv("data/marketing.csv")
head(mydata)
Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why.
Based purely on the adjusted r squared I think that the best linear regression model is mlr1
because it is has the largest adjusted r squared out of the three. To confirm that this is true we will use that model to calculate our sales with the values below.
After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.
Given that Radio = 69
, TV = 255
, POS = 1.5
, and Paper = 75
, calculate the predicted sales value for each of the three models above.
sales_predicted = -17150.46 +275.69 * (radio) + 48.34 * (tv)
sales_predicted = -17150.46 +275.69 *(69) + 48.34*(255)
sales_predicted
[1] 14198.85
read.csv("data/marketing.csv")
head(mydata)
#sales_predicted = -15491.23 + 291.36 * (radio) + 38.26 * (tv) -107.62 (pos)
sales_predicted = -15491.23 + 291.36*(69) + 38.26*(255) - 107.62* (1.5)
sales_predicted
[1] 14207.48
read.csv("data/marketing.csv")
head(mydata)
#sales_predicted = -13801.015 + 294.224 * (radio) + 33.369 * (tv) - 128.875*(pos) -9.159*(paper)
sales_predicted = -13801.015 + 294.224 * (69) + 33.369 * (255) - 128.875*(1.5) -9.159*(75)
sales_predicted
[1] 14129.3
read.csv("data/marketing.csv")
head(mydata)
Based on our data above we can confirm that the best model is mlr3
because it gives us the closest value of sales for the given values of radio, paper, tv, and pos
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
Based on the one field predictive model from Watson it was concluded that radio does infact drive sales by about 93%. In other words 93 percent of sales are driven by radio. Therefore the more we use the radio to our advantage (promotional basis) the more our sales will increase. The images below state this information.From the results below and those we calculated in task two it can be concluded that radio and sales do have a strong correlation given that this is where the highest adjusted r square resulted.