In a given year, if it rains more, we might see an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the amount of produce may be predicted by the amount of rainfall in a certain year. This example represents a simple linear regression, an important concept that allows us to model (predict) the output value of a certain variable based off the input of one or many variables.
This lab explores the concepts of simple linear regression, multiple linear regression, and a revisit of Watson analytics.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.
Execute all code chunks, preview, publish, and submit link on Sakai.
First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly by checking the variables in the environment pane of Rstudio and also displaying the few header ines of the file.
#Read the input file
mydata = read.csv(file="data/marketing.csv")
head(mydata)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next, apply the cor() function to the entire data set. This is a quick way to compare the correlations between all variables.
#Correlation matrix of all variable columns in the data
corr = cor(mydata)
corr
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
# Correlation matrix of columns 2,4, and 6
corr = cor( mydata[ c(2,4,6) ] )
corr
## sales paper pos
## sales 1.0000000 -0.28306828 0.01264860
## paper -0.2830683 1.00000000 -0.09006241
## pos 0.0126486 -0.09006241 1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( mydata[ 2:6 ] )
corr
## sales radio paper tv pos
## sales 1.0000000 0.97713807 -0.28306828 0.95797025 0.01264860
## radio 0.9771381 1.00000000 -0.23835848 0.96609579 0.06040209
## paper -0.2830683 -0.23835848 1.00000000 -0.24587896 -0.09006241
## tv 0.9579703 0.96609579 -0.24587896 1.00000000 -0.03602314
## pos 0.0126486 0.06040209 -0.09006241 -0.03602314 1.00000000
The above matrix is called a square and symmetric matrix.
##### 1A) Explain why the value of “1.0” along the diagonal?
As stated in class, the correlation of a variable in relation to itself will always be 1, a perfect positive correlation. For a correlation matrix, the diagonal is the intersection of the same variable.
##### 1B) Explain with an example how the entries of the matrix is symmetric with respect to the diagonal.
Also covered in class, only half of the values are distinct correlation coefficients in the entire matrix; those above and below every 1.000 (the diagonal). In using this, it is best practice to remove the values above the diagonal as the values below are the same, allowing for a more visualy appealing matrix. An example of this would be the variables of sales and radio. Locating sales on the y-axis, then the finding radio acorss the matrix, the coefficient is .97713807. Locatig radio on the y-axis, then finding sales across, the coefficient is also .97713807. The same pattern holds true with other variables throughout the matrix.
Next we will create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this we need first to install two packages in R-Studio as executed by the command lines below
## corrplot 0.84 loaded
# Generates a corrgram of last computed correlation matrix
corrgram(corr)
# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)
From the matrix, its clear that Sales, Radio and TV have the strongest correlations. Let’s create now few scatterplots to visualize the data and trending lines. First we need to extract the columns from the data file.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
scatter.smooth(radio,sales)
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.
The lm() linear modeling function is a very useful one. The function is set up as lm(y ~ x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.
In the simple linear regression model below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.
#Simple Linear Regression - the R function lm()
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
As indicated by the summary report the intercept value is -9741.92 and the slope for radio is 347.69. We can therefore write the following equation for the linear regression model predicting Sales based on Radio Sales_predicted = -9741.92 + 347.69 * Xradio. Given this equation we can predict the value of sales for any given value of radio like for example 75 (investing $75,000 in radio ads). Always watch for units!
### Linear model ( Y = b +- mx )
### Sales_predicted ~ Radio_X1
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
### Another way to write the equation is to refer to the coefficients of the equation instead of typing the actual values out.
### This is demonstrated below.
sales_predicted = coef(reg)[1] + coef(reg)[2] * 75
coef(reg)[1] # intercept
## (Intercept)
## -9741.921
coef(reg)[2] # slope
## radio
## 347.6888
sales_predicted
## (Intercept)
## 16334.74
A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of Sales versus Radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the predicted values are from the actual value. The difference between the actual sales (circles) and the predicted sales (solid line) is captured in the residual error calculations as reported earlier by the summary function.
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
Note that the trend line shown here is the one corresponding to the regression equation. It is different from the smooth curved line used earlier in the scatter plot.
##### 1C) Repeat the above calculations to derive the linear regression model for Sales versus TV ?
scatter.smooth(tv,sales)
reg <- lm(sales ~ tv)
summary(reg)
##
## Call:
## lm(formula = sales ~ tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1921.87 -412.24 7.02 581.59 1081.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42229.21 4164.12 -10.14 7.19e-09 ***
## tv 221.10 15.61 14.17 3.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared: 0.9177, Adjusted R-squared: 0.9131
## F-statistic: 200.7 on 1 and 18 DF, p-value: 3.336e-11
sales_predicted = coef(reg)[1] + coef(reg)[2]
coef(reg)[1]
## (Intercept)
## -42229.21
coef(reg)[2]
## tv
## 221.1043
##### 1D) Write down the equation for the linear regression model. Identify the values for the intercept and the slope.
The formula for the linear regression model is: Ysales = -42229.21 + 221.10 * Xtv
(Intercept) = -42229.21 (Slope) = 221.10
Many times there are more than one factor or variable that affect the prediction of an outcome. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent variable, and the x variables are the predictors or the independent variables
Lets create a multiple linear regression predicting sales using both independent variables radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
Note the values of 0.9577 for R-squared and 0.9527 for the Adjusted R-squared. The predicted sales can again be calculated given the coefficients of the regression model. The example below calculates the predicted sales for TV = 270 and Radio = 75.
# sales_predicted = radio + tv
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2]*(75) + coef(mlr1)[3]*(270)
sales_predicted
## (Intercept)
## 16578.3
##### 2A) Compare the calculated predicted sales to the actual sales found in the file, by calculating the difference error squared.
The difference error squared formula = (Actual value - Predicted value)^2
(16876-16578.3)^2
## [1] 88625.29
##### 2B) Fill in the code chunk below to create multiple linear regression model for each of the specified two use cases, and display the summary statistics.
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)
#Summary of Multiple Linear Regression Model
summary(mlr2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm(sales ~ radio + tv + pos + paper)
#Summary of Multiple Linear Regression Model
summary(mlr3)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
##### 2C) Write down a summary paragraph of the regression equations and corresponding values for R-squared and Adjusted R-squared for each of the three cases: mlr1, mlr2, and mlr3
mlr1: Ysales = -17150.46 + 275.69 * Xradio + 48.34 * Xtv
R-squared = .9577 Adjusted R-squared = .9527
mlr2: Ysales = -15491.23 + 291.36 * Xradio + 38.26 * Xtv - 107.62 * Xpos
R-squared = .9585 Adjusted R-squared = .9508
mlr3: Ysales = -13801.015 + 294.224 * Xradio + 33.369 * Xtv - 128.875 * Xpos - 9.159 * Xpaper
R-squared = .9612 Adjusted R-sqaured = .9509
##### 2D) Based solely on the values for R-squared and Adj R-squared, which of the three multiple linear regression models mlr1, mlr2, mlr3, is best in predicting sales. Explain why.
As covered in class discussions, a best fitting model does not mean a best predicting model. Given this, the use of the adjusted r-squared value is needed. The adjusted r-squared increases only if the added variable provides better prediction. With this knowledge, mlr1 is the best model in predicting sales as it has the highest adjusted r-squared of all models (.9527).
##### 2E) Calculate the sales value for each of the above three models by substituting for the values Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75. Compare your calculated sales to the actual value as obtained from the data file. Which model is best at fitting the actual sales value? Quantify your answer by calculating the error squared for each case. Compare your findings to your answer in 2D) and explain any discrepencies. Hint: best fitting is not same as best predicting!
#mlr1:
sales_predicted = -17150.46 + 275.69 * (69) + 48.34 * (255)
sales_predicted
## [1] 14198.85
#mlr2:
sales_predicted = -15491.23 + 291.36 * (69) + 38.26 * (255) - 107.62 * (1.5)
sales_predicted
## [1] 14207.48
#mlr3:
sales_predicted= -13801.015 + 294.224 * (69) + 33.369 * (255) - 128.875 * (1.5) - 9.159 * (75)
sales_predicted
## [1] 14129.3
#Error^2 = (Actual value - Predicted value)^2 for each mulitple linear regression model
#mlr1
(13965-14198.85)^2
## [1] 54685.82
#mlr2
(13965-14207.48)^2
## [1] 58796.55
#mlr3
(13965-14129.3)^2
## [1] 26994.49
The model best at fitting the actual sales value is mlr3. Mlr3 is the best at fitting the actual sales value because it is the model that has the closest predicted value to the actual sales value. Supporting this finding is the error squared calculation. The lower the outcome, the less error the model has, meaning it is a bette predictor, for sales in this instance. This finding varies from our answer in 2D) because we found there that mlr1 was the best model due to its higher adjusted r-squared than the other models. However, the measure of r-sqaured, and adjusted r-squared, measures only the relationship (fitting) of the variables in scope. We see here, as you allude to, that best fitting is not the best predicting. This is seen because mlr3 actually had the best prediction, even though it was not the best fitting correlation in the above tasks.
Watson Analytics contains a module for Predict, besides the two others for Explore and Assemble. Although Watson is a useful tool it should never be treated as a black box. In other words there are many circumstances where Watson can fail to make the proper diagnosis and suggest the correct remedies. This is why it is critical, as more of the cognitive analytics and advanced machine learning tools invade our environment, that we also allow room for human intervention and learn how to best interact.
There are many details in the Predict module. We will keep it relatively simple for this introductory phase. To complete this task, follow the directions below. Make sure to capture any screenshots and to answer any questions where requested
You will be guided to a new window where you can go to the Predictive Analysis menu and select the Spiral option. You will be presented with a new discovery window where you will have to select your target variable.
##### 3A) Include a screen shot of the top predictors of sales. Consider the one field (driver) predictive model only and note the predictive power strength for each driver
Watson shows that the driver radio has the highest strength of 93%, with tv driver having an 86% strength. Watson also showed two drivers, that of pos and radio and pos and tv (ommitted for the screenshot as asked for). Pos and radio strength is 99%, and the 2 drivers of pos and tv is 98% strength.
Note that Watson follows a different approach to identify and quantify the predictive strength of each variable. Such details are outside the scope of this exercise. Watson provides a lot of on-line help, and the interested reader is encouraged to independently check the on-line documentation and explore the many capabilities of Watson.
##### 3B) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain.
Watson results reconcil with our findings based on the R regression analysis in task 2 in that it shows that radio and tv are the strongest drivers of sales. Radio had the greater correlation, and Watson reflected that with a 93% strength, followed by tv with a 86% strength. This is reflected in task 2 because with mlr1 we measured radio and tv, and it also showed, like Watson did, that those two variables are the most strongly related to sales than the others. The only difference that appeared was when Watson suggested that pos and radio, and pos and tv were also strong drivers. As talked about in class, technology can help us, but we need human thought to analyze outputs to see if they are correct or not. With this scenario, we already know that pos is not closely related with sales. In fact, we saw in mlr2 that adding pos to radio and tv actually lowered adjusted r-squared; meaning that it did not add a better prediction of sales. More analysis would be needed with situations with just those 2 drivers of sales, but it is just an example as well on how human input is needed when dealing with data analytics.