In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.
First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly.
#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.
#Correlation matrix of all columns in the data
corr = cor(mydata)
corr
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
# Correlation matrix of columns 2,4, and 6
corr = cor( mydata[ c(2,4,6) ] )
corr
## sales paper pos
## sales 1.0000000 -0.28306828 0.01264860
## paper -0.2830683 1.00000000 -0.09006241
## pos 0.0126486 -0.09006241 1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( mydata[ 2:6 ] )
corr
## sales radio paper tv pos
## sales 1.0000000 0.97713807 -0.28306828 0.95797025 0.01264860
## radio 0.9771381 1.00000000 -0.23835848 0.96609579 0.06040209
## paper -0.2830683 -0.23835848 1.00000000 -0.24587896 -0.09006241
## tv 0.9579703 0.96609579 -0.24587896 1.00000000 -0.03602314
## pos 0.0126486 0.06040209 -0.09006241 -0.03602314 1.00000000
When observing along the diagnoal line the value of 1.0 can be seen. The reason this can be seen is because when a variable on the x and y axises is compared to itself it will have a 1.0 correlation.
When observing the data the strongest correlations are 1.0 when the variables are compared to themselves. However, when they are not, there are two that have strong correlation. The first is the correlation between radio and sales which has a correlation of 0.97713807. Furthermore, the second strong correlation is between tv and radio which have a similar correcelatoin of 0.96609579. Both of these have strong correlations because they are the closest to 1.0 when observing the data. That is of course without considering variables compared to themselves.
Next we will create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this we need first to install two packages in R-Studio as executed by the command lines below
## corrplot 0.84 loaded
# Generates a corrgram of last computed correlation matrix
corrgram(corr)
# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)
From the matrix, its clear that Sales, Radio and TV have the strongest correlations. Let’s create now few scatterplots to visualize the data and trending lines. First we need to extract the columns from the data file.
#Extract all variables
pos = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
scatter.smooth(radio,sales)
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.
The lm() linear modeling function is a very useful one. The function is set up as lm(y ~ x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.
In the simple linear regression model below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.
#Simple Linear Regression - the R function lm()
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
As indicated by the summary report the intercept value is -9741.92 and the slope for radio is 347.69. We can therefore write the following equation for the linear regression model predicting Sales based on Radio Sales_predicted = -9741.92 + 347.69 * Xradio. Given this equation we can predict the value of sales for any given value of radio like for example 75 (investing $75,000 in radio ads)
### Linear model ( Y = b +- mx )
### Sales_predicted ~ Radio_X1
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
### Another way to write the equation is to refer to the coefficients of the equation instead of typing the actual values out.
### This is demonstrated below.
sales_predicted = coef(reg)[1] + coef(reg)[2] * 75
coef(reg)[1] # intercept
## (Intercept)
## -9741.921
coef(reg)[2] # slope
## radio
## 347.6888
sales_predicted
## (Intercept)
## 16334.74
# Simple linear regression using the R equation 'lm'
reg1 <- lm (sales ~ tv)
#Summary of new Model
summary(reg1)
##
## Call:
## lm(formula = sales ~ tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1921.87 -412.24 7.02 581.59 1081.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42229.21 4164.12 -10.14 7.19e-09 ***
## tv 221.10 15.61 14.17 3.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared: 0.9177, Adjusted R-squared: 0.9131
## F-statistic: 200.7 on 1 and 18 DF, p-value: 3.336e-11
sales_predicted1 = -42229.21 + 221.10 * (75)
sales_predicted1
## [1] -25646.71
sales_predicted1 = coef(reg1)[1] + coef(reg1)[2]*(75)
coef(reg1)[1]
## (Intercept)
## -42229.21
coef (reg1)[2]
## tv
## 221.1043
sales_predicted1
## (Intercept)
## -25646.39
A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of Sales versus Radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the predictions are from the actual value. The difference between the actual sales (circles) and the predicted sales (solid line) is captured in the residual error calculations as reported by the summary function.
#Plot Radio and Sales
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
Many times there are more than one factor or variable that affect the prediction of an outcome. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.
In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent variable and the x variables are the predictors or the independent variables
Lets create a multiple linear regression predicting sales using both independent variables radio and tv.
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
Note the values of 0.9577 for R-squared and 0.9527 for the Adj R-squared. The predicted sales can again be calculated given the coefficients of the regression model. The example below shows the predicted sales for TV = 270 and Radio = 75.
# sales_predicted = radio + tv
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2]*(75) + coef(mlr1)[3]*(270)
sales_predicted
## (Intercept)
## 16578.3
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm (sales ~ radio + tv + pos)
#Summary of Multiple Linear Regression Model
summary(mlr2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm (sales ~ radio + tv + pos + paper)
#Summary of Multiple Linear Regression Model
summary(mlr3)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
mlr1 sales_predicted2 = -17150.46 + 275.69a + 48.34b
mlr1R-squared = 0.9577 m1r1AdjR-squared = 0.9527
mlr2 sales_predicted3 = -15491.23 + 291.36a + 38.26b - 107.62*c
mlr2R-squared = 0.9585; mlr2AdjR-squared = 0.9508
mlr3 sales_predicted4 = -13801.015 + 294.224a + 33.369b - 128.875c -9.159d
mlr3R-squared = 0.9612; mlr3AdjR-squared = 0.9509
Skip
Watson Analytics contains a module for Predict, besides the two others for Explore and Assemble. Although Watson is a useful tool it should never be treated as a black box. In other words there are many circumstances where Watson can fail to make the proper diagnosis and suggest the correct remedies. This is why it is critical, as more of the cognitive analytics and advanced machine learning tools invade our environment, that we also allow room for human intervention and learn how to best interact.
There are many details in the Predict module. We will keep it relatively simple for this introductory phase. To complete this task, follow the directions below. Make sure to capture any screenshots and to answer any questions where requested
When observing the results from Watson, there is a 99% correlation between pos and radio which incdicate that they are the top drivers of sales. Another high correlation of 98% can be seen from pos and tv. The results from Watson are clear in showing which are the strong drivers of sales which can help a management team conclude where money for marketing should be invested.
The results from Watson reconcile with my findings very well. Watson is undoubtly a great tool when someone wants to look at data in a simple visual. Through Watson I was able to see the correlations from the variables in a simple visual represenation. Watson can give simple easy to understand results from many different variables. You can do a comparsison from a single variable or from two variables. The possibilities with Watson are endless, which is why I believe it is a great tool to use when you want to observe any data.