About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.


Task 1: Simple Linear Regression

First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly.

#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
##   case_number sales radio paper  tv pos
## 1           1 11125    65    89 250 1.3
## 2           2 16121    73    55 260 1.6
## 3           3 16440    74    58 270 1.7
## 4           4 16876    75    82 270 1.3
## 5           5 13965    69    75 255 1.5
## 6           6 14999    70    71 255 2.1

Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

#Correlation matrix of all columns in the data
corr = cor(mydata)
corr
##             case_number      sales       radio       paper          tv
## case_number   1.0000000  0.2402344  0.23586825 -0.36838393  0.22282482
## sales         0.2402344  1.0000000  0.97713807 -0.28306828  0.95797025
## radio         0.2358682  0.9771381  1.00000000 -0.23835848  0.96609579
## paper        -0.3683839 -0.2830683 -0.23835848  1.00000000 -0.24587896
## tv            0.2228248  0.9579703  0.96609579 -0.24587896  1.00000000
## pos           0.0539763  0.0126486  0.06040209 -0.09006241 -0.03602314
##                     pos
## case_number  0.05397630
## sales        0.01264860
## radio        0.06040209
## paper       -0.09006241
## tv          -0.03602314
## pos          1.00000000
# Correlation matrix of columns 2,4, and 6
corr = cor(mydata[ c(2,4,6) ] )
corr
##            sales       paper         pos
## sales  1.0000000 -0.28306828  0.01264860
## paper -0.2830683  1.00000000 -0.09006241
## pos    0.0126486 -0.09006241  1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( mydata[ 2:6 ] )
corr
##            sales       radio       paper          tv         pos
## sales  1.0000000  0.97713807 -0.28306828  0.95797025  0.01264860
## radio  0.9771381  1.00000000 -0.23835848  0.96609579  0.06040209
## paper -0.2830683 -0.23835848  1.00000000 -0.24587896 -0.09006241
## tv     0.9579703  0.96609579 -0.24587896  1.00000000 -0.03602314
## pos    0.0126486  0.06040209 -0.09006241 -0.03602314  1.00000000
1A) Why the value of “1.0” along the diagonal?

The value of 1.0 indicates a perfect positive correlation. It is along the diagonal because the rows and columns are being compared to themselves. For instance, the value 1 that appears in the first column is a result of comparing sales to sales. It is obvious that if a data is compared to itself, it will be perfectly correlated. The next column also shows a value of 1 between the pair radio and radio, and the third column, between paper and paper, and so on.

1B) Which pairs has the strongest correlations?

Besides the pairs with values of 1 along the diagonal, the pair sales and radio show the next strongest correlation, with a value of 0.97713807, followed by radio and tv, with a value of 0.96609579, both very close to almost perfect correlation of 1. Another pair that has a strong correlation is sales and tv, with a value of 0.95797025.

Next we will create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this we need first to install two packages in R-Studio as executed by the command lines below

## 
## The downloaded binary packages are in
##  /var/folders/cs/mq017scs4tz9wkpz9y8xgv0c0000gn/T//RtmpAbx1XA/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/cs/mq017scs4tz9wkpz9y8xgv0c0000gn/T//RtmpAbx1XA/downloaded_packages
## corrplot 0.84 loaded
# Generates a corrgram of last computed correlation matrix
corrgram(corr)

# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)

From the matrix, its clear that Sales, Radio and TV have the strongest correlations. Let’s create now few scatterplots to visualize the data and trending lines. First we need to extract the columns from the data file.

#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
scatter.smooth(radio,sales)

From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.

The lm() linear modeling function is a very useful one. The function is set up as lm(y ~ x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.

In the simple linear regression model below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.

#Simple Linear Regression - the R function lm()
reg <- lm(sales ~ radio)

#Summary of Model
summary(reg)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732.85  -198.88    62.64   415.26   637.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9741.92    1362.94  -7.148 1.17e-06 ***
## radio         347.69      17.83  19.499 1.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9523 
## F-statistic: 380.2 on 1 and 18 DF,  p-value: 1.492e-13

As indicated by the summary report the intercept value is -9741.92 and the slope for radio is 347.69. We can therefore write the following equation for the linear regression model predicting Sales based on Radio Sales_predicted = -9741.92 + 347.69 * Xradio. Given this equation we can predict the value of sales for any given value of radio like for example 75 (investing $75,000 in radio ads)

### Linear model  ( Y = b +- mx ) 
### Sales_predicted  ~ Radio_X1

sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
### Another way to write the equation is to refer to the coefficients of the equation instead of typing the actual values out.
### This is demonstrated below.

sales_predicted = coef(reg)[1] + coef(reg)[2] * 75
coef(reg)[1] # intercept
## (Intercept) 
##   -9741.921
coef(reg)[2] # slope
##    radio 
## 347.6888
sales_predicted
## (Intercept) 
##    16334.74
1C) Repeat the above calculations to derive the linear regression model for Sales versus TV ?
#Plot of TV and Sales
scatter.smooth(tv,sales)

#Simple Linear Regression (Sales vs. TV)
lregm <- lm(sales ~ tv)

#Summary of Model
summary(lregm)
## 
## Call:
## lm(formula = sales ~ tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1921.87  -412.24     7.02   581.59  1081.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42229.21    4164.12  -10.14 7.19e-09 ***
## tv             221.10      15.61   14.17 3.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared:  0.9177, Adjusted R-squared:  0.9131 
## F-statistic: 200.7 on 1 and 18 DF,  p-value: 3.336e-11
### Linear model  ( Y = b +- mx ) 

sales_predicted = -42229.21 + 221.10 * (75)
sales_predicted
## [1] -25646.71
sales_predicted = coef(lregm)[1] + coef(lregm)[2] * 75
coef(lregm)[1] # intercept
## (Intercept) 
##   -42229.21
coef(lregm)[2] # slope
##       tv 
## 221.1043
sales_predicted
## (Intercept) 
##   -25646.39
1D) Write down the equation for the linear regression model. Note the values for the intercept and the slope

Linear regression model: Y = -42229.21 + 221.1043 * Xtv

Interecpt: -42229.21

Slope: 221.1043

A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of Sales versus Radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the predictions are from the actual value. The difference between the actual sales (circles) and the predicted sales (solid line) is captured in the residual error calculations as reported by the summary function.

#Plot Radio and Sales 
plot(radio,sales)

#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2) 


Task 2: Multiple Linear Regression

Many times there are more than one factor or variable that affect the prediction of an outcome. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.

In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent variable and the x variables are the predictors or the independent variables

Lets create a multiple linear regression predicting sales using both independent variables radio and tv.

#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)

#Summary of Multiple Linear Regression Model
summary(mlr1)
## 
## Call:
## lm(formula = sales ~ radio + tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1729.58  -205.97    56.95   335.15   759.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17150.46    6965.59  -2.462 0.024791 *  
## radio          275.69      68.73   4.011 0.000905 ***
## tv              48.34      44.58   1.084 0.293351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared:  0.9577, Adjusted R-squared:  0.9527 
## F-statistic: 192.6 on 2 and 17 DF,  p-value: 2.098e-12

Note the values of 0.9577 for R-squared and 0.9527 for the Adj R-squared. The predicted sales can again be calculated given the coefficients of the regression model. The example below shows the predicted sales for TV = 270 and Radio = 75.

# sales_predicted = radio + tv
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2]*(75) + coef(mlr1)[3]*(270)
sales_predicted
## (Intercept) 
##     16578.3
2A) Create a multiple linear regression model for each of the following, and display the summary statistics.
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)

#Summary of Multiple Linear Regression Model
summary(mlr2)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1748.20  -187.42   -61.14   352.07   734.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15491.23    7697.08  -2.013  0.06130 . 
## radio          291.36      75.48   3.860  0.00139 **
## tv              38.26      48.90   0.782  0.44538   
## pos           -107.62     191.25  -0.563  0.58142   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9508 
## F-statistic: 123.3 on 3 and 16 DF,  p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm(sales ~ radio + tv + pos + paper)
  
#Summary of Multiple Linear Regression Model
summary(mlr3)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1558.13  -239.35     7.25   387.02   728.02 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -13801.015   7865.017  -1.755  0.09970 . 
## radio          294.224     75.442   3.900  0.00142 **
## tv              33.369     49.080   0.680  0.50693   
## pos           -128.875    192.156  -0.671  0.51262   
## paper           -9.159      8.991  -1.019  0.32449   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:  0.9509 
## F-statistic: 92.96 on 4 and 15 DF,  p-value: 2.13e-10
2B) Write down the regression equation and the values for R-Squared and Adj R-Squared for each of the three cases mlr1, mlr2, and mlr3

mlr1

Equation: Ysales = -17150.46 + 275.69(Xradio) + 48.34(Xtv)

R-squared: 0.9577 Adjusted R-squared: 0.9527

mlr2

Equation: Ysales = -15491.23 + 291.36(Xradio) + 38.26(Xtv) - 107.62(Xpos)

R-squared: 0.9585 Adjusted R-squared: 0.9508

mlr3

Equation: Ysales = -13801.015 + 294.224(Xradio) + 33.369(Xtv) - 128.875(Xpos) - 9.159(Xpaper)

R-squared: 0.9612 Adjusted R-squared: 0.9509

2C) Based solely on the values for R-Squared and Adj R-Squared, which of the three multiple linear regression models mlr1, mlr2, mlr3 is best in predicting sales. Explain why.

The multiple linear regression model mlr1 is the best at predicting sales because of its relatively high adjusted R-squared value of 0.9527 compared to those of the other multiple regression models (mlr2 and mlr3, with 0.9508 and 0.9509 respectively). The r-squared values are not as helpful in this situation because they usually go up anyways when variables are added, regardless of whether or not the variables will influence the model. For instance, even though we know from previous labs that paper is not a very efficient predictor of sales, the R-squared still increases when it is added to the model. The adjusted R-squared, on the other hand, includes only variables that do improve or benefit the model, and is therefore, more revealing about which model is best at prediction.

2D) Calculate the predicted sales for each of the three models given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75. Compare your predicted sales to the actual value as obtained from the data file. Which model is best at predicting the actual sales value? Quantify your answer by calculating the error squared. Compare your results to 2C) and share any observations.
#mlr1
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2] * (69) + coef(mlr1)[3] * (255)
sales_predicted
## (Intercept) 
##    14199.05
#mlr2
sales_predicted = coef(mlr2)[1] + coef(mlr2)[2]*(69) + coef(mlr2)[3]*(255) + coef(mlr2)[4]*(1.5)
sales_predicted
## (Intercept) 
##    14208.44
#mlr3
sales_predicted = coef(mlr3)[1] + coef(mlr3)[2]*(69) + coef(mlr3)[3]*(255) + coef(mlr3)[4]*(1.5) + coef(mlr3)[5]*(75)
sales_predicted
## (Intercept) 
##    14129.32
##Error Squared 
#mlr1
error_squaredmlr1 = (14199.05 - 13965)^2
error_squaredmlr1
## [1] 54779.4
#mlr2
error_squaredmlr2 = (14208.44 - 13965)^2
error_squaredmlr2
## [1] 59263.03
#mlr3
errorsquared_mlr3 = (14129.32 - 13965)^2
errorsquared_mlr3
## [1] 27001.06

The model that is best at predicting the actual sales value is mlr3. The predicted value of 14129.32 is closer to the actual value of 13965 (from the marketing.csv file) than the predicted values that resulted from the other models (mlr1 and mlr2). Moreover, since the the difference between the predicted value of mlr3 and the actual value is the lowest, the error squared provided is also the lowest error squared in comparison. Therefore, if we base our decision on the calculations of the error squared, mlr3 is the best at predicting the actual sales value. Nevertheless, our decision should not just be based on the error squared, but instead consider other measures as well.


Task 3: Watson Analytics Predict

Watson Analytics contains a module for Predict, besides the two others for Explore and Assemble. Although Watson is a useful tool it should never be treated as a black box. In other words there are many circumstances where Watson can fail to make the proper diagnosis and suggest the correct remedies. This is why it is critical, as more of the cognitive analytics and advanced machine learning tools invade our environment, that we also allow room for human intervention and learn how to best interact.

There are many details in the Predict module. We will keep it relatively simple for this introductory phase. To complete this task, follow the directions below. Make sure to capture any screenshots and to answer any questions where requested

  1. Logon to your Watson Analytics account at http://watsonanalytics.com
  2. Upload the file marketing.csv unless already in your folder
  3. Select the Predict module to analyze the data

You will be prompted to select the data set and to enter a name for your workbook. By default the target to predict is correctly set to Sales.

3A) Capture a screen shot of the top predictors of sales. Consider the one field predictive model only. Note the predictive power strengths.

Top Predictors of Sales The power strength of radio is 93%. The power strenght of tv is 86%.

Note that Watson follows a different approach to identify and quantify the predictive strength of each variable. Such details are outside the scope of this exercise. Watson provides a lot of on-line help, and the interested reader is welcome to explore the many capabilities of Watson on their own.

3B) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain.

The results based on the predictive model from Watson reconcile with my findings based on the R regression analysis in task 2. According to Watson, the best pair is Sales and Radio, with a predictive strength of 93%, followed by Sales and TV, with a predictive strength of 86%. In other words, sales are driven by radio and tv. Similarly, the best model was the first model mlr1, with the higher adjusted r-squared. It only contained radio and tv, which are the best predictors. Thus, Watson reconciles well with my findings based on the R regression analysis.