Task 1

First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.

mydata = read.csv(file="data/marketing.csv")
head(mydata)
##   case_number sales radio paper  tv pos
## 1           1 11125    65    89 250 1.3
## 2           2 16121    73    55 260 1.6
## 3           3 16440    74    58 270 1.7
## 4           4 16876    75    82 270 1.3
## 5           5 13965    69    75 255 1.5
## 6           6 14999    70    71 255 2.1

Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix. >>>The value “1.0” is down the diagonal because the correlation is calculating the value of one varaible against itself at a time. We are finding the correlation.

cor(mydata)
##             case_number      sales       radio       paper          tv
## case_number   1.0000000  0.2402344  0.23586825 -0.36838393  0.22282482
## sales         0.2402344  1.0000000  0.97713807 -0.28306828  0.95797025
## radio         0.2358682  0.9771381  1.00000000 -0.23835848  0.96609579
## paper        -0.3683839 -0.2830683 -0.23835848  1.00000000 -0.24587896
## tv            0.2228248  0.9579703  0.96609579 -0.24587896  1.00000000
## pos           0.0539763  0.0126486  0.06040209 -0.09006241 -0.03602314
##                     pos
## case_number  0.05397630
## sales        0.01264860
## radio        0.06040209
## paper       -0.09006241
## tv          -0.03602314
## pos          1.00000000

From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.

pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
scatter.smooth(sales,radio)

From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.

The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.

In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.

reg <- lm(sales ~ radio)
summary(reg)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732.85  -198.88    62.64   415.26   637.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9741.92    1362.94  -7.148 1.17e-06 ***
## radio         347.69      17.83  19.499 1.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9523 
## F-statistic: 380.2 on 1 and 18 DF,  p-value: 1.492e-13

Report and interpret the R-Squared value below. >>>The R-squared value is high, and the plot points are close to the line, and few even tough the line. This indicates a strong correlation, but not a perfect correlation.

Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.

plot(radio,sales)
abline(reg,col="blue",lwd=2) 

List some observations from this plot. >>>The plot is positive, increasing uphill from left to right. The points are close to the line, and few even rest on the line.


Task2

Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.

In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent values.

Lets create a multiple linear regression predicting sales using radio and tv.

mlr1 <-lm(sales ~ radio + tv)

summary(mlr1)
## 
## Call:
## lm(formula = sales ~ radio + tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1729.58  -205.97    56.95   335.15   759.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17150.46    6965.59  -2.462 0.024791 *  
## radio          275.69      68.73   4.011 0.000905 ***
## tv              48.34      44.58   1.084 0.293351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared:  0.9577, Adjusted R-squared:  0.9527 
## F-statistic: 192.6 on 2 and 17 DF,  p-value: 2.098e-12
predict.sales = -17150.46 + 275.69*(75) + 48.34*(270)

For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.

Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:

#mlr2 = Sales predicted by radio, tv, and pos
lm1 = lm(sales ~ radio + tv + pos) 
lm1 
## 
## Call:
## lm(formula = sales ~ radio + tv + pos)
## 
## Coefficients:
## (Intercept)        radio           tv          pos  
##   -15491.23       291.36        38.26      -107.62
summary(lm1)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1748.20  -187.42   -61.14   352.07   734.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15491.23    7697.08  -2.013  0.06130 . 
## radio          291.36      75.48   3.860  0.00139 **
## tv              38.26      48.90   0.782  0.44538   
## pos           -107.62     191.25  -0.563  0.58142   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9508 
## F-statistic: 123.3 on 3 and 16 DF,  p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
lm2 = lm(sales ~ radio + tv + pos + paper)
lm2
## 
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
## 
## Coefficients:
## (Intercept)        radio           tv          pos        paper  
##  -13801.015      294.224       33.369     -128.875       -9.159
summary(lm2)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1558.13  -239.35     7.25   387.02   728.02 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -13801.015   7865.017  -1.755  0.09970 . 
## radio          294.224     75.442   3.900  0.00142 **
## tv              33.369     49.080   0.680  0.50693   
## pos           -128.875    192.156  -0.671  0.51262   
## paper           -9.159      8.991  -1.019  0.32449   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:  0.9509 
## F-statistic: 92.96 on 4 and 15 DF,  p-value: 2.13e-10

Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why. >>>summar(mlr1) because the adjusted r-squared is higher.

#Linear Regression
sales_predicted = -13801.015 + 294.224*(radio) + 33.369*(tv) + -128.875*(pos) + -9.159*(paper)

sales_predicted = -13801.015 + 294.224*(69) + 33.369*(255) + -128.875*(1.5) + -9.159*(75)

sales_predicted
## [1] 14129.3

After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.

Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.


Task 3

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

  1. Logon to your Watson Analytics account at watsonanalytics.com
  2. Upload the file marketing.csv unless already in your folder
  3. Use the Predictive module to analyze the data
  4. Note the predictive power strength of reported variables. Consider the one field predictive model only. >>>The predictive model has a 97 score.
  5. How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how. >>>Out of all the variables, Tv and Radio had the strongest corrleation according to Watson. This is the same case for the linear regression models, because the model bewteen Tv and Radio has the highest adjusted r-squared.