About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Setup

Make sure to download the folder titled ‘bsad_lab5’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.


Task 1

First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.

#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)

Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix.

#Correlation Matrix
cor(mydata)
            case_number      sales       radio       paper          tv         pos
case_number   1.0000000  0.2402344  0.23586825 -0.36838393  0.22282482  0.05397630
sales         0.2402344  1.0000000  0.97713807 -0.28306828  0.95797025  0.01264860
radio         0.2358682  0.9771381  1.00000000 -0.23835848  0.96609579  0.06040209
paper        -0.3683839 -0.2830683 -0.23835848  1.00000000 -0.24587896 -0.09006241
tv            0.2228248  0.9579703  0.96609579 -0.24587896  1.00000000 -0.03602314
pos           0.0539763  0.0126486  0.06040209 -0.09006241 -0.03602314  1.00000000
corr = cor( mydata[ c(2,4,6) ] )
corr = cor( mydata[ 2:6 ] )
corr
           sales       radio       paper          tv         pos
sales  1.0000000  0.97713807 -0.28306828  0.95797025  0.01264860
radio  0.9771381  1.00000000 -0.23835848  0.96609579  0.06040209
paper -0.2830683 -0.23835848  1.00000000 -0.24587896 -0.09006241
tv     0.9579703  0.96609579 -0.24587896  1.00000000 -0.03602314
pos    0.0126486  0.06040209 -0.09006241 -0.03602314  1.00000000

The 1.0 is diegnol because the relationship with itself is the same. So that causes a 1.0 because there is nothing occurring that would cause it to yield a different value because it is being compared to itself. The pairs with the strongest correlations are that of Radio and Sales, and the next highest one appears to be Tv and Radio.

From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.

#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
pos
 [1] 1.3 1.6 1.7 1.3 1.5 2.1 1.2 3.0 1.6 1.6 1.5 0.0 0.2 0.9 1.0 2.6 1.2 2.5 2.5 1.4
paper
 [1] 89 55 58 82 75 71 59 65 62 56 66 50 47 78 41 63 77 35 37 80
tv
 [1] 250 260 270 270 255 255 280 280 260 260 255 270 275 275 275 280 280 280 250 252
sales
 [1] 11125 16121 16440 16876 13965 14999 20167 20450 15789 15991 15234 17522 17933 18390 18723 19328
[17] 19399 19641 12369 13882
radio
 [1] 65 73 74 75 69 70 87 89 72 73 70 78 79 81 81 84 84 85 65 68
#Plot of Radio and Sales using plot command from Worksheet 4
plot(radio,sales)

scatter.smooth(radio,sales)

scatter.smooth(sales,sales)

scatter.smooth(tv,sales)

scatter.smooth(paper,sales)

scatter.smooth(pos,sales)

From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.

The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.

In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.

#Simple Linear Regression
reg <- lm(radio ~ sales)
#Summary of Model
summary(reg)

Call:
lm(formula = radio ~ sales)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0269 -1.3495 -0.3121  0.7580  4.2569 

Coefficients:
              Estimate Std. Error t value          Pr(>|t|)    
(Intercept) 30.1923939  2.3815792   12.68 0.000000000207424 ***
sales        0.0027461  0.0001408   19.50 0.000000000000149 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.607 on 18 degrees of freedom
Multiple R-squared:  0.9548,    Adjusted R-squared:  0.9523 
F-statistic: 380.2 on 1 and 18 DF,  p-value: 0.0000000000001492

Report and interpret the R-Squared value below.

R-squared Value: 0.9548 This value is a good one, but not the best it can be because it’s value is very close to 1.

Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.

#Plot Radio and Sales 
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg,col="blue",lwd=2)

List some observations from this plot.

All of the points seem to be very close to the line meaning that there are no huge deviations and or outliers within the data.

(For some reason my blue line will not show up, even though I have typed in everything correctly.)


Task2

Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression.

In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent value.

Lets create a multiple linear regression predicting sales using radio and tv.

#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)

Call:
lm(formula = sales ~ radio + tv)

Residuals:
     Min       1Q   Median       3Q      Max 
-1729.58  -205.97    56.95   335.15   759.26 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17150.46    6965.59  -2.462 0.024791 *  
radio          275.69      68.73   4.011 0.000905 ***
tv              48.34      44.58   1.084 0.293351    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 568.9 on 17 degrees of freedom
Multiple R-squared:  0.9577,    Adjusted R-squared:  0.9527 
F-statistic: 192.6 on 2 and 17 DF,  p-value: 0.000000000002098

For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.

Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:

#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)
summary(mlr2)

Call:
lm(formula = sales ~ radio + tv + pos)

Residuals:
     Min       1Q   Median       3Q      Max 
-1748.20  -187.42   -61.14   352.07   734.20 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -15491.23    7697.08  -2.013  0.06130 . 
radio          291.36      75.48   3.860  0.00139 **
tv              38.26      48.90   0.782  0.44538   
pos           -107.62     191.25  -0.563  0.58142   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 580.7 on 16 degrees of freedom
Multiple R-squared:  0.9585,    Adjusted R-squared:  0.9508 
F-statistic: 123.3 on 3 and 16 DF,  p-value: 0.00000000002859
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm(sales ~ radio + tv + pos + paper)
summary(mlr3)

Call:
lm(formula = sales ~ radio + tv + pos + paper)

Residuals:
     Min       1Q   Median       3Q      Max 
-1558.13  -239.35     7.25   387.02   728.02 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -13801.015   7865.017  -1.755  0.09970 . 
radio          294.224     75.442   3.900  0.00142 **
tv              33.369     49.080   0.680  0.50693   
pos           -128.875    192.156  -0.671  0.51262   
paper           -9.159      8.991  -1.019  0.32449   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 580 on 15 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9509 
F-statistic: 92.96 on 4 and 15 DF,  p-value: 0.000000000213

mlr2 R-squared: 0.9585 Adjusted R-squared: 0.9508

mlr3 Multiple R-squared: 0.9612 Adjusted R-squared: 0.9509

Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why.

mlr2 is the linear regression model that is best at predicting sales because it’s r-sqyared and adjusted r-suared value barely needed to be adjusted to provide almost the sane number.

After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.

Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.

smlr1 = -17150.46+(69*275.46)+(48.34*255)
smlr1
[1] 14182.98
smlr2 = -17150.46+(69*275.46)+(48.34*255) + (1.5*-128.875)
smlr2
[1] 13989.67
smlr3 = -17150.46+(69*275.46)+(48.34*255) + (1.5*-128.875) + (75*-9.159)
smlr3
[1] 13302.74

Task 3

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

  1. Logon to your Watson Analytics account at watsonanalytics.com
  2. Upload the file marketing.csv unless already in your folder
  3. Use the Predictive module to analyze the data
  4. Note the predictive power strength of reported variables. Consider the one field predictive model only.
  5. How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.

It is clear, just as we saw beor ethat the tv and radio inpyt have a strong correlation to sales results.

---
title: "Business Analytics Lab Worksheet 05"
author: "Justice Lawson"
date: "8/2/2017"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
subtitle: CME Group Foundation Business Analytics Lab
---

### About
In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable. 

This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics. 

### Setup

Make sure to download the folder titled 'bsad_lab5' zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to 'Session', scroll down to 'Set Working Directory', and click 'To Source File Location'. Now, follow the directions to complete the lab.

----------

### Task 1

First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly. 

```{r}
#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
```
Next, apply the `cor()` function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

Why is the value "1.0" down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix. 
```{r}
#Correlation Matrix
cor(mydata)
corr = cor( mydata[ c(2,4,6) ] )
corr = cor( mydata[ 2:6 ] )
corr
```
The 1.0 is diegnol because the relationship with itself is the same. So that causes a 1.0 because there is nothing occurring that would cause it to yield a different value because it is being compared to itself. The pairs with the strongest correlations are that of Radio and Sales, and the next highest one appears to be Tv and Radio. 


From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.

```{r}
#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio

pos
paper
tv
sales
radio

#Plot of Radio and Sales using plot command from Worksheet 4
plot(radio,sales)
scatter.smooth(radio,sales)
scatter.smooth(sales,sales)
scatter.smooth(tv,sales)
scatter.smooth(paper,sales)
scatter.smooth(pos,sales)
```
From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph. 

The `lm()` function is a very useful one. The function is set up as `lm(y~x)` where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.

In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model. 

```{r}
#Simple Linear Regression
reg <- lm(radio ~ sales)

#Summary of Model
summary(reg)
```
Report and interpret the R-Squared value below. 

R-squared Value: 0.9548
This value is a good one, but not the best it can be because it's value is very close to 1.

Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.

```{r}
#Plot Radio and Sales 
plot(radio,sales)

#Add a trend line plot using the linear model we created above
abline(reg,col="blue",lwd=2)
```
List some observations from this plot. 

All of the points seem to be very close to the line meaning that there are no huge deviations and or outliers within the data. 

(For some reason my blue line will not show up, even though I have typed in everything correctly.)

----------

### Task2

Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable. While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. This idea is a loose metaphor for multiple linear regression. 

In R, multiple linear regression takes the form of `lm(y ~ x0 + x1 + x2 + ... )`, where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent value.

Lets create a multiple linear regression predicting sales using radio and tv. 

```{r}
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)

#Summary of Multiple Linear Regression Model
summary(mlr1)
```
For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527. 

Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared: 
```{r}
#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)
summary(mlr2)

#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm(sales ~ radio + tv + pos + paper)
summary(mlr3)

```
mlr2
R-squared:  0.9585
Adjusted R-squared:  0.9508 

mlr3
Multiple R-squared:  0.9612
Adjusted R-squared:  0.9509

Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why. 

mlr2 is the linear regression model that is best at predicting sales because it's r-sqyared and adjusted r-suared value barely needed to be adjusted to provide almost the sane number. 

After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables. 

Given that `Radio = 69` , `TV = 255` , `POS = 1.5`, and `Paper = 75`, calculate the predicted sales value for each of the three models above.

```{r}
smlr1 = -17150.46+(69*275.46)+(48.34*255)
smlr1

smlr2 = -17150.46+(69*275.46)+(48.34*255) + (1.5*-128.875)
smlr2

smlr3 = -17150.46+(69*275.46)+(48.34*255) + (1.5*-128.875) + (75*-9.159)
smlr3
```

----------

### Task 3

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked. 

  1. Logon to your Watson Analytics account at watsonanalytics.com
  2. Upload the file marketing.csv unless already in your folder
  3. Use the Predictive module to analyze the data
  4. Note the predictive power strength of reported variables. Consider the one field predictive model only.
  5. How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.
  
![](data/wat1.png)
![](data/wat2.png)
![](data/wat3.png)

It is clear, just as we saw beor ethat the tv and radio inpyt have a strong correlation to sales results.
