Task 1: Correlation Analysis
1A) Read the csv file into R Studio and display the dataset.
1B) Create a correlation table for your to compare the correlations between all variables. Remove any variables where correlation between variables is irrelevant or inaccurate
- Commands: cor() mydata[ -c(“COLUMN_NAME OR COLUMN_NUMBER”) ]
#corr = cor( MYDATA )
#corr
corr = cor(mydata [ -c (1) ])
corr
TV radio newspaper sales
TV 1.00000000 0.05480866 0.05664787 0.7822244
radio 0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales 0.78222442 0.57622257 0.22829903 1.0000000
1C) Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations, list the pairs.
The value “1.0” goes down on a diagonal because at each point where “1.0” is located exact same variables are matched up with each other. For example, “1.0” first appears where TV is matched up with TV. Since they are the same variables, the relationship between them is perfectly correlated. The paris that seem to have the strongest correlations are TV and sales, and radio and sales.
1D-a) Identifying the dependent variable (y) and one independent variable (x_i) using the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. Use those two variables to create a scatterplot to visualize the data. Note any patterns or relation between the two variables
- Commands: qplot( x = VARIABLE, y = VARIABLE, data = mydata)
qplot( x = radio, y = sales, data = mydata)

Task 2: Regression Analysis
- To create a regression model we use the function lm(), such as lm( y ~ x )
- Where the independent variable or variables [ x_1, x_2, … x_i ], predict the values of the dependent variable y.
2A) Create a linear regression model by identifying the dependent variable (y) and for independent variable (x_i) use the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. (same variables as 1D-a)
#Simple Linear Regression Model
#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg = lm( sales ~ radio )
2B) Use the regression model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?
- Commands: Use the summary() function to create a report for the linear model
#Summary of Simple Linear Regression Model
#summary(MODEL)
summary(reg)
Call:
lm(formula = sales ~ radio)
Residuals:
Min 1Q Median 3Q Max
-15.7305 -2.1324 0.7707 2.7775 8.1810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
radio 0.20250 0.02041 9.921 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
The R-squared value for this data is 0.332 and the adjusted R-squared is 0.3287. Low R-quared values, generally anything below 0.5, indicate that the regression model is a bad fit for the data, which is true in this case.
2C) Create a plot for the dependent (y) and independent (x) variables Note any patterns or relation between the two variables describe the trend line.
- The trend line will show how far the predictions are from the actual value
- The distance from the actual versus the predicted is the residual
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()
p <- qplot( x = radio, y = sales, data = mydata) + geom_point()
#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(medthod = "lm", formula = y ~ x)
Ignoring unknown parameters: medthod

We can see from the trend line that we just created that there is a positive correlation or relationship between radio and sales. From the positive slope of the line, we know that as one variable increases, so does the other.
2D-b) Create a Multiple Linear Regression Model using all relevant independent variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?
mlr2 = lm(sales ~ radio + TV + newspaper, data=mydata)
summary(mlr2)
Call:
lm(formula = sales ~ radio + TV + newspaper, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
This model gives us an R-quared value of 0.8972 and an adjusted R-squared of 0.8956. These R-squared values are also much higher than the ones from the first model, and they are also a good fit for this data.
Based purely on the values for R-Squared and Adjusted R-Squared, which linear regression model is best in predicting the dependent variable? Explain why
The second and third model are extremely close to each other, however the second linear regression model is slightly better at predicting the dependent variable. Both the second and third model have the same R-squared value of 0.8972, however the second model has a slighty higher adjusted R-squared value which is 0.8962 compared to 0.8956. Even though this is a very small difference between the two adjusted R-squared values it is enough to make the second model slightly better.
2E) Use the three different models to predicted the dependent variable for the given values of the independent variables.
- Variable: Radio = 69
- Variable: TV = 255
- Variable: newspaper = 75
MODEL 1
sales_predicted1 = 9.3166 + (0.20250) * (radio)
sales_predicted1
[1] 23.2891
MODEL 2
sales_predicted2 = 2.92110 + 0.18799 * (radio) + 0.04575 * (TV)
sales_predicted2
[1] 27.55866
MODEL 3
sales_predicted3 = 2.938889 + 0.188530 * (radio) + 0.045765 * (TV) - 0.001037 * (newspaper)
sales_predicted3
[1] 27.53976
Task 3: Watson Analysis
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
3A) Use the Predictive module to analyze the given data. Note any interesting patterns add an screenshot of what you found.
knitr::include_graphics("imgs/screenshot1.png")

knitr::include_graphics("imgs/screenshot2.png")

These predictive modules show that TV and radio are both strong predictors of sales. I also found it interesting that the predictive strength of this model is 72%.
3B) Note the predictive power strength of reported variables. Consider the one field predictive model only, describe your findings and add and screenshot
knitr::include_graphics("imgs/screenshot3.png")

This graphic once again shows that TV and radio are the strongest drivers of sales, combined they accound for 94% of sales. TV drives sales by 59% and radio drives sales by 32%. This also aligns with the regression models we used earlier which also show that TV was the strongest driver of sales.
3C) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.
As mentioned above, the Watson results support the our findings from the R regression analysis. Watson analytics showed that TV and radio are the strongest drivers of sales and the regression models showed that TV and radio resulted in the strongest R-squared values proving that they are highly correlated and are best at driving sales.
---
title: "Predictive Analytics"
author: "Cameron Gerhart"
date: "March 6, 2018"
output:
  html_notebook: default
  html_document: default
subtitle: CME Group Foundation Business Analytics Lab
---

-------------

## Notebook Instructions

-------------

* For your assignment you may be using different dataset than what is included here. 

* Always read carefully the instructions on Sakai.  

* Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

### About

* In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. 

* This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. 

* This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable. 

* This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics. 


### Load Packages in R/RStudio 

We are going to use tidyverse a collection of R packages designed for data science. 

* Info: https://www.tidyverse.org/

```{r, echo = FALSE}

# Here we are checking if the package is installed
if(!require("tidyverse")){
  
  # If the package is not in the system then it will be install
  install.packages("tidyverse", dependencies = TRUE)
  
  # Here we are loading the package
  library("tidyverse")
}

# Here we are checking if the package is installed
if(!require("plotly")){
  
  # If the package is not in the system then it will be install
  install.packages("plotly", dependencies = TRUE)
  
  # Here we are loading the package
  library("plotly")
}

```

-------------

## Task 1: Correlation Analysis

-------------

### 1A) Read the csv file into R Studio and display the dataset. 

* Name your dataset 'mydata' so it easy to work with.

* Commands: read_csv() rename() head()

#### Extract the assigned features (columns) to perform some analytics. 

```{r}

mydata = read.csv(file="data/Advertising.csv")
head(mydata)
```

```{r}
sales = mydata$sales
radio = mydata$radio
newspaper = mydata$newspaper
TV = mydata$TV
```

### 1B) Create a correlation table for your to compare the correlations between all variables. Remove any variables where correlation between variables is irrelevant or inaccurate

* Commands: cor() mydata[ -c("COLUMN_NAME OR COLUMN_NUMBER") ]

```{r}

#corr = cor( MYDATA )
#corr

corr = cor(mydata [ -c (1) ])
corr
```

### 1C) Why is the value "1.0" down the diagonal? Which pairs seem to have the strongest correlations, list the pairs. 
The value "1.0" goes down on a diagonal because at each point where "1.0" is located exact same variables are matched up with each other. For example, "1.0" first appears where TV is matched up with TV. Since they are the same variables, the relationship between them is perfectly correlated. The paris that seem to have the strongest correlations are TV and sales, and radio and sales. 


### 1D-a) Identifying the dependent variable (y) and one independent variable (x_i) using the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. Use those two variables to create a scatterplot to visualize the data. Note any patterns or relation between the two variables

* Commands: qplot( x = VARIABLE, y = VARIABLE, data = mydata)

```{r}
qplot( x = radio, y = sales, data = mydata)
```


### 1D-b) Create a 3D scatterplot between the two of the strongest correlated variables to the dependent variable. Note any patterns and the coordinates of three points with the heights values (x,y,z)

```{r}

#p <- plot_ly(mydata, x = ~VARIBLE_1, y = ~VARIBLE_2, z = ~VARIBLE_3, marker = list(size = 5)) %>%
#  add_markers() %>% 
#p


p = plot_ly(mydata, x = ~radio, y = ~TV, z = ~sales, marker = list(size = 5)) %>%
  add_markers() 
p
```


-------------

## Task 2: Regression Analysis

-------------

* To create a regression model we use the function lm(), such as lm( y ~ x )
* Where the independent variable or variables [ x_1, x_2, ... x_i ], predict the values of the dependent variable y.

### 2A) Create a linear regression model by identifying the dependent variable (y) and for independent variable (x_i) use the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. (same variables as 1D-a)

* Commands: lm( y ~ x ) 

```{r}
#Simple Linear Regression Model

#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )

reg = lm( sales ~ radio )
```


### 2B) Use the regression model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

* Commands: Use the summary() function to create a report for the linear model

```{r}
#Summary of Simple Linear Regression Model

#summary(MODEL)

summary(reg)
```
The R-squared value for this data is 0.332 and the adjusted R-squared is 0.3287. Low R-quared values, generally anything below 0.5, indicate that the regression model is a bad fit for the data, which is true in this case. 

### 2C) Create a plot for the dependent (y) and independent (x) variables Note any patterns or relation between the two variables describe the trend line.

* The trend line will show how far the predictions are from the actual value
* The distance from the actual versus the predicted is the residual

```{r}

#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()

p <- qplot( x = radio, y = sales, data = mydata) + geom_point()

#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)

p + geom_smooth(medthod = "lm", formula = y ~ x)
```
We can see from the trend line that we just created that there is a positive correlation or relationship between radio and sales. From the positive slope of the line, we know that as one variable increases, so does the other. 

### 2D-a) Create a Multiple linear regression model and summary report using the two strongest correlated variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data? Compared this model to the previous model, which model is better? 

* Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable.
* While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops. 
* This idea is a loose metaphor for multiple linear regression. 

* Multiple linear regression lm(y ~ x_0 + x_1 + x_2 + ... x_i )
* Where y is the predicted/dependent variable and the x variables are the predictors/independent variable 

* commands: lm( y ~ x_1 + x_2 ) summary( reg_model )

```{r}

#Multiple Linear Regression Model
#mlr1 <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE1 + INDEPENDENT_VARIABLE2 )
mlr1 = lm( sales ~ radio + TV, data = mydata )

#Summary of Multiple Linear Regression Model
summary(mlr1)
```
The R-squared value for this model is 0.8972 and the adjusted R-squared value is 0.8962. These values are higher and much closer to 1 than the previous values, which indicated that this model fits the data better than the previous one. 

### 2D-b) Create a Multiple Linear Regression Model using all relevant independent variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?
```{r}
mlr2 = lm(sales ~ radio + TV + newspaper, data=mydata)
summary(mlr2)
```
This model gives us an R-quared value of 0.8972 and an adjusted R-squared of 0.8956. These R-squared values are also much higher than the ones from the first model, and they are also a good fit for this data. 

#### Based purely on the values for R-Squared and Adjusted R-Squared, which linear regression model is best in predicting the dependent variable? Explain why
The second and third model are extremely close to each other, however the second linear regression model is slightly better at predicting the dependent variable. Both the second and third model have the same R-squared value of 0.8972, however the second model has a slighty higher adjusted R-squared value which is 0.8962 compared to 0.8956. Even though this is a very small difference between the two adjusted R-squared values it is enough to make the second model slightly better. 

### 2E) Use the three different models to predicted the dependent variable for the given values of the independent variables. 

* Variable: Radio = 69
* Variable: TV = 255
* Variable: newspaper = 75

**MODEL 1**
```{r}
sales_predicted1 = 9.3166 + (0.20250) * (radio)
sales_predicted1
```

**MODEL 2**
```{r}
sales_predicted2 = 2.92110 + 0.18799 * (radio) + 0.04575 * (TV)
sales_predicted2
```

**MODEL 3**
```{r}
sales_predicted3 = 2.938889 + 0.188530 * (radio) + 0.045765 * (TV) - 0.001037 * (newspaper)
sales_predicted3
```


-------------

## Task 3: Watson Analysis

-------------

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked. 

### 3A) Use the Predictive module to analyze the given data. Note any interesting patterns add an screenshot of what you found.

```{r}
knitr::include_graphics("imgs/screenshot1.png") 
```
```{r}
knitr::include_graphics("imgs/screenshot2.png")
```
These predictive modules show that TV and radio are both strong predictors of sales. I also found it interesting that the predictive strength of this model is 72%. 

### 3B) Note the predictive power strength of reported variables. Consider the one field predictive model only, describe your findings and add and screenshot

```{r}
knitr::include_graphics("imgs/screenshot3.png")
```
This graphic once again shows that TV and radio are the strongest drivers of sales, combined they accound for 94% of sales. TV drives sales by 59% and radio drives sales by 32%. This also aligns with the regression models we used earlier which also show that TV was the strongest driver of sales. 

### 3C) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.
As mentioned above, the Watson results support the our findings from the R regression analysis. Watson analytics showed that TV and radio are the strongest drivers of sales and the regression models showed that TV and radio resulted in the strongest R-squared values proving that they are highly correlated and are best at driving sales. 
