Predictive Analytics

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.
This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.
This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Task 1: Correlation Analysis

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() rename() head()

Extract the assigned features (columns) to perform some analytics.

mydata = read.csv(file="data/Advertising.csv")
head(mydata)

##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

mydata <- rename(mydata, "case_number" = "X")
head(mydata)

##   case_number    TV radio newspaper sales
## 1           1 230.1  37.8      69.2  22.1
## 2           2  44.5  39.3      45.1  10.4
## 3           3  17.2  45.9      69.3   9.3
## 4           4 151.5  41.3      58.5  18.5
## 5           5 180.8  10.8      58.4  12.9
## 6           6   8.7  48.9      75.0   7.2

sales = mydata$sales
TV = mydata$TV
radio = mydata$radio
news = mydata$newspaper

1B) Create a correlation table for your to compare the correlations between all variables. Remove any variables where correlation between variables is irrelevant or inaccurate

Commands: cor() mydata[ -c(“COLUMN_NAME OR COLUMN_NUMBER”) ]

#corr = cor( MYDATA )
#corr

corr = cor(mydata[ -c(1)])
corr

##                   TV      radio  newspaper     sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000

1C) Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations, list the pairs.

1.0 goes down the diagonal because it checks the correlation between a variable and itself, they will always be correlated to themselves, hence the 1.There is the strongest correlation between TV and sales, the second strongest is between sales and radio.

1D-a) Identifying the dependent variable (y) and one independent variable (x_i) using the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. Use those two variables to create a scatterplot to visualize the data. Note any patterns or relation between the two variables

Commands: qplot( x = VARIABLE, y = VARIABLE, data = mydata)

#Use radio because TV is above 0.60. Radio is 0.576.
qplot( x = radio, y = sales, data = mydata)

The correlation has a stronger line at the top of the curve, where sales are the highest for each expenditure of radio spending. It is a posive correlation although it is a bit scattered overall.

1D-b) Create a 3D scatterplot between the two of the strongest correlated variables to the dependent variable. Note any patterns and the coordinates of three points with the heights values (x,y,z)

#p <- plot_ly(mydata, x = ~VARIBLE_1, y = ~VARIBLE_2, z = ~VARIBLE_3, marker = list(size = 5)) %>%
#  add_markers() %>% 
#p

p <- plot_ly(mydata, x = ~TV, y = ~sales, z = ~radio, marker = list(size = 5)) %>%
  add_markers()

p

The highest valued coordinates are (276.9, 27, 48.9) TV = 276.9 Radio = 48.9 Sales = 27 While sales, tv, and radio do not always rise together, the highest and lowest sales tend to have the highest and lowest radio and tv values. TV appears to have a big affect on sales because when TV is low, sales tend to be lower even if the radio value is high. For example, see (0.7, 1.6, 39.6)

Task 2: Regression Analysis

To create a regression model we use the function lm(), such as lm( y ~ x )
Where the independent variable or variables [ x_1, x_2, … x_i ], predict the values of the dependent variable y.

2A) Create a linear regression model by identifying the dependent variable (y) and for independent variable (x_i) use the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. (same variables as 1D-a)

Commands: lm( y ~ x )

#Simple Linear Regression Model

#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg <- lm(sales ~ radio)

2B) Use the regression model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

Commands: Use the summary() function to create a report for the linear model

#Summary of Simple Linear Regression Model

#summary(MODEL)
summary(reg)

## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

The R squared is 0.332 and the Adjusted R-squared is 0.3287. These are not very promising values, meaning that the low R-Squared indicates a poor fit to the data.

2C) Create a plot for the dependent (y) and independent (x) variables Note any patterns or relation between the two variables describe the trend line.

The trend line will show how far the predictions are from the actual value
The distance from the actual versus the predicted is the residual

#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()

p <- qplot( x = radio, y = sales, data = mydata) + geom_point()

#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(method ="lm" , formula = y ~ x)

This does not look like a reliable formula as most of the points are not along the line. There seems to be a positive relationship even though the data does not fit the line well.

2D-a) Create a Multiple linear regression model and summary report using the two strongest correlated variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data? Compared this model to the previous model, which model is better?

Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable.
While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops.
This idea is a loose metaphor for multiple linear regression.
Multiple linear regression lm(y ~ x_0 + x_1 + x_2 + … x_i )
Where y is the predicted/dependent variable and the x variables are the predictors/independent variable
commands: lm( y ~ x_1 + x_2 ) summary( reg_model )

#Multiple Linear Regression Model
#mlr1 <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE1 + INDEPENDENT_VARIABLE2 )
mlrl <- lm(sales ~ radio + TV)
#Summary of Multiple Linear Regression Model
summary(mlrl)

## 
## Call:
## lm(formula = sales ~ radio + TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Multiple R-Squared and Adjusted R Squared more than doubles, meaning that this model is likely more accurate and would be preffered as it fits the data well.

2D-b) Create a Multiple Linear Regression Model using all relevant independent variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

mlrl2 <- lm(sales ~ radio + TV + news)
summary(mlrl2)

## 
## Call:
## lm(formula = sales ~ radio + TV + news)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## news        -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

The accuracy on this model goes down from the previous one, but is still better than the first. While the R-Squared stays the same, the adjusted decreases slightly. This overall is a good fit for the data, just not the best fit.

Based purely on the values for R-Squared and Adjusted R-Squared, which linear regression model is best in predicting the dependent variable? Explain why

The second model that considers radio and tv is the best model for predicting the dependent variable. It has the higherst values for R-Squared and Adjusted R Squared and uses the strongest indicators of sales.

2E) Use the three different models to predicted the dependent variable for the given values of the independent variables.

Variable: Radio = 69
Variable: TV = 255
Variable: newspaper = 75

MODEL 1

9.31164 + 0.20250*69

## [1] 23.28414

MODEL 2

2.9211 + 0.18799*69 + 0.04575*255

## [1] 27.55866

MODEL 3

0.188530*69 + 0.045765*255 - 0.001037*75 + 2.938889

## [1] 27.53976

Task 3: Watson Analysis

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Use the Predictive module to analyze the given data. Note any interesting patterns add an screenshot of what you found.

The sales are best predicted using radio and tv values in both models. This is interesting because newspaper was not even considered to be a driver. This could be due to the 78% data accuracy for that variable. I also find it interesting that it is 72% accurate. It is nice to visually see how much closer the 94% of radio and TV are to sales than the 1 driver indicators.

knitr::include_graphics("img/PredictiveModel1.PNG")

knitr::include_graphics("img/PredictiveModel2.PNG")

3B) Note the predictive power strength of reported variables. Consider the one field predictive model only, describe your findings and add and screenshot

The power strength of TV is 59% and Radio is 32%. This is not surprising as the model ran in R also indicated that TV was a stronger indicator of sales than radio.

knitr::include_graphics("img/DriverStrength.PNG")

3C) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.

The Watson results validate the findings in the R regression analysis. They both show that using tv and radio as the predictor helps yield the best indicator of sales. I find it interesting that the newspaper is not used at all in Watson, but both watson and r found that it is not as much of reliable indicator due to the outliers.