Notebook Instructions


About

  • In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.

  • This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.

  • This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

  • This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 1: Correlation Analysis


1A) Read the csv file into R Studio and display the dataset.

  • Name your dataset ‘mydata’ so it easy to work with.

  • Commands: read_csv() rename() head()

Extract the assigned features (columns) to perform some analytics.

mydata = read.csv(file="data/Advertising.csv")
head(mydata)
##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2
mydata <- rename(mydata, "case_number" = "X")
sales <- mydata$sales
newspaper <- mydata$newspaper
radio <- mydata$radio
TV <- mydata$TV
case_number <- mydata$case_number
head(mydata)
##   case_number    TV radio newspaper sales
## 1           1 230.1  37.8      69.2  22.1
## 2           2  44.5  39.3      45.1  10.4
## 3           3  17.2  45.9      69.3   9.3
## 4           4 151.5  41.3      58.5  18.5
## 5           5 180.8  10.8      58.4  12.9
## 6           6   8.7  48.9      75.0   7.2

Sales is the dependent variable. Where as TV Radio and Newspaper are independent.

1B) Create a correlation table for your to compare the correlations between all variables. Remove any variables where correlation between variables is irrelevant or inaccurate

  • Commands: cor(mydata[ -c(“COLUMN_NUMBER”) ])
#corr = cor(mydata)
#corr
corr = cor(mydata[ -c(1,1)])
corr
##                   TV      radio  newspaper     sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000

1C) Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations, list the pairs.

The value 1.0 goes down the matrix diagonally because it is the correlation between the same variable. The strongest correlation is between sales and TV at 78%.

1D-a) Identifying the dependent variable (y) and one independent variable (x_i) using the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. Use those two variables to create a scatterplot to visualize the data. Note any patterns or relation between the two variables.

  • Commands: qplot( x = VARIABLE, y = VARIABLE, data = mydata)
qplot( x = radio, y = sales, data = mydata)

Two variables that have a relationship between .2 and .6 are radio and sales. Their correlation is 57.6%

1D-b) Create a 3D scatterplot between the two of the strongest correlated variables to the dependent variable. Note any patterns and the coordinates of three points with the heights values (x,y,z)

#p <- plot_ly(mydata, x = ~VARIBLE_1, y = ~VARIBLE_2, z = ~VARIBLE_3, marker = list(size = 5)) %>%
#  add_markers() %>% 
#p
p <- plot_ly(mydata, x = ~radio, y = ~TV, z = ~sales, marker = list(size = 5)) 
p
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

This plot indicates that the correlation between radio and sales and TV and sales are positively correlated. As radio and TV advertisements go up, sales go up! ————-

Task 2: Regression Analysis


2A) Create a linear regression model by identifying the dependent variable (y) and for independent variable (x_i) use the correlation table to identify a variable with a coefficient greater than 0.20 and lower than 0.60. (same variables as 1D-a)

  • Commands: lm( y ~ x )
  • Commands: Use the summary() function to create a report for the linear model
#Simple Linear Regression Model
#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg <- lm( sales ~ radio, data = mydata)
reg
## 
## Call:
## lm(formula = sales ~ radio, data = mydata)
## 
## Coefficients:
## (Intercept)        radio  
##      9.3116       0.2025
#summary(MODEL)
summary(reg)
## 
## Call:
## lm(formula = sales ~ radio, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

2B) Use the regression model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

The r-squared is .332 and adjusted is .329 These values are both below .5 so they are not significant to analyze. This, however, does not indicate that this line isn’t a great fit. Anything above .75 would be better. The r-squared and the adjusted are fairly equal. 33% of the time you can predict the impact of radio ads to sales. R squareds below .5 indicate a line is a bad fit.

Note: Y_mpg_predicted = intercept estimate value +/- intercept weight estimate * (weight)

2C) Create a plot for the dependent (y) and independent (x) variables Note any patterns or relation between the two variables describe the trend line.

  • The trend line will show how far the predictions are from the actual value
  • The distance from the actual versus the predicted is the residual
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()

p <- qplot( x = mydata$radio, y = mydata$sales, data = mydata) + geom_point()

p + geom_smooth(method = "lm", formula = y ~ x)

The pattern is that the more radio advertising there is, the higher the sales amount. This line has a positive slope.

2D-a) Create a Multiple linear regression model and summary report using the two strongest correlated variables and the dependent variable.

  • Sometimes, one variable is very good at predicting another variable. But most times, there are more than one factors that affect the prediction of another variable.
  • While increased rainfall is a good predictor of increased crop supply, decreased herbivores can also result in an increase of crops.
  • This idea is a loose metaphor for multiple linear regression.

  • Multiple linear regression lm(y ~ x_0 + x_1 + x_2 + … x_i )
  • Where y is the predicted/dependent variable and the x variables are the predictors/independent variable

  • commands: lm( y ~ x_1 + x_2 ) summary( reg_model )

#Multiple Linear Regression Model
#mlr1 <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE1 + INDEPENDENT_VARIABLE2 )
#Summary of Multiple Linear Regression Model

mlrm_sales <- lm( sales ~ radio + TV )
mlrm_sales
## 
## Call:
## lm(formula = sales ~ radio + TV)
## 
## Coefficients:
## (Intercept)        radio           TV  
##     2.92110      0.18799      0.04575
summary(mlrm_sales)
## 
## Call:
## lm(formula = sales ~ radio + TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

sales_predicted = intercept + radiocoefficient(radio) + tvcoefficient(TV)

Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data? Compared this model to the previous model, which model is better?

The r squared and the adjusted r squared are .897 and .896. This is a great fit for our data because it is over .75. This is a better fit than the last model.

2D-b) Create a Multiple Linear Regression Model using all relevant independent variables and the dependent variable. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

mlrm_sales2 <- lm( mydata$sales ~ mydata$radio + mydata$TV + mydata$newspaper)
mlrm_sales2
## 
## Call:
## lm(formula = mydata$sales ~ mydata$radio + mydata$TV + mydata$newspaper)
## 
## Coefficients:
##      (Intercept)      mydata$radio         mydata$TV  mydata$newspaper  
##         2.938889          0.188530          0.045765         -0.001037
summary(mlrm_sales2)
## 
## Call:
## lm(formula = mydata$sales ~ mydata$radio + mydata$TV + mydata$newspaper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.938889   0.311908   9.422   <2e-16 ***
## mydata$radio      0.188530   0.008611  21.893   <2e-16 ***
## mydata$TV         0.045765   0.001395  32.809   <2e-16 ***
## mydata$newspaper -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Based purely on the values for R-Squared and Adjusted R-Squared, which linear regression model is best in predicting the dependent variable? Explain why

The r squared is .897 and the adjusted r squared is .895. Both are great fits for the model with only slight variation.

2E) Use the three different models to predicted the dependent variable for the given values of the independent variables.

  • Variable: Radio = 69
  • Variable: TV = 255
  • Variable: newspaper = 75

MODEL 1

# lm( sales ~ radio )
radio = 69
predicted_sales_radio =  9.3116 + (0.20250) * (radio)
predicted_sales_radio
## [1] 23.2841

Through this model, 23.28 is the predicted sales value when using the radio data point of 69.

MODEL 2

# lm( sales ~ radio + TV )
radio = 69
TV = 255
predicted_sales_TVradio = 2.92110 + 0.18799 * (radio) + 0.04575 * (TV)
predicted_sales_TVradio
## [1] 27.55866

Through model 2, 27.56 is the predicted sales value when using both the TV and radio data to do a multiple linier regression.

MODEL 3

#lm( sales ~ radio + TV + newspaper)
radio = 69
TV = 255
newspaper = 75
predicted_sales_all = 2.938889 + 0.188530 * (radio) + 0.045765 * (TV) - 0.001037 * (newspaper)
predicted_sales_all
## [1] 27.53976

Through model 3’s analysis, the predicted sales utilizing newspaper, TV, and radio is 27.54.


Task 3: Watson Analysis


To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Use the Predictive module to analyze the given data. Note any interesting patterns add a screenshot of what you found.

The advertising data has 91% quality according to Watson.

knitr::include_graphics('images/lab 05 watson 1.png')

The predictive strength of the variables according to Watson is 72%. TV is the strongest predictor, but both radio and TV combined are the best predictors for sales!

knitr::include_graphics('images/lab 05 watson 2.png')

Above is the decision tree that shows the same results as the decision rules.

3B) Note the predictive power strength of reported variables. Consider the one field predictive model only, describe your findings and add and screenshot

The predicted value strength of TV and Radio is 20.75.

knitr::include_graphics('images/spiral.png')


Above is a spiral visualization with radio as the farthest predictor and TV and Radio combined as the best. This shows that sales do not just happen because you do one thing. It shows that the amount of sales driven by advertising is much more complicated than this = this. Consumer sales is complex and companies should use many forms of advertising to drive sales. Each customer and consumer is different and therefore there needs to be variety in how companies advertise in order to maximize sales.

3C) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.

The Watson analytics results 100% reconciles our findings based in the R regression analysis. Our analysis showed that TV was the strongest predictor. But we used radio to analyze in task 2. Radio’s analysis in Watson is seen below.

knitr::include_graphics('images/radio.png')

The predictive strength of radio is 32%.

Above, I wrote a quick analysis on why sales is not driven by just one variable. Sales is best driven by multiple advertising mediums because all consumers are different.