About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Setup

Make sure to download the folder titled ‘bsad_lab5’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.


Task 1

First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.

To start any project, you have to read in the data. The data has two columns for case number, so we only need to deal with one of these columns.

#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)

Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.

Why is the value “1.0” down the diagonal? Which pairs seem to have the strongest correlations? Answer and list the pairs below the matrix.

The value “1.0” was in the diagonal becasue case study is corresponding to case study so there is a perfect correlation between the same variable. Becasue of this we can adjust this and make it just read 2:6.Radio and Sales have a very strong correlation, sales and tv have a strong corelation and radio and tv have a strong correlation. Pos and paper have a strong negative correlation. You can also see this throuh the correlation plots and diagrams.

#Correlation Matrix
cor = cor(mydata)
corr = cor(mydata[2:6])
corr
           sales       radio       paper          tv
sales  1.0000000  0.97713807 -0.28306828  0.95797025
radio  0.9771381  1.00000000 -0.23835848  0.96609579
paper -0.2830683 -0.23835848  1.00000000 -0.24587896
tv     0.9579703  0.96609579 -0.24587896  1.00000000
pos    0.0126486  0.06040209 -0.09006241 -0.03602314
              pos
sales  0.01264860
radio  0.06040209
paper -0.09006241
tv    -0.03602314
pos    1.00000000
corr
           sales       radio       paper          tv
sales  1.0000000  0.97713807 -0.28306828  0.95797025
radio  0.9771381  1.00000000 -0.23835848  0.96609579
paper -0.2830683 -0.23835848  1.00000000 -0.24587896
tv     0.9579703  0.96609579 -0.24587896  1.00000000
pos    0.0126486  0.06040209 -0.09006241 -0.03602314
              pos
sales  0.01264860
radio  0.06040209
paper -0.09006241
tv    -0.03602314
pos    1.00000000

From the matrix, its clear that Sales and Radio have the strongest correlation. So, create a scatterplot between the two to visualize the data. Make sure to extract the columns.

First we have to read in all the variables so we can use them to see further trends in the data. Now that the variables are extracted, we can make scatterplots to visualize the stronger correlations. Through this we see many linear relationships.

#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio
#Plot of Radio and Sales using plot command from Worksheet 4
plot(radio,sales)

scatter.smooth(radio,sales)

scatter.smooth(paper,sales)

scatter.smooth(radio,paper)

scatter.smooth(tv,sales)

scatter.smooth(radio,tv)

scatter.smooth(pos,sales)

scatter.smooth(radio,pos)

From this plot, it seems the points are scattered in an almost linear way. So, we will try to fit a simple linear regression model to the graph.

The lm() function is a very useful one. The function is set up as lm(y~x) where the x variable, the independent variable, predicts values of the y variable, or the dependent variable.

In the regression below, we are using radio ads to predict sales. We print out a summary to view the quantitative facts about the linear model.

We use regression lines in order to predict where other values will occur. These simple equations allow us to predict where one variable will fall when we input another variable.

#Simple Linear Regression
reg <- lm(sales ~ radio)
reg <- lm(sales ~ tv)
#Summary of Model
summary(reg)

Call:
lm(formula = sales ~ tv)

Residuals:
     Min       1Q   Median       3Q      Max 
-1921.87  -412.24     7.02   581.59  1081.61 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -42229.21    4164.12  -10.14 7.19e-09
tv             221.10      15.61   14.17 3.34e-11
               
(Intercept) ***
tv          ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 771.3 on 18 degrees of freedom
Multiple R-squared:  0.9177,    Adjusted R-squared:  0.9131 
F-statistic: 200.7 on 1 and 18 DF,  p-value: 3.336e-11
###Linear Model (Y + b =-mx)
### Sales_predicted - Radio_X1
sales_predicted = -9741.92 + 347.69 * (72)
sales_predicted
[1] 15291.76

Report and interpret the R-Squared value below.

The R-squared value is 15291.76. This means that, when looking at a given radio number, we can predict that sales will correlate with the output number.

Because the R-Squared value is so high, it indicates that the model is a good fit, but not perfect. We will overlay a trend plot over the original plot we had. This will show how far the predictions are from the actual value. The distance from the actual versus the predicted is the residual.

#Plot Radio and Sales 
plot(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg,col="blue",lwd=2) 

We can see here that it isn’t a perfect correlation because the dots don’t line up completely with the line that is given. This being said, it is a very strong correlation. The plot also has a slight arching slope.


Task2

In R, multiple linear regression takes the form of lm(y ~ x0 + x1 + x2 + ... ), where y is the value that is being predicted, or the dependent value and the x variables are the predictors or the independent values.

Using multiple regression can increase or decrease the accuracy depending on the variables used. Using more variables complicates the process and the equaion but in many cases it allows us to find a more accurate number. The more variables we add in, the more complex the equation.

#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
mlr1

Call:
lm(formula = sales ~ radio + tv)

Coefficients:
(Intercept)        radio           tv  
  -17150.46       275.69        48.34  
#Summary of Multiple Linear Regression Model
summary(mlr1)

Call:
lm(formula = sales ~ radio + tv)

Residuals:
     Min       1Q   Median       3Q      Max 
-1729.58  -205.97    56.95   335.15   759.26 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -17150.46    6965.59  -2.462 0.024791
radio          275.69      68.73   4.011 0.000905
tv              48.34      44.58   1.084 0.293351
               
(Intercept) *  
radio       ***
tv             
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 568.9 on 17 degrees of freedom
Multiple R-squared:  0.9577,    Adjusted R-squared:  0.9527 
F-statistic: 192.6 on 2 and 17 DF,  p-value: 2.098e-12

For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.

Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared: These multiple regression models all predict sales, but they all come up with different numbers, so we must decide which equation is most accurate.

#mlr2 = Sales predicted by radio, tv, and pos
mlr2 <-lm(sales ~ radio + tv + pos)
mlr2

Call:
lm(formula = sales ~ radio + tv + pos)

Coefficients:
(Intercept)        radio           tv          pos  
  -15491.23       291.36        38.26      -107.62  
summary(mlr2)

Call:
lm(formula = sales ~ radio + tv + pos)

Residuals:
     Min       1Q   Median       3Q      Max 
-1748.20  -187.42   -61.14   352.07   734.20 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -15491.23    7697.08  -2.013  0.06130 . 
radio          291.36      75.48   3.860  0.00139 **
tv              38.26      48.90   0.782  0.44538   
pos           -107.62     191.25  -0.563  0.58142   
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 580.7 on 16 degrees of freedom
Multiple R-squared:  0.9585,    Adjusted R-squared:  0.9508 
F-statistic: 123.3 on 3 and 16 DF,  p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 <-lm(sales ~ radio + tv + pos + paper)
mlr3

Call:
lm(formula = sales ~ radio + tv + pos + paper)

Coefficients:
(Intercept)        radio           tv          pos  
 -13801.015      294.224       33.369     -128.875  
      paper  
     -9.159  
summary(mlr3)

Call:
lm(formula = sales ~ radio + tv + pos + paper)

Residuals:
     Min       1Q   Median       3Q      Max 
-1558.13  -239.35     7.25   387.02   728.02 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -13801.015   7865.017  -1.755  0.09970
radio          294.224     75.442   3.900  0.00142
tv              33.369     49.080   0.680  0.50693
pos           -128.875    192.156  -0.671  0.51262
paper           -9.159      8.991  -1.019  0.32449
              
(Intercept) . 
radio       **
tv            
pos           
paper         
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 580 on 15 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9509 
F-statistic: 92.96 on 4 and 15 DF,  p-value: 2.13e-10

Based purely on the values for R-Squared and Adj R-Squared, which linear regression model is best in predicting sales. Explain why.

mlr3 is the best linear regression model because its R-squared and adjusted R squared are closer to 1 than mlr2 or mlr1.

After deciding which model predicts sales best, we will confirm the it truly is the best model by predicting sales given independent variables.

Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.


mlr1 <-lm(sales ~ 69 + 255)
mlr2 <-lm(sales ~ 69 + 255 + 1.5)
mlr3 <-lm(sales ~ 69 + 255 + 1.5 + 75)

Task 3

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

  1. Logon to your Watson Analytics account at watsonanalytics.com
  2. Upload the file marketing.csv unless already in your folder
  3. Use the Predictive module to analyze the data
  4. Note the predictive power strength of reported variables. Consider the one field predictive model only.
  5. How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how.

The results that I found in Watson supported what I found in sections 1 and 2. They created very nice visual representations and they displayed the data in a very readable way. Watson made this whole process much easier. I tried screenshotting my resutls but an error message kept coming up.

