About

In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants. This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year. This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.

This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.

Setup

Make sure to download the folder titled ‘bsad_lab5’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.


Task 1

First, read in the marketing data that was used in the previous lab. It can be found in the downloaded folder bsad_lab5. Make sure to view the file to ensure it was read in correctly.

#Read data correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)
##   case_number sales radio paper  tv pos
## 1           1 11125    65    89 250 1.3
## 2           2 16121    73    55 260 1.6
## 3           3 16440    74    58 270 1.7
## 4           4 16876    75    82 270 1.3
## 5           5 13965    69    75 255 1.5
## 6           6 14999    70    71 255 2.1

Next the cor() function is used to demonstrate correlations between variables.

#Correlation Matrix
cor(mydata)
##             case_number      sales       radio       paper          tv
## case_number   1.0000000  0.2402344  0.23586825 -0.36838393  0.22282482
## sales         0.2402344  1.0000000  0.97713807 -0.28306828  0.95797025
## radio         0.2358682  0.9771381  1.00000000 -0.23835848  0.96609579
## paper        -0.3683839 -0.2830683 -0.23835848  1.00000000 -0.24587896
## tv            0.2228248  0.9579703  0.96609579 -0.24587896  1.00000000
## pos           0.0539763  0.0126486  0.06040209 -0.09006241 -0.03602314
##                     pos
## case_number  0.05397630
## sales        0.01264860
## radio        0.06040209
## paper       -0.09006241
## tv          -0.03602314
## pos          1.00000000

The ‘1.0’ valus is shown diagonal because it is correlation with the variable itself so it will always equal itself- ‘1.0’

The two variables with the highest correlation are ‘Sales’ and ‘Radio’ as it is the closest equal to ‘1’. Sales and Radio have the second highest correlation.

So, I will create a scatterplot between the two to visualize the data.

#Extract all variables
pos  = mydata$pos
paper = mydata$paper
tv = mydata$tv
sales = mydata$sales
radio = mydata$radio

#Plot of Radio and Sales using plot command from Worksheet 4
plot(sales, radio)

The points are scattered in a linear fashion in the scatter plot.

The lm() function can be very helpful in creating a linear regression model.

In the regression below, we are using radio ads to predict sales. We will then create a summary to view the quantitative facts about the linear model.

#Simple Linear Regression
reg <- lm(sales ~ radio)

#Summary of Model
summary(reg)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732.85  -198.88    62.64   415.26   637.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9741.92    1362.94  -7.148 1.17e-06 ***
## radio         347.69      17.83  19.499 1.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9523 
## F-statistic: 380.2 on 1 and 18 DF,  p-value: 1.492e-13

Report and interpret the R-Squared value below.

Since the r-squared value is very high then this means that the model is suitable fit. A trend plot will be overlayed on the scatter plot in order to create a visualization of its accuracy.

#Plot Radio and Sales 
plot(radio,sales)

#Add a trend line plot using the linear model we created above
abline(reg,col="green",lwd=2) 

List some observations from this plot. This trend plot demonstrates that a majority of the points fall within very close range of the trend line, with only very few straying far from the line.


Task2

In this task set, we will complete examples of multiple linear regression models using the following function: ‘lm(y ~ x0 + x1 + x2 + … )’

Here, a multiple linear regression model is created predicting sales using radio and tv.

#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)

#Summary of Multiple Linear Regression Model
summary(mlr1)
## 
## Call:
## lm(formula = sales ~ radio + tv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1729.58  -205.97    56.95   335.15   759.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17150.46    6965.59  -2.462 0.024791 *  
## radio          275.69      68.73   4.011 0.000905 ***
## tv              48.34      44.58   1.084 0.293351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared:  0.9577, Adjusted R-squared:  0.9527 
## F-statistic: 192.6 on 2 and 17 DF,  p-value: 2.098e-12

For mlr1, the R-Squared values is 0.9577 and the Adj R-Squared is 0.9527.

Create a Multiple Linear Regression Model for each of the following, display the summary statistics, and write the values for R-Squared and Adj R-Squared:

#mlr2 = Sales predicted by radio, tv, and pos
mlr2 = lm(sales ~ radio +tv +pos)

#Summary of mlr2
summary(mlr2)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1748.20  -187.42   -61.14   352.07   734.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15491.23    7697.08  -2.013  0.06130 . 
## radio          291.36      75.48   3.860  0.00139 **
## tv              38.26      48.90   0.782  0.44538   
## pos           -107.62     191.25  -0.563  0.58142   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9508 
## F-statistic: 123.3 on 3 and 16 DF,  p-value: 2.859e-11
#mlr3 = Sales predicted by radio, tv, pos, and paper
mlr3 = lm(sales ~ radio +tv +pos +paper)

#Summary of mlr3
summary(mlr3)
## 
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1558.13  -239.35     7.25   387.02   728.02 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -13801.015   7865.017  -1.755  0.09970 . 
## radio          294.224     75.442   3.900  0.00142 **
## tv              33.369     49.080   0.680  0.50693   
## pos           -128.875    192.156  -0.671  0.51262   
## paper           -9.159      8.991  -1.019  0.32449   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:  0.9509 
## F-statistic: 92.96 on 4 and 15 DF,  p-value: 2.13e-10

Observations:

Mlr1 had the highest adjusted r-squared value, even though all mlr models had increasing r-squared values.

Therefore, mlr1 is the most accurate mode of these options to predict the dependent variable of sales.

Given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75, calculate the predicted sales value for each of the three models above.

radiovalue = 69
tvvalue = 255
posvalue = 1.5
papervalue = 75
#mlr1
radiovalue + tvvalue
## [1] 324
#mlr2
radiovalue + tvvalue + posvalue
## [1] 325.5
#mlr3
radiovalue + tvvalue + posvalue + papervalue
## [1] 400.5

Task 3

For this last task, I uploaded the ‘marketing.csv’ file to Watson Analytics. Here in the predictive module, it confirmed the findings we had come across in r- studio. Findings were consistent that sales are driven primarily by radio followed by tv.