This project will explore the concepts of simple linear regression, multiple linear regression and finding the optimum solution with possible business limitations.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
First, read in the marketing data that was used in the previous lab. Make sure the file is read in correctly. (0.25 points)
#Read data correctly
marketing = read.csv(file="marketing.csv")
head(marketing)
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
Next, apply the cor() function to the data to understand the correlations between variables. This is a great way to compare the correlations between all variables.(0.25 points)
#Correlation matrix of all columns in the data
corr = cor(marketing)
corr
## case_number sales radio paper tv
## case_number 1.0000000 0.2402344 0.23586825 -0.36838393 0.22282482
## sales 0.2402344 1.0000000 0.97713807 -0.28306828 0.95797025
## radio 0.2358682 0.9771381 1.00000000 -0.23835848 0.96609579
## paper -0.3683839 -0.2830683 -0.23835848 1.00000000 -0.24587896
## tv 0.2228248 0.9579703 0.96609579 -0.24587896 1.00000000
## pos 0.0539763 0.0126486 0.06040209 -0.09006241 -0.03602314
## pos
## case_number 0.05397630
## sales 0.01264860
## radio 0.06040209
## paper -0.09006241
## tv -0.03602314
## pos 1.00000000
#Correlation matrix of all columns except the first column. This is convenient since case_number is only an indicator for the month and should be excluded from the calculations.
corr = cor( marketing[ 2:6 ] )
corr
## sales radio paper tv pos
## sales 1.0000000 0.97713807 -0.28306828 0.95797025 0.01264860
## radio 0.9771381 1.00000000 -0.23835848 0.96609579 0.06040209
## paper -0.2830683 -0.23835848 1.00000000 -0.24587896 -0.09006241
## tv 0.9579703 0.96609579 -0.24587896 1.00000000 -0.03602314
## pos 0.0126486 0.06040209 -0.09006241 -0.03602314 1.00000000
The value of “1.0” is along the diagonal because the values that are being correlated with each other are the same. Radio correlated against Radio will be 1.0 because they are the same value. ##### 1B) Which pairs has the strongest correlations? (0.25 points) Radio:Sales, tv:sales, radio:tv
Next, create a visual diagram of the correlation matrix called a corrgram where the correlations strength are represented by colors intensity. To do this you need first to install two packages in R-Studio as executed by the command lines below
## corrplot 0.91 loaded
Generate a corrgram and a corrplot for the computed correlaation matrix (0.25 points)
# Generates a corrgram of last computed correlation matrix
corrgram(corr)
# Generates a corrplot, similar a corrgram, but with a different visual display
corrplot(corr)
Now extract all the variables and use the scatter.smooth() function to plot radio expences and sales. (0.25 points)
#Extract all variables
pos = marketing$pos
paper = marketing$paper
tv = marketing$tv
sales = marketing$sales
radio = marketing$radio
#Use scatter.smooth() function to plot Radio and Sales.
scatter.smooth(radio,sales)
Using the trend line in this plot, we can see a quite linear trend. As radio expences increase, sales increase.
Now, try to find a simple linear regression model between sales and marketing expences for radio commercials.(0.25 points)
#Simple Linear Regression
reg <- lm(sales ~ radio)
#Summary of Model
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732.85 -198.88 62.64 415.26 637.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9741.92 1362.94 -7.148 1.17e-06 ***
## radio 347.69 17.83 19.499 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.6 on 18 degrees of freedom
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9523
## F-statistic: 380.2 on 1 and 18 DF, p-value: 1.492e-13
The intercept for radio is -9741.92 and the slope for radio is 347.69. The equation for the linear regression model that predicts the sales, based on the radio, is sales_predicted = -9741.92 + 347.69 * (75)
The interpretation for the slope: for ever “x” radios sold, sales wil increase by about $347.69. Interpretation for the intercept: at 0 radios, the amount of sales is -$9741.92.
Given this equation we can predict the value of sales for any given value of radio ads. Use your equation to calculate the predicted sales value for 75 (investing $75,000 in radio ads).(0.25 points)
### Sales_predicted ~ Radio expences
sales_predicted = -9741.92 + 347.69 * (75)
sales_predicted
## [1] 16334.83
A high R-Squared value indicates that the model is a good fit, but not perfect. For the case of Sales versus Radio we will overlay the trend line representing the regression equation over the original plot. This will show how far the calculated results are from the actual value. The difference between the actual sales (circles) and the fitted sales (solid line) is captured in the residual error calculations.
#Plot Radio and Sales
plot(radio,sales)
scatter.smooth(radio,sales)
#Add a trend line plot using the linear model we created above
abline(reg, col="blue",lwd=2)
Now you can derive the linear regression model for Sales versus TV (0.25 points)
#Simple Linear Regression
regtv <- lm(sales ~ tv)
#Summary of Model
summary(regtv)
##
## Call:
## lm(formula = sales ~ tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1921.87 -412.24 7.02 581.59 1081.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42229.21 4164.12 -10.14 7.19e-09 ***
## tv 221.10 15.61 14.17 3.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 771.3 on 18 degrees of freedom
## Multiple R-squared: 0.9177, Adjusted R-squared: 0.9131
## F-statistic: 200.7 on 1 and 18 DF, p-value: 3.336e-11
sales_prediction(2) = -42229.21 + 221.10 * x
Create a multiple linear regression predicting sales using both independent variables radio and tv. (0.25 points)
#Multiple Linear Regression Model
mlr1 <-lm(sales ~ radio + tv)
#Summary of Multiple Linear Regression Model
summary(mlr1)
##
## Call:
## lm(formula = sales ~ radio + tv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1729.58 -205.97 56.95 335.15 759.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17150.46 6965.59 -2.462 0.024791 *
## radio 275.69 68.73 4.011 0.000905 ***
## tv 48.34 44.58 1.084 0.293351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 568.9 on 17 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9527
## F-statistic: 192.6 on 2 and 17 DF, p-value: 2.098e-12
sales_predicted = -17150.46 + 275.69(radio) + 48.34(tv) Interpretation of the intercept: The average sales is -$17150.46 when radio and tv sales are at 0. Interpretation of slopes: for every radio sold, sales will increase by $275.69. For every tv sold, sales will increase by $48.34.
The predicted sales can again be calculated given the coefficients of the regression model.
Calculate the predicted sales for TV = 270 and Radio = 75 (0.25 points)
# sales_predicted = radio + tv
sales_predicted = coef(mlr1)[1] + coef(mlr1)[2]*(75) + coef(mlr1)[3]*(270)
sales_predicted
## (Intercept)
## 16578.3
Create a multiple linear regression model for each of the following, and display the summary statistics (0.25 points)
#Multiple regression model for sales predicted by radio, tv, and pos
m1r2 = lm(sales ~ radio+tv+pos)
#Summary of Multiple Linear Regression Model
summary(m1r2)
##
## Call:
## lm(formula = sales ~ radio + tv + pos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1748.20 -187.42 -61.14 352.07 734.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15491.23 7697.08 -2.013 0.06130 .
## radio 291.36 75.48 3.860 0.00139 **
## tv 38.26 48.90 0.782 0.44538
## pos -107.62 191.25 -0.563 0.58142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.7 on 16 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9508
## F-statistic: 123.3 on 3 and 16 DF, p-value: 2.859e-11
#Multiple regression model for sales predicted by radio, tv, pos, and paper
m1r3 = lm(sales ~ radio+tv+pos+paper)
#Summary of Multiple Linear Regression Model
summary(m1r3)
##
## Call:
## lm(formula = sales ~ radio + tv + pos + paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1558.13 -239.35 7.25 387.02 728.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13801.015 7865.017 -1.755 0.09970 .
## radio 294.224 75.442 3.900 0.00142 **
## tv 33.369 49.080 0.680 0.50693
## pos -128.875 192.156 -0.671 0.51262
## paper -9.159 8.991 -1.019 0.32449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580 on 15 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9509
## F-statistic: 92.96 on 4 and 15 DF, p-value: 2.13e-10
By looking at R-squared and and adjusted R-squared, we can find the model that best predicts sales: m1r1: r-squared: 0.9577, adjusted r-squared: 0.9527 m1r2: r-squared: 0.9585, adjusted r-squared: 0.9508 m1r3: r-squared: 0.9612, adjusted r-squared: 0.9509
From this, we can see that model m1r1 best predicts the sales because of the high adjusted r-squared because it is closest to 1. Although m1r3 has a higher r-squared, it has a lower adjusted r-squared.
Calculate the sales for each of the three models given that Radio = 69 , TV = 255 , POS = 1.5, and Paper = 75. (0.25 points)
# sales_predicted = radio + tv + pos + paper
m1r1= -17150.46 + 275.49*69 + 48.34*255
m1r1
## [1] 14185.05
m1r2 = -15491.23 + 291.36*69 + 38.26*255 - 107.62*1.5
m1r2
## [1] 14207.48
m1r3 = -13801.015 + 294.224*69 + 33.369*255 - 128.875*1.5 -9.159*75
m1r3
## [1] 14129.3
###Task 3: Linear Programming & Optimization (total 3 points) For this task, we need to install an optimization package in R.
if(!require("lpSolveAPI",quietly = TRUE))
install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
#install special package required for the solver
#install.packages("lpSolveAPI", repos = "https://cran.us.r-project.org")
#install.packages("lpSolveAPI", repos = "https://cloud.r-project.org")
# load package library
library("lpSolveAPI")
First create the linear programming model object in R. This is the starting point. The object will eventually contain all definitions and results.
# Start creating the linear programming model
lpmark <- make.lp(0, 2)
Next, define the type of optimization, set the objective function, and add the constraints to our model object.
# Define type of optimization,and dump the screen output into a variable `dump`
dump = lp.control(lpmark, sense="max")
# Set the objective function
set.objfn(lpmark, c(275.691, 48.341))
# add constraints
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
Finally solve the model.
# View the problem formulation in tabular/matrix form
lpmark
## Model name:
## C1 C2
## Maximize 275.691 48.341
## R1 1 1 <= 350000
## R2 1 0 >= 15000
## R3 0 1 >= 75000
## R4 2 -1 = 0
## R5 1 0 >= 0
## R6 0 1 >= 0
## Kind Std Std
## Type Real Real
## Upper Inf Inf
## Lower 0 0
# Solve
solve(lpmark)
## [1] 0
# Display the objective function optimum value
get.objective(lpmark)
## [1] 43443517
# Display the decision variables optimum values
get.variables(lpmark)
## [1] 116666.7 233333.3
Optimal Sales: 43,443,517 Optimal Radio: 116,666.7 Optimal TV ads: 233,333.3 These all satisfy the constraints because the radio and tv budgets must be >= 15,000 and 75,000 respectively. Radio and tv are 116,666.7 and 233,333,3 (respectively). They are all greater than the constraints so thy are all satisfies. The tv ads must be <= 350,000. The sum of the radio and tv values is 116,666.7 + 233,333.7 = 350,000. This satisfies the constraint because it equals exactly 350,000. According to the model, these are the optimal allocations of resources to maximize sales.