About

In this lab, we will focus on linear programming and non-linear regression modeling. Linear regression modeling, as discussed in the previous lab, works with simple and multiple linear regression models. Sometimes the relationships are not best represented by linear models, and a non-linear regression modeling is required. The general concept remains the same; minimizing the error between the observed/actual values and the values predicted by the model. Linear programming on the other hands seek to find the optimal solution to a problem with multivariables and multiconstraints described by linear relationships.

In this lab, we will perform a non-linear regression modeling on the cost of servers, and setup a linear programming model to solve the marketing use case discussed in class.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.

Task 1: Non-Linear Regression Modeling

First, we must read the file ‘ServersCost.csv’ into R, and extract the two columns of interest.

mydata <- read.csv("data/serverscost.csv", header=TRUE)
head(mydata)
##   servers  cost
## 1       1 27654
## 2       2 24789
## 3       3 21890
## 4       4 21633
## 5       5 15843
## 6       6 12567
servers = mydata$servers
cost = mydata$cost

We start by creating a simple linear regression model. Next, we plot the points to visually inspect the data and unravel any potential relationships.

linear_model = lm(cost ~ servers)
summary(linear_model)
## 
## Call:
## lm(formula = cost ~ servers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10646.2  -8646.2   -544.7   7066.0  12858.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14747.2     4035.5   3.654  0.00181 **
## servers         48.0      336.9   0.142  0.88828   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8687 on 18 degrees of freedom
## Multiple R-squared:  0.001127,   Adjusted R-squared:  -0.05437 
## F-statistic: 0.0203 on 1 and 18 DF,  p-value: 0.8883
# add linear line based on regression model
plot(servers,cost, pch=16) # the pch option is to accentuate the points
abline(linear_model, col="blue", lwd=2)

The blue line here represents the model based predicted data, and the black dots are the actual data points. Clearly the line model is far from being a good predictor. Next we will use a nonlinear quadratic model to see how the model can be improved.

A Multiple linear regression model is of the form y = a+ b x1 + c x2 +…..+ n xn where y is the dependent variable and the x’s are the independent variables.

For the non-linear quadratic regression model the equation takes the form y = a+ b x + c x^2

# First it is best to define a new variable which is the squared value of servers

servers2 = servers^2

# see the first few values of servers2 and compare it to first few values of servers

head(servers2)
## [1]  1  4  9 16 25 36
head(servers)
## [1] 1 2 3 4 5 6
# The model formula is based on the form y = a+ b*x + c*x^2. This can be translated to: predicted Cost = a+ b*Number of servers + c*Number of servers squared. We need to know what is a,b, and c !

quad_model = lm(cost ~ servers + servers2)
summary(quad_model)
## 
## Call:
## lm(formula = cost ~ servers + servers2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2897.8 -1553.4  -513.2  1152.4  4752.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35417.77    1742.64   20.32 2.30e-13 ***
## servers     -5589.43     382.19  -14.62 4.62e-11 ***
## servers2      268.45      17.68   15.19 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared:  0.9314, Adjusted R-squared:  0.9233 
## F-statistic: 115.4 on 2 and 17 DF,  p-value: 1.282e-10

The equation of the quadratic regression model is Y = 35417.77 - 5589.43 x + 268.45 x^2. This can be translated to: predicted cost = 35417.77 - 5589.43 number of servers + 268.45 number of servers squared. Using the variables we created, this can be translated to: predicted cost = 35417.77 - 5589.43 servers + 268.45 servers2

From the summary, we see the R-squared value has increased dramatically from the linear model, indicating a big improvement. That is not the only value we must check though. Let’s inspect visually how the model based data compare to the actual data.

First, we must calculate the predicted values. Then, we can plot the points.

# compute the predicted values based on the quad model. We will use the function "predict" which is will calculate the predicted cost for each number of servers based on the quadratic regression model that we created above. 

predicted2 = predict(quad_model,data=mydata)

# let us see the first few values in "predicted2", "servers", and "servers2"

head(predicted2)
##        1        2        3        4        5        6 
## 30096.79 25312.71 21065.52 17355.23 14181.84 11545.35
head(servers)
## [1] 1 2 3 4 5 6
head(servers2)
## [1]  1  4  9 16 25 36
# Notice that 35417.77 - 5589.43* 1 + 268.45 *1^2 = 30096.79 (i.e. predicted2 = 35417.77 - 5589.43* servers + 268.45 *servers2 ). This is equation of the quadratic regression model :) 

# now let us see the first few values of actual cost 

head(cost)
## [1] 27654 24789 21890 21633 15843 12567
#this means that when number of servers = 1, the predcited cost based on quadratic regression model is 30096.79, but the actual cost is 27654 

#Now we will see visually how the predcited costs based on quadratic regression model compare to actual costs

# Plot cost versus servers based on actual values
plot(servers,cost, pch=16) 

#Plot cost versus servers based on predicted values from quadratic regression model
plot(servers,predicted2, pch=16) 

#It is better to have actual cost and predicted cost on one graph to compare
# Plot cost versus servers based on actual values again
plot(servers,cost, pch=16)

# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations

par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE) 

# We will use the red color for the quadratic model
plot(predicted2, col="red", pch=16) 

It is easy to observe now that the predicted values are more in line with the observed values.

A common misconception is that the higher order the non-linear model is the better it is. Lets try a cubic model and see how it performs. This model is of the form y = a+ b x + c x^2 + d x^3.

1A) Create a cubic non-linear regression model, and display the summary statistics.
#First define the additional new variable. the cubic of servers, needed for your model
servers3 = servers^3

# The model formula is of the form y = a+ b * x + c * x^2 + d * x^3.  For consistency best to call your new model cubic_model.
cubic_model = lm(cost ~ servers + servers2 + servers3)
summary(cubic_model)
## 
## Call:
## lm(formula = cost ~ servers + servers2 + servers3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2871.0 -1435.1  -473.6  1271.8  4600.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36133.696   2625.976  13.760 2.77e-10 ***
## servers     -5954.738   1056.596  -5.636 3.72e-05 ***
## servers2      310.895    115.431   2.693    0.016 *  
## servers3       -1.347      3.619  -0.372    0.715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2404 on 16 degrees of freedom
## Multiple R-squared:  0.932,  Adjusted R-squared:  0.9193 
## F-statistic: 73.11 on 3 and 16 DF,  p-value: 1.478e-09

We can now visually inspect the goodness of the quadratic model and the cubic model.

1B) Create a plot of cost of versus servers based on actual data. Overlay the predicted values based on the cubic model, and on the quadratic model. Follow the sequence of commands described below.
# compute the predicted values based on the cubic model. For consistency best to call predicted3.
predicted3 = predict(cubic_model,data=mydata)
head(predicted3) 
##        1        2        3        4        5        6 
## 30488.51 25457.02 21031.16 17202.83 13963.95 11306.44
# Plot cost versus servers based on actual values
plot(cost, servers, pch=16) 

# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
plot(cost,predicted3, pch=16)

# Use the green color for the cubic model
plot(predicted3, col="green", pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt= "n", yaxt= "n", ann=FALSE)

# Use the red color for the quadratic model
plot(predicted3, col="red", pch=16)

From the graphs it should be hard to tell which of the two models, the quadratic or the cubic, is a better fit and predictor. A better way to quantify which model is best is to look at the Adjusted-R2.

quadratic_model = lm(cost ~ servers + servers2)
summary(quadratic_model)
## 
## Call:
## lm(formula = cost ~ servers + servers2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2897.8 -1553.4  -513.2  1152.4  4752.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35417.77    1742.64   20.32 2.30e-13 ***
## servers     -5589.43     382.19  -14.62 4.62e-11 ***
## servers2      268.45      17.68   15.19 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared:  0.9314, Adjusted R-squared:  0.9233 
## F-statistic: 115.4 on 2 and 17 DF,  p-value: 1.282e-10
1C) List here the R2 and Adjusted R2 for all three models linear, quandratic, and cubic. Identify which model is a better predictor and why

The R2 for linear model is 0.001127, adjusted R is -0.05437. Quadratic model R2 is 0.9314 and adjusted R is 0.9233. Cubic model R2 0.932 and adjusted R is 0.9193.

Task 2: Linear Programming & Optimization

For this task, we need to install an optimization package in R.

# Require will load the package only if not installed 
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
  install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")

Solving Marketing Model

First create the linear programming model object in R. This is the starting point. The object will eventually contain all definitions and results.

# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark <- make.lp(0, 2) 

Next we need to define the type of optimization, set the objective function, and add the constraints to our model object.

# Define type of optimization as maximum and dump the screen output into `dump`
dump = lp.control(lpmark, sense="max")  
# Set the objective function
set.objfn(lpmark, c(275.691, 48.341))

# add constraints
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
2A) Add the remaining missing five constraints to the model above. Follow the guidelines as described in the lectures.

Finally we can explore and solve the model using the lpSolveAPI package.

# View the problem formulation in tabular/matrix form
lpmark
## Model name: 
##                C1       C2            
## Maximize  275.691   48.341            
## R1              1        1  <=  350000
## R2              1        0  >=   15000
## R3              0        1  >=   75000
## R4              2       -1   =       0
## R5              1        0  >=       0
## R6              0        1  >=       0
## Kind          Std      Std            
## Type         Real     Real            
## Upper         Inf      Inf            
## Lower           0        0
# Solve 
solve(lpmark) 
## [1] 0
# Display the objective function optimum value
get.objective(lpmark)
## [1] 43443517
# Display the decision variables optimum values
get.variables(lpmark) 
## [1] 116666.7 233333.3
2B) Write down and clearly mark the optimum values for sales, radio, and tv ads. Show how the optimum decision values satisfy all constraints.

Sales:$43,443,517 Optimum Value of Radio: $116,666.7 Optimum value of TV: $233,333.3

The radio value of 116,666.7 and is above 15,000. The TV value of 233,333.3 and is above 75,000. It is exactly double the amount spent on radio. The sum of the two numbers is 350,000. The sales value is maximized at a value of 43,443,517.

43,443,517 ~ (275.691 x (116,666.7) + 48.341 x (233,333.3))