In this lab, we will focus on linear programming and non-linear regression modeling. Linear regression modeling, as discussed in the previous lab, works with simple and multiple linear regression models. Sometimes the relationships are not best represented by linear models, and a non-linear regression modeling is required. The general concept remains the same; minimizing the error between the observed/actual values and the values predicted by the model. Linear programming on the other hands seek to find the optimal solution to a problem with multivariables and multiconstraints described by linear relationships.
In this lab, we will perform a non-linear regression modeling on the cost of servers, and setup a linear programming model to solve the marketing use case discussed in class.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.
First, we must read the file ‘ServersCost.csv’ into R, and extract the two columns of interest.
mydata <- read.csv("data/serverscost.csv", header=TRUE)
head(mydata)
## servers cost
## 1 1 27654
## 2 2 24789
## 3 3 21890
## 4 4 21633
## 5 5 15843
## 6 6 12567
The data consists of 20 rows and 2 variables: servers and cost. It represents the cost of running its servers. From observing the data, one could tell that the cost with one server is high and then decreases as servers are added up until around 10 servers, where after that point, the more servers, the higher the cost.
servers = mydata$servers
cost = mydata$cost
We start by creating a simple linear regression model. Next, we plot the points to visually inspect the data and unravel any potential relationships.
linear_model = lm(cost ~ servers)
summary(linear_model)
##
## Call:
## lm(formula = cost ~ servers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10646.2 -8646.2 -544.7 7066.0 12858.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14747.2 4035.5 3.654 0.00181 **
## servers 48.0 336.9 0.142 0.88828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8687 on 18 degrees of freedom
## Multiple R-squared: 0.001127, Adjusted R-squared: -0.05437
## F-statistic: 0.0203 on 1 and 18 DF, p-value: 0.8883
The important values to look at are the intercept (14747.2), the server “slope” (48), the R-squared, and the Adj R-squared. Both R-squared values are low, and Adj R-squared is even negative.
# add linear line based on regression model
plot(servers,cost, pch=16) # the pch option is to accentuate the points
abline(linear_model, col="blue", lwd=2)
Through the graph above, it is clear that the linear regression model is not an accurate model to use for the data, given that the points are more parabolic than linear. This makes sense because if there are too few servers, then customer service may not be good and the company could potentially be performing poorly. However, if there are too many servers, then resources may be being wasted and the cost may be high as well. In this case, we want to be minimizing cost. The blue line here represents the model based predicted data, and the black dots are the actual data points. Clearly the line model is far from being a good predictor. Next we will use a nonlinear quadratic model to see how the model can be improved.
A linear model is of the form y ~ x0 + x1 +….xn where y is the dependent variable and the xn are the independent variables. For the non-linear quadratic regression model the equation takes the form y ~ x + x^2
# First it is best to define a new variable which is the squared value of servers
servers2 = servers^2
# The model formula is based on the form y = x + x^2
quad_model = lm(cost ~ servers + servers2)
summary(quad_model)
##
## Call:
## lm(formula = cost ~ servers + servers2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2897.8 -1553.4 -513.2 1152.4 4752.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35417.77 1742.64 20.32 2.30e-13 ***
## servers -5589.43 382.19 -14.62 4.62e-11 ***
## servers2 268.45 17.68 15.19 2.55e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared: 0.9314, Adjusted R-squared: 0.9233
## F-statistic: 115.4 on 2 and 17 DF, p-value: 1.282e-10
From the summary, we see the R-squared value has increased dramatically from the linear model, indicating a big improvement. That is not the only value we must check though. Let’s inspect visually how the model based data compare to the actual data.
First, we must calculate the predicted values. Then, we can plot the points.
# compute the predicted values based on the quad model
predicted2 = predict(quad_model,data=mydata)
# Plot cost versus servers based on actual values
plot(servers,cost, pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the red color for the quadratic model
plot(predicted2, col="red", pch=16)
It is easy to observe now that the predicted values are more in line with the observed values. The quadratic model portrays a parabola, which is more representative of the data.
A common misconception is that the higher order the non-linear model is the better it is. Lets try a cubic model and see how it performs. This model is of the form y = x + x^2 + x^3.
#First define the additional new variable. the cubic of servers, needed for your model
servers3 = servers^3
# The model formula is of the form x + x^2 + x^3. For consistency best to call your new model cubic_model.
cubic_model = lm(cost ~ servers + servers2 + servers3)
summary(cubic_model)
##
## Call:
## lm(formula = cost ~ servers + servers2 + servers3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2871.0 -1435.1 -473.6 1271.8 4600.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36133.696 2625.976 13.760 2.77e-10 ***
## servers -5954.738 1056.596 -5.636 3.72e-05 ***
## servers2 310.895 115.431 2.693 0.016 *
## servers3 -1.347 3.619 -0.372 0.715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2404 on 16 degrees of freedom
## Multiple R-squared: 0.932, Adjusted R-squared: 0.9193
## F-statistic: 73.11 on 3 and 16 DF, p-value: 1.478e-09
We can now visually inspect the goodness of the quadratic model and the cubic model. Though the cubic model does have a high R2 and Adj R2, the Adj R2 for the the quadratic model is higher and therefore better than the Adj R2 for the cubic, lining up with the statement above that a high order does not necessarily mean the model is better.
# compute the predicted values based on the cubic model. For consistency best to call predicted3.
predicted3 = predict(cubic_model,data=mydata)
# Plot cost versus servers based on actual values
plot(servers,cost, pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the green color for the cubic model
plot(predicted3, col="green", pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the red color for the quadratic model
plot(predicted2, col="red", pch=16)
From the graphs it should be hard to tell which of the two models, the quadratic or the cubic, is a better fit and predictor. A better way to quantify which model is best is to look at the Adjusted-R2.
Linear Model: R2=0.001127, Adj R2=-0.05437
Quadratic Model: R2=0.9314, Adj R2=0.9233
Cubic Model: R2=0.932, Adj R2=0.9193
For this task, we need to install an optimization package in R.
# Require will load the package only if not installed
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
First create the linear programming model object in R. This is the starting point. The object will eventually contain all definitions and results.
# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark <- make.lp(0, 2)
Next we need to define the type of optimization, set the objective function, and add the constraints to our model object.
# Define type of optimization as maximum and dump the screen output into `dump`
dump = lp.control(lpmark, sense="max")
# Set the objective function
set.objfn(lpmark, c(275.691, 48.341))
# add constraints
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
Finally we can explore and solve the model using the lpSolveAPI package.
# View the problem formulation in tabular/matrix form
lpmark
## Model name:
## C1 C2
## Maximize 275.691 48.341
## R1 1 1 <= 350000
## R2 1 0 >= 15000
## R3 0 1 >= 75000
## R4 2 -1 = 0
## R5 1 0 >= 0
## R6 0 1 >= 0
## Kind Std Std
## Type Real Real
## Upper Inf Inf
## Lower 0 0
# Solve
solve(lpmark)
## [1] 0
# Display the objective function optimum value
get.objective(lpmark)
## [1] 43443517
# Display the decision variables optimum values
get.variables(lpmark)
## [1] 116666.7 233333.3
Optimum Sales = $43,443,517
Optimum Amount Spent on Radio Ads ~ $116,666
Optimum Amount Spent on Tv Ads ~ $233,333
The optimum values above satisfy all the constraints from above: - both optimum radio ads and tv ads together are below the budget of $350,000 1) the optimum amount spent on radio ads is above $15,000 2) the optimum amount spent on tv ads is above $75,000 3) 116,666 is around half 233,333 satisfying the 4th constraint 4) the optimum amount spent on radio ads is greater than 0 5) the optimum amount spent on tv ads is greater than 0