In this lab, we will focus on non-linear regression modeling, and linear programming. Linear regression modeling, as discussed in the previous lab, works with simple and multiple linear regression models. Sometimes the relationships are not best represented by linear models, and a non-linear regression modeling is required. The general concept remains the same; minimizing the error between the observed/actual values and the values predicted/fitted by the model. Linear programming on the other hands seek to find the optimal solution to a problem with multivariables and multiconstraints described by linear relationships, as opposed to non-linear relationships. The latter would be a non-linear programming, a case not covered in this class.
In this lab, we will perform a non-linear regression modeling on the cost of servers, and setup a linear programming model to solve the marketing use case discussed in class.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visisble only in preview mode) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.
Execute all code chunks, preview, publish, and submit link on Sakai.
First, we must read the file ‘ServersCost.csv’ into R, and extract the two columns of interest.
mydata <- read.csv("data/serverscost.csv", header=TRUE)
head(mydata)
servers = mydata$servers
cost = mydata$cost
We start by creating a simple linear regression model. Next, we plot the points to visually inspect the data and unravel any potential relationships.
linear_model = lm(cost ~ servers)
summary(linear_model)
Call:
lm(formula = cost ~ servers)
Residuals:
Min 1Q Median 3Q Max
-10646.2 -8646.2 -544.7 7066.0 12858.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14747.2 4035.5 3.654 0.00181 **
servers 48.0 336.9 0.142 0.88828
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 8687 on 18 degrees of freedom
Multiple R-squared: 0.001127, Adjusted R-squared: -0.05437
F-statistic: 0.0203 on 1 and 18 DF, p-value: 0.8883
# add linear line based on regression model
plot(servers,cost, pch=16) # the pch option is to accentuate the points
abline(linear_model, col="blue", lwd=2)
The blue line here represents the model based predicted data, and the black dots are the actual data points. Clearly, from the qualitative visual inspection and the quantitative \(R^2\) and \(AdjustedR^2\) the linear model is far from being a good fit or predictor. Next we will use a nonlinear quadratic model to see how the model can be improved.
A linear model is of the form \(y\) ~ \(x_0 + x_1 +....x_n\) where \(y\) is the dependent variable and the \(x_n\) are the independent variables. For the non-linear quadratic regression model we are looking for an equation of the form \(y\) ~ \(x + x^2\)
# First it is best to define a new variable which is the squared value of servers
servers2 = servers^2
# The model formula is based on the form y = x + x^2
quad_model = lm(cost ~ servers + servers2)
summary(quad_model)
Call:
lm(formula = cost ~ servers + servers2)
Residuals:
Min 1Q Median 3Q Max
-2897.8 -1553.4 -513.2 1152.4 4752.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35417.77 1742.64 20.32 2.30e-13 ***
servers -5589.43 382.19 -14.62 4.62e-11 ***
servers2 268.45 17.68 15.19 2.55e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2342 on 17 degrees of freedom
Multiple R-squared: 0.9314, Adjusted R-squared: 0.9233
F-statistic: 115.4 on 2 and 17 DF, p-value: 1.282e-10
From the summary, we see the R-squared value has increased dramatically from the linear model, indicating a big improvement. That is not the only value we must check though. Let’s inspect visually how the model based data compare to the actual data.
First, we must calculate the predicted/fitted values. Then, we can plot the predicted points next to the actual values.
# Compute the predicted values based on the quad model using the R-function predict()
predicted2 = predict(quad_model,data=mydata)
# Plot cost versus servers based on actual values using a striked symbol
plot(servers,cost, pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the red color for the quadratic model
plot(predicted2, col="red", pch=16)
It is easy to observe now that the predicted values are more in line with the observed values.
A common misconception is that the higher order the non-linear model is, the better predictive it is. Remember from the previous lab and class sessions the need to distinguish between \(R^2\) which is a measure of how good fitting the model is and \(AdjustedR^2\) which is a measure of how good predicting the model is. Lets try a cubic model and see how it performs. The model takes now the form \(y\) ~ \(x + x^2 + x^3\).
##### 1A) Fill-in the code chunk below to derive a cubic non-linear regression model, and display the summary statistics.
#First define the additional new variable. the cubic of servers, needed for your model
servers3 = servers^3
# The model formula is of the form x + x^2 + x^3. For consistency best to call your new model cubic_model.
cubic_model = lm(cost ~ servers + servers2 + servers3)
summary(cubic_model)
Call:
lm(formula = cost ~ servers + servers2 + servers3)
Residuals:
Min 1Q Median 3Q Max
-2871.0 -1435.1 -473.6 1271.8 4600.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36133.696 2625.976 13.760 2.77e-10 ***
servers -5954.738 1056.596 -5.636 3.72e-05 ***
servers2 310.895 115.431 2.693 0.016 *
servers3 -1.347 3.619 -0.372 0.715
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2404 on 16 degrees of freedom
Multiple R-squared: 0.932, Adjusted R-squared: 0.9193
F-statistic: 73.11 on 3 and 16 DF, p-value: 1.478e-09
We can next visually inspect the goodness of the quadratic model and the cubic model.
##### 1B) Graph a plot of cost versus servers based on actual data. Overlay the predicted values based on the cubic model, and on the quadratic model. Follow the sequence of commands described in the code chunk below.
# compute the predicted values based on the cubic model. For consistency with the previous example best to call your model predicted3.
predicted3 = predict(cubic_model,data=mydata)
# Plot cost versus servers based on actual values
plot(servers, cost, pch=16)
# The par (parameter setting) command is needed to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the color `green` to plot the predicted points for the cubic model
plot(predicted3, col= "green", pch=16)
# The par (parameter setting) command is needed again to overlay the predicted model based values without the labels and annotations
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# Use the color `red` to plot the predicted poiints for the quadratic model
plot(predicted2, col="red", pch=16)
From the graphs it should be hard to tell which of the two models, the quadratic or the cubic, is a better fit and predictor. A good way to quantify which model is best in predicting, is to look at the \(AdjustedR^2\).
##### 1C) List here the R2 and Adjusted R2 for all three models linear, quandratic, and cubic. Identify which model is a better predictor and which model is a better fit. Explain why.
Liner: R-squared = 0.001127 , Adj R-squared = -0.05437 Quadratic: R-squared = 0.9314 , Adj R-squared = 0.9233 Cubic: R-squared = 0.932 , Adj R-squared = 0.9193
The best predictor of the data is the quadratic model, because it has the highest Adj R-squared value. The cubic model is a better fit because it has the highest value of R-squared.
For this task, we need to install an optimization package in R.
# Require will load the package only if not installed
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
We will solve for the marketing use case discussed in class. First create the linear programming model object in R. This is the starting point. An object is like a container and will eventually contain all the definitions for objective function, constraints and optimized results.
# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark <- make.lp(0, 2)
Next we need to define the type of optimization, set the objective function, and add the constraints to our model object. This is done by using different commands applicable only to the created linear programming model object.
# Define type of optimization as maximum and to avoid the unnecessary screen outputs in the worksheet dump the screen output into a variable called `dump`
dump = lp.control(lpmark, sense="max")
# Set the objective function with the proper coefficients associated with the decision variables
set.objfn(lpmark, c(275.691, 48.341))
# add a constraint for the maximum allowed budget of $350K
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
##### 2A) Insert in the above code chunk the remaining five constraints corresponding to the marketing model. Follow the guidelines as described in the class PPT slides.
Finally we can explore, solve the model, and report results using additional commands part of the lpSolveAPI package and applicable only to our created lpmark linear programming object.
# View the problem formulation in tabular/matrix form
lpmark
Model name:
C1 C2
Maximize 275.691 48.341
R1 1 1 <= 350000
R2 1 0 >= 15000
R3 0 1 >= 75000
R4 2 -1 = 0
R5 1 0 >= 0
R6 0 1 >= 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
# Solve
solve(lpmark)
[1] 0
# Display the objective function optimum value
get.objective(lpmark)
[1] 43443517
# Display the decision variables optimum values
get.variables(lpmark)
[1] 116666.7 233333.3
##### 2B) Clearly mark the optimum values for sales, radio, and tv ads. Show how the optimum solution satisfy all six constraints by substituting for the decision variables and numerically validating each case.
Optimum value Sales= 43443517 radios=116666.7 tv=233333.3
All constraints are satisfied because:
The sum of the optimum radio value and the optimum tv value equals 350,000.
The value for optimum radio is 116,666.7, which is greater than 15,000.
The value for optimum TV is 233,333.3, which is greater than 75,000.
The optimum value for TV is twice the optimum value for Radio.
There are TV ads.