In this lab, we will focus on linear and non-linear programming.
Linear programming, as discussed in the previous lab, works with simple and multiple linear regression techniques; sometimes the variables have completely direct or completely non-direct relationships and these techniques can model them.
Sometimes, however, the variables do not predict each other in a linear way. For example, looking at the stock market vs. time, we know that generally the market was booming before the crash, then the market crashed and the great depression hit, and slowly the market started to rise again.
This pattern is not linear, and in fact a non-linear programming technique can be used to model it and predict the value of the market based on the year.
In this lab, we will explore topics like optimization, solve a marketing model, and perform linear and non-linear regression on the cost of servers.
We are going to use tidyverse a collection of R packages designed for data science.
lprec <- make.lp(0, 2)
lp.control(lprec, sense="max")
$anti.degen
[1] "fixedvars" "stalling"
$basis.crash
[1] "none"
$bb.depthlimit
[1] -50
$bb.floorfirst
[1] "automatic"
$bb.rule
[1] "pseudononint" "greedy" "dynamic" "rcostfixing"
$break.at.first
[1] FALSE
$break.at.value
[1] 1e+30
$epsilon
epsb epsd epsel epsint epsperturb epspivot
1e-10 1e-09 1e-12 1e-07 1e-05 2e-07
$improve
[1] "dualfeas" "thetagap"
$infinite
[1] 1e+30
$maxpivot
[1] 250
$mip.gap
absolute relative
1e-11 1e-11
$negrange
[1] -1e+06
$obj.in.basis
[1] TRUE
$pivoting
[1] "devex" "adaptive"
$presolve
[1] "none"
$scalelimit
[1] 5
$scaling
[1] "geometric" "equilibrate" "integers"
$sense
[1] "maximize"
$simplextype
[1] "dual" "primal"
$timeout
[1] 0
$verbose
[1] "neutral"
set.objfn(lprec, c(275.691, 48.341))
add.constraint(lprec, c(1, 1), "<=", 350000)
add.constraint(lprec, c(1, 0), ">=", 15000)
add.constraint(lprec, c(0, 1), ">=", 75000)
add.constraint(lprec, c(2, -1), "=", 0)
lprec
Model name:
C1 C2
Maximize 275.691 48.341
R1 1 1 <= 350000
R2 1 0 >= 15000
R3 0 1 >= 75000
R4 2 -1 = 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
# solve
solve(lprec)
[1] 0
get.objective(lprec)
[1] 43443517
get.variables(lprec)
[1] 116666.7 233333.3
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head()
mydata = read.csv("data/ServersCost.csv")
head(mydata)
servers = mydata$servers
cost = mydata$cost
cor(mydata)
servers cost
servers 1.00000000 0.03356606
cost 0.03356606 1.00000000
There is a positive correlation between servers and cost, however it is a very weak correlation.
Commands: p <- qplot( x = INDEPENDENT, y = DEPENDENT, data = mydata) + geom_point()
library("plotly")
p = qplot( x = servers, y = cost, data = mydata) + geom_point()
p
Commmand: p + geom_smooth(method = “lm”)
p1 = p + geom_smooth(method = "lm")
p1
The points in this linear model do not follow a linear pattern. They seem to better represent a quadratic function isntead of a linear function, therefore the slightly positive trend line for this linear model does not fit the data at all.
linear_model = lm( cost ~ servers )
linear_model
Call:
lm(formula = cost ~ servers)
Coefficients:
(Intercept) servers
14747 48
summary(linear_model)
Call:
lm(formula = cost ~ servers)
Residuals:
Min 1Q Median 3Q Max
-10646.2 -8646.2 -544.7 7066.0 12858.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14747.2 4035.5 3.654 0.00181 **
servers 48.0 336.9 0.142 0.88828
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8687 on 18 degrees of freedom
Multiple R-squared: 0.001127, Adjusted R-squared: -0.05437
F-statistic: 0.0203 on 1 and 18 DF, p-value: 0.8883
The R-squared value is 0.001127 and the adjusted R-squared is -0.05437. The extremely low R-squared and negative adjusted R-squared values indicate that this is a very poor model for the data.
We use a transformation and use a nonlinear quadratic model to see how the model fits to the data.
Quadratic Model: y = x + x^2
servers = mydata$servers
servers_squared = mydata$servers^2
quad_model = lm(cost ~ servers + servers_squared)
quad_model
Call:
lm(formula = cost ~ servers + servers_squared)
Coefficients:
(Intercept) servers servers_squared
35417.8 -5589.4 268.4
summary(quad_model)
Call:
lm(formula = cost ~ servers + servers_squared)
Residuals:
Min 1Q Median 3Q Max
-2897.8 -1553.4 -513.2 1152.4 4752.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35417.77 1742.64 20.32 2.30e-13 ***
servers -5589.43 382.19 -14.62 4.62e-11 ***
servers_squared 268.45 17.68 15.19 2.55e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2342 on 17 degrees of freedom
Multiple R-squared: 0.9314, Adjusted R-squared: 0.9233
F-statistic: 115.4 on 2 and 17 DF, p-value: 1.282e-10
The R-square value is 0.9314 and the adjusted R-squared is 0.9233. Both of these values are positive and pretty close to 1, which indicates that this model is a good fit for the data.
Commands: predicted_2 <- predict( quad_model, data = mydata )
servers2 = servers^2
quad_model = lm(cost ~ servers + servers2 )
predicted2 = predict(quad_model,data=mydata)
predicted2
1 2 3 4 5 6 7 8
30096.790 25312.706 21065.520 17355.233 14181.844 11545.354 9445.762 7883.068
9 10 11 12 13 14 15 16
6857.273 6368.376 6416.377 7001.277 8123.076 9781.772 11977.367 14709.861
17 18 19 20
17979.252 21785.543 26128.731 31008.818
Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )
qplot( x = servers, y = predicted2, colour = "red")
By looking at the shape of this model it seems like this would be a good fit for the data, because the predicted points seem to closely match the actual points.
servers_cubed = mydata$servers^3
cubic_model = lm(cost ~ servers + servers_squared + servers_cubed)
cubic_model
Call:
lm(formula = cost ~ servers + servers_squared + servers_cubed)
Coefficients:
(Intercept) servers servers_squared servers_cubed
36133.696 -5954.738 310.895 -1.347
summary(cubic_model)
Call:
lm(formula = cost ~ servers + servers_squared + servers_cubed)
Residuals:
Min 1Q Median 3Q Max
-2871.0 -1435.1 -473.6 1271.8 4600.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36133.696 2625.976 13.760 2.77e-10 ***
servers -5954.738 1056.596 -5.636 3.72e-05 ***
servers_squared 310.895 115.431 2.693 0.016 *
servers_cubed -1.347 3.619 -0.372 0.715
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2404 on 16 degrees of freedom
Multiple R-squared: 0.932, Adjusted R-squared: 0.9193
F-statistic: 73.11 on 3 and 16 DF, p-value: 1.478e-09
The R-squared value for this model is 0.932 and the adjusted R-squared is 0.9193. Based on these values, the model is also a good fit for the data.
Commands: predicted3 <- predict( cubic_model, data = mydata )
predicted3 = predict( cubic_model, data = mydata)
Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )
qplot( x = servers, y = predicted3, colour = "red")
Based on this plot using the cubic model, it appears to be a good fit for the data as well. The R-squared and adjusted R-squared values for the cubic model are slightly higher than the R-squared and adjusted R-squared values for the quadratic model, so overall I would say that the cubic model is better. However, the difference between them is so small that it would be okay to use either model for this data.
variables: LINEAR_MODEL , PREDICTED_QUADRATIC, PREDICTED_CUBIC
# Black = Actual Data
plot(servers, cost, pch = 16)
# Blue = Linear Line based on Linear Regression Model
abline(linear_model, col = "blue", lwd = 2)
# Red = Quadratic Model based on Quadratric Regression found above
# Needed to overlay new points without the labels and annotations
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE)
plot(predicted2, col = "red", pch = 16)
# Green = Cubic Model based on Cubic Regression found above
# Overlay new points without the labels and annotations
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE)
plot(predicted3, col = "green", pch = 16)
In my opinion, the cubic model (predicted 3 model) is the best fit for the data because it’s predicted values are the closest to matching the actual values. The quadratic model is very similar to the cubic model so it would be okay to use that one as well, but I would still choose the cubic model. Obviously, the linear model is the worst fit so that should not be used at all.