In this lab, we will focus on linear and non-linear programming.
Linear programming, as discussed in the previous lab, works with simple and multiple linear regression techniques; sometimes the variables have completely direct or completely non-direct relationships and these techniques can model them.
Sometimes, however, the variables do not predict each other in a linear way. For example, looking at the stock market vs. time, we know that generally the market was booming before the crash, then the market crashed and the great depression hit, and slowly the market started to rise again.
This pattern is not linear, and in fact a non-linear programming technique can be used to model it and predict the value of the market based on the year.
In this lab, we will explore topics like optimization, solve a marketing model, and perform linear and non-linear regression on the cost of servers.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: lpSolveAPI
lprec <- make.lp(0, 2)
lp.control(lprec, sense="max")
## $anti.degen
## [1] "fixedvars" "stalling"
##
## $basis.crash
## [1] "none"
##
## $bb.depthlimit
## [1] -50
##
## $bb.floorfirst
## [1] "automatic"
##
## $bb.rule
## [1] "pseudononint" "greedy" "dynamic" "rcostfixing"
##
## $break.at.first
## [1] FALSE
##
## $break.at.value
## [1] 1e+30
##
## $epsilon
## epsb epsd epsel epsint epsperturb epspivot
## 1e-10 1e-09 1e-12 1e-07 1e-05 2e-07
##
## $improve
## [1] "dualfeas" "thetagap"
##
## $infinite
## [1] 1e+30
##
## $maxpivot
## [1] 250
##
## $mip.gap
## absolute relative
## 1e-11 1e-11
##
## $negrange
## [1] -1e+06
##
## $obj.in.basis
## [1] TRUE
##
## $pivoting
## [1] "devex" "adaptive"
##
## $presolve
## [1] "none"
##
## $scalelimit
## [1] 5
##
## $scaling
## [1] "geometric" "equilibrate" "integers"
##
## $sense
## [1] "maximize"
##
## $simplextype
## [1] "dual" "primal"
##
## $timeout
## [1] 0
##
## $verbose
## [1] "neutral"
set.objfn(lprec, c(275.691, 48.341))
add.constraint(lprec, c(1, 1), "<=", 350000)
add.constraint(lprec, c(1, 0), ">=", 15000)
add.constraint(lprec, c(0, 1), ">=", 75000)
add.constraint(lprec, c(2, -1), "=", 0)
lprec
## Model name:
## C1 C2
## Maximize 275.691 48.341
## R1 1 1 <= 350000
## R2 1 0 >= 15000
## R3 0 1 >= 75000
## R4 2 -1 = 0
## Kind Std Std
## Type Real Real
## Upper Inf Inf
## Lower 0 0
# solve
solve(lprec)
## [1] 0
Describe findings
get.objective(lprec)
## [1] 43443517
describe findings
get.variables(lprec)
## [1] 116666.7 233333.3
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head()
mydata = read.csv(file = "data/ServersCost.csv")
head(mydata)
## servers cost
## 1 1 27654
## 2 2 24789
## 3 3 21890
## 4 4 21633
## 5 5 15843
## 6 6 12567
servers = mydata$servers
cost = mydata$cost
summary(mydata)
## servers cost
## Min. : 1.00 Min. : 4533
## 1st Qu.: 5.75 1st Qu.: 6533
## Median :10.50 Median :14682
## Mean :10.50 Mean :15251
## 3rd Qu.:15.25 3rd Qu.:22665
## Max. :20.00 Max. :28111
cor(mydata)
## servers cost
## servers 1.00000000 0.03356606
## cost 0.03356606 1.00000000
The correlation between server and cost is positive and weak.
Commands: p <- qplot( x = INDEPENDENT, y = DEPENDENT, data = mydata) + geom_point()
#installed package plotly
library("plotly")
## Warning: package 'plotly' was built under R version 3.4.4
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- qplot( x = servers, y = cost, data = mydata) + geom_point()
p
Commmand: p + geom_smooth(method = “lm”)
p2 <- p + geom_smooth(method = "lm")
p2
The trend line added to this model is not correct at all. It doesn’t cover most of the data. This is because the data itself is not linear.
linear_model <- lm(cost ~ servers)
linear_model
##
## Call:
## lm(formula = cost ~ servers)
##
## Coefficients:
## (Intercept) servers
## 14747 48
summary(linear_model)
##
## Call:
## lm(formula = cost ~ servers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10646.2 -8646.2 -544.7 7066.0 12858.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14747.2 4035.5 3.654 0.00181 **
## servers 48.0 336.9 0.142 0.88828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8687 on 18 degrees of freedom
## Multiple R-squared: 0.001127, Adjusted R-squared: -0.05437
## F-statistic: 0.0203 on 1 and 18 DF, p-value: 0.8883
This model is horrible, the R-squared and Adjusted R-Squared values are extremely low.
We use a transformation and use a nonlinear quadratic model to see how the model fits to the data.
Quadratic Model: y = x + x^2
servers_squared = servers^2
quad_model <- lm(cost ~ servers + servers_squared)
quad_model
##
## Call:
## lm(formula = cost ~ servers + servers_squared)
##
## Coefficients:
## (Intercept) servers servers_squared
## 35417.8 -5589.4 268.4
summary(quad_model)
##
## Call:
## lm(formula = cost ~ servers + servers_squared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2897.8 -1553.4 -513.2 1152.4 4752.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35417.77 1742.64 20.32 2.30e-13 ***
## servers -5589.43 382.19 -14.62 4.62e-11 ***
## servers_squared 268.45 17.68 15.19 2.55e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared: 0.9314, Adjusted R-squared: 0.9233
## F-statistic: 115.4 on 2 and 17 DF, p-value: 1.282e-10
The R-Squared and Adjusted R-Squared values of 0.9314 and 0.9233 respectively are very high, meaning that this is a great fit for the data. Especially, when compared to the linear model.
Commands: predicted_2 <- predict( quad_model, data = mydata )
predicted2 = predict(quad_model,data=mydata)
predicted2
## 1 2 3 4 5 6 7
## 30096.790 25312.706 21065.520 17355.233 14181.844 11545.354 9445.762
## 8 9 10 11 12 13 14
## 7883.068 6857.273 6368.376 6416.377 7001.277 8123.076 9781.772
## 15 16 17 18 19 20
## 11977.367 14709.861 17979.252 21785.543 26128.731 31008.818
Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )
qplot( x = servers, y = predicted2, colour = "red" )
qplot
## function (x, y = NULL, ..., data, facets = NULL, margins = FALSE,
## geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
## main = NULL, xlab = deparse(substitute(x)), ylab = deparse(substitute(y)),
## asp = NA, stat = NULL, position = NULL)
## {
## if (!missing(stat))
## warning("`stat` is deprecated", call. = FALSE)
## if (!missing(position))
## warning("`position` is deprecated", call. = FALSE)
## if (!is.character(geom))
## stop("`geom` must be a character vector", call. = FALSE)
## argnames <- names(as.list(match.call(expand.dots = FALSE)[-1]))
## arguments <- as.list(match.call()[-1])
## env <- parent.frame()
## aesthetics <- compact(arguments[.all_aesthetics])
## aesthetics <- aesthetics[!is.constant(aesthetics)]
## aes_names <- names(aesthetics)
## aesthetics <- rename_aes(aesthetics)
## class(aesthetics) <- "uneval"
## if (missing(data)) {
## data <- data.frame()
## facetvars <- all.vars(facets)
## facetvars <- facetvars[facetvars != "."]
## names(facetvars) <- facetvars
## facetsdf <- as.data.frame(mget(facetvars, envir = env))
## if (nrow(facetsdf))
## data <- facetsdf
## }
## if ("auto" %in% geom) {
## if ("sample" %in% aes_names) {
## geom[geom == "auto"] <- "qq"
## }
## else if (missing(y)) {
## x <- eval(aesthetics$x, data, env)
## if (is.discrete(x)) {
## geom[geom == "auto"] <- "bar"
## }
## else {
## geom[geom == "auto"] <- "histogram"
## }
## if (missing(ylab))
## ylab <- "count"
## }
## else {
## if (missing(x)) {
## aesthetics$x <- bquote(seq_along(.(y)), aesthetics)
## }
## geom[geom == "auto"] <- "point"
## }
## }
## p <- ggplot(data, aesthetics, environment = env)
## if (is.null(facets)) {
## p <- p + facet_null()
## }
## else if (is.formula(facets) && length(facets) == 2) {
## p <- p + facet_wrap(facets)
## }
## else {
## p <- p + facet_grid(facets = deparse(facets), margins = margins)
## }
## if (!is.null(main))
## p <- p + ggtitle(main)
## for (g in geom) {
## params <- arguments[setdiff(names(arguments), c(aes_names,
## argnames))]
## params <- lapply(params, eval, parent.frame())
## p <- p + do.call(paste0("geom_", g), params)
## }
## logv <- function(var) var %in% strsplit(log, "")[[1]]
## if (logv("x"))
## p <- p + scale_x_log10()
## if (logv("y"))
## p <- p + scale_y_log10()
## if (!is.na(asp))
## p <- p + theme(aspect.ratio = asp)
## if (!missing(xlab))
## p <- p + xlab(xlab)
## if (!missing(ylab))
## p <- p + ylab(ylab)
## if (!missing(xlim))
## p <- p + xlim(xlim)
## if (!missing(ylim))
## p <- p + ylim(ylim)
## p
## }
## <bytecode: 0x000000001aa19418>
## <environment: namespace:ggplot2>
This plot is a good fit for the data because the predicted values look similar to the actual values.
servers_cubed = servers^3
cubic_model <- lm(cost ~ servers + servers_squared + servers_cubed)
cubic_model
##
## Call:
## lm(formula = cost ~ servers + servers_squared + servers_cubed)
##
## Coefficients:
## (Intercept) servers servers_squared servers_cubed
## 36133.696 -5954.738 310.895 -1.347
summary(cubic_model)
##
## Call:
## lm(formula = cost ~ servers + servers_squared + servers_cubed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2871.0 -1435.1 -473.6 1271.8 4600.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36133.696 2625.976 13.760 2.77e-10 ***
## servers -5954.738 1056.596 -5.636 3.72e-05 ***
## servers_squared 310.895 115.431 2.693 0.016 *
## servers_cubed -1.347 3.619 -0.372 0.715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2404 on 16 degrees of freedom
## Multiple R-squared: 0.932, Adjusted R-squared: 0.9193
## F-statistic: 73.11 on 3 and 16 DF, p-value: 1.478e-09
This is a good fit for the data based on the R-Squared and the Adjusted R-Squared of 0.932 and 0.9193 respectively. However, the quadratic model was a bit better.
Commands: predicted3 <- predict( cubic_model, data = mydata )
predicted3 <- predict( cubic_model, data = mydata )
Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )
qplot( x = servers, y = predicted3, colour = "green")
This plot is a good fit for the data because the predicted values look similar to the actual values. The quadratic predictor appears to be nearly the same.This would be because the R-Squared and adjusted R-Squared are nearly the same for both models, but the quadratic is slightly better overall. This difference is not as noticable to the eye but the cubic model is a bit more narrow.
variables: LINEAR_MODEL , PREDICTED_QUADRATIC, PREDICTED_CUBIC
# Black = Actual Data
plot(servers, cost, pch = 16)
# Blue = Linear Line based on Linear Regression Model
abline(linear_model, col = "blue", lwd = 2)
# Red = Quadratic Model based on Quadratric Regression found above
# Needed to overlay new points without the labels and annotations
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE)
plot(predicted2, col = "red", pch = 16)
# Green = Cubic Model based on Cubic Regression found above
# Overlay new points without the labels and annotations
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE)
plot(predicted3, col = "green", pch = 16)
There is not much difference between the cubic and quadratic models. The linear is clearly the worst fit, but it would be difficult to distinguish between the cubic and quadratic