Notebook Instructions


About

  • In this lab, we will focus on linear and non-linear programming.

  • Linear programming, as discussed in the previous lab, works with simple and multiple linear regression techniques; sometimes the variables have completely direct or completely non-direct relationships and these techniques can model them.

  • Sometimes, however, the variables do not predict each other in a linear way. For example, looking at the stock market vs. time, we know that generally the market was booming before the crash, then the market crashed and the great depression hit, and slowly the market started to rise again.

  • This pattern is not linear, and in fact a non-linear programming technique can be used to model it and predict the value of the market based on the year.

  • In this lab, we will explore topics like optimization, solve a marketing model, and perform linear and non-linear regression on the cost of servers.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

## Loading required package: lpSolveAPI

Task 1: Linear Programming - Solving Marketing Model


1A) Create the model object in R.

lprec <- make.lp(0, 2) 

Set the constrains and objective function for the model.

  • Set for maximum
lp.control(lprec, sense="max")  
## $anti.degen
## [1] "fixedvars" "stalling" 
## 
## $basis.crash
## [1] "none"
## 
## $bb.depthlimit
## [1] -50
## 
## $bb.floorfirst
## [1] "automatic"
## 
## $bb.rule
## [1] "pseudononint" "greedy"       "dynamic"      "rcostfixing" 
## 
## $break.at.first
## [1] FALSE
## 
## $break.at.value
## [1] 1e+30
## 
## $epsilon
##       epsb       epsd      epsel     epsint epsperturb   epspivot 
##      1e-10      1e-09      1e-12      1e-07      1e-05      2e-07 
## 
## $improve
## [1] "dualfeas" "thetagap"
## 
## $infinite
## [1] 1e+30
## 
## $maxpivot
## [1] 250
## 
## $mip.gap
## absolute relative 
##    1e-11    1e-11 
## 
## $negrange
## [1] -1e+06
## 
## $obj.in.basis
## [1] TRUE
## 
## $pivoting
## [1] "devex"    "adaptive"
## 
## $presolve
## [1] "none"
## 
## $scalelimit
## [1] 5
## 
## $scaling
## [1] "geometric"   "equilibrate" "integers"   
## 
## $sense
## [1] "maximize"
## 
## $simplextype
## [1] "dual"   "primal"
## 
## $timeout
## [1] 0
## 
## $verbose
## [1] "neutral"
set.objfn(lprec, c(275.691, 48.341))

1B) Add constrains

add.constraint(lprec, c(1, 1), "<=", 350000)
add.constraint(lprec, c(1, 0), ">=", 15000)
add.constraint(lprec, c(0, 1), ">=", 75000)
add.constraint(lprec, c(2, -1), "=", 0)

View the problem formulation in tabular/matrix form to confirm that the model was created correctly.

lprec
## Model name: 
##                C1       C2            
## Maximize  275.691   48.341            
## R1              1        1  <=  350000
## R2              1        0  >=   15000
## R3              0        1  >=   75000
## R4              2       -1   =       0
## Kind          Std      Std            
## Type         Real     Real            
## Upper         Inf      Inf            
## Lower           0        0

1C) Solve the optimization problem

# solve 
solve(lprec) 
## [1] 0

Describe findings

Display the objective function optimum value

get.objective(lprec)
## [1] 43443517

describe findings

Display the variables optimum values

get.variables(lprec) 
## [1] 116666.7 233333.3

Task 2: Regression Analysis - Linear Regression


2A) Read the csv file into R Studio and display the dataset.

  • Name your dataset ‘mydata’ so it easy to work with.

  • Commands: read_csv() head()

mydata = read.csv(file = "data/ServersCost.csv")

head(mydata)
##   servers  cost
## 1       1 27654
## 2       2 24789
## 3       3 21890
## 4       4 21633
## 5       5 15843
## 6       6 12567

Extract the assigned features (columns) to perform some analytics.

servers = mydata$servers
cost = mydata$cost

summary(mydata)
##     servers           cost      
##  Min.   : 1.00   Min.   : 4533  
##  1st Qu.: 5.75   1st Qu.: 6533  
##  Median :10.50   Median :14682  
##  Mean   :10.50   Mean   :15251  
##  3rd Qu.:15.25   3rd Qu.:22665  
##  Max.   :20.00   Max.   :28111

2B) Create a correlation table for your to compare the correlations between all variables. What can you tell about the correlation between the variables.

cor(mydata)
##            servers       cost
## servers 1.00000000 0.03356606
## cost    0.03356606 1.00000000

The correlation between server and cost is positive and weak.

2C) Create a plot for the dependent (y) and independent (x) variables. Note any patterns or relation between the two variables describe the trend line.

  • The blue line here represents the linear model we created and the black dots are the data points.

Commands: p <- qplot( x = INDEPENDENT, y = DEPENDENT, data = mydata) + geom_point()

#installed package plotly
library("plotly")
## Warning: package 'plotly' was built under R version 3.4.4
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- qplot( x = servers, y = cost, data = mydata) + geom_point()

p

Commmand: p + geom_smooth(method = “lm”)

Add a trend line plot using the a linear model

p2 <- p + geom_smooth(method = "lm")
p2

The trend line added to this model is not correct at all. It doesn’t cover most of the data. This is because the data itself is not linear.

2D) Create a linear regression model by identifying the dependent variable (y) and independent variable (x_n)

  • Commands: linear_model <- lm( DEPENDENT ~ INDEPENDENT )
linear_model <- lm(cost ~ servers)
linear_model
## 
## Call:
## lm(formula = cost ~ servers)
## 
## Coefficients:
## (Intercept)      servers  
##       14747           48

Use the regression model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

  • Commands: summary( linear_model )
summary(linear_model)
## 
## Call:
## lm(formula = cost ~ servers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10646.2  -8646.2   -544.7   7066.0  12858.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14747.2     4035.5   3.654  0.00181 **
## servers         48.0      336.9   0.142  0.88828   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8687 on 18 degrees of freedom
## Multiple R-squared:  0.001127,   Adjusted R-squared:  -0.05437 
## F-statistic: 0.0203 on 1 and 18 DF,  p-value: 0.8883

This model is horrible, the R-squared and Adjusted R-Squared values are extremely low.


Task 3: Regression Analysis - Non-linear Regression


3A) Create a non-linear quadratic regression model by identifying the dependent variable (y) and independent variables (x). Transforms the independent variable by squaring it and adding to the model.

  • The Quadratic model formula is: y = x + x^2
  • Commands: quad_model <- lm(y ~ x + x_squared)
  • Commands: To squared a variable use (^) such as x^2
servers_squared = servers^2
quad_model <- lm(cost ~ servers + servers_squared)
quad_model
## 
## Call:
## lm(formula = cost ~ servers + servers_squared)
## 
## Coefficients:
##     (Intercept)          servers  servers_squared  
##         35417.8          -5589.4            268.4

Use the quadratic model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

  • Commands: summary( quad_model )
summary(quad_model)
## 
## Call:
## lm(formula = cost ~ servers + servers_squared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2897.8 -1553.4  -513.2  1152.4  4752.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     35417.77    1742.64   20.32 2.30e-13 ***
## servers         -5589.43     382.19  -14.62 4.62e-11 ***
## servers_squared   268.45      17.68   15.19 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared:  0.9314, Adjusted R-squared:  0.9233 
## F-statistic: 115.4 on 2 and 17 DF,  p-value: 1.282e-10

The R-Squared and Adjusted R-Squared values of 0.9314 and 0.9233 respectively are very high, meaning that this is a great fit for the data. Especially, when compared to the linear model.

3B) Compute the predicted values based on the quadratic model.

Commands: predicted_2 <- predict( quad_model, data = mydata )

predicted2 = predict(quad_model,data=mydata)
predicted2
##         1         2         3         4         5         6         7 
## 30096.790 25312.706 21065.520 17355.233 14181.844 11545.354  9445.762 
##         8         9        10        11        12        13        14 
##  7883.068  6857.273  6368.376  6416.377  7001.277  8123.076  9781.772 
##        15        16        17        18        19        20 
## 11977.367 14709.861 17979.252 21785.543 26128.731 31008.818

Create a plot using the quadratic model predicted values in color red. Noted the shape, looking at the plot is this a good or bad fit for your data?

Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )

qplot( x = servers, y = predicted2, colour = "red" )

qplot
## function (x, y = NULL, ..., data, facets = NULL, margins = FALSE, 
##     geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "", 
##     main = NULL, xlab = deparse(substitute(x)), ylab = deparse(substitute(y)), 
##     asp = NA, stat = NULL, position = NULL) 
## {
##     if (!missing(stat)) 
##         warning("`stat` is deprecated", call. = FALSE)
##     if (!missing(position)) 
##         warning("`position` is deprecated", call. = FALSE)
##     if (!is.character(geom)) 
##         stop("`geom` must be a character vector", call. = FALSE)
##     argnames <- names(as.list(match.call(expand.dots = FALSE)[-1]))
##     arguments <- as.list(match.call()[-1])
##     env <- parent.frame()
##     aesthetics <- compact(arguments[.all_aesthetics])
##     aesthetics <- aesthetics[!is.constant(aesthetics)]
##     aes_names <- names(aesthetics)
##     aesthetics <- rename_aes(aesthetics)
##     class(aesthetics) <- "uneval"
##     if (missing(data)) {
##         data <- data.frame()
##         facetvars <- all.vars(facets)
##         facetvars <- facetvars[facetvars != "."]
##         names(facetvars) <- facetvars
##         facetsdf <- as.data.frame(mget(facetvars, envir = env))
##         if (nrow(facetsdf)) 
##             data <- facetsdf
##     }
##     if ("auto" %in% geom) {
##         if ("sample" %in% aes_names) {
##             geom[geom == "auto"] <- "qq"
##         }
##         else if (missing(y)) {
##             x <- eval(aesthetics$x, data, env)
##             if (is.discrete(x)) {
##                 geom[geom == "auto"] <- "bar"
##             }
##             else {
##                 geom[geom == "auto"] <- "histogram"
##             }
##             if (missing(ylab)) 
##                 ylab <- "count"
##         }
##         else {
##             if (missing(x)) {
##                 aesthetics$x <- bquote(seq_along(.(y)), aesthetics)
##             }
##             geom[geom == "auto"] <- "point"
##         }
##     }
##     p <- ggplot(data, aesthetics, environment = env)
##     if (is.null(facets)) {
##         p <- p + facet_null()
##     }
##     else if (is.formula(facets) && length(facets) == 2) {
##         p <- p + facet_wrap(facets)
##     }
##     else {
##         p <- p + facet_grid(facets = deparse(facets), margins = margins)
##     }
##     if (!is.null(main)) 
##         p <- p + ggtitle(main)
##     for (g in geom) {
##         params <- arguments[setdiff(names(arguments), c(aes_names, 
##             argnames))]
##         params <- lapply(params, eval, parent.frame())
##         p <- p + do.call(paste0("geom_", g), params)
##     }
##     logv <- function(var) var %in% strsplit(log, "")[[1]]
##     if (logv("x")) 
##         p <- p + scale_x_log10()
##     if (logv("y")) 
##         p <- p + scale_y_log10()
##     if (!is.na(asp)) 
##         p <- p + theme(aspect.ratio = asp)
##     if (!missing(xlab)) 
##         p <- p + xlab(xlab)
##     if (!missing(ylab)) 
##         p <- p + ylab(ylab)
##     if (!missing(xlim)) 
##         p <- p + xlim(xlim)
##     if (!missing(ylim)) 
##         p <- p + ylim(ylim)
##     p
## }
## <bytecode: 0x000000001aa19418>
## <environment: namespace:ggplot2>

This plot is a good fit for the data because the predicted values look similar to the actual values.

3C) Create a non-linear cubic regression model by identifying the dependent variable (y) and independent variables (x). Transforms the independent variable by squaring it to second (x^2) and third )x^3) degrees and adding them to the model.

  • The Cubic model formula is: y = x + x^2 + x^3
  • Commands: cubic_model <- lm(y ~ x + x_squared + x_cubic)
  • Commands: To squared a variable use (^) such as x^2, x^3
servers_cubed = servers^3
cubic_model <- lm(cost ~ servers + servers_squared + servers_cubed)
cubic_model
## 
## Call:
## lm(formula = cost ~ servers + servers_squared + servers_cubed)
## 
## Coefficients:
##     (Intercept)          servers  servers_squared    servers_cubed  
##       36133.696        -5954.738          310.895           -1.347

Use the cubic model to create a report. Note the R-Squared and Adjusted R-Squared values, determine if this is a good or bad fit for your data?

  • Commands: summary( cubic_model )
summary(cubic_model)
## 
## Call:
## lm(formula = cost ~ servers + servers_squared + servers_cubed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2871.0 -1435.1  -473.6  1271.8  4600.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     36133.696   2625.976  13.760 2.77e-10 ***
## servers         -5954.738   1056.596  -5.636 3.72e-05 ***
## servers_squared   310.895    115.431   2.693    0.016 *  
## servers_cubed      -1.347      3.619  -0.372    0.715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2404 on 16 degrees of freedom
## Multiple R-squared:  0.932,  Adjusted R-squared:  0.9193 
## F-statistic: 73.11 on 3 and 16 DF,  p-value: 1.478e-09

This is a good fit for the data based on the R-Squared and the Adjusted R-Squared of 0.932 and 0.9193 respectively. However, the quadratic model was a bit better.

3D) Compute the predicted values based on the cubic model.

Commands: predicted3 <- predict( cubic_model, data = mydata )

predicted3 <- predict( cubic_model, data = mydata )

Create a plot using the cubic model predicted values in color green. Noted the shape, looking at the plot is this a good or bad fit for your data? Is this model better than the previous?

Commands: qplot( x = DEPENDENT, y = INDEPENDENT/PREDICTED, colour = “red” )

qplot( x = servers, y = predicted3, colour = "green")

This plot is a good fit for the data because the predicted values look similar to the actual values. The quadratic predictor appears to be nearly the same.This would be because the R-Squared and adjusted R-Squared are nearly the same for both models, but the quadratic is slightly better overall. This difference is not as noticable to the eye but the cubic model is a bit more narrow.

3E) Overlay the all models on top of the data. Which model seems to fit the best in your opinion? Justify your answer.

variables: LINEAR_MODEL , PREDICTED_QUADRATIC, PREDICTED_CUBIC

# Black = Actual Data
plot(servers, cost, pch = 16) 
# Blue = Linear Line based on Linear Regression Model
abline(linear_model, col = "blue", lwd = 2) 

# Red = Quadratic Model based on Quadratric Regression found above
# Needed to overlay new points without the labels and annotations
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE) 
plot(predicted2, col = "red", pch = 16) 

# Green = Cubic Model based on Cubic Regression found above
# Overlay new points without the labels and annotations 
par(new = TRUE, xaxt = "n", yaxt = "n", ann = FALSE) 
plot(predicted3, col = "green", pch = 16)

There is not much difference between the cubic and quadratic models. The linear is clearly the worst fit, but it would be difficult to distinguish between the cubic and quadratic