About

In this lab, we will focus on linear programming and non-linear regression modeling. Linear regression modeling, as discussed in the previous lab, works with simple and multiple linear regression models. Sometimes the relationships are not best represented by linear models, and a non-linear regression modeling is required. The general concept remains the same; minimizing the error between the observed/actual values and the values predicted by the model. Linear programming on the other hands seek to find the optimal solution to a problem with multivariables and multiconstraints described by linear relationships.

In this lab, we will perform a non-linear regression modeling on the cost of servers, and setup a linear programming model to solve the marketing use case discussed in class.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

PART A: NON-LINEAR REGRESSION MODELING

First, we must read the file ‘ServersCost.csv’ into R, and extract the two columns of interest.

mydata <- read.csv("data/serverscost.csv", header=TRUE) #Read the file
head(mydata)

##   servers  cost
## 1       1 27654
## 2       2 24789
## 3       3 21890
## 4       4 21633
## 5       5 15843
## 6       6 12567

servers = mydata$servers #Extract column servers
cost = mydata$cost       #Extract column cost

We start by creating a simple linear regression model. Next, we plot the points to visually inspect the data and unravel any potential relationships.

# Create a simple linear regression model.
linear_model = lm(cost ~ servers)   

# Display the summary statistics of the linear_model.
summary(linear_model)

## 
## Call:
## lm(formula = cost ~ servers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10646.2  -8646.2   -544.7   7066.0  12858.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14747.2     4035.5   3.654  0.00181 **
## servers         48.0      336.9   0.142  0.88828   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8687 on 18 degrees of freedom
## Multiple R-squared:  0.001127,   Adjusted R-squared:  -0.05437 
## F-statistic: 0.0203 on 1 and 18 DF,  p-value: 0.8883

# add linear line based on regression model
plot(servers,cost, pch=16) # the pch option is to accentuate the points
abline(linear_model, col="blue", lwd=2)

The blue line here represents the model based predicted data, and the black dots are the actual data points. Clearly the line model is far from being a good predictor. Next we will use a nonlinear quadratic model to see how the model can be improved.

A linear model is of the form y ~ x0 + x1 +….xn where y is the dependent variable (DV) and the xn are the independent variables (IV). For the non-linear quadratic regression model the equation takes the form y ~ x + x^2. To create a quadratic non-linear regression model follow these two steps:

# Firstly, define a new variable called servers2 which is the squared value of servers
servers2 = servers^2

# Secondly, create a quadratic non-linear regression model.
# The model formula is of the form y = x + x^2 For consistency, name your new model quad_model.
quad_model = lm(cost ~ servers + servers2)

# Display the summary statistics of the quad_model.
summary(quad_model)

## 
## Call:
## lm(formula = cost ~ servers + servers2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2897.8 -1553.4  -513.2  1152.4  4752.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35417.77    1742.64   20.32 2.30e-13 ***
## servers     -5589.43     382.19  -14.62 4.62e-11 ***
## servers2      268.45      17.68   15.19 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2342 on 17 degrees of freedom
## Multiple R-squared:  0.9314, Adjusted R-squared:  0.9233 
## F-statistic: 115.4 on 2 and 17 DF,  p-value: 1.282e-10

From the summary, we see the R-squared value has increased dramatically from the linear model, indicating a big improvement. That is not the only value we must check though. Let’s inspect visually how the model based data compare to the actual data. To visualize the model data compared to the actual data follow these four steps:

# Firstly, compute the predicted values based on the quad_model, name the variable predicted2.
predicted2 = predict(quad_model,data=mydata)

# Secondly, plot cost versus servers based on actual values.
plot(servers,cost, pch=16)

# Thirdly, use the par() function to overlay the predicted2 values based on quad_model without the labels and annotations.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE) 

# Fourthly, plot the predicted2 values based on the quad_model and use the red color for the plotting.
plot(predicted2, col="RED", pch=16)

It is easy to observe now that the predicted values are more in line with the observed values.

A common misconception is that the higher order the non-linear model is the better it is. Lets try a cubic model and see how it performs. This model is of the form y = x + x^2 + x^3. Complete the following two tasks to create a cubic non-linear regression model.

TASK 1: Firstly, define an additional new variable called servers3 which is the cubic of servers.

servers3 = servers^3

TASK 2: Secondly, Create a cubic non-linear regression model, and display the summary statistics.

The model formula is of the form x + x^2 + x^3. For consistency, name your new model cubic_model.

cubic_model = lm(cost ~ servers + servers2 + servers3)

summary(cubic_model)

## 
## Call:
## lm(formula = cost ~ servers + servers2 + servers3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2871.0 -1435.1  -473.6  1271.8  4600.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36133.696   2625.976  13.760 2.77e-10 ***
## servers     -5954.738   1056.596  -5.636 3.72e-05 ***
## servers2      310.895    115.431   2.693    0.016 *  
## servers3       -1.347      3.619  -0.372    0.715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2404 on 16 degrees of freedom
## Multiple R-squared:  0.932,  Adjusted R-squared:  0.9193 
## F-statistic: 73.11 on 3 and 16 DF,  p-value: 1.478e-09

We can now visually inspect the goodness of the quadratic model and the cubic model. To visualize the cubic model data compared to the actual data, and compared to quadratic model, follow the following steps:

TASK 3: Firstly, compute the predicted values based on cubic_model, name the variable predicted3.

predicted3 = predict(cubic_model,data=mydata)

Secondly, plot cost versus servers based on actual data. Thirdly, overlay the predicted values based on the cubic model, and on the quadratic model. Fourthly, plot the data. (Note: write all codes for TASK 4 and TASK 5 in the same chunk.)

# Below is the code to plot cost versus servers based on actual values.
 plot(servers,cost, pch=16)

# Below is the code to overlay the predicted3 values based on the cubic_model without the labels and annotations using the par() function.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE) 

##### TASK 4: Write the code to plot the predicted2 values based on the quadratic_model and use the RED color for the plotting.
plot(predicted2, col="RED", pch=16) 


# Below is the code to overlay the predicted2 values based on the quadratic_model without the labels and annotations using the par() function.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE) 

##### TASK 5: Write the code to plot the predicted3 values based on the cubic_model and use the GREEN color for the plotting.
plot(predicted3, col="GREEN", pch=16)

From the graphs it should be hard to tell which of the two models, the quadratic or the cubic, is a better fit and predictor. A better way to quantify which model is best is to look at the Adjusted-R2.

TASK 6: List here the R2 and Adjusted R2 for all three models linear, quadratic, and cubic.

ANSWER:

For the linear model, Multiple R-squared is 0.001127 and Adjusted R-squared is -0.05437. For the quadratic model, Multiple R-squared is 0.9314 and Adjusted R-squared is 0.9233. For the cubic model, Multiple R-squared is 0.932 and Adjusted R-squared is 0.9193.

TASK 7: Identify which model is a better predictor and explain why.

ANSWER:

When using the linear model, it is sufficient to use the R-Square value. However, with models such as the quadratic or cubic models, you need to use the Adjusted R-Squared value instead. The quadratic model is the best predictor because it has the best Adjusted R-Squared value at 0.9233. The cubic model also has the largest R-squared value at 0.932, however the Adjusted R-squared values are the ones we use in this scenario. 0.9233 is the closest value to 1 (100%) which means it has the lowest margin for error.

PART B: LINEAR PROGRAMMING & OPTIMIZATION

For this task, we need to install an optimization package in R.

# Require will load the package only if not installed 
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
  install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")

Solving Marketing Model

First create the linear programming model object in R. This is the starting point. The object will eventually contain all definitions and results.

# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark <- make.lp(0, 2)

Next we need to define the type of optimization, set the objective function, and add the constraints to our model object.

# Define type of optimization as maximum and dump the screen output into `dump`
dump = lp.control(lpmark, sense="max")  
# Set the objective function
set.objfn(lpmark, c(275.691, 48.341))

# add constraints
add.constraint(lpmark, c(1, 1), "<=", 350000)

TASK 8: Add the remaining missing five constraints to the linear programming model above. Follow the guidelines as described in the lectures.

add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)

Finally we can explore and solve the model using the lpSolveAPI package.

# View the problem formulation in tabular/matrix form
lpmark

## Model name: 
##                C1       C2            
## Maximize  275.691   48.341            
## R1              1        1  <=  350000
## R2              1        0  >=   15000
## R3              0        1  >=   75000
## R4              2       -1   =       0
## R5              1        0  >=       0
## R6              0        1  >=       0
## Kind          Std      Std            
## Type         Real     Real            
## Upper         Inf      Inf            
## Lower           0        0

# Solve 
solve(lpmark)

## [1] 0

# Display the objective function optimum value i.e. the optimum sales value.
get.objective(lpmark)

## [1] 43443517

# Display the decision variables optimum values i.e. the optimum values for radio and tv ads.
get.variables(lpmark)

## [1] 116666.7 233333.3

TASK 9: Write down and clearly mark the optimum values for sales, radio, and tv ads.

ANSWER:

The optimum value for radio is 116666.7. The optimum value for tv is 233333.3. The optimum value of sales is 43443517.

TASK 10: Show how the optimum decision values satisfy all constraints.

ANSWER:

Both decision variables of Xradio and Xtv sum up to $350,000 (satisfying the constraint that the sum must be less than or equal to $350,000 total). The Optimum Xradio value of $116,666.70 satisfies its constaints by being greater than both the values of $15,000 and $0. The optimum Xradio value is half of the optimum Xtv value, satisfying the constraint 2Xradio - 1Xtv = 0. The Optimum Xtv value of $233,333.30 satisfies both of its constraints by being being greater than both $75,000 and $0.

Lab 07

CME Group Foundation Business Analytics Lab

Mona Tariq

3/27/2019

About

Setup

PART A: NON-LINEAR REGRESSION MODELING

TASK 1: Firstly, define an additional new variable called servers3 which is the cubic of servers.

TASK 2: Secondly, Create a cubic non-linear regression model, and display the summary statistics.

The model formula is of the form x + x^2 + x^3. For consistency, name your new model cubic_model.

TASK 3: Firstly, compute the predicted values based on cubic_model, name the variable predicted3.

Secondly, plot cost versus servers based on actual data. Thirdly, overlay the predicted values based on the cubic model, and on the quadratic model. Fourthly, plot the data. (Note: write all codes for TASK 4 and TASK 5 in the same chunk.)

TASK 6: List here the R2 and Adjusted R2 for all three models linear, quadratic, and cubic.

ANSWER:

TASK 7: Identify which model is a better predictor and explain why.

ANSWER:

PART B: LINEAR PROGRAMMING & OPTIMIZATION

Solving Marketing Model

TASK 8: Add the remaining missing five constraints to the linear programming model above. Follow the guidelines as described in the lectures.

TASK 9: Write down and clearly mark the optimum values for sales, radio, and tv ads.

ANSWER:

TASK 10: Show how the optimum decision values satisfy all constraints.

ANSWER: