09 12 2019

Introduction

Business analytics - An overview

Business Intelligence (BI) software

Software packages covering most analytics task in a customized and modular way. Always contains data warehouse concepts as well as reporting and data analysis routines. Famous firms are IBM, Oracle, Microsoft, SAP, SAS, …

BI software market

Market volume (worldwide)

Gartner (2018)

Gartner (2018)

BI component rating

BARC (2019)

BARC (2019)

What is R?

Definition

R is an open-source, interpreted (statistical) programming language and environment available at (www.r-project.org).

Characteristics:

  • object-oriented & function-oriented \(\rightarrow\) everything in R is an object
  • modular & interactive \(\rightarrow\) extensible by packages, immediate output
  • platform idependent \(\rightarrow\) R scripts are easily portable
  • designed for statistical computing \(\rightarrow\) not fast, but easier to learn
  • R is for anybody how works with data
  • biggest enemy: Python :-)

Evolution of R packages

R working environment

Frontend(s)

R is the name of a programming language (like Java) as well as programming environment (like Eclipse). Traditionally, R is just console and an editor for writing scripts. However, most users work with Rstudio as a more powerful evironment for R (www.rstudio.com)

Case study data: Brazilian E-Commerce data

Case study: The data (I)

Order data

Ordered items

Case study: The data (II)

Product data

Geoloc data

Case study: Obtaining the data

  • Create an account for kaggle & login
  • Search for the Brazilian E-Commerce project
  • donwload zip-file including 9 csv-files \(\rightarrow\) unzip
  • import csv-files in Rstudio via “import dataset” (upper-right panel) \(\rightarrow\) “From text”

Case study: The data in R

Order data

head(orders.data, 4)
>                           order_id                      customer_id
> 1 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d
> 2 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef
> 3 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089
> 4 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82
>   order_status order_purchase_timestamp   order_approved_at
> 1    delivered      2017-10-02 10:56:33 2017-10-02 11:07:15
> 2    delivered      2018-07-24 20:41:37 2018-07-26 03:24:27
> 3    delivered      2018-08-08 08:38:49 2018-08-08 08:55:23
> 4    delivered      2017-11-18 19:28:06 2017-11-18 19:45:59
>   order_delivered_carrier_date order_delivered_customer_date
> 1          2017-10-04 19:55:00           2017-10-10 21:25:13
> 2          2018-07-26 14:31:00           2018-08-07 15:27:45
> 3          2018-08-08 13:50:00           2018-08-17 18:06:29
> 4          2017-11-22 13:39:59           2017-12-02 00:28:42
>   order_estimated_delivery_date
> 1           2017-10-18 00:00:00
> 2           2018-08-13 00:00:00
> 3           2018-09-04 00:00:00
> 4           2017-12-15 00:00:00

Packages required in the following

Installation via menu or install.packages() functions. Be aware, this may take a while. This presentation can be found here

install.packages(c("tmap","sf", "tmaptools", "brazilmaps", "png", "plotly", "geosphere", "htmlwidgets","metR",
            "tidyverse", "lubridate", "widgetframe", "ROI","ROI.plugin.glpk", "ompr", "ompr.roi", "vis.network"))
library(tmap)
library(sf)
library(tmaptools)
library(brazilmaps)
library(png)
library(plotly)
library(geosphere)
library(htmlwidgets)
library(metR)
library(tidyverse)
library(lubridate)
library(widgetframe)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
library(visNetwork)

Objects & Functions

Data objects

Objects & workspace

Everything in R is an object. Objects are always in a environment and are categorized in numerous ways, e.g. by storage mode, or class. The most important is the global environment, also called workspace. We will distinguish data objects and functions. Functions are objects which create environemnts.

# shows current environement
environment()
# show all active environments
search()
# assign an object called 'da.1'
da.1 <- 5
# define a function
f.1 <- function() print("Hello world!")  
# function printing the environment
f.2 <- function(){
  print(environment()) 
} 
# objects in the workspace
objects()
environment()
> <environment: R_GlobalEnv>

f.1()
> [1] "Hello world!"

f.2() 
> <environment: 0x00000000378a3d68>

Functions in R (I)

Definition

Functions are code-wrappers. They consist of argument, body, and environment. The environment determines where function resides (i.e., from where it can be called). The body describes what happens to the arguments.

# show arguments of a function
formals(f.1) 
> NULL
# show body of a function
body(f.1) 
> print("Hello world!")
# show environment of a function
environment(f.1) 
> <environment: R_GlobalEnv>
# function with arguments
f.3 <- function(x,y) x+y
formals(f.3) # more to be seen
> $x
> 
> 
> $y
# most built-in functions have neither 
# attribute nor body --> external code
formals(sum);body(sum);environment(sum)
> NULL
> NULL
> NULL
# Dynamic lookup:
# In case an object is missing, R looks 
# in the functions' environment before
# reporting an error
x <- 5
y <- 7
f.4 <- function() x+y 
f.4()
> [1] 12

Functions in R (II)

# Name masking:
# What happens in a function, stays there
x <- 4
y <- 5
f.5 <- function() {
   x <- 3; y <- 2;   x*y  
}
f.5()
> [1] 6
x; y
> [1] 4
> [1] 5
# unless we say otherwise
f.6 <- function(){
 x <<- 3; y <<- 2;  x*y   
}
f.6()
> [1] 6
x; y
> [1] 3
> [1] 2


# arguments have to be passed
f.3(x=4,y=7); f.3( 4, 7)
> [1] 11
> [1] 11
# be careful with assignments 
x = 9; y = 3
x; y
> [1] 9
> [1] 3
f.3( x <- 4, y <- 7)
> [1] 11
x; y
> [1] 4
> [1] 7
# Lazyness: you can define useless arguments
f.7 <- function(x) y^2
f.7("test")
> [1] 49
# you can define defaults
f.8 <- function(x=2, y=3) x^2+y^2
f.8(); f.8(3, 4)
> [1] 13
> [1] 25

Functions in R (III) - Examples

Density of normal distribution:

\[f(x,\mu,\sigma) = \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^2 }} \cdot e^{-\frac{(x-\mu)^2}{2 \cdot \sigma^2}}\]

dnorm.1 <- function(x, mu, sigma){
  tmp.1 <- 1/sqrt(2*pi*sigma^2) 
  tmp.2 <- exp(-(x-mu)^2/(2*sigma^2))
  tmp.1 * tmp.2
} 
# evaluation at 1
dnorm.1(1, mu = 3, sigma = 1)
> [1] 0.05399097
# result with built-in function:
dnorm(1, mean = 3, sd = 1)
> [1] 0.05399097



Loss function:

\[ \scriptsize \begin{align*} V_n(x, \mu, \sigma) = & \int_{x}^{\infty} (t-x) \cdot f(t,\mu, \sigma)\; \text{d} t\\ = & \sigma \cdot \left(\varphi\left(\frac{x-\mu}{\sigma}\right) - \frac{x-\mu}{\sigma} \cdot \left(1-\Phi\left(\frac{x-\mu}{\sigma} \right)\right)\right) \end{align*} \]

# or more complex (loss function):
v.fun <- function(x, mu = 0, sigma = 1) {
  tmp.1 <- dnorm((x-mu)/sigma)
  tmp.2 <- (x-mu) * (1-pnorm((x-mu)/sigma))
  return(sigma * tmp.1 - tmp.2)
}
# value for st. loss function at 0.5
v.fun(.5)
> [1] 0.1977966
# the same value obtained directly by integration
integrate(lower=0.5, upper = 10, 
          f= function(x) (x-.5)*dnorm(x))
> 0.1977966 with absolute error < 1e-04

Some useful functions

names description
Functions for functions
help(func) show help page of func
return(x) returns x and exits function
stop(‘mess’) stops execution of function and shows warning ‘mess’
warning(‘mess’) shows warning ‘mess’
Plotting
plot(x,y)/ggplot() open graphic device and creates scatter plotat x and y coordinates
lines(x,y)/geom_lines() adds lines to an existing plot
points(x,y)/geom_points() adds points toan existing plot
curve(expr)/stat_function() plos function expr
hist(x)/geom_histogram() plots histogram of x
barplot(x)/geom_bar() creates a barplot
boxplot(x)/geom_boxplot() creates a boxplot
Basic math
prod(x)/sum(x) product/sum of x
log(x) logarithm of x
exp(x) e to the power of x
sqrt(x) square root of x
names description
Object attributes
length(x) length of x (1-dim.)
dim(x) dimensions of x (at 2-dim.)
nrow(x) number of rows
ncol(x) number of columns
Programming
apply(x, m, expr) applies expr to all columns (m=2) or rows (m=1) of matrix x
lappy(x,expr) applies xpr to all elements of list x
next stops execution of an iteration and increases increment
break stops execution of a loop or fucntion
repeat{code} executes code until a stop or break occors
for(i in x){code} executes code with for each value of x (names as i)
while(cond.){code} executes code as long as cond. is true

Plotting functions

# plot standard normal distribution
curve(dnorm, -3, 3, 
      ylab="density")  


# plot standard loss function with ggplot2
library(ggplot2)
ggplot(data=data.frame(x=-3:3), aes(x=x)) +
  stat_function(fun = v.fun )  +
  ylab("expected loss")

Case study: Basic plots - Products per category

ggplotly(ggplot(products.data, aes(x=product_category_name)) + geom_bar() +  
  xlab("product category") + ylab("number of products") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)))

Basic data objects

Definition

Basic data objects (vectors,matrices and arrays) consist of scalars of the same storage type only (i.e. only numbers, characters, logicals) while advanced data objects (data frames and lists) can contain different types.


Name Dim. Built function
vector 1 c(),numeric()
matrix 2 matrix()
array n array()
data frame 2 data.frame()
list 1 list()
devopedia.org (2019)

devopedia.org (2019)

Data objects in R - Basic elements

# Vectors
## an empty numeric vector
v.1 <- numeric(5) 
## connecting scalars
v.2 <- c(5, 6, 7)   
## shortcut for integers
v.3 <- 10:18        

# Matrices
## matrix with 2 columns
m.1 <- matrix(1:6, ncol=2)  

## constructing matrices from vectors 
### bind column-wise
m.2 <- cbind(1:3, 2:4, 3:5) 
### row-wise binding
m.3 <- rbind(1:3, 2:4, 3:5) 

# arrays
## basic elements can contain only one data type 
a.1 <- array(c(LETTERS[1:6],7:12), dim=c(2,3,2))



v.3
> [1] 10 11 12 13 14 15 16 17 18

m.1 
>      [,1] [,2]
> [1,]    1    4
> [2,]    2    5
> [3,]    3    6

a.1 
> , , 1
> 
>      [,1] [,2] [,3]
> [1,] "A"  "C"  "E" 
> [2,] "B"  "D"  "F" 
> 
> , , 2
> 
>      [,1] [,2] [,3]
> [1,] "7"  "9"  "11"
> [2,] "8"  "10" "12"

Data objects in R - Advanced elements

# Data frames
## technically: 
##    a list with vectors of same length
## structurally: 
##    a matrix with different data types 
d.1 <- data.frame(a = 1:2, 
                  b = c("rest", "test"), 
                  c = c(T,F))
d.2 <- data.frame(a = 1:3, 
                  b = seq(10, 20, length.out = 3), 
                  c = rnorm(3))

# Lists
## a quite simple list
l.1 <- list(a = 1:3, b = "nest", c = TRUE)  
## a more complicated list
l.2 <- list(a = 1:3, b = "nest", 
            c = list(
              d = "test", 
              e = rep(x = c(TRUE,FALSE), each = 2)
              )
            )  


d.1
>   a    b     c
> 1 1 rest  TRUE
> 2 2 test FALSE

head(d.2, 1) # first element 
>   a  b         c
> 1 1 10 0.8478022

l.1
> $a
> [1] 1 2 3
> 
> $b
> [1] "nest"
> 
> $c
> [1] TRUE

tail(l.2, 1) # last element
> $c
> $c$d
> [1] "test"
> 
> $c$e
> [1]  TRUE  TRUE FALSE FALSE

Indexing of vectors and matrices

Data objects are indexed by position, name, or logical identifier. Vectors are indexed by <vector>[<index>]. You can also negate the selection, i.e. extract all entries except the indexed entries by <vector>[-<index>]. Matrices are 2-dimensional and, thus, require 2 indices, i.e. <matrix>[<index row>, <index col.>]

x <- c(5, 6, 7)
# extract 1st entry 
x[1]              
> [1] 5
# extract 1st & 3rd entry
x[c(1,3)]         
> [1] 5 7
# logical index have same length as indexed vector 
x[c(T,F,T)]       
> [1] 5 7
# generate an indexing vector by logic expr.
y <- x <= 6       
# retain inices of those entries 
z <- which(y)     
# subvector of all entries of x which are > 6
x[!y]              
> [1] 7
x <- cbind(1:6, 2:7, 3:8) # column-wise binding
# first row, first column (just one element)
x[1,1]
> [1] 1
# first row (a vector)
x[1,]
> [1] 1 2 3
# logical index vector
y <- rep( x = c(T,F,T), times = 2) 
# mix row and column selections
x[y,c(1,3)]               
>      [,1] [,2]
> [1,]    1    3
> [2,]    3    5
> [3,]    4    6
> [4,]    6    8

Indexing of lists, data frames and tibbles

All advanced data objects can have named elements. These namescan be used to extract the corresponding element by <object>$<element name>. Alternatively, elements can be accessed by <object>[[<element index>]] (single element) or <object>[<element index>] (multiple elements). Additionally, tibbles and data fames can be indexed like matrices

df1 <- data.frame(one = 1:2, two = c("nest","test") )
# access column one
df1$one           
> [1] 1 2
# a tibble (basically the same as data frame)
tb1 <- tibble(one = 1:2, two = c("nest","test")) 
tb1$two           # works too
> [1] "nest" "test"
# a list
l1 <- list(a = 1:3, b = "nest", c = TRUE)
#  for multiple components
l1[c(1,3)]
> $a
> [1] 1 2 3
> 
> $c
> [1] TRUE
# for multiple components by name
l1[c("a","c")]
> $a
> [1] 1 2 3
> 
> $c
> [1] TRUE
# names can be retrieved 
names(l1)
> [1] "a" "b" "c"
# ... and changed
names(l1) <- c("x","y","z")
l1$z
> [1] TRUE

Some more useful functions

Name Description
which(x) returns indices of TRUE entries of ´x´
sort(x) sorts x ascendingly (unless decreasing=TRUE)
order(x) returns indices to order x ascendingly (unless decreasing=TRUE)
table(x) returns frequencies of values of x
aggregate(y~x, FUN) splits data into subsets and applies ‘FUN’ to subsets
lm(y~x) estimates a linear regression model
merge(x,y,by) merge data frames x and y by variable by


# random vector of length 4
x <- rnorm(4); x
> [1]  0.6549547 -1.0846396 -0.4841962  0.7884190
# sort vector
sort(x)
> [1] -1.0846396 -0.4841962  0.6549547  0.7884190
# ordering vector
order(x)
> [1] 2 3 1 4
# ... which also sorts x
x[order(x)]
> [1] -1.0846396 -0.4841962  0.6549547  0.7884190
# create data frame
df <- data.frame(x = x, y = 1-10*x)
# reorder data frame
df[order(df$y),]
>            x         y
> 4  0.7884190 -6.884190
> 1  0.6549547 -5.549547
> 3 -0.4841962  5.841962
> 2 -1.0846396 11.846396

Case study: Frequncy of products in orders

Pareto chart or ABC analysis on frequency per product orders

# determine frequency of ordered products
x <- table(orders.items$product_id)
# sort products products descendingly
x <- sort(x, decreasing = T)
# compose a data frame with ...
df.freq.prod <- data.frame(
  names = names(x),         # names of products
  names_int = 1:length(x),  # product ID
  freq = as.numeric(x/sum(x)),          # rel. frequency
  cfreq = cumsum(x/sum(x))) # cumulated rel. freq.
# create plot
ggplot(df.freq.prod, aes(x=names_int, y= cfreq)) + 
        geom_line() + ylab("cum. frequency") + 
        xlab("Product ID") +  theme(axis.text = 
          element_text(size = 14), 
          axis.title=element_text(size=16))

Case study: Revenue of products

Pareto chart on total revenue per product

# sum revenues (prices) per product over all orders 
# (result is a data frame)
df.rev.prod <- aggregate(price ~ product_id, 
                FUN = sum, data = orders.items)
# sort product revenues descendingly
df.rev.prod <- df.rev.prod[order(df.rev.prod$price, decreasing = T),]
# enlarge data frame with ...
df.rev.prod <- data.frame(df.rev.prod, 
          names_int = 1:nrow(df.rev.prod),  
          share=df.rev.prod$price/sum(df.rev.prod$price), 
          cshare = cumsum(df.rev.prod$price/sum(df.rev.prod$price)))
# create plot
ggplot(df.rev.prod, aes(x=names_int, y=cshare)) + 
  geom_line() + ylab("cum. relative revenue") + 
  xlab("Product ID") +  theme(axis.text=
    element_text(size=14), 
    axis.title=element_text(size=16))


Case study: Revenues per product categories

# sum revenues (prices) per product over all orders 
df.prod.cat <- aggregate(price ~ product_id, FUN = sum, data = orders.items)
# add corresponding product categories
df.prod.cat <- merge(df.prod.cat, products.data, by = "product_id")
# create plot
ggplot(df.prod.cat, aes(x=product_category_name, y=log(price)) ) + geom_boxplot(outlier.size = 0.01) +   ylab("log. prices") +
  xlab("Product category") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

Stochastic Simulation

Computer simulation

Definition

Computer simulation means to model a real-world system by a computer model. Thereby, the modelled entities and their relations among each other as well as external factors are usually described by mathematical expressions.

Some useful functions for simulation

Naturally, R is often useful for stochastic simulation as it allows one to easily model stochastic processes. One core element are distribution functions. For a given distribution R offers usually the density (d), probability (p), quantile (q) and random number generator (r). The syntax for the normal distribution is e.g. d/p/q/rnorm(x). See ?Distributions

Name Description
sample(x, size) sample size elements from x
d/p/q/rnorm(x) normal distribution
d/p/q/rbinom(x) binomial distribution
d/p/q/runif(x) uniform distribution
d/p/q/rpois(x) Poisson distribution
replicate(x, FUN) FUN is repeated x times


# sample from a vector
sample(1:10, 5)
> [1]  6 10  2  7  5
# .. also with relacements
sample(1:5, 5, replace = TRUE)
> [1] 4 1 2 5 2
# 95%-qunatile of the st. normal dist.
qnorm(0.95)
> [1] 1.644854
# 5 Bernoulli trial with prob. 0.5
rbinom(n=5, size=1 , prob=0.5)
> [1] 0 0 0 1 0
# ... same with replicate
replicate(5, rbinom(n=1, size=1 , prob=0.5))
> [1] 1 0 0 1 0

Case study: Inventory simulation

(\(s\),\(q\)) inventory policy

Case study: Demand of a product (I)

# most often ordered product
x.max <- rownames(df.freq.prod)[1]
# identify orders with x.max
orders.items.x.max <- orders.items[which(orders.items$product_id == x.max), ]
# add order information
orders.x.max <- merge(orders.items.x.max, orders.data, by="order_id")
# reorder according to date
orders.x.max <- orders.x.max[order(orders.x.max$order_purchase_timestamp), ]
# convert dates 
orders.x.max$order_purchase_timestamp <- as_datetime(orders.x.max$order_purchase_timestamp)
# add date components and order quantity column
orders.x.max <- data.frame(orders.x.max, 
          year.order = year(orders.x.max$order_purchase_timestamp),
          day.order = yday(orders.x.max$order_purchase_timestamp),
          ord.quant = rep(1,nrow(orders.x.max)))
# convert days in factor 
orders.x.max$day.order <- as.factor(orders.x.max$day.order)
# add missing factor levels
levels(orders.x.max$day.order) <- c(
  levels(orders.x.max$day.order),
  as.character(1:365)[!as.character(1:365) %in% levels(orders.x.max$day.order)])
# sort factor levels
orders.x.max$day.order <- factor(orders.x.max$day.order,levels(orders.x.max$day.order)[order(as.numeric(levels(orders.x.max$day.order)))])

Case study: Demand of a product (II)

# calculate daily demand
day.dem.x.max <- aggregate(ord.quant ~ day.order + year.order, orders.x.max, FUN=sum, drop = F)
# cut off days before product entered portfolio
id.nan <- which( !is.na(day.dem.x.max$ord.quant) )
day.dem.x.max <- day.dem.x.max[min(id.nan):max(id.nan), ]
# each day without entry has  demand of 0
day.dem.x.max$ord.quant[is.na(day.dem.x.max$ord.quant)] <- 0
# recompose date
tmp <- make_date(year = day.dem.x.max$year.order)
yday(tmp) <- as.numeric(day.dem.x.max$day.order)

# add to data frame
day.dem.x.max <- data.frame(day.dem.x.max,
                            date = tmp)

# make time series plot
ggplot(day.dem.x.max, aes(x=date, y=ord.quant)) + 
  geom_point() + geom_bar(stat='identity') +
  ylab("order quantity")


Case study: Demand of a product (III)

library(fitdistrplus)
# find best fitting distribution
descdist(day.dem.x.max$ord.quant, discrete = TRUE)

fit.nbino <- fitdist(day.dem.x.max$ord.quant,
                     "nbinom")
denscomp(fit.nbino)

Inventory simulation - (\(s\),\(q\))-policy (I)

sq.inv <- function(d, s, q, l.ini, L = 1  ){
        # number of periods
        n <- length(d)
       # initialize result matrix
        tmp.mat <- matrix(0, nrow = n + 1 + L, ncol = 5 )
        colnames(tmp.mat) <- c("demand","inventory","disp.inv", "order", "delivery")
        # initialize matrix
        tmp.mat[,"demand"] <- c(0, d, rep(0,L))
        tmp.mat[1,"inventory"] <- l.ini
        tmp.mat[1,"disp.inv"] <- l.ini
        # inventory calculation
        for(i in 2:(n+1)){
            # check whether reorder level is reached
            if(tmp.mat[i-1, "disp.inv"] <= s){
                tmp.mat[i, "order"] <- q
                tmp.mat[i+L, "delivery"] <- q
            }
        # inventory balances  
        tmp.mat[i, "disp.inv"] <- tmp.mat[i-1, "disp.inv"] - tmp.mat[i, "demand"] + tmp.mat[i, "order"]
        tmp.mat[i, "inventory"] <- tmp.mat[i-1, "inventory"] - tmp.mat[i, "demand"] + tmp.mat[i, "delivery"]
        }
        # return results
        return(as.data.frame(tmp.mat[2:(n+1),]))
    }

Inventory simulation - (\(s\),\(q\))-policy (II)

# create Poisson distributed demand with 
# mean = 10 for 100 periods 
d.vec <- rpois(100, 10)

# set parameters and 
# simulate stock levels
res.mat <- sq.inv(d = d.vec, # demand
                  s = 12,    # reorder point
                  q = 70,    # order quantity 
                  l.ini = 70, # initial stock
                  L = 1)     # lead time

# plot inventory levels
ggplot(res.mat, 
       aes(x=1:nrow(res.mat), y=inventory)) +
  geom_step() + 
  geom_hline(yintercept = 0) + 
  xlab("periods")


Case study: Simulate inventory

# create Poisson distributed demand with 
# mean = 10 for 100 periods 
d.vec <- day.dem.x.max[,"ord.quant"]

# set parameters and 
# simulate stock levels
inv.x.max <- sq.inv(d = d.vec, # demand
                  s = 10,    # reorder point
                  q = 20,    # order quantity 
                  l.ini = 20, # initial stock
                  L = 3)     # lead time

# plot inventory levels
ggplot(inv.x.max, 
       aes(x=1:nrow(inv.x.max), y=inventory)) +
  geom_step() + 
  geom_hline(yintercept = 0) + 
  xlab("periods")


Inventory simulation - Evaluate solutions

eval.inv <- function(result, c.order, c.inv, c.stockout = 0){
  # determine periods with deliveries 
  tmp.id <- result[,"delivery"] > 0
  #  number of orders
  nb.orders <- sum(tmp.id)
  # total stock
  tot.stock <- result %>% filter(inventory > 0) %>% pull(inventory) %>% sum
  # total backlog
  tot.backlog <- abs(result %>% filter(inventory < 0) %>% pull(inventory) %>% sum)
  # total cost per period
  total.cost <- (nb.orders * c.order +  tot.stock * c.inv + tot.backlog * c.stockout)/nrow(result)
  # retain stock levels befor delivery arrives
  tmp.min.stock <- result[which(tmp.id) - 1,"inventory"]
  # alpha-service level (per period, not order cycle)
  alpha.sl <- 100 * ( 1 - sum( tmp.min.stock < 0 ) / nrow(result) )
  # Beta-service level
  beta.sl <- (100 * (1 - sum( -tmp.min.stock[ tmp.min.stock < 0]) / sum(result[,"demand"])))
  # return results
  return(list(  total.per.cost = total.cost, alpha.sl = alpha.sl, beta.sl = beta.sl,
    nb.orders = nb.orders, total.stock = tot.stock, total.backlog = tot.backlog
  ))}

Case study: Parameters \(s\) and \(q\) (I)

For the order quantity, Harris-Andlers formula can be used: \[ q = \sqrt{\frac{2 \cdot \mu_d \cdot c_o}{c_i}}\] For \(s\) holds that it controlls the stock-out risk. To assess the risk, the demand in the lead time is the sum of \(L\) negative binomially distributed variables \(d_t \sim NB(r,p)\), for which holds \[D^L = \sum_{t=1}^L d_t \sim NB(L \cdot r, p)\] For controlling the \(\alpha\) service level, the \(\alpha\) quantile of \(D^L\) can be used. For the \(\beta\) service level, the corresponding loss function \(V_{NB}(x, r, p)\) can be used as follows

\[ s = V^{-1}_{NB}\left((1-\beta)\cdot \mu_{d} \cdot L ,r,p\right)\]

Case study: Parameters \(s\) and \(q\) (II)

# formulate loss function of neg. binomial dist.
loss.nb <- function(x, r, p){
  sum( ((x:1000)-x) * 
         dnbinom((x:1000), size=r, prob=p ))
} 

# formulate inverse loss function
inv.loss.nb <- function (yval, ...){
  round(uniroot(f = function (x)  
    (yval - loss.nb(round(x), ...)) , 
                interval = c(0, 1000))$root)
  }
# uniroot looks for the roots of a cont. function
# thus, a bit crude to round the argument

# fit best parameters of negative binomial
fit.nbino <- fitdist(day.dem.x.max[,"ord.quant"], 
                     "nbinom")

# fitdist returns size (a.k.a. r) and mean
fit.r <- fit.nbino$estimate["size"]

# to apply convolution, p is calculated
fit.p <- fit.nbino$estimate["size"]/
  (fit.nbino$estimate["size"] + fit.nbino$estimate["mu"])


With \(L=3\) days, \(r=\) 0.41, \(p=\) 0.23 as well as \(\mu_d=\) 1.35, \(s\) depending on \(\alpha\) and \(\beta\) service level can be determined as

Case study: Empirical evaluation for \(\beta=95\%\)

# set parameters and  simulate stock levels
inv.b95 <- sq.inv(d = day.dem.x.max[,"ord.quant"], 
                  s = 17,    # reorder point
                  q = 20,    # order quantity 
                  l.ini = 20, # initial stock
                  L = 3)     # lead time

head(eval.inv(inv.b95, 
              c.inv=0.01, 
              c.order=100, 
              c.stockout = 0.15),5)
> $total.per.cost
> [1] 7.146564
> 
> $alpha.sl
> [1] 98.71795
> 
> $beta.sl
> [1] 95.44592
> 
> $nb.orders
> [1] 27
> 
> $total.stock
> [1] 8056

Continuous optimization

Overview routines

Most multi-purpose optimization routines in R are dedicated to continuous optimization as this is most often encoutered in statistics. Additionally, also constrained optimization problems appear which are notoriously difficult to solve in their most general form. A comprehensive overview can be found here.

type of objective func. variables constraints package function
general 1, cont. none built-in optimize()
general \(n\), cont. boxed built-in optim()
general \(n\), cont. boxed optimx optimx()
general \(n\), cont. linear built-in constrOptim()
linear \(n\), cont. linear lpSolve lp()
quad. \(n\), cont. linear quadprog solve.QP()
general \(n\), cont. general nloptr nloptr()
general \(n\), cont. general Rsolnp solnp()

Continuous optimization with optim

For unconstraint (or at most box-constraint) general-purpose optimization, R offers the function optim(). The syntax of both functions is identical: optim(par = <initial parameter>, fn = <obj. function>, method = <opt. routine>). Additionally, upper and lower values for the parameters can be set by option lower = <vector of lower bounds> and upper = <vector of upper bounds>. In this case, method = "L-BFGS-B" must be selected.

Example: Continuous location planning

\[\min\limits_{(u,v) \in \mathbb R^2} \rightarrow \sum_{i \in \mathcal{C} } a_i \cdot (|x_i - u| + |y_i - v|)\]

loc.mh <- function(loc, a, x, y) sum(a * (abs(x - loc[1]) + abs(y - loc[2]) ) ) 
n <- 100
df.mh.loc <- data.frame(
  weight = sample(1:100, size = n),    # sample weights for each point/customer
  x = rnorm(n),                   # sample x coordinates
  y = rnorm(n))                   # sample y coordinates
res <- optim(par = c(0,0), fn = loc.mh, method = "BFGS", a = df.mh.loc$weight, x = df.mh.loc$x , y= df.mh.loc$y) 
res$par   
> [1] -0.1303003 -0.1399569

Example: Optimal solution

# generate grid of potential locations
emat <- expand.grid(x=seq(-3,3, length.out = 100), y=seq(-3,3, length.out = 100))
emat <- emat %>% add_column(objective = apply(emat, 1, function(x) 
  loc.mh(x, a = df.mh.loc$weight, x = df.mh.loc$x, y = df.mh.loc$y)))
# generate plot
ggplot(NULL, aes(x, y))  + geom_point(data = df.mh.loc , aes(size=weight)) + 
  geom_point(aes(x=res$par[1], y=res$par[2]), colour="blue", shape = 17, size= 3) + 
  geom_contour(data = emat, aes(z = objective))  +  geom_text_contour(data = emat, aes(z = objective), stroke=.15)

Case study: Inventory optimization

# wrapper function for simulating & evaluating (s,q) setting
sim.inv <- function(x, d.vec = day.dem.x.max[,"ord.quant"]){
  tmp <- sq.inv(d = d.vec, s = x[1],  q = x[2], l.ini = 10, L = 3)
eval.inv(tmp, c.inv=0.01, c.order=100, c.stockout = 0.5)$total.per.cost
} 

For finding the cost-optimal setting of \(s\) and \(q\), the following constraints are modelled: \[ \begin{align*} -q + s & \geq 0\\ q & > 0\\ s & \geq 0 \end{align*} \]

c.mat <- cbind( c(-1,1,0), c(1,0,1))
c.vec <- c(0,-100,0)
res <- constrOptim(theta = c(12,151), f = sim.inv, ui = c.mat, ci=c.vec, grad=NULL) 
head(res,2)
> $par
> [1]   8.000324 131.000082
> 
> $value
> [1] 1.727231

Case study: Optimal solution

# generate grid of parameter constellations
emat <- expand.grid(s=seq(6,10, length.out = 50), q=seq(110,150, length.out = 50))
emat <- emat %>% add_column(objective = apply(emat, 1, function(x) sim.inv(x)))
# generate plot
ggplot(emat, aes(x=s, y=q))  + geom_contour(aes(z=objective))  +  geom_text_contour(aes(z=objective), stroke=.15) +
geom_point(aes(x=res$par[1], y=res$par[2]), colour="red", shape = 17, size= 3)

Mixed-integer optimization

Generic formulation of MILP models

Mixed-integer linear optimization problems (MILP) are characterized by linear objective functions and constraints w.r.t. the decision variables. However, some or all decision variables are integer and/or binary variables. In general, the canonical form of a MILP can be written as

\[ \min\limits_{ \mathbf{x}, \mathbf{y}, \mathbf{z}} \rightarrow \mathbf{c}_x^T \cdot \mathbf{x} + \mathbf{c}_y^T \cdot \mathbf{y} + \mathbf{c}_z^T \cdot \mathbf{z}\] \[ \begin{align} \mathbf{A}_x \cdot \mathbf{x} + \mathbf{A}_y \cdot \mathbf{y} + \mathbf{A}_z \cdot \mathbf{z} & \leq \mathbf{b} \\ \mathbf{x} & \in \mathbb{R} \\ \mathbf{y} & \in \mathbb{Z} \\ \mathbf{z} & \in \{0,1\} \\ \end{align} \]

Hence, a MILP basically consists of five parts:

  1. coefficient vector \(c = (c_x, c_y, c_z)\)
  2. constraint matrix \(A = (A_x,A_y,A_z)\)
  3. right-hand side vector \(\mathbf{b}\)
  4. direction of the constraints
  5. the domain declarations of the decision variables.

The ROI package: R Optimization Infrastructure

Although any optimization problem consists of objective function, variables, and constraints, there are numerous ways to formalize these components. The package ROI provides a unified framework for seting up and solving optimization problems. An extensive description of the package can be found here.

library(ROI)            # load package
ROI_available_solvers() # check available solvers
ROI_installed_solvers() # check installed solvers
Theussl et al. (2017)

Theussl et al. (2017)

The ROI package: Structure

To formulate an optimization problem, the function OP(objective, constraints, types, bounds). All components are generated by creator functions. The types of variables are defined by a character vector of length \(n\) consisting of the characters ‘C’, ‘I’ or ‘B’ indicating continuous, integer or binary variables, respectively. Corresponsindgly, upper and lower bounds of the variables can be given as a named list consisting of two vectors of length \(n\) and named “lower” and “upper”.

For the objective function these creator functions are defined as follows:

type creator description
function F_objective(F,n,G,H,names) \(F(\mathbf x)\)…function to optimize, \(\mathbf x\)…vector of decision variables, \(G(\mathbf x)\)…gradient function (\(F'(\mathbf x)\), optional), \(H(\mathbf x)\)…Hessian function (\(F''(\mathbf x)\), optional), names…naming of variables
quadratic Q_objective(Q,L,names) \(F(\mathbf x)= \frac{1}{2}\cdot \mathbf x^T \cdot \mathbf Q\cdot \mathbf x+ \mathbf x^T\cdot \mathbf l\), \(\mathbf Q\)…matrix (\(n\times n\)) with coefficients of quadratic terms, \(\mathbf l\)… vector of coefficients for linear terms, names…naming of variables
linear L_objective(L,names) \(F(\mathbf x)= \mathbf x^T\cdot \mathbf l\), \(\mathbf l\)… vector of coefficients for linear terms, names…naming of variables

The ROI package: Constraints

The second component of optimization problems are the constraints. In general, the constraints consist of the left-hand side which depends on the decision variables \(x\), the right-hand side vector (\(\mathbf{b}\), rhs), and the direction vector (dir), e.g. \(lhs(\mathbf{x})\leq \mathbf{b}\). The left-hand side is created in a similar way as the objective function using the following creators

type creator description
function F_constraint(F,dir,rhs,names) \(F\)…list of functions with \(F=\left\{f_i(\mathbf x) \right\}\) such that \(f_i(\mathbf x) \leq b_i\)
quadratic Q_constraint(Q,L,dir,rhs,names) left-hand side: \(\frac{1}{2}\cdot \mathbf x^T \cdot \mathbf Q\cdot \mathbf x+ \mathbf x^T\cdot \mathbf L\) with \(\mathbf Q\)…list of \(n\times n\)-matrices with coefficients of quadratic terms, \(\mathbf L\)…matrix of coefficients for linear terms
linear L_constraint(L,dir,rhs,names) left-hand side: \(\mathbf x^T\cdot \mathbf L\), \(\mathbf L\)…matrix of linear coefficients

Example: Linear assignmnet problem (LAP)

\[min \rightarrow \sum_{i,j =1}^n x_{ij} \cdot c_{ij}\] \[\begin{align}\small \sum_{i=1}^n x_{ij} &= 1 & \forall j = 1,...,n\\ \sum_{j=1}^n x_{ij} &= 1 & \forall i = 1,...,n\\ x_{ij} & \in \left\{0,1\right\} \end{align}\]

For a linear assignment problem with 3 rooms follows \(\small \mathbf{x} = (x_{11},x_{12},...,x_{1n},x_{21},...,x_{2n},...,x_{nn})\) and \[ \small \mathbf A = \left[ \begin{array}{*{10}{r}} 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ \end{array} \right] \]

Example: LAP in R

# create problem dimension
n <- 10
# create cost vector
c.vec <- rpois(n^2, 100)    # sample costs from Poisson distribution with mean 100
# create empty constraint matrix
L.mat <- matrix(0, ncol = n^2, nrow = 2*n)
# first n rows
L.mat[1:n,] <- do.call(cbind, lapply(1:n, function(x) diag(n)))
# last n rows
L.mat[(n+1):(2*n),] <- t(sapply(1:n, function(x) c(rep(0,(x-1)*n), rep(1,n),rep(0,(n-x)*n) )))

# create optimization problem
copt <- OP(
  objective = L_objective(L = c.vec),
  constraints = L_constraint(L = L.mat, dir = rep("==", 2*n), rhs = rep(1, 2*n)),
  types = rep("B", n^2)
)

# solve the problem
copt_sol <- ROI_solve(copt)        

Example: Solution LAP

# display solution in matrix form
sol.mat <- matrix(copt_sol$solution, ncol = n)
# nodes
nodes <- data.frame(id = 1:20,  label = c(paste("room", 1:10), paste("element", 1:10)),
  group = rep(c("rooms", "elements"), each=10),  title = c(paste0("<p><b>room ", 1:10,"</b><br></p>"),
            paste0("<p><b>element ", 1:10,"</b><br></p>")),  color = rep(c("darkred","darkblue"), each=10))
# edges
edges <- data.frame(from = 1:10, to = 10+apply(sol.mat, 1, function(x) which(x>0)), label = matrix(c.vec, ncol = n, byrow=T)[which(t(sol.mat)>0)] )
# visNetwork(nodes, edges, height = "200px", width = "99.999%") %>%
#  visLayout(hierarchical = TRUE) 

Algebraic model formulation with ompr

To formulate large (mixed-)integer optimization problems. algebraic modelling languages are usually used. Commercial solvers often provide a tailored algebraic programming language like ILOG OPL. The package ompr provides a similar access for modelling MILPs in a algebraic way. The package uses the ROI framework for accessing MILP solvers.

Models in ompr are set up by adding pats of the model via piping with the following functions:

function description
MIPModel() model creator (without arguments)
set_objective(expr, quantifier) create objective function
add_variable(name, quantifier) add variable group
add_constraint(expr, quantifier) add constraint group
sum_expr(expr, quantifier) create sums

Note that all expressions must be vectorized, i.e., should be able to process vectors and return results as a vector.

Example: The Travelling Salesman Problem (TSP).

Given a set of nodes \(\mathcal{N}\) with \(n = |\mathcal{N}|\) and pairwise distances \(d_{ij}\) between any pair of nodes (\(i,j \in \mathcal{N}\)), the TSP is to find the shortest tour which visits all nodes and returns to the origin. To avoid subtours the Miller-Tucker-Zemlin formulation is used:

\[\min \rightarrow \sum_{i=1}^n \sum_{j=1}^n d_{ij} \cdot x_{ij}\]

\[ \begin{align} \sum_{i \in \mathcal{N}: i\neq j} x_{ij} & = 1 & \forall j \in \mathcal{N} \\ \sum_{j \in \mathcal{N}: j\neq i} x_{ij} & = 1 & \forall i \in \mathcal{N} \\ u_j & \geq u_i + 1 - n \cdot (1-x_{ij}) & \forall i,j \in \mathcal{N}, i \neq j, i,j>1 \\ u_1 & = 1 & \\ u_i & \in \left[2,n\right] & \forall i \in \mathcal{N}, i>1\\ x_{ii} & = 0 & \forall i \in \mathcal{N} \\ x_{ij} & \in \left\{0,1\right\} & \forall i,j \in \mathcal{N} \end{align} \]

Example TSP: The model in ompr

# set number of nodes  & sample node coordinates
n <- 10; cities <- data.frame(id = 1:n, x = runif(n), y = runif(n))
# calculate distances
distance <- as.matrix(dist(cities[,c("x","y")]))       
# create vectorized function for accessing distance values
dist_fun <- function(i, j)  vapply(seq_along(i), function(k) distance[i[k], j[k]], numeric(1L))
# remove variables with same name as your decision var.
rm(x); rm(u)
# create model
model <- MIPModel() %>%
  # create  variables x and u
  add_variable(x[i, j], i = 1:n, j = 1:n, type = "integer", lb = 0, ub = 1) %>%
  add_variable(u[i], i = 1:n, lb = 1, ub = n) %>% 
  # minimize travel distance
  set_objective(sum_expr(dist_fun(i, j) * x[i, j], i = 1:n, j = 1:n), "min") %>%
  # leave each node
  add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
  # visit each node
  add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
  # ensure no subtours (arc constraints)
  add_constraint(u[1] == 1) %>% 
  add_constraint(u[j] >= u[i] + 1 - n * (1 - x[i, j]), i = 1:n, j = 2:n) %>%
  # exclude self-cycles
  set_bounds(x[i, i], ub = 0, i = 1:n)

Example TSP: Results

# solve model with GLPK
result <- solve_model(model, 
        with_ROI(solver = "glpk", verbose = TRUE))

# obtain solution
solution <- get_solution(result, x[i, j]) %>% 
  filter(value > 0)

# use variable u to reorder the data frame of nodes
ggplot(cities[
  c(order(get_solution(result, u[i])$value),1),], 
  aes(x, y)) +  geom_point() + geom_path() + 
  geom_text(aes(label=id),hjust=0, vjust=0) +  
  ggtitle(paste0("Total cost: ",
        round(objective_value(result), 2)))

Case study: A warehouse location problem

For delivering the customers in Brazil, 5 distribution centers (DC) shall be installed such that the total weighted distance between the customers and nearest DC is minimized. The customers’ weights are the number of orders in the dat set. The optimization problem can be set up as follows \[ \min \rightarrow \sum_{i \in \mathcal C} w_i \cdot \sum_{j \in \mathcal D} x_{ij} \cdot d_{ij}\] \[ \begin{align*} \sum_{j \in \mathcal D} x_{ij} & = 1 & \forall i \in \mathcal C\\ x_{ij} & \leq y_j & \forall i \in \mathcal C, j \in \mathcal D\\ \sum_{j \in \mathcal D} y_j & \leq K & \\ x_{ij}, y_{j} & \in \{0,1\} & \forall i \in \mathcal C, j \in \mathcal D \end{align*} \]

Case study: Display customer on map

Customer locations are identified by the first 5 digits (out of 8) of the postal zip number.

# change name of zip code name for subsequent merging
colnames(customer.data)[3] <- "zip_code"
colnames(geo.data)[1:3] <- c("zip_code","lat","lng")
# calculate median coordinates of zip areas
geo.data.sub <- data.frame(
  aggregate(lat ~ zip_code, FUN=median, 
            data = geo.data), 
  lng = aggregate(lng ~ zip_code, FUN=median, 
                  data = geo.data)[,2])
# merge geolocation data to customer data 
cust.data.add <- merge(customer.data, geo.data.sub, 
                       by="zip_code")
# remove outlier
cust.data.add <- filter(cust.data.add, lng < -30)
# generate plot
data(World)
# obtain brazil map
bra_map <- get_brmap(geo = "State", class = "sf")
# convert customer coordinates
cust.points <- st_as_sf(cust.data.add, 
                        coords = c("lng", "lat"))
# plot brazil map
 tm_shape(bra_map) + tm_borders() +
   tm_shape(cust.points) + tm_dots()

Conclusion

Summary

Avantages Disadvantages
easy to learn heterogenous coding
very flexible unknown quality of packages
simple integration with 3rd party code slow code for traditional coding
data structures easy to manipulate problems with big data \(\rightarrow\) tailor-made obejcts and memory handling needed