09 12 2019
Software packages covering most analytics task in a customized and modular way. Always contains data warehouse concepts as well as reporting and data analysis routines. Famous firms are IBM, Oracle, Microsoft, SAP, SAS, …
Gartner (2018)
BARC (2019)
R is an open-source, interpreted (statistical) programming language and environment available at (www.r-project.org).
R is the name of a programming language (like Java) as well as programming environment (like Eclipse). Traditionally, R is just console and an editor for writing scripts. However, most users work with Rstudio as a more powerful evironment for R (www.rstudio.com)
Data obtained from https://www.kaggle.com/olistbr/brazilian-ecommerce
head(orders.data, 4) > order_id customer_id > 1 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d > 2 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef > 3 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 > 4 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 > order_status order_purchase_timestamp order_approved_at > 1 delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 > 2 delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 > 3 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 > 4 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 > order_delivered_carrier_date order_delivered_customer_date > 1 2017-10-04 19:55:00 2017-10-10 21:25:13 > 2 2018-07-26 14:31:00 2018-08-07 15:27:45 > 3 2018-08-08 13:50:00 2018-08-17 18:06:29 > 4 2017-11-22 13:39:59 2017-12-02 00:28:42 > order_estimated_delivery_date > 1 2017-10-18 00:00:00 > 2 2018-08-13 00:00:00 > 3 2018-09-04 00:00:00 > 4 2017-12-15 00:00:00
Installation via menu or install.packages() functions. Be aware, this may take a while. This presentation can be found here
install.packages(c("tmap","sf", "tmaptools", "brazilmaps", "png", "plotly", "geosphere", "htmlwidgets","metR",
"tidyverse", "lubridate", "widgetframe", "ROI","ROI.plugin.glpk", "ompr", "ompr.roi", "vis.network"))
library(tmap)
library(sf)
library(tmaptools)
library(brazilmaps)
library(png)
library(plotly)
library(geosphere)
library(htmlwidgets)
library(metR)
library(tidyverse)
library(lubridate)
library(widgetframe)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
library(visNetwork)
Everything in R is an object. Objects are always in a environment and are categorized in numerous ways, e.g. by storage mode, or class. The most important is the global environment, also called workspace. We will distinguish data objects and functions. Functions are objects which create environemnts.
# shows current environement
environment()
# show all active environments
search()
# assign an object called 'da.1'
da.1 <- 5
# define a function
f.1 <- function() print("Hello world!")
# function printing the environment
f.2 <- function(){
print(environment())
}
# objects in the workspace
objects()
environment() > <environment: R_GlobalEnv> f.1() > [1] "Hello world!" f.2() > <environment: 0x00000000378a3d68>
Functions are code-wrappers. They consist of argument, body, and environment. The environment determines where function resides (i.e., from where it can be called). The body describes what happens to the arguments.
# show arguments of a function
formals(f.1)
> NULL
# show body of a function
body(f.1)
> print("Hello world!")
# show environment of a function
environment(f.1)
> <environment: R_GlobalEnv>
# function with arguments
f.3 <- function(x,y) x+y
formals(f.3) # more to be seen
> $x
>
>
> $y
# most built-in functions have neither # attribute nor body --> external code formals(sum);body(sum);environment(sum) > NULL > NULL > NULL # Dynamic lookup: # In case an object is missing, R looks # in the functions' environment before # reporting an error x <- 5 y <- 7 f.4 <- function() x+y f.4() > [1] 12
# Name masking:
# What happens in a function, stays there
x <- 4
y <- 5
f.5 <- function() {
x <- 3; y <- 2; x*y
}
f.5()
> [1] 6
x; y
> [1] 4
> [1] 5
# unless we say otherwise
f.6 <- function(){
x <<- 3; y <<- 2; x*y
}
f.6()
> [1] 6
x; y
> [1] 3
> [1] 2
# arguments have to be passed
f.3(x=4,y=7); f.3( 4, 7)
> [1] 11
> [1] 11
# be careful with assignments
x = 9; y = 3
x; y
> [1] 9
> [1] 3
f.3( x <- 4, y <- 7)
> [1] 11
x; y
> [1] 4
> [1] 7
# Lazyness: you can define useless arguments
f.7 <- function(x) y^2
f.7("test")
> [1] 49
# you can define defaults
f.8 <- function(x=2, y=3) x^2+y^2
f.8(); f.8(3, 4)
> [1] 13
> [1] 25\[f(x,\mu,\sigma) = \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^2 }} \cdot e^{-\frac{(x-\mu)^2}{2 \cdot \sigma^2}}\]
dnorm.1 <- function(x, mu, sigma){
tmp.1 <- 1/sqrt(2*pi*sigma^2)
tmp.2 <- exp(-(x-mu)^2/(2*sigma^2))
tmp.1 * tmp.2
}
# evaluation at 1
dnorm.1(1, mu = 3, sigma = 1)
> [1] 0.05399097
# result with built-in function:
dnorm(1, mean = 3, sd = 1)
> [1] 0.05399097
\[ \scriptsize \begin{align*} V_n(x, \mu, \sigma) = & \int_{x}^{\infty} (t-x) \cdot f(t,\mu, \sigma)\; \text{d} t\\ = & \sigma \cdot \left(\varphi\left(\frac{x-\mu}{\sigma}\right) - \frac{x-\mu}{\sigma} \cdot \left(1-\Phi\left(\frac{x-\mu}{\sigma} \right)\right)\right) \end{align*} \]
# or more complex (loss function):
v.fun <- function(x, mu = 0, sigma = 1) {
tmp.1 <- dnorm((x-mu)/sigma)
tmp.2 <- (x-mu) * (1-pnorm((x-mu)/sigma))
return(sigma * tmp.1 - tmp.2)
}
# value for st. loss function at 0.5
v.fun(.5)
> [1] 0.1977966
# the same value obtained directly by integration
integrate(lower=0.5, upper = 10,
f= function(x) (x-.5)*dnorm(x))
> 0.1977966 with absolute error < 1e-04| names | description |
|---|---|
| Functions for functions | |
| help(func) | show help page of func |
| return(x) | returns x and exits function |
| stop(‘mess’) | stops execution of function and shows warning ‘mess’ |
| warning(‘mess’) | shows warning ‘mess’ |
| Plotting | |
| plot(x,y)/ggplot() | open graphic device and creates scatter plotat x and y coordinates |
| lines(x,y)/geom_lines() | adds lines to an existing plot |
| points(x,y)/geom_points() | adds points toan existing plot |
| curve(expr)/stat_function() | plos function expr |
| hist(x)/geom_histogram() | plots histogram of x |
| barplot(x)/geom_bar() | creates a barplot |
| boxplot(x)/geom_boxplot() | creates a boxplot |
| Basic math | |
| prod(x)/sum(x) | product/sum of x |
| log(x) | logarithm of x |
| exp(x) | e to the power of x |
| sqrt(x) | square root of x |
| names | description |
|---|---|
| Object attributes | |
| length(x) | length of x (1-dim.) |
| dim(x) | dimensions of x (at 2-dim.) |
| nrow(x) | number of rows |
| ncol(x) | number of columns |
| Programming | |
| apply(x, m, expr) | applies expr to all columns (m=2) or rows (m=1) of matrix x |
| lappy(x,expr) | applies xpr to all elements of list x |
| next | stops execution of an iteration and increases increment |
| break | stops execution of a loop or fucntion |
| repeat{code} | executes code until a stop or break occors |
| for(i in x){code} | executes code with for each value of x (names as i) |
| while(cond.){code} | executes code as long as cond. is true |
# plot standard normal distribution
curve(dnorm, -3, 3,
ylab="density")
# plot standard loss function with ggplot2
library(ggplot2)
ggplot(data=data.frame(x=-3:3), aes(x=x)) +
stat_function(fun = v.fun ) +
ylab("expected loss")
ggplotly(ggplot(products.data, aes(x=product_category_name)) + geom_bar() +
xlab("product category") + ylab("number of products") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)))
Basic data objects (vectors,matrices and arrays) consist of scalars of the same storage type only (i.e. only numbers, characters, logicals) while advanced data objects (data frames and lists) can contain different types.
| Name | Dim. | Built function |
|---|---|---|
| vector | 1 | c(),numeric() |
| matrix | 2 | matrix() |
| array | n | array() |
| data frame | 2 | data.frame() |
| list | 1 | list() |
devopedia.org (2019)
# Vectors ## an empty numeric vector v.1 <- numeric(5) ## connecting scalars v.2 <- c(5, 6, 7) ## shortcut for integers v.3 <- 10:18 # Matrices ## matrix with 2 columns m.1 <- matrix(1:6, ncol=2) ## constructing matrices from vectors ### bind column-wise m.2 <- cbind(1:3, 2:4, 3:5) ### row-wise binding m.3 <- rbind(1:3, 2:4, 3:5) # arrays ## basic elements can contain only one data type a.1 <- array(c(LETTERS[1:6],7:12), dim=c(2,3,2))
v.3 > [1] 10 11 12 13 14 15 16 17 18 m.1 > [,1] [,2] > [1,] 1 4 > [2,] 2 5 > [3,] 3 6 a.1 > , , 1 > > [,1] [,2] [,3] > [1,] "A" "C" "E" > [2,] "B" "D" "F" > > , , 2 > > [,1] [,2] [,3] > [1,] "7" "9" "11" > [2,] "8" "10" "12"
# Data frames
## technically:
## a list with vectors of same length
## structurally:
## a matrix with different data types
d.1 <- data.frame(a = 1:2,
b = c("rest", "test"),
c = c(T,F))
d.2 <- data.frame(a = 1:3,
b = seq(10, 20, length.out = 3),
c = rnorm(3))
# Lists
## a quite simple list
l.1 <- list(a = 1:3, b = "nest", c = TRUE)
## a more complicated list
l.2 <- list(a = 1:3, b = "nest",
c = list(
d = "test",
e = rep(x = c(TRUE,FALSE), each = 2)
)
)
d.1 > a b c > 1 1 rest TRUE > 2 2 test FALSE head(d.2, 1) # first element > a b c > 1 1 10 0.8478022 l.1 > $a > [1] 1 2 3 > > $b > [1] "nest" > > $c > [1] TRUE tail(l.2, 1) # last element > $c > $c$d > [1] "test" > > $c$e > [1] TRUE TRUE FALSE FALSE
Data objects are indexed by position, name, or logical identifier. Vectors are indexed by <vector>[<index>]. You can also negate the selection, i.e. extract all entries except the indexed entries by <vector>[-<index>]. Matrices are 2-dimensional and, thus, require 2 indices, i.e. <matrix>[<index row>, <index col.>]
x <- c(5, 6, 7) # extract 1st entry x[1] > [1] 5 # extract 1st & 3rd entry x[c(1,3)] > [1] 5 7 # logical index have same length as indexed vector x[c(T,F,T)] > [1] 5 7 # generate an indexing vector by logic expr. y <- x <= 6 # retain inices of those entries z <- which(y) # subvector of all entries of x which are > 6 x[!y] > [1] 7
x <- cbind(1:6, 2:7, 3:8) # column-wise binding # first row, first column (just one element) x[1,1] > [1] 1 # first row (a vector) x[1,] > [1] 1 2 3 # logical index vector y <- rep( x = c(T,F,T), times = 2) # mix row and column selections x[y,c(1,3)] > [,1] [,2] > [1,] 1 3 > [2,] 3 5 > [3,] 4 6 > [4,] 6 8
All advanced data objects can have named elements. These namescan be used to extract the corresponding element by <object>$<element name>. Alternatively, elements can be accessed by <object>[[<element index>]] (single element) or <object>[<element index>] (multiple elements). Additionally, tibbles and data fames can be indexed like matrices
df1 <- data.frame(one = 1:2, two = c("nest","test") )
# access column one
df1$one
> [1] 1 2
# a tibble (basically the same as data frame)
tb1 <- tibble(one = 1:2, two = c("nest","test"))
tb1$two # works too
> [1] "nest" "test"
# a list
l1 <- list(a = 1:3, b = "nest", c = TRUE)
# for multiple components
l1[c(1,3)]
> $a
> [1] 1 2 3
>
> $c
> [1] TRUE
# for multiple components by name
l1[c("a","c")]
> $a
> [1] 1 2 3
>
> $c
> [1] TRUE
# names can be retrieved
names(l1)
> [1] "a" "b" "c"
# ... and changed
names(l1) <- c("x","y","z")
l1$z
> [1] TRUE
| Name | Description |
|---|---|
which(x) |
returns indices of TRUE entries of ´x´ |
sort(x) |
sorts x ascendingly (unless decreasing=TRUE) |
order(x) |
returns indices to order x ascendingly (unless decreasing=TRUE) |
table(x) |
returns frequencies of values of x |
aggregate(y~x, FUN) |
splits data into subsets and applies ‘FUN’ to subsets |
lm(y~x) |
estimates a linear regression model |
merge(x,y,by) |
merge data frames x and y by variable by |
# random vector of length 4 x <- rnorm(4); x > [1] 0.6549547 -1.0846396 -0.4841962 0.7884190 # sort vector sort(x) > [1] -1.0846396 -0.4841962 0.6549547 0.7884190 # ordering vector order(x) > [1] 2 3 1 4 # ... which also sorts x x[order(x)] > [1] -1.0846396 -0.4841962 0.6549547 0.7884190 # create data frame df <- data.frame(x = x, y = 1-10*x) # reorder data frame df[order(df$y),] > x y > 4 0.7884190 -6.884190 > 1 0.6549547 -5.549547 > 3 -0.4841962 5.841962 > 2 -1.0846396 11.846396
# determine frequency of ordered products
x <- table(orders.items$product_id)
# sort products products descendingly
x <- sort(x, decreasing = T)
# compose a data frame with ...
df.freq.prod <- data.frame(
names = names(x), # names of products
names_int = 1:length(x), # product ID
freq = as.numeric(x/sum(x)), # rel. frequency
cfreq = cumsum(x/sum(x))) # cumulated rel. freq.
# create plot
ggplot(df.freq.prod, aes(x=names_int, y= cfreq)) +
geom_line() + ylab("cum. frequency") +
xlab("Product ID") + theme(axis.text =
element_text(size = 14),
axis.title=element_text(size=16))
# sum revenues (prices) per product over all orders
# (result is a data frame)
df.rev.prod <- aggregate(price ~ product_id,
FUN = sum, data = orders.items)
# sort product revenues descendingly
df.rev.prod <- df.rev.prod[order(df.rev.prod$price, decreasing = T),]
# enlarge data frame with ...
df.rev.prod <- data.frame(df.rev.prod,
names_int = 1:nrow(df.rev.prod),
share=df.rev.prod$price/sum(df.rev.prod$price),
cshare = cumsum(df.rev.prod$price/sum(df.rev.prod$price)))
# create plot
ggplot(df.rev.prod, aes(x=names_int, y=cshare)) +
geom_line() + ylab("cum. relative revenue") +
xlab("Product ID") + theme(axis.text=
element_text(size=14),
axis.title=element_text(size=16))
# sum revenues (prices) per product over all orders
df.prod.cat <- aggregate(price ~ product_id, FUN = sum, data = orders.items)
# add corresponding product categories
df.prod.cat <- merge(df.prod.cat, products.data, by = "product_id")
# create plot
ggplot(df.prod.cat, aes(x=product_category_name, y=log(price)) ) + geom_boxplot(outlier.size = 0.01) + ylab("log. prices") +
xlab("Product category") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))
Computer simulation means to model a real-world system by a computer model. Thereby, the modelled entities and their relations among each other as well as external factors are usually described by mathematical expressions.
Naturally, R is often useful for stochastic simulation as it allows one to easily model stochastic processes. One core element are distribution functions. For a given distribution R offers usually the density (d), probability (p), quantile (q) and random number generator (r). The syntax for the normal distribution is e.g. d/p/q/rnorm(x). See ?Distributions
| Name | Description |
|---|---|
sample(x, size) |
sample size elements from x |
d/p/q/rnorm(x) |
normal distribution |
d/p/q/rbinom(x) |
binomial distribution |
d/p/q/runif(x) |
uniform distribution |
d/p/q/rpois(x) |
Poisson distribution |
replicate(x, FUN) |
FUN is repeated x times |
# sample from a vector sample(1:10, 5) > [1] 6 10 2 7 5 # .. also with relacements sample(1:5, 5, replace = TRUE) > [1] 4 1 2 5 2 # 95%-qunatile of the st. normal dist. qnorm(0.95) > [1] 1.644854 # 5 Bernoulli trial with prob. 0.5 rbinom(n=5, size=1 , prob=0.5) > [1] 0 0 0 1 0 # ... same with replicate replicate(5, rbinom(n=1, size=1 , prob=0.5)) > [1] 1 0 0 1 0
# most often ordered product
x.max <- rownames(df.freq.prod)[1]
# identify orders with x.max
orders.items.x.max <- orders.items[which(orders.items$product_id == x.max), ]
# add order information
orders.x.max <- merge(orders.items.x.max, orders.data, by="order_id")
# reorder according to date
orders.x.max <- orders.x.max[order(orders.x.max$order_purchase_timestamp), ]
# convert dates
orders.x.max$order_purchase_timestamp <- as_datetime(orders.x.max$order_purchase_timestamp)
# add date components and order quantity column
orders.x.max <- data.frame(orders.x.max,
year.order = year(orders.x.max$order_purchase_timestamp),
day.order = yday(orders.x.max$order_purchase_timestamp),
ord.quant = rep(1,nrow(orders.x.max)))
# convert days in factor
orders.x.max$day.order <- as.factor(orders.x.max$day.order)
# add missing factor levels
levels(orders.x.max$day.order) <- c(
levels(orders.x.max$day.order),
as.character(1:365)[!as.character(1:365) %in% levels(orders.x.max$day.order)])
# sort factor levels
orders.x.max$day.order <- factor(orders.x.max$day.order,levels(orders.x.max$day.order)[order(as.numeric(levels(orders.x.max$day.order)))])
# calculate daily demand day.dem.x.max <- aggregate(ord.quant ~ day.order + year.order, orders.x.max, FUN=sum, drop = F) # cut off days before product entered portfolio id.nan <- which( !is.na(day.dem.x.max$ord.quant) ) day.dem.x.max <- day.dem.x.max[min(id.nan):max(id.nan), ] # each day without entry has demand of 0 day.dem.x.max$ord.quant[is.na(day.dem.x.max$ord.quant)] <- 0
# recompose date
tmp <- make_date(year = day.dem.x.max$year.order)
yday(tmp) <- as.numeric(day.dem.x.max$day.order)
# add to data frame
day.dem.x.max <- data.frame(day.dem.x.max,
date = tmp)
# make time series plot
ggplot(day.dem.x.max, aes(x=date, y=ord.quant)) +
geom_point() + geom_bar(stat='identity') +
ylab("order quantity")
library(fitdistrplus) # find best fitting distribution descdist(day.dem.x.max$ord.quant, discrete = TRUE)
fit.nbino <- fitdist(day.dem.x.max$ord.quant,
"nbinom")
denscomp(fit.nbino)
sq.inv <- function(d, s, q, l.ini, L = 1 ){
# number of periods
n <- length(d)
# initialize result matrix
tmp.mat <- matrix(0, nrow = n + 1 + L, ncol = 5 )
colnames(tmp.mat) <- c("demand","inventory","disp.inv", "order", "delivery")
# initialize matrix
tmp.mat[,"demand"] <- c(0, d, rep(0,L))
tmp.mat[1,"inventory"] <- l.ini
tmp.mat[1,"disp.inv"] <- l.ini
# inventory calculation
for(i in 2:(n+1)){
# check whether reorder level is reached
if(tmp.mat[i-1, "disp.inv"] <= s){
tmp.mat[i, "order"] <- q
tmp.mat[i+L, "delivery"] <- q
}
# inventory balances
tmp.mat[i, "disp.inv"] <- tmp.mat[i-1, "disp.inv"] - tmp.mat[i, "demand"] + tmp.mat[i, "order"]
tmp.mat[i, "inventory"] <- tmp.mat[i-1, "inventory"] - tmp.mat[i, "demand"] + tmp.mat[i, "delivery"]
}
# return results
return(as.data.frame(tmp.mat[2:(n+1),]))
}
# create Poisson distributed demand with
# mean = 10 for 100 periods
d.vec <- rpois(100, 10)
# set parameters and
# simulate stock levels
res.mat <- sq.inv(d = d.vec, # demand
s = 12, # reorder point
q = 70, # order quantity
l.ini = 70, # initial stock
L = 1) # lead time
# plot inventory levels
ggplot(res.mat,
aes(x=1:nrow(res.mat), y=inventory)) +
geom_step() +
geom_hline(yintercept = 0) +
xlab("periods")
# create Poisson distributed demand with
# mean = 10 for 100 periods
d.vec <- day.dem.x.max[,"ord.quant"]
# set parameters and
# simulate stock levels
inv.x.max <- sq.inv(d = d.vec, # demand
s = 10, # reorder point
q = 20, # order quantity
l.ini = 20, # initial stock
L = 3) # lead time
# plot inventory levels
ggplot(inv.x.max,
aes(x=1:nrow(inv.x.max), y=inventory)) +
geom_step() +
geom_hline(yintercept = 0) +
xlab("periods")
eval.inv <- function(result, c.order, c.inv, c.stockout = 0){
# determine periods with deliveries
tmp.id <- result[,"delivery"] > 0
# number of orders
nb.orders <- sum(tmp.id)
# total stock
tot.stock <- result %>% filter(inventory > 0) %>% pull(inventory) %>% sum
# total backlog
tot.backlog <- abs(result %>% filter(inventory < 0) %>% pull(inventory) %>% sum)
# total cost per period
total.cost <- (nb.orders * c.order + tot.stock * c.inv + tot.backlog * c.stockout)/nrow(result)
# retain stock levels befor delivery arrives
tmp.min.stock <- result[which(tmp.id) - 1,"inventory"]
# alpha-service level (per period, not order cycle)
alpha.sl <- 100 * ( 1 - sum( tmp.min.stock < 0 ) / nrow(result) )
# Beta-service level
beta.sl <- (100 * (1 - sum( -tmp.min.stock[ tmp.min.stock < 0]) / sum(result[,"demand"])))
# return results
return(list( total.per.cost = total.cost, alpha.sl = alpha.sl, beta.sl = beta.sl,
nb.orders = nb.orders, total.stock = tot.stock, total.backlog = tot.backlog
))}
For the order quantity, Harris-Andlers formula can be used: \[ q = \sqrt{\frac{2 \cdot \mu_d \cdot c_o}{c_i}}\] For \(s\) holds that it controlls the stock-out risk. To assess the risk, the demand in the lead time is the sum of \(L\) negative binomially distributed variables \(d_t \sim NB(r,p)\), for which holds \[D^L = \sum_{t=1}^L d_t \sim NB(L \cdot r, p)\] For controlling the \(\alpha\) service level, the \(\alpha\) quantile of \(D^L\) can be used. For the \(\beta\) service level, the corresponding loss function \(V_{NB}(x, r, p)\) can be used as follows
\[ s = V^{-1}_{NB}\left((1-\beta)\cdot \mu_{d} \cdot L ,r,p\right)\]
# formulate loss function of neg. binomial dist.
loss.nb <- function(x, r, p){
sum( ((x:1000)-x) *
dnbinom((x:1000), size=r, prob=p ))
}
# formulate inverse loss function
inv.loss.nb <- function (yval, ...){
round(uniroot(f = function (x)
(yval - loss.nb(round(x), ...)) ,
interval = c(0, 1000))$root)
}
# uniroot looks for the roots of a cont. function
# thus, a bit crude to round the argument
# fit best parameters of negative binomial
fit.nbino <- fitdist(day.dem.x.max[,"ord.quant"],
"nbinom")
# fitdist returns size (a.k.a. r) and mean
fit.r <- fit.nbino$estimate["size"]
# to apply convolution, p is calculated
fit.p <- fit.nbino$estimate["size"]/
(fit.nbino$estimate["size"] + fit.nbino$estimate["mu"])
With \(L=3\) days, \(r=\) 0.41, \(p=\) 0.23 as well as \(\mu_d=\) 1.35, \(s\) depending on \(\alpha\) and \(\beta\) service level can be determined as
# set parameters and simulate stock levels
inv.b95 <- sq.inv(d = day.dem.x.max[,"ord.quant"],
s = 17, # reorder point
q = 20, # order quantity
l.ini = 20, # initial stock
L = 3) # lead time
head(eval.inv(inv.b95,
c.inv=0.01,
c.order=100,
c.stockout = 0.15),5)
> $total.per.cost
> [1] 7.146564
>
> $alpha.sl
> [1] 98.71795
>
> $beta.sl
> [1] 95.44592
>
> $nb.orders
> [1] 27
>
> $total.stock
> [1] 8056
Most multi-purpose optimization routines in R are dedicated to continuous optimization as this is most often encoutered in statistics. Additionally, also constrained optimization problems appear which are notoriously difficult to solve in their most general form. A comprehensive overview can be found here.
| type of objective func. | variables | constraints | package | function |
|---|---|---|---|---|
| general | 1, cont. | none | built-in | optimize() |
| general | \(n\), cont. | boxed | built-in | optim() |
| general | \(n\), cont. | boxed | optimx |
optimx() |
| general | \(n\), cont. | linear | built-in | constrOptim() |
| linear | \(n\), cont. | linear | lpSolve |
lp() |
| quad. | \(n\), cont. | linear | quadprog |
solve.QP() |
| general | \(n\), cont. | general | nloptr |
nloptr() |
| general | \(n\), cont. | general | Rsolnp |
solnp() |
optimFor unconstraint (or at most box-constraint) general-purpose optimization, R offers the function optim(). The syntax of both functions is identical: optim(par = <initial parameter>, fn = <obj. function>, method = <opt. routine>). Additionally, upper and lower values for the parameters can be set by option lower = <vector of lower bounds> and upper = <vector of upper bounds>. In this case, method = "L-BFGS-B" must be selected.
\[\min\limits_{(u,v) \in \mathbb R^2} \rightarrow \sum_{i \in \mathcal{C} } a_i \cdot (|x_i - u| + |y_i - v|)\]
loc.mh <- function(loc, a, x, y) sum(a * (abs(x - loc[1]) + abs(y - loc[2]) ) ) n <- 100 df.mh.loc <- data.frame( weight = sample(1:100, size = n), # sample weights for each point/customer x = rnorm(n), # sample x coordinates y = rnorm(n)) # sample y coordinates res <- optim(par = c(0,0), fn = loc.mh, method = "BFGS", a = df.mh.loc$weight, x = df.mh.loc$x , y= df.mh.loc$y) res$par > [1] -0.1303003 -0.1399569
# generate grid of potential locations emat <- expand.grid(x=seq(-3,3, length.out = 100), y=seq(-3,3, length.out = 100)) emat <- emat %>% add_column(objective = apply(emat, 1, function(x) loc.mh(x, a = df.mh.loc$weight, x = df.mh.loc$x, y = df.mh.loc$y))) # generate plot ggplot(NULL, aes(x, y)) + geom_point(data = df.mh.loc , aes(size=weight)) + geom_point(aes(x=res$par[1], y=res$par[2]), colour="blue", shape = 17, size= 3) + geom_contour(data = emat, aes(z = objective)) + geom_text_contour(data = emat, aes(z = objective), stroke=.15)
# wrapper function for simulating & evaluating (s,q) setting
sim.inv <- function(x, d.vec = day.dem.x.max[,"ord.quant"]){
tmp <- sq.inv(d = d.vec, s = x[1], q = x[2], l.ini = 10, L = 3)
eval.inv(tmp, c.inv=0.01, c.order=100, c.stockout = 0.5)$total.per.cost
}
For finding the cost-optimal setting of \(s\) and \(q\), the following constraints are modelled: \[ \begin{align*} -q + s & \geq 0\\ q & > 0\\ s & \geq 0 \end{align*} \]
c.mat <- cbind( c(-1,1,0), c(1,0,1)) c.vec <- c(0,-100,0) res <- constrOptim(theta = c(12,151), f = sim.inv, ui = c.mat, ci=c.vec, grad=NULL) head(res,2) > $par > [1] 8.000324 131.000082 > > $value > [1] 1.727231
# generate grid of parameter constellations emat <- expand.grid(s=seq(6,10, length.out = 50), q=seq(110,150, length.out = 50)) emat <- emat %>% add_column(objective = apply(emat, 1, function(x) sim.inv(x))) # generate plot ggplot(emat, aes(x=s, y=q)) + geom_contour(aes(z=objective)) + geom_text_contour(aes(z=objective), stroke=.15) + geom_point(aes(x=res$par[1], y=res$par[2]), colour="red", shape = 17, size= 3)
Mixed-integer linear optimization problems (MILP) are characterized by linear objective functions and constraints w.r.t. the decision variables. However, some or all decision variables are integer and/or binary variables. In general, the canonical form of a MILP can be written as
\[ \min\limits_{ \mathbf{x}, \mathbf{y}, \mathbf{z}} \rightarrow \mathbf{c}_x^T \cdot \mathbf{x} + \mathbf{c}_y^T \cdot \mathbf{y} + \mathbf{c}_z^T \cdot \mathbf{z}\] \[ \begin{align} \mathbf{A}_x \cdot \mathbf{x} + \mathbf{A}_y \cdot \mathbf{y} + \mathbf{A}_z \cdot \mathbf{z} & \leq \mathbf{b} \\ \mathbf{x} & \in \mathbb{R} \\ \mathbf{y} & \in \mathbb{Z} \\ \mathbf{z} & \in \{0,1\} \\ \end{align} \]
Hence, a MILP basically consists of five parts:
ROI package: R Optimization InfrastructureAlthough any optimization problem consists of objective function, variables, and constraints, there are numerous ways to formalize these components. The package ROI provides a unified framework for seting up and solving optimization problems. An extensive description of the package can be found here.
library(ROI) # load package ROI_available_solvers() # check available solvers ROI_installed_solvers() # check installed solvers
Theussl et al. (2017)
ROI package: StructureTo formulate an optimization problem, the function OP(objective, constraints, types, bounds). All components are generated by creator functions. The types of variables are defined by a character vector of length \(n\) consisting of the characters ‘C’, ‘I’ or ‘B’ indicating continuous, integer or binary variables, respectively. Corresponsindgly, upper and lower bounds of the variables can be given as a named list consisting of two vectors of length \(n\) and named “lower” and “upper”.
For the objective function these creator functions are defined as follows:
| type | creator | description |
|---|---|---|
| function | F_objective(F,n,G,H,names) |
\(F(\mathbf x)\)…function to optimize, \(\mathbf x\)…vector of decision variables, \(G(\mathbf x)\)…gradient function (\(F'(\mathbf x)\), optional), \(H(\mathbf x)\)…Hessian function (\(F''(\mathbf x)\), optional), names…naming of variables |
| quadratic | Q_objective(Q,L,names) |
\(F(\mathbf x)= \frac{1}{2}\cdot \mathbf x^T \cdot \mathbf Q\cdot \mathbf x+ \mathbf x^T\cdot \mathbf l\), \(\mathbf Q\)…matrix (\(n\times n\)) with coefficients of quadratic terms, \(\mathbf l\)… vector of coefficients for linear terms, names…naming of variables |
| linear | L_objective(L,names) |
\(F(\mathbf x)= \mathbf x^T\cdot \mathbf l\), \(\mathbf l\)… vector of coefficients for linear terms, names…naming of variables |
ROI package: ConstraintsThe second component of optimization problems are the constraints. In general, the constraints consist of the left-hand side which depends on the decision variables \(x\), the right-hand side vector (\(\mathbf{b}\), rhs), and the direction vector (dir), e.g. \(lhs(\mathbf{x})\leq \mathbf{b}\). The left-hand side is created in a similar way as the objective function using the following creators
| type | creator | description |
|---|---|---|
| function | F_constraint(F,dir,rhs,names) |
\(F\)…list of functions with \(F=\left\{f_i(\mathbf x) \right\}\) such that \(f_i(\mathbf x) \leq b_i\) |
| quadratic | Q_constraint(Q,L,dir,rhs,names) |
left-hand side: \(\frac{1}{2}\cdot \mathbf x^T \cdot \mathbf Q\cdot \mathbf x+ \mathbf x^T\cdot \mathbf L\) with \(\mathbf Q\)…list of \(n\times n\)-matrices with coefficients of quadratic terms, \(\mathbf L\)…matrix of coefficients for linear terms |
| linear | L_constraint(L,dir,rhs,names) |
left-hand side: \(\mathbf x^T\cdot \mathbf L\), \(\mathbf L\)…matrix of linear coefficients |
\[min \rightarrow \sum_{i,j =1}^n x_{ij} \cdot c_{ij}\] \[\begin{align}\small \sum_{i=1}^n x_{ij} &= 1 & \forall j = 1,...,n\\ \sum_{j=1}^n x_{ij} &= 1 & \forall i = 1,...,n\\ x_{ij} & \in \left\{0,1\right\} \end{align}\]
For a linear assignment problem with 3 rooms follows \(\small \mathbf{x} = (x_{11},x_{12},...,x_{1n},x_{21},...,x_{2n},...,x_{nn})\) and \[ \small \mathbf A = \left[ \begin{array}{*{10}{r}} 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ \end{array} \right] \]
# create problem dimension
n <- 10
# create cost vector
c.vec <- rpois(n^2, 100) # sample costs from Poisson distribution with mean 100
# create empty constraint matrix
L.mat <- matrix(0, ncol = n^2, nrow = 2*n)
# first n rows
L.mat[1:n,] <- do.call(cbind, lapply(1:n, function(x) diag(n)))
# last n rows
L.mat[(n+1):(2*n),] <- t(sapply(1:n, function(x) c(rep(0,(x-1)*n), rep(1,n),rep(0,(n-x)*n) )))
# create optimization problem
copt <- OP(
objective = L_objective(L = c.vec),
constraints = L_constraint(L = L.mat, dir = rep("==", 2*n), rhs = rep(1, 2*n)),
types = rep("B", n^2)
)
# solve the problem
copt_sol <- ROI_solve(copt)
# display solution in matrix form
sol.mat <- matrix(copt_sol$solution, ncol = n)
# nodes
nodes <- data.frame(id = 1:20, label = c(paste("room", 1:10), paste("element", 1:10)),
group = rep(c("rooms", "elements"), each=10), title = c(paste0("<p><b>room ", 1:10,"</b><br></p>"),
paste0("<p><b>element ", 1:10,"</b><br></p>")), color = rep(c("darkred","darkblue"), each=10))
# edges
edges <- data.frame(from = 1:10, to = 10+apply(sol.mat, 1, function(x) which(x>0)), label = matrix(c.vec, ncol = n, byrow=T)[which(t(sol.mat)>0)] )
# visNetwork(nodes, edges, height = "200px", width = "99.999%") %>%
# visLayout(hierarchical = TRUE)
omprTo formulate large (mixed-)integer optimization problems. algebraic modelling languages are usually used. Commercial solvers often provide a tailored algebraic programming language like ILOG OPL. The package ompr provides a similar access for modelling MILPs in a algebraic way. The package uses the ROI framework for accessing MILP solvers.
Models in ompr are set up by adding pats of the model via piping with the following functions:
| function | description |
|---|---|
MIPModel() |
model creator (without arguments) |
set_objective(expr, quantifier) |
create objective function |
add_variable(name, quantifier) |
add variable group |
add_constraint(expr, quantifier) |
add constraint group |
sum_expr(expr, quantifier) |
create sums |
Note that all expressions must be vectorized, i.e., should be able to process vectors and return results as a vector.
Given a set of nodes \(\mathcal{N}\) with \(n = |\mathcal{N}|\) and pairwise distances \(d_{ij}\) between any pair of nodes (\(i,j \in \mathcal{N}\)), the TSP is to find the shortest tour which visits all nodes and returns to the origin. To avoid subtours the Miller-Tucker-Zemlin formulation is used:
\[\min \rightarrow \sum_{i=1}^n \sum_{j=1}^n d_{ij} \cdot x_{ij}\]
\[ \begin{align} \sum_{i \in \mathcal{N}: i\neq j} x_{ij} & = 1 & \forall j \in \mathcal{N} \\ \sum_{j \in \mathcal{N}: j\neq i} x_{ij} & = 1 & \forall i \in \mathcal{N} \\ u_j & \geq u_i + 1 - n \cdot (1-x_{ij}) & \forall i,j \in \mathcal{N}, i \neq j, i,j>1 \\ u_1 & = 1 & \\ u_i & \in \left[2,n\right] & \forall i \in \mathcal{N}, i>1\\ x_{ii} & = 0 & \forall i \in \mathcal{N} \\ x_{ij} & \in \left\{0,1\right\} & \forall i,j \in \mathcal{N} \end{align} \]
ompr# set number of nodes & sample node coordinates
n <- 10; cities <- data.frame(id = 1:n, x = runif(n), y = runif(n))
# calculate distances
distance <- as.matrix(dist(cities[,c("x","y")]))
# create vectorized function for accessing distance values
dist_fun <- function(i, j) vapply(seq_along(i), function(k) distance[i[k], j[k]], numeric(1L))
# remove variables with same name as your decision var.
rm(x); rm(u)
# create model
model <- MIPModel() %>%
# create variables x and u
add_variable(x[i, j], i = 1:n, j = 1:n, type = "integer", lb = 0, ub = 1) %>%
add_variable(u[i], i = 1:n, lb = 1, ub = n) %>%
# minimize travel distance
set_objective(sum_expr(dist_fun(i, j) * x[i, j], i = 1:n, j = 1:n), "min") %>%
# leave each node
add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
# visit each node
add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
# ensure no subtours (arc constraints)
add_constraint(u[1] == 1) %>%
add_constraint(u[j] >= u[i] + 1 - n * (1 - x[i, j]), i = 1:n, j = 2:n) %>%
# exclude self-cycles
set_bounds(x[i, i], ub = 0, i = 1:n)
# solve model with GLPK
result <- solve_model(model,
with_ROI(solver = "glpk", verbose = TRUE))
# obtain solution
solution <- get_solution(result, x[i, j]) %>%
filter(value > 0)
# use variable u to reorder the data frame of nodes
ggplot(cities[
c(order(get_solution(result, u[i])$value),1),],
aes(x, y)) + geom_point() + geom_path() +
geom_text(aes(label=id),hjust=0, vjust=0) +
ggtitle(paste0("Total cost: ",
round(objective_value(result), 2)))
For delivering the customers in Brazil, 5 distribution centers (DC) shall be installed such that the total weighted distance between the customers and nearest DC is minimized. The customers’ weights are the number of orders in the dat set. The optimization problem can be set up as follows \[ \min \rightarrow \sum_{i \in \mathcal C} w_i \cdot \sum_{j \in \mathcal D} x_{ij} \cdot d_{ij}\] \[ \begin{align*} \sum_{j \in \mathcal D} x_{ij} & = 1 & \forall i \in \mathcal C\\ x_{ij} & \leq y_j & \forall i \in \mathcal C, j \in \mathcal D\\ \sum_{j \in \mathcal D} y_j & \leq K & \\ x_{ij}, y_{j} & \in \{0,1\} & \forall i \in \mathcal C, j \in \mathcal D \end{align*} \]
Customer locations are identified by the first 5 digits (out of 8) of the postal zip number.
# change name of zip code name for subsequent merging
colnames(customer.data)[3] <- "zip_code"
colnames(geo.data)[1:3] <- c("zip_code","lat","lng")
# calculate median coordinates of zip areas
geo.data.sub <- data.frame(
aggregate(lat ~ zip_code, FUN=median,
data = geo.data),
lng = aggregate(lng ~ zip_code, FUN=median,
data = geo.data)[,2])
# merge geolocation data to customer data
cust.data.add <- merge(customer.data, geo.data.sub,
by="zip_code")
# remove outlier
cust.data.add <- filter(cust.data.add, lng < -30)
# generate plot
data(World)
# obtain brazil map
bra_map <- get_brmap(geo = "State", class = "sf")
# convert customer coordinates
cust.points <- st_as_sf(cust.data.add,
coords = c("lng", "lat"))
# plot brazil map
tm_shape(bra_map) + tm_borders() +
tm_shape(cust.points) + tm_dots()
| Avantages | Disadvantages |
|---|---|
| easy to learn | heterogenous coding |
| very flexible | unknown quality of packages |
| simple integration with 3rd party code | slow code for traditional coding |
| data structures easy to manipulate | problems with big data \(\rightarrow\) tailor-made obejcts and memory handling needed |