library(RCurl)
library(tidyverse)
library(fitdistrplus)
set.seed(1)
Below, I just define my URLs, and fetch my data for local analysis:
# All problem datasets were converted to csv and uploaded to Github
# for convenience.
p1_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_01.csv"
p2_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_02.csv"
p3_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_03.csv"
p1 <- read.csv(text = getURL(p1_url), header = F)
p2 <- read.csv(text = getURL(p2_url), header = F)
p3 <- read.csv(text = getURL(p3_url), header = F)
p1_data <- p1$V1
p3_data <- p3$V1
Problem data in csv format: Inter-arrival times (in minutes) of calls into a call center.
Reviewing the plot below, the histogram of the Problem.Dataset_06_01.xls
data appears that it would work well with a exponential, weibull, log-normal, or gamma distribution.
hist(p1_data)
Using the fitdist
function within the fitdistrplus
library provides a wide variety of tools for fitting and testing fits of data. Highly recommended.
I’ll use it to review different possibilities by employing the Cullen and Frey Graph with 500 bootstrapped samples using descdist
- the resulting chart examines the kurtoisis and skewness squared of different possible fits to help me narrow down the possible candidates - but not select one (note that Weibull isn’t shown but it’s close to gamma and log-normal).
descdist(p1_data, boot = 500, method = "sample")
## summary statistics
## ------
## min: 2.06 max: 48.65
## median: 9.015
## mean: 11.92048
## sample sd: 9.714686
## sample skewness: 1.725842
## sample kurtosis: 6.307556
After reviewing the plot above, I’ve settled on comparing three different distribution fits on this data: Weibull, gamma, and log-normal. Below, I create a generic helper function that uses fitdistrplus
functions that will plot and compare the fits:
lil_fit_func <- function(x, fits2try){
# x is a list of fitdist objects
# fits2try is a list of names
cdfcomp(x, legendtext=fits2try)
denscomp(x, legendtext=fits2try)
qqcomp(x, legendtext=fits2try)
ppcomp(x, legendtext=fits2try)
gofstat(x, fitnames=fits2try)
}
fitW1 <- fitdist(p1_data, "weibull")
fitg1 <- fitdist(p1_data, "gamma")
fitL1 <- fitdist(p1_data,"lnorm")
fits_compare_P1 <- list(fitW1, fitg1, fitL1)
fit_names_P1 <- c("weibull", "gamma", "lognorm")
Comparison of Weibull, Gamma, and log-normal fits on the problem 1 data:
lil_fit_func(fits_compare_P1, fit_names_P1)
## Goodness-of-fit statistics
## weibull gamma lognorm
## Kolmogorov-Smirnov statistic 0.07934941 0.07461236 0.05044292
## Cramer-von Mises statistic 0.05847634 0.05174037 0.01825121
## Anderson-Darling statistic 0.44537170 0.35230811 0.14142579
##
## Goodness-of-fit criteria
## weibull gamma lognorm
## Akaike's Information Criterion 290.3611 288.2076 284.9146
## Bayesian Information Criterion 293.8364 291.6830 288.3900
Above, it’s usually the case that the model with the lowest BIC is preferred. Via Wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
Upon interpreting the results, I am happy that the log-normal function is the appropriate fit:
fitL1$estimate
## meanlog sdlog
## 2.1850979 0.7711947
Then the Simio code would be: Random.Lognormal( 2.1850979 , 0.7711947 )
Reviewing the plot below, the histogram of the Problem.Dataset_06_01.xls
and Problem.Dataset_06_02.xls
data appears that it could work well with a exponential, weibull, log-normal, or gamma distribution.
# note that since I've gotten more data, I'm going to try and fit a
# single model to the combined data of p1 and p2:
p2_data <- rbind(p2, p1)$V1
hist(p2_data)
I’ll again return to the descdist
function to review the Cullen and Frey graph of the kurtosis and skwewness squared.
descdist(p2_data, boot = 500, method = "sample")
## summary statistics
## ------
## min: 2.06 max: 48.65
## median: 9.14
## mean: 11.37753
## sample sd: 8.107968
## sample skewness: 1.953007
## sample kurtosis: 7.823871
#cdfcomp(x, legendtext=fits2try)
After reviewing the above combined data from problem 1 and problem 2, I’ll try out 4 distros: Weibull, gamma, exp, log-normal:
fitW2 <- fitdist(p2_data, "weibull")
fitg2 <- fitdist(p2_data, "gamma")
fitE2 <- fitdist(p2_data, "exp")
fitL2 <- fitdist(p2_data,"lnorm")
fits_compare_P2 <- list(fitW2, fitg2, fitE2, fitL2)
fit_names_P2 <- c("weibull", "gamma", "expo", "lognorm")
lil_fit_func(fits_compare_P2, fit_names_P2)
## Goodness-of-fit statistics
## weibull gamma expo lognorm
## Kolmogorov-Smirnov statistic 0.08964109 0.07770454 0.2201968 0.03229403
## Cramer-von Mises statistic 0.21344699 0.12427091 1.1178276 0.01270974
## Anderson-Darling statistic 1.38478074 0.72525507 6.3539379 0.08760997
##
## Goodness-of-fit criteria
## weibull gamma expo lognorm
## Akaike's Information Criterion 588.7543 580.2897 612.8320 572.8128
## Bayesian Information Criterion 593.7315 585.2670 615.3206 577.7901
As shown above, log-normal again seems to be the most appropriate fit.
fitL2$estimate
## meanlog sdlog
## 2.2235537 0.6395984
Then the Simio code would be: Random.Lognormal( 2.2235537 , 0.6395984 )
Again, I review the data:
hist(p3_data, freq = F)
After reviewing the data directly, I noted that it appears to be discrete.
descdist(p3_data, boot = 500, method = "sample", discrete = T)
## summary statistics
## ------
## min: 1 max: 8
## median: 3
## mean: 2.955556
## sample sd: 1.519584
## sample skewness: 0.9870622
## sample kurtosis: 4.144604
As I had difficulty getting “Binomial” and “Poisson” fitting to work in R, I resorted to using Stat::Fit per the text that worked very well:
Stat::Fit for Problem 3: Binomial Fit
Now, turning to the auto-fit, I see that Stat::Fit also selected the same distributions as I selected from the Cullen and Frey graph:
Stat::Fit AutoFit for Problem 3
And here is the goodness of fit test that shows that Binomial is the most appropriate fit according to the results and my observations:
Stat::Fit Goodness-of-Fit Problem 3
Per Stat::Fit
program the Simio code for Binomial is: Random.Binomial(0.185, 16.)
By setting a CDF function for a given distribution \(F(x)\) equal to an RV \(R\) where \(R=U[0,1]\), I can generate random values from the uniform distribution between \(a\) and \(b\) as \(U[a,b]\):
The CDF that provides continuous values defined between \([a, b]\) given by:
\(F_U(x) = \begin{cases} 0, & x < a \\ \frac{x-a}{b-a}, & a \leq x\leq b \\ 1, & x > b \end{cases}\)
Setting \(F_U(x)=R\) and solving for \(X\) will give me an equation that I can use for generating random values between \(a\) and \(b\).
\(\frac{x-a}{b-a}=R \rightarrow x-a = R(b-a) \rightarrow x = R(b-a)+a\):
my_uniAB <- function(a, b){
R <- runif(1)
return(R*(b-a)+a)
}
my_uniAB(5,6) # generate a random number between 5 and 6
## [1] 5.200175
my_uniAB(10,15) # generate a random number between 10 and 15
## [1] 11.06465
This is essentially the same problem as 4 but with the Weibull distribution.
First I take the CDF:
\(F_W(x) = \begin{cases} 1-e^{-\big(\frac{x}{\lambda}\big)^k}, & x \geq 0 \\ 0, & x < 0 \end{cases}\)
And then I’ll set it equal to my uniform random variable \(R=U[0,1]\):
\(F_W(x)=R \rightarrow 1-e^{-\big(\frac{x}{\lambda}\big)^k} = R \rightarrow\)
\(-\big(\frac{x}{\lambda}\big)^k = \ln(1-R) \rightarrow\)
\(x = {\lambda\big({-\ln(1-R)}\big)^\frac{1}{k}} \rightarrow\)
\(\therefore x = {\lambda\big({-\ln(1-R)}\big)^\frac{1}{k}}\)
#set.seed(2)
my_weibull <- function(lam, k){
R <- runif(1)
output <- lam * (-log((1-R)))^(1/k)
return(output)
}
lam <- 3
k <- 4
my_weibull(lam, k)
## [1] 3.208739
To create a CDF based on a specific, discrete set of values with given probabilities, I’ll eventually be using R’s sample
function which can handle this problem nicely.
First, I define the demand and given problem data:
# generating RVs based on specific set of discrete data:
oat_D <- c(0, 0.5, 1, 1.5, 2, 3, 4, 5, 7.5, 10)
oat_prob <- c(.05, .07, .09, .11, .15, .25, .10, .09, .06, .03)
pea_D <- c(0, .5, 1, 1.5, 2, 3)
pea_prob <- c(.1, .2, .2, .3, .1, .1)
bean_D <- c(0, 1, 3, 4.5)
bean_prob <- c(.2, .4, .3, .1)
bar_D <- c(0, .5, 1, 3.5)
bar_prob <- c(.2, .4, .3, .1)
# via HW3
nm_code <- c(1:4)
produce <- c("oats", "peas", "beans", "barley")
w_price <- c(1.05, 3.17, 1.99, 0.95) # per pound wholesale $
r_price <- c(1.29, 3.76, 2.23, 1.65) # per pound retail $
my_df <- data.frame(produce = c("oats", "peas", "beans", "barley"),
w_price = c(1.05, 3.17, 1.99, 0.95),
r_price = c(1.29, 3.76, 2.23, 1.65),
stringsAsFactors = F)
Here I present the given data:
produce | w_price | r_price |
---|---|---|
oats | 1.05 | 1.29 |
peas | 3.17 | 3.76 |
beans | 1.99 | 2.23 |
barley | 0.95 | 1.65 |
Then I create a function that’s based on R’s sample
but slightly modified to allow me to generate the demand for each product in a general sense and then make another function that generates the total cost, total revenue, and total profit for \(n\) business days:
get_demand <- function(d, probs){
# d is a vector of demand values
# probs cooresponds to d
return(sample(d,size=1, replace=F, prob=probs))
}
get_bizday <- function(n, a_df){
# a_df is df of produce costs and retail prices
# construct a business day's cost, rev, profit:
result_container <- NULL
for(i in 1:n){
temp <- cbind(a_df,
D = c(get_demand(oat_D, oat_prob),
get_demand(pea_D, pea_prob),
get_demand(bean_D, bean_prob),
get_demand(bar_D, bar_prob))) %>%
mutate(revenue = D * r_price,
cost = D * w_price,
profit = D * (r_price - w_price),
day_num = i)
biz_results <- data.frame(day = i,
t_rev = sum(temp$revenue),
t_cost = sum(temp$cost),
t_profit = sum(temp$profit))
result_container <- rbind(result_container, biz_results)
#result_container <- rbind(result_container, temp)
}
min_results <- data.frame(t_rev = min(result_container$t_rev),
t_cost = min(result_container$t_cost),
t_profit = min(result_container$t_profit))
val_container <- list(raw_data = temp,
biz_results = result_container,
min_results = min_results)
return(val_container)
#return(result_container)
}
For example, here’s the output for 1 business day:
get_bizday(1, my_df)$biz_results
## day t_rev t_cost t_profit
## 1 1 5.285 4.16 1.125
To get the total cost, total revenue, and total profit for 90 days based on the respective discrete demand distributions, I create the data set:
s90 <- get_bizday(90, my_df)$biz_results
Here’s a look at the generated data:
day | t_rev | t_cost | t_profit |
---|---|---|---|
1 | 24.020 | 18.755 | 5.265 |
2 | 26.465 | 21.990 | 4.475 |
3 | 9.395 | 8.030 | 1.365 |
4 | 21.195 | 17.810 | 3.385 |
5 | 16.225 | 12.170 | 4.055 |
6 | 13.980 | 11.675 | 2.305 |
The summary values for daily revenue, costs, and profit on a 90 day season are then:
gg<-s90[,2:4]
summary(gg)
## t_rev t_cost t_profit
## Min. : 3.530 Min. : 2.535 Min. :0.600
## 1st Qu.: 9.395 1st Qu.: 7.836 1st Qu.:1.542
## Median :13.918 Median :11.515 Median :2.310
## Mean :14.407 Mean :11.871 Mean :2.536
## 3rd Qu.:17.594 3rd Qu.:14.781 3rd Qu.:2.998
## Max. :32.640 Max. :27.290 Max. :6.860
While the total revenue, costs, and profits over a 90 day season are:
my_totals <- gg %>% summarise_all(sum)
t_rev | t_cost | t_profit |
---|---|---|
1296.64 | 1068.43 | 228.21 |
What if we want to know what the distribution of the minimum profit of 90-day runs over 100 runs?
contain <- NULL
for(i in 1:100){
tmp <- get_bizday(90, my_df)$min_results
contain <- rbind(contain, tmp)
}
hist(contain$t_profit)
Next, I’ll again employ fitdistrplus
to see if I can’t find a good distro:
descdist(contain$t_profit, boot = 500, method = "sample", discrete = F)
## summary statistics
## ------
## min: 0 max: 1.115
## median: 0.535
## mean: 0.5582
## sample sd: 0.233515
## sample skewness: -0.1155128
## sample kurtosis: 2.925607
As shown above, a good choice of distribution would the normal distribution or even the gamma. Below, I’ll perform the curve-fitting to the minimum profit obtained over 100 90-day runs:
fitg8 <- fitdist(contain$t_profit, "gamma", method = "mme")
fitN8 <- fitdist(contain$t_profit,"norm")
fits_compare_P8 <- list(fitN8, fitN8)
fit_names_P8 <- c("norm", "gamma")
lil_fit_func(fits_compare_P8, fit_names_P8)
## Goodness-of-fit statistics
## norm gamma
## Kolmogorov-Smirnov statistic 0.1004297 0.1004297
## Cramer-von Mises statistic 0.1070331 0.1070331
## Anderson-Darling statistic 0.5885457 0.5885457
##
## Goodness-of-fit criteria
## norm gamma
## Akaike's Information Criterion -3.114077 -3.114077
## Bayesian Information Criterion 2.096263 2.096263
As shown above, the normal or the gamma distribution would work. I’d probably use normal as it’s the most common.
Vito Ricci’s FITTING DISTRIBUTIONS WITH R and Mages’ Blog post on Fitting distributions with R as well as the help
examples found under ?fitdist
.
According to Ricci, there are 4 steps in fitting distributions: