HW8_604

library(RCurl)
library(tidyverse)
library(fitdistrplus)
set.seed(1)

Below, I just define my URLs, and fetch my data for local analysis:

# All problem datasets were converted to csv and uploaded to Github
# for convenience.  
p1_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_01.csv"

p2_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_02.csv"

p3_url <- "https://raw.githubusercontent.com/jbrnbrg/Wk8_604_Input_Analysis_Var_Reduction/master/prob_data_06_03.csv"

p1 <- read.csv(text = getURL(p1_url), header = F)
p2 <- read.csv(text = getURL(p2_url), header = F)
p3 <- read.csv(text = getURL(p3_url), header = F)

p1_data <- p1$V1
p3_data <- p3$V1

1)

Problem data in csv format: Inter-arrival times (in minutes) of calls into a call center.

Reviewing the plot below, the histogram of the Problem.Dataset_06_01.xls data appears that it would work well with a exponential, weibull, log-normal, or gamma distribution.

hist(p1_data)

Using the fitdist function within the fitdistrplus library provides a wide variety of tools for fitting and testing fits of data. Highly recommended.

I’ll use it to review different possibilities by employing the Cullen and Frey Graph with 500 bootstrapped samples using descdist - the resulting chart examines the kurtoisis and skewness squared of different possible fits to help me narrow down the possible candidates - but not select one (note that Weibull isn’t shown but it’s close to gamma and log-normal).

descdist(p1_data, boot = 500, method = "sample")

## summary statistics
## ------
## min:  2.06   max:  48.65 
## median:  9.015 
## mean:  11.92048 
## sample sd:  9.714686 
## sample skewness:  1.725842 
## sample kurtosis:  6.307556

After reviewing the plot above, I’ve settled on comparing three different distribution fits on this data: Weibull, gamma, and log-normal. Below, I create a generic helper function that uses fitdistrplus functions that will plot and compare the fits:

lil_fit_func <- function(x, fits2try){
  # x is a list of fitdist objects
  # fits2try is a list of names
  cdfcomp(x, legendtext=fits2try)
  denscomp(x, legendtext=fits2try)
  qqcomp(x, legendtext=fits2try)
  ppcomp(x, legendtext=fits2try)
  gofstat(x, fitnames=fits2try)
}

fitW1 <- fitdist(p1_data, "weibull")
fitg1 <- fitdist(p1_data, "gamma")
fitL1 <- fitdist(p1_data,"lnorm")

fits_compare_P1 <- list(fitW1, fitg1, fitL1)
fit_names_P1 <- c("weibull", "gamma", "lognorm")

Comparison of Weibull, Gamma, and log-normal fits on the problem 1 data:

lil_fit_func(fits_compare_P1, fit_names_P1)

## Goodness-of-fit statistics
##                                 weibull      gamma    lognorm
## Kolmogorov-Smirnov statistic 0.07934941 0.07461236 0.05044292
## Cramer-von Mises statistic   0.05847634 0.05174037 0.01825121
## Anderson-Darling statistic   0.44537170 0.35230811 0.14142579
## 
## Goodness-of-fit criteria
##                                 weibull    gamma  lognorm
## Akaike's Information Criterion 290.3611 288.2076 284.9146
## Bayesian Information Criterion 293.8364 291.6830 288.3900

Above, it’s usually the case that the model with the lowest BIC is preferred. Via Wikipedia:

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

Upon interpreting the results, I am happy that the log-normal function is the appropriate fit:

fitL1$estimate

##   meanlog     sdlog 
## 2.1850979 0.7711947

Then the Simio code would be: Random.Lognormal( 2.1850979 , 0.7711947 )

2)

Problem 2 data in csv format

Reviewing the plot below, the histogram of the Problem.Dataset_06_01.xls and Problem.Dataset_06_02.xls data appears that it could work well with a exponential, weibull, log-normal, or gamma distribution.

# note that since I've gotten more data, I'm going to try and fit a 
# single model to the combined data of p1 and p2:  
p2_data <- rbind(p2, p1)$V1

hist(p2_data)

I’ll again return to the descdist function to review the Cullen and Frey graph of the kurtosis and skwewness squared.

descdist(p2_data, boot = 500, method = "sample")

## summary statistics
## ------
## min:  2.06   max:  48.65 
## median:  9.14 
## mean:  11.37753 
## sample sd:  8.107968 
## sample skewness:  1.953007 
## sample kurtosis:  7.823871

#cdfcomp(x, legendtext=fits2try)

After reviewing the above combined data from problem 1 and problem 2, I’ll try out 4 distros: Weibull, gamma, exp, log-normal:

fitW2 <- fitdist(p2_data, "weibull")
fitg2 <- fitdist(p2_data, "gamma")
fitE2 <- fitdist(p2_data, "exp")
fitL2 <- fitdist(p2_data,"lnorm")

fits_compare_P2 <- list(fitW2, fitg2, fitE2, fitL2)
fit_names_P2 <- c("weibull", "gamma", "expo", "lognorm")

lil_fit_func(fits_compare_P2, fit_names_P2)

## Goodness-of-fit statistics
##                                 weibull      gamma      expo    lognorm
## Kolmogorov-Smirnov statistic 0.08964109 0.07770454 0.2201968 0.03229403
## Cramer-von Mises statistic   0.21344699 0.12427091 1.1178276 0.01270974
## Anderson-Darling statistic   1.38478074 0.72525507 6.3539379 0.08760997
## 
## Goodness-of-fit criteria
##                                 weibull    gamma     expo  lognorm
## Akaike's Information Criterion 588.7543 580.2897 612.8320 572.8128
## Bayesian Information Criterion 593.7315 585.2670 615.3206 577.7901

As shown above, log-normal again seems to be the most appropriate fit.

fitL2$estimate

##   meanlog     sdlog 
## 2.2235537 0.6395984

Then the Simio code would be: Random.Lognormal( 2.2235537 , 0.6395984 )

3)

Problem 3 data in csv format

Again, I review the data:

hist(p3_data, freq = F)

After reviewing the data directly, I noted that it appears to be discrete.

descdist(p3_data, boot = 500, method = "sample", discrete = T)

## summary statistics
## ------
## min:  1   max:  8 
## median:  3 
## mean:  2.955556 
## sample sd:  1.519584 
## sample skewness:  0.9870622 
## sample kurtosis:  4.144604

As I had difficulty getting “Binomial” and “Poisson” fitting to work in R, I resorted to using Stat::Fit per the text that worked very well:

Stat::Fit for Problem 3: Binomial Fit

Now, turning to the auto-fit, I see that Stat::Fit also selected the same distributions as I selected from the Cullen and Frey graph:

Stat::Fit AutoFit for Problem 3

And here is the goodness of fit test that shows that Binomial is the most appropriate fit according to the results and my observations:

Stat::Fit Goodness-of-Fit Problem 3

Per Stat::Fit program the Simio code for Binomial is: Random.Binomial(0.185, 16.)

4)

By setting a CDF function for a given distribution \(F(x)\) equal to an RV \(R\) where \(R=U[0,1]\), I can generate random values from the uniform distribution between \(a\) and \(b\) as \(U[a,b]\):

The CDF that provides continuous values defined between \([a, b]\) given by:

\(F_U(x) = \begin{cases} 0, & x < a \\ \frac{x-a}{b-a}, & a \leq x\leq b \\ 1, & x > b \end{cases}\)

Setting \(F_U(x)=R\) and solving for \(X\) will give me an equation that I can use for generating random values between \(a\) and \(b\).

\(\frac{x-a}{b-a}=R \rightarrow x-a = R(b-a) \rightarrow x = R(b-a)+a\):

my_uniAB <- function(a, b){
  R <- runif(1)
  return(R*(b-a)+a)
}
my_uniAB(5,6)    # generate a random number between 5 and 6

## [1] 5.200175

my_uniAB(10,15)  # generate a random number between 10 and 15

## [1] 11.06465

5)

This is essentially the same problem as 4 but with the Weibull distribution.

First I take the CDF:

\(F_W(x) = \begin{cases} 1-e^{-\big(\frac{x}{\lambda}\big)^k}, & x \geq 0 \\ 0, & x < 0 \end{cases}\)

And then I’ll set it equal to my uniform random variable \(R=U[0,1]\):

\(F_W(x)=R \rightarrow 1-e^{-\big(\frac{x}{\lambda}\big)^k} = R \rightarrow\)
\(-\big(\frac{x}{\lambda}\big)^k = \ln(1-R) \rightarrow\)
\(x = {\lambda\big({-\ln(1-R)}\big)^\frac{1}{k}} \rightarrow\)

\(\therefore x = {\lambda\big({-\ln(1-R)}\big)^\frac{1}{k}}\)

#set.seed(2)
my_weibull <- function(lam, k){
  R <- runif(1)
  output <- lam * (-log((1-R)))^(1/k)
  return(output)
}

lam <- 3
k <- 4

my_weibull(lam, k)

## [1] 3.208739

8)

To create a CDF based on a specific, discrete set of values with given probabilities, I’ll eventually be using R’s sample function which can handle this problem nicely.

First, I define the demand and given problem data:

# generating RVs based on specific set of discrete data:
oat_D <- c(0, 0.5, 1, 1.5, 2, 3, 4, 5, 7.5, 10)
oat_prob <- c(.05, .07, .09, .11, .15, .25, .10, .09, .06, .03)

pea_D <- c(0, .5, 1, 1.5, 2, 3)
pea_prob <- c(.1, .2, .2, .3, .1, .1)

bean_D <- c(0, 1, 3, 4.5)
bean_prob <- c(.2, .4, .3, .1)

bar_D <- c(0, .5, 1, 3.5)
bar_prob <- c(.2, .4, .3, .1)

                                                     # via HW3
nm_code <- c(1:4)
produce <- c("oats", "peas", "beans", "barley")
w_price <- c(1.05, 3.17, 1.99, 0.95)                 # per pound wholesale $
r_price <- c(1.29, 3.76, 2.23, 1.65)                 # per pound retail $

my_df <- data.frame(produce = c("oats", "peas", "beans", "barley"), 
                    w_price = c(1.05, 3.17, 1.99, 0.95),
                    r_price = c(1.29, 3.76, 2.23, 1.65), 
                    stringsAsFactors = F)

Here I present the given data:

Wholesale and Retail Costs of Produce
produce	w_price	r_price
oats	1.05	1.29
peas	3.17	3.76
beans	1.99	2.23
barley	0.95	1.65

Then I create a function that’s based on R’s sample but slightly modified to allow me to generate the demand for each product in a general sense and then make another function that generates the total cost, total revenue, and total profit for \(n\) business days:

get_demand <- function(d, probs){
  # d is a vector of demand values
  # probs cooresponds to d
  return(sample(d,size=1, replace=F, prob=probs))
}

get_bizday <- function(n, a_df){
  # a_df is df of produce costs and retail prices
  # construct a business day's cost, rev, profit:  
  result_container <- NULL
  for(i in 1:n){
    temp <- cbind(a_df, 
                  D = c(get_demand(oat_D, oat_prob),
                        get_demand(pea_D, pea_prob),
                        get_demand(bean_D, bean_prob),
                        get_demand(bar_D, bar_prob))) %>% 
      mutate(revenue = D * r_price,
             cost = D * w_price, 
             profit = D * (r_price - w_price), 
             day_num = i)
    
    biz_results <- data.frame(day = i,
                              t_rev = sum(temp$revenue), 
                              t_cost = sum(temp$cost), 
                              t_profit = sum(temp$profit))

    result_container <- rbind(result_container, biz_results)
    #result_container <- rbind(result_container, temp)
  }
  
  min_results <- data.frame(t_rev = min(result_container$t_rev), 
                            t_cost = min(result_container$t_cost), 
                            t_profit = min(result_container$t_profit))
  
  val_container <- list(raw_data = temp,
                        biz_results = result_container, 
                        min_results = min_results)
  return(val_container)
  #return(result_container)
}

For example, here’s the output for 1 business day:

get_bizday(1, my_df)$biz_results

##   day t_rev t_cost t_profit
## 1   1 5.285   4.16    1.125

To get the total cost, total revenue, and total profit for 90 days based on the respective discrete demand distributions, I create the data set:

s90 <- get_bizday(90, my_df)$biz_results

Here’s a look at the generated data:

Sample of the 90-day dataset
day	t_rev	t_cost	t_profit
1	24.020	18.755	5.265
2	26.465	21.990	4.475
3	9.395	8.030	1.365
4	21.195	17.810	3.385
5	16.225	12.170	4.055
6	13.980	11.675	2.305

The summary values for daily revenue, costs, and profit on a 90 day season are then:

gg<-s90[,2:4]
summary(gg)

##      t_rev            t_cost          t_profit    
##  Min.   : 3.530   Min.   : 2.535   Min.   :0.600  
##  1st Qu.: 9.395   1st Qu.: 7.836   1st Qu.:1.542  
##  Median :13.918   Median :11.515   Median :2.310  
##  Mean   :14.407   Mean   :11.871   Mean   :2.536  
##  3rd Qu.:17.594   3rd Qu.:14.781   3rd Qu.:2.998  
##  Max.   :32.640   Max.   :27.290   Max.   :6.860

While the total revenue, costs, and profits over a 90 day season are:

my_totals <- gg %>% summarise_all(sum)

Totals for 90-day dataset
t_rev	t_cost	t_profit
1296.64	1068.43	228.21

What if we want to know what the distribution of the minimum profit of 90-day runs over 100 runs?

contain <- NULL
for(i in 1:100){
  tmp <- get_bizday(90, my_df)$min_results
  contain <- rbind(contain, tmp)
}
hist(contain$t_profit)

Next, I’ll again employ fitdistrplus to see if I can’t find a good distro:

descdist(contain$t_profit, boot = 500, method = "sample", discrete = F)

## summary statistics
## ------
## min:  0   max:  1.115 
## median:  0.535 
## mean:  0.5582 
## sample sd:  0.233515 
## sample skewness:  -0.1155128 
## sample kurtosis:  2.925607

As shown above, a good choice of distribution would the normal distribution or even the gamma. Below, I’ll perform the curve-fitting to the minimum profit obtained over 100 90-day runs:

fitg8 <- fitdist(contain$t_profit, "gamma", method = "mme")
fitN8 <- fitdist(contain$t_profit,"norm")

fits_compare_P8 <- list(fitN8, fitN8)
fit_names_P8 <- c("norm", "gamma")
lil_fit_func(fits_compare_P8, fit_names_P8)

## Goodness-of-fit statistics
##                                   norm     gamma
## Kolmogorov-Smirnov statistic 0.1004297 0.1004297
## Cramer-von Mises statistic   0.1070331 0.1070331
## Anderson-Darling statistic   0.5885457 0.5885457
## 
## Goodness-of-fit criteria
##                                     norm     gamma
## Akaike's Information Criterion -3.114077 -3.114077
## Bayesian Information Criterion  2.096263  2.096263

As shown above, the normal or the gamma distribution would work. I’d probably use normal as it’s the most common.

References, Notes:

Vito Ricci’s FITTING DISTRIBUTIONS WITH R and Mages’ Blog post on Fitting distributions with R as well as the help examples found under ?fitdist.

According to Ricci, there are 4 steps in fitting distributions:

Model/function choice: hypothesize families of distributions;
Estimate parameters;
Evaluate quality of fit;
Goodness of fit statistical tests.

HW8_604

jbrnbrg

October 16, 2017

1)

2)

3)

4)

5)

8)

References, Notes: