NOTE: All files were converted to .csv for easier manipulaiton in R.
Using the file Problem_Dataset_06.01.xls, fit the best model(s)
p1 <- read.csv(Prob1_dir, header = FALSE)
plotdist(p1$V1, histo = TRUE, demp = TRUE)
From the empirical density above, our distribution is right skewed and appears to be an exponential type of distribution. The Cullen and Frey Graph below is a good way to exempt some distributions by the parameters of skewness and kurtosis using the descdist function; the bootstrapped values are from random samples (with replacement) of the data. From this Cullen and Frey Graph and the empirical graphs above, our choices for good fits would seem to be limited to the available distributions in the fitdistrplus package:
descdist(p1$V1, boot = 500)
## summary statistics
## ------
## min: 2.06 max: 48.65
## median: 9.015
## mean: 11.92048
## estimated sd: 9.832445
## estimated skewness: 1.790431
## estimated kurtosis: 6.895655
These 3 aforementioned distributions are fit against 4 classical fit parameters, most important being the density and CDF plot.
fw <- fitdist(p1$V1, "weibull")
fg <- fitdist(p1$V1, "gamma")
fe <- fitdist(p1$V1, "exp")
par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "gamma", "expo")
denscomp(list(fw, fg, fe), legendtext = plot.legend)
qqcomp(list(fw, fg, fe), legendtext = plot.legend)
cdfcomp(list(fw, fg, fe), legendtext = plot.legend)
ppcomp(list(fw, fg, fe), legendtext = plot.legend)
From the plotted fitting metrics above, it appears that Weibull and gamma are the best contenders, but it seems a little unclear who is the clear winner form the density plot. Let’s take a closer look with a larger density plot for these two distributions.
denscomp(list(fw, fg), legendtext = c("Weibull", "gamma"))
It seems that the gamma distribution fits this data the best. Let us confirm this against the Akaline’s and Bayesian Information Criterion (AIC and BIC), which will give a sort of rank to the goodness of fit models passed to it using the gofstat function as well as the Goodness-of-fit statistics, which give distances between the fits and the empirical data.
gofstat(list(fw, fg))
## Goodness-of-fit statistics
## 1-mle-weibull 2-mle-gamma
## Kolmogorov-Smirnov statistic 0.07934941 0.07461236
## Cramer-von Mises statistic 0.05847634 0.05174037
## Anderson-Darling statistic 0.44537170 0.35230811
##
## Goodness-of-fit criteria
## 1-mle-weibull 2-mle-gamma
## Akaike's Information Criterion 290.3611 288.2076
## Bayesian Information Criterion 293.8364 291.6830
Since the gamma distribution has the min AIC, BIC, and minimum goodness-of-fit statistics, we will choose the gamma distribution.
The coefficients for the gamma distribution are 1.8546097, 0.1555721, the shape and rate respectively. The Simio function to model this distribution would be Random.Gamma(1.8546097, 0.1555721).
Using the file Problem_Dataset_06.1.xls, fit the best model(s)
A similar process will be used as that outlined above. The explanations will be kept to a minimum due to the same thought processes used previously.
p2 <- read.csv(Prob2_dir, header = FALSE)
plotdist(p2$V1, histo = TRUE, demp = TRUE)
descdist(p2$V1, boot = 500)
## summary statistics
## ------
## min: 3.8 max: 36.13
## median: 9.14
## mean: 10.89234
## estimated sd: 6.364957
## estimated skewness: 1.97635
## estimated kurtosis: 7.933002
fg <- fitdist(p2$V1, "gamma")
fln <- fitdist(p2$V1, "lnorm")
fl <- fitdist(p2$V1, "logis")
f_list <- list(fg, fln, fl)
par(mfrow = c(2, 2))
plot.legend <- sapply(c(1:length(f_list)), function(x) f_list[[x]]$distname)
denscomp(f_list, legendtext = plot.legend)
qqcomp(f_list, legendtext = plot.legend)
cdfcomp(f_list, legendtext = plot.legend)
ppcomp(f_list, legendtext = plot.legend)
Gamma and lnorm seem to be the best contenders
denscomp(list(fg, fln), legendtext = c("gamma", "lognorm"))
gofstat(list(fg, fln))
## Goodness-of-fit statistics
## 1-mle-gamma 2-mle-lnorm
## Kolmogorov-Smirnov statistic 0.09941791 0.06764782
## Cramer-von Mises statistic 0.11444195 0.03973639
## Anderson-Darling statistic 0.71667740 0.26739444
##
## Goodness-of-fit criteria
## 1-mle-gamma 2-mle-lnorm
## Akaike's Information Criterion 288.3562 282.6829
## Bayesian Information Criterion 292.0565 286.3832
Choose the lnorm distribution. From the help section in Simio, the normal mean and normal standard deviations are needed as parameters. We can get these from the summary statistics from the descdist output above of the non-transformed data. The function in Simio would therefore take the function Random.Lognormal(10.89234, 6.364957).
Using the file Problem_Dataset_06.3.xls, fit the best model(s)
A similar process will be used as that outlined above. The data is discrete, however, and new distributions will be under scrutiny due to the different type of distributions available for this data type. The explanations will be kept to a minimum due to the same thought processes used previously.
p3 <- read.csv(Prob3_dir, header = FALSE)
plotdist(p3$V1, histo = TRUE, demp = TRUE)
descdist(p3$V1, discrete = TRUE, boot = 500)
## summary statistics
## ------
## min: 1 max: 8
## median: 3
## mean: 2.955556
## estimated sd: 1.536755
## estimated skewness: 1.02143
## estimated kurtosis: 4.428947
CDFs have the most meaning for visualizing discrete fits. The negative binomial and Poisson are under scrutiny.
fnb <- fitdist(p3$V1, distr = "nbinom", discrete = TRUE)
fp <- fitdist(p3$V1, distr = "pois", discrete = TRUE)
f_list <- list(fnb, fp)
plot.legend <- sapply(c(1:length(f_list)), function(x) f_list[[x]]$distname)
cdfcomp(f_list, legendtext = plot.legend)
The negative binomial and Poisson are indistinguishable from each other in regards to the empirical CDF.
gofstat(f_list)
## Chi-squared statistic: 1.919848 1.920083
## Degree of freedom of the Chi-squared distribution: 2 3
## Chi-squared p-value: 0.3829221 0.5891583
## Chi-squared table:
## obscounts theo 1-mle-nbinom theo 2-mle-pois
## <= 1 7 9.265752 9.264853
## <= 2 13 10.230512 10.230086
## <= 3 11 10.078506 10.078530
## <= 4 8 7.446571 7.446914
## > 4 6 7.978659 7.979617
##
## Goodness-of-fit criteria
## 1-mle-nbinom 2-mle-pois
## Akaike's Information Criterion 166.2799 164.2799
## Bayesian Information Criterion 169.8932 166.0866
Since the Chi-squared stat is so close between the two, both distributions could be modeled.
Modelling the Negative Binomial Simio wants the variables of probabilityOfSuccess and numberOfSuccesses to model the distribution for the negative binomial. The fitdist function return coefficients size and \(\mu\).
fnb
## Fitting of the distribution ' nbinom ' by maximum likelihood
## Parameters:
## estimate Std. Error
## size 4.136931e+06 70.798134
## mu 2.955426e+00 0.256268
According to R, probability of success \(p\) is = \(\frac{size}{size+\mu} = 0.9999993\) and size is the number of successes. Therefore the Simio function will take the form Random.NegativeBinomial(0.9999993, 4136931)
Modelling Poisson Simio wants the variable of mean to model the distribution for Poisson. The fitdist function return the coefficient \(\lambda\).
fp
## Fitting of the distribution ' pois ' by maximum likelihood
## Parameters:
## estimate Std. Error
## lambda 2.955556 0.2562791
According to R, the expected value \(E(x)\) is \(\lambda\). Therefore the Simio function will take the form Random.Poisson(2.9555556)
Derive the inverse-CDF formula for generating random variables from a continuous uniform distribution between real numbers \(a\) and \(b\) \((a < b)\)
The CDF will have the form:
\[F_X(x) =\begin{cases}0 & x < a \\ \frac{x-a}{b-a} & a \leq x \leq b \\1 & x > b\end{cases} \\ \therefore F_{X}^{-1}(U) = U(b-a)+ a \space for \space 0 \leq U \leq 1\]
Derive the inverse-CDF formula for generating random variables from a Weibull distribution
The expected value \(E(x)\) of the Weibull distribution is given by:
\[E(x)=\frac{\eta}{\beta}\times\Gamma\left ( \frac{1}{\beta}+1 \right )\]
Where:
NOTE: In Simio, there is no + 1 in the Gamma function. Everywhere else I looked for \(E(x)\), it was in there. This may be due to a mistake in the Simio documentation. In addition, there is an additional parameter \(\gamma\), which were in all the funcitons. I saw a note in the mean documentation on realwiki that it could be dropped (I believe) if one would only want a 2-parameter Weibull. The following CDF drops this variable \(\gamma\).
The CDF of a Weibull distribution is:
\[F_X(x) =\begin{cases}1-e^{-(x/\eta)^{\beta}} & x \geq 0\\ 0 & x <0\end{cases}\\ \rightarrow U = 1-e^{-(x/\eta)^{\beta}} \rightarrow ln(1-U)=ln(e^{-(x/\eta)^{\beta}}) \rightarrow -ln(1-U)=(x/\eta)^{\beta} \rightarrow \eta(-ln(1-U))^{1/\beta} = x\\ \therefore F_{X}^{-1}(U) = \eta(-ln(1-U))^{1/\beta} \space\space for \space -ln(1-U) \geq 0\]
It is important to note that the natural log is undefined for \(U = 1\); it will produce \(\infty\), which is still in the range of the original inequality of \(x \geq0\). The formula was not reduced any further so it holds the structure of the original CDF inequality for \(x \geq0\)
Excel was used to model the sale of goods simulation. The .xls is attached with the assignment. Using the new demand PDF the theoretical and simulation mean and stand deviation were compared using 90 days of sales. The error in comparing the simulated to theoretical seems within a reasonable margin.
theory_mu <- c(2.94, 1.25, 1.75, 0.85)
sim_mu <- c(3.0944, 1.2611, 1.9778, 0.8278)
mu_error <- theory_mu - sim_mu
theory_sd <- c(2.1878, 0.8139, 1.4361, 0.95)
sim_sd <- c(2.2221, 0.8907, 1.5523, 0.9633)
sd_error <- theory_sd-sim_sd
good_names <- c("oats", "peas", "beans", "barley")
(data.frame(theory_mu,sim_mu, mu_error, theory_sd, sim_sd, sd_error, row.names = good_names))
## theory_mu sim_mu mu_error theory_sd sim_sd sd_error
## oats 2.94 3.0944 -0.1544 2.1878 2.2221 -0.0343
## peas 1.25 1.2611 -0.0111 0.8139 0.8907 -0.0768
## beans 1.75 1.9778 -0.2278 1.4361 1.5523 -0.1162
## barley 0.85 0.8278 0.0222 0.9500 0.9633 -0.0133