Data 604 - Week 8

6.5 Problems

NOTE: All files were converted to .csv for easier manipulaiton in R.

Problem 1

Using the file Problem_Dataset_06.01.xls, fit the best model(s)

p1 <- read.csv(Prob1_dir, header = FALSE)

plotdist(p1$V1, histo = TRUE, demp = TRUE)

From the empirical density above, our distribution is right skewed and appears to be an exponential type of distribution. The Cullen and Frey Graph below is a good way to exempt some distributions by the parameters of skewness and kurtosis using the descdist function; the bootstrapped values are from random samples (with replacement) of the data. From this Cullen and Frey Graph and the empirical graphs above, our choices for good fits would seem to be limited to the available distributions in the fitdistrplus package:

Weibull
gamma
exponential

descdist(p1$V1, boot = 500)

## summary statistics
## ------
## min:  2.06   max:  48.65 
## median:  9.015 
## mean:  11.92048 
## estimated sd:  9.832445 
## estimated skewness:  1.790431 
## estimated kurtosis:  6.895655

These 3 aforementioned distributions are fit against 4 classical fit parameters, most important being the density and CDF plot.

fw <- fitdist(p1$V1, "weibull")
fg <- fitdist(p1$V1, "gamma")
fe <- fitdist(p1$V1, "exp")
par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "gamma", "expo")
denscomp(list(fw, fg, fe), legendtext = plot.legend)
qqcomp(list(fw, fg, fe), legendtext = plot.legend)
cdfcomp(list(fw, fg, fe), legendtext = plot.legend)
ppcomp(list(fw, fg, fe), legendtext = plot.legend)

From the plotted fitting metrics above, it appears that Weibull and gamma are the best contenders, but it seems a little unclear who is the clear winner form the density plot. Let’s take a closer look with a larger density plot for these two distributions.

denscomp(list(fw, fg), legendtext = c("Weibull", "gamma"))

It seems that the gamma distribution fits this data the best. Let us confirm this against the Akaline’s and Bayesian Information Criterion (AIC and BIC), which will give a sort of rank to the goodness of fit models passed to it using the gofstat function as well as the Goodness-of-fit statistics, which give distances between the fits and the empirical data.

gofstat(list(fw, fg))

## Goodness-of-fit statistics
##                              1-mle-weibull 2-mle-gamma
## Kolmogorov-Smirnov statistic    0.07934941  0.07461236
## Cramer-von Mises statistic      0.05847634  0.05174037
## Anderson-Darling statistic      0.44537170  0.35230811
## 
## Goodness-of-fit criteria
##                                1-mle-weibull 2-mle-gamma
## Akaike's Information Criterion      290.3611    288.2076
## Bayesian Information Criterion      293.8364    291.6830

Since the gamma distribution has the min AIC, BIC, and minimum goodness-of-fit statistics, we will choose the gamma distribution.

The coefficients for the gamma distribution are 1.8546097, 0.1555721, the shape and rate respectively. The Simio function to model this distribution would be Random.Gamma(1.8546097, 0.1555721).

Problem 2

Using the file Problem_Dataset_06.1.xls, fit the best model(s)

A similar process will be used as that outlined above. The explanations will be kept to a minimum due to the same thought processes used previously.

p2 <- read.csv(Prob2_dir, header = FALSE)

plotdist(p2$V1, histo = TRUE, demp = TRUE)

descdist(p2$V1, boot = 500)

## summary statistics
## ------
## min:  3.8   max:  36.13 
## median:  9.14 
## mean:  10.89234 
## estimated sd:  6.364957 
## estimated skewness:  1.97635 
## estimated kurtosis:  7.933002

fg <- fitdist(p2$V1, "gamma")
fln <- fitdist(p2$V1, "lnorm")
fl <- fitdist(p2$V1, "logis")
f_list <- list(fg, fln, fl)
par(mfrow = c(2, 2))
plot.legend <- sapply(c(1:length(f_list)), function(x) f_list[[x]]$distname)
denscomp(f_list, legendtext = plot.legend)
qqcomp(f_list, legendtext = plot.legend)
cdfcomp(f_list, legendtext = plot.legend)
ppcomp(f_list, legendtext = plot.legend)

Gamma and lnorm seem to be the best contenders

denscomp(list(fg, fln), legendtext = c("gamma", "lognorm"))

gofstat(list(fg, fln))

## Goodness-of-fit statistics
##                              1-mle-gamma 2-mle-lnorm
## Kolmogorov-Smirnov statistic  0.09941791  0.06764782
## Cramer-von Mises statistic    0.11444195  0.03973639
## Anderson-Darling statistic    0.71667740  0.26739444
## 
## Goodness-of-fit criteria
##                                1-mle-gamma 2-mle-lnorm
## Akaike's Information Criterion    288.3562    282.6829
## Bayesian Information Criterion    292.0565    286.3832

Choose the lnorm distribution. From the help section in Simio, the normal mean and normal standard deviations are needed as parameters. We can get these from the summary statistics from the descdist output above of the non-transformed data. The function in Simio would therefore take the function Random.Lognormal(10.89234, 6.364957).

Problem 3

Using the file Problem_Dataset_06.3.xls, fit the best model(s)

A similar process will be used as that outlined above. The data is discrete, however, and new distributions will be under scrutiny due to the different type of distributions available for this data type. The explanations will be kept to a minimum due to the same thought processes used previously.

p3 <- read.csv(Prob3_dir, header = FALSE)

plotdist(p3$V1, histo = TRUE, demp = TRUE)

descdist(p3$V1, discrete = TRUE, boot = 500)

## summary statistics
## ------
## min:  1   max:  8 
## median:  3 
## mean:  2.955556 
## estimated sd:  1.536755 
## estimated skewness:  1.02143 
## estimated kurtosis:  4.428947

CDFs have the most meaning for visualizing discrete fits. The negative binomial and Poisson are under scrutiny.

fnb <- fitdist(p3$V1, distr = "nbinom", discrete = TRUE)
fp <- fitdist(p3$V1, distr = "pois", discrete = TRUE)
f_list <- list(fnb, fp)
plot.legend <- sapply(c(1:length(f_list)), function(x) f_list[[x]]$distname)
cdfcomp(f_list, legendtext = plot.legend)

The negative binomial and Poisson are indistinguishable from each other in regards to the empirical CDF.

gofstat(f_list)

## Chi-squared statistic:  1.919848 1.920083 
## Degree of freedom of the Chi-squared distribution:  2 3 
## Chi-squared p-value:  0.3829221 0.5891583 
## Chi-squared table:
##      obscounts theo 1-mle-nbinom theo 2-mle-pois
## <= 1         7          9.265752        9.264853
## <= 2        13         10.230512       10.230086
## <= 3        11         10.078506       10.078530
## <= 4         8          7.446571        7.446914
## > 4          6          7.978659        7.979617
## 
## Goodness-of-fit criteria
##                                1-mle-nbinom 2-mle-pois
## Akaike's Information Criterion     166.2799   164.2799
## Bayesian Information Criterion     169.8932   166.0866

Since the Chi-squared stat is so close between the two, both distributions could be modeled.

Modelling the Negative Binomial Simio wants the variables of probabilityOfSuccess and numberOfSuccesses to model the distribution for the negative binomial. The fitdist function return coefficients size and \(\mu\).

fnb

## Fitting of the distribution ' nbinom ' by maximum likelihood 
## Parameters:
##          estimate Std. Error
## size 4.136931e+06  70.798134
## mu   2.955426e+00   0.256268

According to R, probability of success \(p\) is = \(\frac{size}{size+\mu} = 0.9999993\) and size is the number of successes. Therefore the Simio function will take the form Random.NegativeBinomial(0.9999993, 4136931)

Modelling Poisson Simio wants the variable of mean to model the distribution for Poisson. The fitdist function return the coefficient \(\lambda\).

fp

## Fitting of the distribution ' pois ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## lambda 2.955556  0.2562791

According to R, the expected value \(E(x)\) is \(\lambda\). Therefore the Simio function will take the form Random.Poisson(2.9555556)

Problem 4

Derive the inverse-CDF formula for generating random variables from a continuous uniform distribution between real numbers \(a\) and \(b\) \((a < b)\)

The CDF will have the form:

\[F_X(x) =\begin{cases}0 & x < a \\ \frac{x-a}{b-a} & a \leq x \leq b \\1 & x > b\end{cases} \\ \therefore F_{X}^{-1}(U) = U(b-a)+ a \space for \space 0 \leq U \leq 1\]

Problem 5

Derive the inverse-CDF formula for generating random variables from a Weibull distribution

The expected value \(E(x)\) of the Weibull distribution is given by:

\[E(x)=\frac{\eta}{\beta}\times\Gamma\left ( \frac{1}{\beta}+1 \right )\]

Where:

\(\eta\): scale parameter
\(\beta\): shape parameter
\(\Gamma\): Gamma function

NOTE: In Simio, there is no + 1 in the Gamma function. Everywhere else I looked for \(E(x)\), it was in there. This may be due to a mistake in the Simio documentation. In addition, there is an additional parameter \(\gamma\), which were in all the funcitons. I saw a note in the mean documentation on realwiki that it could be dropped (I believe) if one would only want a 2-parameter Weibull. The following CDF drops this variable \(\gamma\).

The CDF of a Weibull distribution is:

\[F_X(x) =\begin{cases}1-e^{-(x/\eta)^{\beta}} & x \geq 0\\ 0 & x <0\end{cases}\\ \rightarrow U = 1-e^{-(x/\eta)^{\beta}} \rightarrow ln(1-U)=ln(e^{-(x/\eta)^{\beta}}) \rightarrow -ln(1-U)=(x/\eta)^{\beta} \rightarrow \eta(-ln(1-U))^{1/\beta} = x\\ \therefore F_{X}^{-1}(U) = \eta(-ln(1-U))^{1/\beta} \space\space for \space -ln(1-U) \geq 0\]

It is important to note that the natural log is undefined for \(U = 1\); it will produce \(\infty\), which is still in the range of the original inequality of \(x \geq0\). The formula was not reduced any further so it holds the structure of the original CDF inequality for \(x \geq0\)

Problem 8

Excel was used to model the sale of goods simulation. The .xls is attached with the assignment. Using the new demand PDF the theoretical and simulation mean and stand deviation were compared using 90 days of sales. The error in comparing the simulated to theoretical seems within a reasonable margin.

theory_mu <- c(2.94, 1.25, 1.75, 0.85)
sim_mu <- c(3.0944, 1.2611, 1.9778, 0.8278)
mu_error <- theory_mu - sim_mu
theory_sd <- c(2.1878, 0.8139, 1.4361, 0.95)
sim_sd <- c(2.2221, 0.8907, 1.5523, 0.9633)
sd_error <- theory_sd-sim_sd

good_names <- c("oats", "peas", "beans", "barley")

(data.frame(theory_mu,sim_mu, mu_error, theory_sd, sim_sd, sd_error, row.names = good_names))

##        theory_mu sim_mu mu_error theory_sd sim_sd sd_error
## oats        2.94 3.0944  -0.1544    2.1878 2.2221  -0.0343
## peas        1.25 1.2611  -0.0111    0.8139 0.8907  -0.0768
## beans       1.75 1.9778  -0.2278    1.4361 1.5523  -0.1162
## barley      0.85 0.8278   0.0222    0.9500 0.9633  -0.0133