library(tidyverse)
L <- \(lam) {
y <- c(6,8,10)
likelihood <- exp(-3*lam)*lam^sum(y)/prod(factorial(y))
return(likelihood)
}
L(7)[1] 0.001378964
[1] 0.001423158
R.A. Fisher
Image source: University of Adelaide
[Pearson’s “Method of Moments”] is without question of great practical utility, [wherein] different forms of frequency curves are fitted by calculating as many moments of the sample as there are parameters to be evaluated. The parameters chosen are those of an infinite population of the specified type having the same moments as those calculated from the sample.
[F]or that class of distribution to which the method can be applied, it has not been shown, except in the case of the normal curve, that the best values will be obtained by the method of moments. The method will, in these cases, certainly be serviceable in yielding an approximation, but to discover whether this approximation is a good or a bad one, and to improve it, if necessary, a more adequate criterion is required.
\[L(\theta) = p(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n p(y_i;\theta)\]
\[L(\theta) = f(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n f_Y(y_i;\theta)\]
\[\hat\theta_{MLE} = \arg\!\max_{\theta}L(\theta)\]
\[L(\lambda) = p(6;\lambda)p(8;\lambda)p(10;\lambda) = \frac{e^{-\lambda}\lambda^6}{6!}\times \frac{e^{-\lambda}\lambda^8}{8!} \times \frac{e^{-\lambda}\lambda^{10}}{10!} = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]
\[L(7) = \frac{e^{-3\cdot 7}7^{24}}{6!8!10!} = 0.00138\]
\[L(9) = \frac{e^{-3\cdot 9}9^{24}}{6!8!10!} = 0.00142\]
Of course, rather than just evaluating \(L(\lambda)\) for two values we could consider it over all possible values:
Where does this likelihood have its maximum?
\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]
\[\log(a\cdot b\cdot c) = \log(a)+\log(b)+\log(c)\] \[\log(a^b) = b\log(a)\]
\[\ln(e^x) = x \Leftarrow \mbox{ simplifies a lot of well-known pdfs!}\]
Using calculus to find the \(\lambda\) that maximizes
\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]
\[\ln(L(\lambda)) = \ln(e^{-3\lambda}) + \ln(\lambda^{24}) -\ln(6!8!10!)=-3\lambda + 24 \ln (\lambda) -\ln(6!8!10!)\]
\[\frac{d}{d\lambda}\ln(L(\lambda)) = -3+\frac{24}{\lambda}\stackrel{set}{=}0 \Rightarrow \hat\lambda_{MLE} = 8\]
\[L(\lambda) = \prod_{i=1}^n p(y_i;\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} = \frac{e^{-n\lambda}\lambda^{\sum_{i=1}^ny_i}}{\prod_{i=1}^ny_i!}\]
\[\ln(L(\lambda)) = \ln\left(e^{-n\lambda}\right) + \ln\left(\lambda^{\sum_{i=1}^ny_i}\right)-\ln\left(\prod_{i=1}^ny_i!\right)\]
\[ = -n\lambda + \sum_{i=1}^ny_i\ln(\lambda)-\ln\left(\prod_{i=1}^ny_i!\right)\]
\[\frac{d}{d\lambda}\ln(L(\lambda)) = -n+ \frac{\sum_{i=1}^ny_i}{\lambda} \stackrel{set}{=}0\]
\[\Rightarrow \hat\lambda_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]
optimize() to perform single-parameter optimization. Arguments:
f: name of your log-likelihood function...: additional arguments to your log-likelihood function, most likely a sample vectorinterval: interval over which to search for the maximummaximum=TRUE: by default, optimize() minimizes. Toggle this to maximize.Writing the log-likelihood function:
Simulating a sample:
What should the MLE be?
Using optimize() to find MLE:
\[L(\lambda) = \prod_{i=1}^n f(y_i;\lambda) = \prod_{i=1}^n \lambda e^{-\lambda y_i}= \lambda^n e^{-\lambda\sum_{i=1}^ny_i}\]
\[\ln(L(\lambda)) = \ln\left(\lambda^n\right) + \ln\left(e^{-\lambda\sum_{i=1}^ny_i}\right)\]
\[ = n \ln(\lambda) - \lambda\sum_{i=1}^ny_i\]
\[\frac{d}{d\lambda}\ln(L(\lambda)) = \frac{n}{\lambda}- \sum_{i=1}^ny_i \stackrel{set}{=}0\]
\[\Rightarrow \hat\lambda_{MLE} = \frac{n}{\sum_{i=1}^ny_i}\]
\[L(\beta) = \prod_{i=1}^n f(y_i;\beta) = \prod_{i=1}^n \frac{1}{\beta} e^{-y_i/\beta}= \beta^{-n} e^{-\sum_{i=1}^ny_i/\beta}\]
\[\ln(L(\beta)) = -n \ln(\beta) - \frac{\sum_{i=1}^ny_i}{\beta}\]
\[\frac{d}{d\beta}\ln(L(\beta)) = -\frac{n}{\beta}+ \frac{\sum_{i=1}^ny_i}{\beta^2} \stackrel{set}{=}0\]
\[\Rightarrow \hat\beta_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]
\[\hat\lambda = \frac{1}{\bar Y} \mbox{ is the MLE of } \lambda\]
\[\tau(\hat\lambda) = \frac{1}{\hat\lambda} = \frac{1}{1/\bar Y} \mbox{ is the MLE of } \frac{1}{\lambda} = \tau(\lambda) = \beta\]
Rote approach:
\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n}\]
\[\ln(L(\theta)) = -n\ln(\theta)\]
\[\frac{d}{d\theta}\ln(L(\theta)) = -\frac{n}{\theta}\stackrel{set}{=}0 \Rightarrow \hat\theta_{MLE} = -\frac{n}{0} = -\infty...?\]
\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} 1(y_i >0) 1(y_i < \theta)\]
\[= \frac{1}{\theta^n}\cdot 1(y_{(1)}>0)\cdot 1(y_{(n)}< \theta)\]
\[L(\theta) = \frac{1}{\theta^n}\cdot 1(\theta>y_{(n)})\]
\[L(y_{(n)}) = y_{(n)}^{-n}\]
\[\therefore \hat\theta_{MLE} = Y_{(n)}\]
Make sure to sketch the likelihood function and identify its nonzero form (monotone increasing? decreasing?) and where it achieves its maximum when you have non-concave likelihoods (when \(\theta\) parameterizes the support)
\[L(\mu,\sigma^2) = \prod_{i=1}^n f(y_i;\mu,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(y_i-\mu)^2/2\sigma^2}= (2\pi\sigma^2)^{-n/2} e^{-\sum_{i=1}^n(y_i-\mu)^2/2\sigma^2}\]
\[\ln(L(\mu,\sigma^2)) = -\frac{n}{2}\ln(\sigma^2)- \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2}-\frac{n}{2}\ln(2\pi)\]
\[\mbox{Eq. 1: } \frac{\partial}{\partial \mu}\ln(L(\mu,\sigma^2))=\frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} \stackrel{set}{=}0 \]
\[\mbox{Eq. 2: }\frac{\partial}{\partial \sigma^2}\ln(L(\mu,\sigma^2))=-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}\stackrel{set}{=}0\]
\[\mbox{Eq. 1: } \frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} =0\] \[ \Rightarrow \sum_{i=1}^ny_i-n\mu=0\] \[\Rightarrow n\bar y -n\mu =0\]
\[\Rightarrow \hat\mu_{MLE} = \bar y\]
Plug into:
\[\mbox{Eq. 2: }-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}=0\]
\[\Rightarrow \frac{\sum_{i=1}^n(y_i-\hat\mu)^2}{2\sigma^4} = \frac{n}{2\sigma^2}\] \[\Rightarrow \hat\sigma^2_{MLE} = \frac{\sum_{i=1}^n(y_i-\bar y)^2}{n}\]
\[\small L(\alpha,\beta) = \prod_{i=1}^n f(y_i;\alpha,\beta) = \prod_{i=1}^n \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} y_i^{\alpha-1}(1-y_i)^{\beta-1} = \left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)^n\prod_{i=1}^ny_i^{\alpha-1}\prod_{i=1}^n(1-y_i)^{\beta-1}\]
\[\small \ln(L(\alpha,\beta)) = n\ln(\Gamma(\alpha+\beta)) -n\ln(\Gamma(\alpha))-n\ln(\Gamma(\beta)) +(\alpha-1)\sum_{i=1}^n\ln(y_i)+ (\beta-1)\sum_{i=1}^n\ln(1-y_i)\]
\[\small \frac{\partial}{\partial \alpha} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \alpha} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\alpha)}\frac{\partial}{\partial \alpha} \Gamma(\alpha) + \sum_{i=1}^n\ln(y_i)...?\]
\[\small \frac{\partial}{\partial \beta} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \beta} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\beta)}\frac{\partial}{\partial \beta} \Gamma(\beta) + \sum_{i=1}^n\ln(1-y_i)...?\]
dbeta(, log=TRUE) to evaluate the log-likelihood contributions from each data point:[1] 0.8993290 0.6549274 0.4466458 0.9780859 0.6196458 0.6009996
alphaseq <- seq(2,8,l=100)
betaseq <- seq(1,6,l=100)
(loglik_grid <- expand.grid(alpha = alphaseq, beta = betaseq)
%>% mutate(y = list(mysamp))
%>% as_tibble()
%>% mutate(loglik = pmap_dbl(list(alpha,beta,y), \(a,b,y)
beta_loglik(c(a,b), y)
)
)
) %>% head()# A tibble: 6 × 4
alpha beta y loglik
<dbl> <dbl> <list> <dbl>
1 2 1 <dbl [200]> 66.8
2 2.06 1 <dbl [200]> 68.4
3 2.12 1 <dbl [200]> 69.9
4 2.18 1 <dbl [200]> 71.1
5 2.24 1 <dbl [200]> 72.3
6 2.30 1 <dbl [200]> 73.2
optim() to find MLEsoptim() is the workhorse R function for multi-dimensional optimization. It does minimization only.par: initial values. Usually we just make a guess. A good idea to try several starting points!fn: the name of the function to be optimized. The first argument must be a VECTOR of parameters over which to optimize....: Additional arguments to the log-likelihood function, including the data.control: a list of arguments to control optim. One important control parameter is fnscale=-1, indicating we want to minimize the negative log-likelihood.optim() for simulated data\[\hat\alpha_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right)\bar Y_1\]
\[ \hat\beta_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right) (1-\bar Y_1)\]
Method-of-moments estimators:
ggplot()+
geom_histogram(aes(x = mysamp, y = after_stat(density)),
binwidth=0.02,
center=0.01,
fill='grey',
col='black'
) +
geom_function(fun = \(y)
dbeta(y, shape1 = 4.827151, shape2 = 1.876892),
aes(color='Fitted beta with MLEs')
) +
geom_function(fun = \(y)
dbeta(y, shape1 = 4.685153, shape2 = 1.809314),
aes(color='Fitted beta with MoMs')
)+
theme_classic(base_size = 18) +
labs(x='y', color='')Karl Pearson
Image source: Wikipedia
Wasting your time fitting curves by moments, eh?
To my astonishment, [Fisher’s method] depends on first working out the constants of the frequency curve by the Method of Moments, and then superposing on it, by what Fisher terms the “Method of Maximum Likelihood,” a further approximation to obtain, what he holds he will thus get, “more efficient values” of the curve constants. The additional computing work after the determination by moments is as long as, I think longer than, the first fitting by moments. The practical value therefore turns on the gain in accuracy obtained by the increased labour.