1 Question 1: Derive the Gradient of the Likelihood function

We assume that \(x_1, x_2, \dots, x_n\) are i.i.d. from a Gamma distribution with shape parameter \(\alpha > 0\) and scale parameter \(\beta > 0\).

The density function is:

\[ f(x \mid \alpha, \beta) = \frac{1}{\Gamma(\alpha)\beta^\alpha} x^{\alpha-1} e^{-x/\beta}, \quad x > 0. \]


1.1 (a) Full Log-Likelihood Function

First, we write the likelihood function by multiplying the density across all observations:

\[ L(\alpha, \beta) = \prod_{i=1}^{n} \frac{1}{\Gamma(\alpha)\beta^\alpha} x_i^{\alpha-1} e^{-x_i/\beta}. \]

Now we take the natural logarithm to simplify the product into a sum.

Using log rules: - \(\log(ab) = \log a + \log b\) - \(\log(a^b) = b \log a\)

We obtain the full log-likelihood:

\[ \ell(\alpha, \beta) = -n \log \Gamma(\alpha) - n\alpha \log \beta + (\alpha - 1)\sum_{i=1}^{n} \log x_i - \frac{1}{\beta}\sum_{i=1}^{n} x_i. \]

This function measures how plausible \((\alpha, \beta)\) are given the observed data.


1.2 (b) Derivation of the Score Functions

The score functions are the first partial derivatives of the log-likelihood with respect to \(\alpha\) and \(\beta\).


1.2.1 Derivative with Respect to \(\beta\)

We differentiate \(\ell(\alpha, \beta)\) with respect to \(\beta\).

Only the terms involving \(\beta\) are:

\[ - n\alpha \log \beta - \frac{1}{\beta}\sum_{i=1}^{n} x_i. \]

Taking derivatives term-by-term:

  • \(\frac{d}{d\beta}(-n\alpha \log \beta) = -\frac{n\alpha}{\beta}\)

  • \(\frac{d}{d\beta}\left(-\frac{1}{\beta}\sum x_i\right) = \frac{1}{\beta^2}\sum x_i\)

Therefore, the score function for \(\beta\) is:

\[ \frac{\partial \ell}{\partial \beta} = -\frac{n\alpha}{\beta} + \frac{1}{\beta^2} \sum_{i=1}^{n} x_i. \]


1.2.2 Derivative with Respect to \(\alpha\)

Now differentiate with respect to \(\alpha\).

The terms involving \(\alpha\) are:

\[ - n \log \Gamma(\alpha) - n\alpha \log \beta + (\alpha - 1)\sum_{i=1}^{n} \log x_i. \]

We use the identity:

\[ \frac{d}{d\alpha}\log \Gamma(\alpha) = \psi(\alpha), \]

where \(\psi(\alpha)\) is the digamma function.

Taking derivatives:

  • \(\frac{d}{d\alpha}(-n \log \Gamma(\alpha)) = -n\psi(\alpha)\)

  • \(\frac{d}{d\alpha}(-n\alpha \log \beta) = -n\log \beta\)

  • \(\frac{d}{d\alpha}((\alpha - 1)\sum \log x_i) = \sum \log x_i\)

Combining these results:

\[ \frac{\partial \ell}{\partial \alpha} = - n\psi(\alpha) - n\log \beta + \sum_{i=1}^{n} \log x_i. \]


2 Question 2

2.1 (a) :MLE using Gradient-Based Optimization

# -----------------------------
# Step 1: Enter the Birth Data
# -----------------------------

# observed delivery times 
birth_data <- c(
2.1, 3.4, 4.25, 5.6, 6.4, 7.3, 8.5, 8.75, 8.9, 9.5, 9.75, 10, 10.4, 10.4, 16, 19,
4, 4.1, 5, 5.5, 5.7, 6.5, 7.25, 7.3, 7.5, 8.2, 8.5, 9.75, 11, 11.2, 15, 16.5, 2.6,
3.6, 3.6, 6.4, 6.8, 7.5, 7.5, 8.25, 8.5, 10.4, 10.75, 14.25, 14.5, 1.5, 4.7, 4.7,
7.2, 7.25, 8.1, 8.5, 9.2, 9.5, 10.7, 11.5, 2.5, 2.5, 3.4, 4.2, 5.9, 6.25, 7.3, 7.5,
7.8, 8.3, 8.3, 10.25, 12.9, 14.3, 4, 4, 5.25, 6.1, 6.5, 6.9, 7, 8.45, 9.25, 10.1,
10.2, 12.75, 14.6, 2, 2.7, 2.75, 3.4, 4.2, 4.3, 4.9, 6.25, 7, 9, 9.25, 10.7
)

n <- length(birth_data)   # sample size


# -----------------------------------
# Step 2: Log-Likelihood Function
# -----------------------------------

# This function returns the *negative* log-likelihood.
# We use negative because optim() performs minimization by default.

gamma_loglik <- function(params, data) {
  
  alpha <- params[1]   
  beta  <- params[2]
  
  # Enforce positivity constraint (Gamma parameters must be > 0)
  if(alpha <= 0 || beta <= 0) return(Inf)
  
  n <- length(data)
  
  loglik <- -n * log(gamma(alpha)) -
            n * alpha * log(beta) +
            (alpha - 1) * sum(log(data)) -
            (1 / beta) * sum(data)
  
  return(-loglik)  # negative for minimization
}


# -----------------------------------
# Step 3: Gradient 
# -----------------------------------

# This function returns the gradient as a numeric vector.
# It contains the two partial derivatives derived in Question 1.

gamma_score <- function(params, data) {
  
  alpha <- params[1]
  beta  <- params[2]
  
  if(alpha <= 0 || beta <= 0) return(c(Inf, Inf))
  
  n <- length(data)
  
  # Partial derivative with respect to alpha
  grad_alpha <- -n * digamma(alpha) -
                n * log(beta) +
                sum(log(data))
  
  # Partial derivative with respect to beta
  grad_beta <- - (n * alpha) / beta +
               (1 / beta^2) * sum(data)
  
  # Return gradient vector
  return(-c(grad_alpha, grad_beta))
}


# -----------------------------------
# Step 4: Choose Initial Values
# -----------------------------------

# Reasonable starting values based on method of moments intuition
initial_values <- c(alpha = 2, beta = 2)


# -----------------------------------
# Step 5: Run Optimization
# -----------------------------------

mle_result <- optim(
  par = initial_values,
  fn = gamma_loglik,
  gr = gamma_score,
  data = birth_data,
  method = "L-BFGS-B",         # allows bounds
  lower = c(1e-6, 1e-6)        # enforce positivity
)

mle_result$par   # Estimated (alpha_hat, beta_hat)
   alpha     beta 
4.387850 1.760123 

2.2 (b) Method of Moments:

# Compute sample mean
x_bar <- mean(birth_data)

# Compute sample variance 
s2 <- var(birth_data)

# Method of Moments estimator for alpha
alpha_mme <- x_bar^2 / s2

# Method of Moments estimator for beta
beta_mme <- s2 / x_bar

c(alpha_mme = alpha_mme, beta_mme = beta_mme)
alpha_mme  beta_mme 
 4.685073  1.648460 

2.3 2(c) Comparison of Method of Moments and Maximum Likelihood

We compare the Method of Moments (MME) and Maximum Likelihood Estimation (MLE) both theoretically and visually.

# ---------------------------------------
# Plot histogram with fitted densities
# ---------------------------------------

# Create histogram scaled to density
hist(birth_data,
     probability = TRUE,           # scale histogram to density
     breaks = 15,
     col = "lightgray",
     border = "white",
     main = "Gamma Fit: MLE vs MME",
     xlab = "Delivery Time (hours)")

# Create grid of x values for smooth curves
x_grid <- seq(min(birth_data),
              max(birth_data),
              length.out = 500)

# Add MLE fitted density
lines(x_grid,
      dgamma(x_grid,
             shape = mle_result$par[1],
             scale = mle_result$par[2]),
      col = "blue",
      lwd = 2)

# Add MME fitted density
lines(x_grid,
      dgamma(x_grid,
             shape = alpha_mme,
             scale = beta_mme),
      col = "red",
      lwd = 2,
      lty = 2)

# Add legend
legend("topright",
       legend = c("MLE", "MME"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1,2))

The Method of Moments (MME) and Maximum Likelihood Estimation (MLE) are two general approaches for estimating unknown parameters in a probability distribution. The Method of Moments determines parameter values by equating sample moments, such as the sample mean and variance, to their corresponding theoretical moments. In contrast, Maximum Likelihood Estimation selects parameter values that maximize the likelihood function, meaning it chooses the values under which the observed data are most probable.

In this application, both methods were used to estimate the parameters of a Gamma model for delivery times. Although the numerical estimates differed slightly, both methods produced very similar fitted curves when overlaid on the histogram. The data display moderate right skewness, and both fitted distributions capture the central concentration of observations as well as the gradual decline in the right tail. The visual difference between the two fitted curves is small, indicating that both methods provide a reasonable description of the data.

The Method of Moments has the advantage of conceptual simplicity. It is straightforward to compute and often leads to algebraically simple solutions. Because it relies only on summary statistics, it is computationally efficient and easy to implement. However, since it uses only selected characteristics of the data rather than the full probability structure, it may be less statistically efficient and may not capture subtle features of the distribution as effectively.

Maximum Likelihood Estimation typically requires more computational effort, especially when closed-form solutions do not exist and numerical optimization is necessary. Despite this, MLE has stronger theoretical properties under standard conditions. It is generally consistent, asymptotically normal, and asymptotically efficient, meaning that in large samples it tends to produce estimates with desirable statistical properties. Because it is based on the full likelihood function, it often provides a slightly better overall fit.

In conclusion, both methods perform similarly for this dataset, as shown by the nearly overlapping fitted curves. However, Maximum Likelihood Estimation is generally preferred in practice due to its stronger theoretical justification and its use of the full distributional information. The Method of Moments remains useful for its simplicity and as a practical starting point for more advanced estimation procedures.

---
title: 'STA 506 HOMEWORK 5'
author: 'Gerard Ike'
date: "2026-03-02"
output:
  html_document:              # output document format
    toc: yes                  # add table contents
    toc_float: yes            # toc_property: floating
    toc_depth: 4              # depth of TOC headings
    fig_width: 6              # global figure width
    fig_height: 4             # global figure height
    fig_caption: yes          # add figure caption
    number_sections: yes      # numbering section headings
    toc_collapsed: yes        # TOC subheading clapsing
    code_folding: hide        # folding/showing code
    code_download: yes        # allow to download complete RMarkdown source code
    smooth_scroll: yes        # scrolling text of the document
    theme: lumen              # visual theme for HTML document only
    highlight: tango          # code syntax highlighting styles
  pdf_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
  word_document:
    toc: yes
    toc_depth: '4'
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.
#
knitr::opts_chunk$set(echo = TRUE,            # include code chunk in the output file
                      warning = FALSE,        # sometimes, you code may produce warning messages,
                                              # you can choose to include the warning messages in
                                              # the output file. 
                      results = TRUE,         # you can also decide whether to include the output
                                              # in the output file.
                      message = FALSE,        # suppress messages 
                      comment = NA            # remove the default leading hash tags in the output
                      )   
```

# Question 1: Derive the Gradient of the Likelihood function

We assume that $x_1, x_2, \dots, x_n$ are i.i.d. from a Gamma distribution with shape parameter $\alpha > 0$ and scale parameter $\beta > 0$.

The density function is:

$$
f(x \mid \alpha, \beta)
=
\frac{1}{\Gamma(\alpha)\beta^\alpha}
x^{\alpha-1} e^{-x/\beta},
\quad x > 0.
$$

---

## (a) Full Log-Likelihood Function

First, we write the likelihood function by multiplying the density across all observations:

$$
L(\alpha, \beta)
=
\prod_{i=1}^{n}
\frac{1}{\Gamma(\alpha)\beta^\alpha}
x_i^{\alpha-1}
e^{-x_i/\beta}.
$$

Now we take the natural logarithm to simplify the product into a sum.

Using log rules:
- $\log(ab) = \log a + \log b$
- $\log(a^b) = b \log a$

We obtain the full log-likelihood:

$$
\ell(\alpha, \beta)
=
-n \log \Gamma(\alpha)
- n\alpha \log \beta
+ (\alpha - 1)\sum_{i=1}^{n} \log x_i
- \frac{1}{\beta}\sum_{i=1}^{n} x_i.
$$

This function measures how plausible $(\alpha, \beta)$ are given the observed data.

---

## (b) Derivation of the Score Functions 

The score functions are the first partial derivatives of the log-likelihood with respect to $\alpha$ and $\beta$.


---

### Derivative with Respect to $\beta$

We differentiate $\ell(\alpha, \beta)$ with respect to $\beta$.

Only the terms involving $\beta$ are:

$$
- n\alpha \log \beta
- \frac{1}{\beta}\sum_{i=1}^{n} x_i.
$$

Taking derivatives term-by-term:

- $\frac{d}{d\beta}(-n\alpha \log \beta) = -\frac{n\alpha}{\beta}$

- $\frac{d}{d\beta}\left(-\frac{1}{\beta}\sum x_i\right)
=
\frac{1}{\beta^2}\sum x_i$

Therefore, the score function for $\beta$ is:

$$
\frac{\partial \ell}{\partial \beta}
=
-\frac{n\alpha}{\beta}
+
\frac{1}{\beta^2}
\sum_{i=1}^{n} x_i.
$$

---

### Derivative with Respect to $\alpha$

Now differentiate with respect to $\alpha$.

The terms involving $\alpha$ are:

$$
- n \log \Gamma(\alpha)
- n\alpha \log \beta
+ (\alpha - 1)\sum_{i=1}^{n} \log x_i.
$$

We use the identity:

$$
\frac{d}{d\alpha}\log \Gamma(\alpha)
=
\psi(\alpha),
$$

where $\psi(\alpha)$ is the digamma function.

Taking derivatives:

- $\frac{d}{d\alpha}(-n \log \Gamma(\alpha)) = -n\psi(\alpha)$

- $\frac{d}{d\alpha}(-n\alpha \log \beta) = -n\log \beta$

- $\frac{d}{d\alpha}((\alpha - 1)\sum \log x_i)
=
\sum \log x_i$

Combining these results:

$$
\frac{\partial \ell}{\partial \alpha}
=
- n\psi(\alpha)
- n\log \beta
+ \sum_{i=1}^{n} \log x_i.
$$

---

# Question 2
## (a) :MLE using Gradient-Based Optimization
```{r}
# -----------------------------
# Step 1: Enter the Birth Data
# -----------------------------

# observed delivery times 
birth_data <- c(
2.1, 3.4, 4.25, 5.6, 6.4, 7.3, 8.5, 8.75, 8.9, 9.5, 9.75, 10, 10.4, 10.4, 16, 19,
4, 4.1, 5, 5.5, 5.7, 6.5, 7.25, 7.3, 7.5, 8.2, 8.5, 9.75, 11, 11.2, 15, 16.5, 2.6,
3.6, 3.6, 6.4, 6.8, 7.5, 7.5, 8.25, 8.5, 10.4, 10.75, 14.25, 14.5, 1.5, 4.7, 4.7,
7.2, 7.25, 8.1, 8.5, 9.2, 9.5, 10.7, 11.5, 2.5, 2.5, 3.4, 4.2, 5.9, 6.25, 7.3, 7.5,
7.8, 8.3, 8.3, 10.25, 12.9, 14.3, 4, 4, 5.25, 6.1, 6.5, 6.9, 7, 8.45, 9.25, 10.1,
10.2, 12.75, 14.6, 2, 2.7, 2.75, 3.4, 4.2, 4.3, 4.9, 6.25, 7, 9, 9.25, 10.7
)

n <- length(birth_data)   # sample size


# -----------------------------------
# Step 2: Log-Likelihood Function
# -----------------------------------

# This function returns the *negative* log-likelihood.
# We use negative because optim() performs minimization by default.

gamma_loglik <- function(params, data) {
  
  alpha <- params[1]   
  beta  <- params[2]
  
  # Enforce positivity constraint (Gamma parameters must be > 0)
  if(alpha <= 0 || beta <= 0) return(Inf)
  
  n <- length(data)
  
  loglik <- -n * log(gamma(alpha)) -
            n * alpha * log(beta) +
            (alpha - 1) * sum(log(data)) -
            (1 / beta) * sum(data)
  
  return(-loglik)  # negative for minimization
}


# -----------------------------------
# Step 3: Gradient 
# -----------------------------------

# This function returns the gradient as a numeric vector.
# It contains the two partial derivatives derived in Question 1.

gamma_score <- function(params, data) {
  
  alpha <- params[1]
  beta  <- params[2]
  
  if(alpha <= 0 || beta <= 0) return(c(Inf, Inf))
  
  n <- length(data)
  
  # Partial derivative with respect to alpha
  grad_alpha <- -n * digamma(alpha) -
                n * log(beta) +
                sum(log(data))
  
  # Partial derivative with respect to beta
  grad_beta <- - (n * alpha) / beta +
               (1 / beta^2) * sum(data)
  
  # Return gradient vector
  return(-c(grad_alpha, grad_beta))
}


# -----------------------------------
# Step 4: Choose Initial Values
# -----------------------------------

# Reasonable starting values based on method of moments intuition
initial_values <- c(alpha = 2, beta = 2)


# -----------------------------------
# Step 5: Run Optimization
# -----------------------------------

mle_result <- optim(
  par = initial_values,
  fn = gamma_loglik,
  gr = gamma_score,
  data = birth_data,
  method = "L-BFGS-B",         # allows bounds
  lower = c(1e-6, 1e-6)        # enforce positivity
)

mle_result$par   # Estimated (alpha_hat, beta_hat)
```

## (b) Method of Moments:
```{r}

# Compute sample mean
x_bar <- mean(birth_data)

# Compute sample variance 
s2 <- var(birth_data)

# Method of Moments estimator for alpha
alpha_mme <- x_bar^2 / s2

# Method of Moments estimator for beta
beta_mme <- s2 / x_bar

c(alpha_mme = alpha_mme, beta_mme = beta_mme)
```

## 2(c) Comparison of Method of Moments and Maximum Likelihood

We compare the Method of Moments (MME) and Maximum Likelihood Estimation (MLE) both theoretically and visually.

```{r}
# ---------------------------------------
# Plot histogram with fitted densities
# ---------------------------------------

# Create histogram scaled to density
hist(birth_data,
     probability = TRUE,           # scale histogram to density
     breaks = 15,
     col = "lightgray",
     border = "white",
     main = "Gamma Fit: MLE vs MME",
     xlab = "Delivery Time (hours)")

# Create grid of x values for smooth curves
x_grid <- seq(min(birth_data),
              max(birth_data),
              length.out = 500)

# Add MLE fitted density
lines(x_grid,
      dgamma(x_grid,
             shape = mle_result$par[1],
             scale = mle_result$par[2]),
      col = "blue",
      lwd = 2)

# Add MME fitted density
lines(x_grid,
      dgamma(x_grid,
             shape = alpha_mme,
             scale = beta_mme),
      col = "red",
      lwd = 2,
      lty = 2)

# Add legend
legend("topright",
       legend = c("MLE", "MME"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1,2))
```


The Method of Moments (MME) and Maximum Likelihood Estimation (MLE) are two general approaches for estimating unknown parameters in a probability distribution. The Method of Moments determines parameter values by equating sample moments, such as the sample mean and variance, to their corresponding theoretical moments. In contrast, Maximum Likelihood Estimation selects parameter values that maximize the likelihood function, meaning it chooses the values under which the observed data are most probable.

In this application, both methods were used to estimate the parameters of a Gamma model for delivery times. Although the numerical estimates differed slightly, both methods produced very similar fitted curves when overlaid on the histogram. The data display moderate right skewness, and both fitted distributions capture the central concentration of observations as well as the gradual decline in the right tail. The visual difference between the two fitted curves is small, indicating that both methods provide a reasonable description of the data.

The Method of Moments has the advantage of conceptual simplicity. It is straightforward to compute and often leads to algebraically simple solutions. Because it relies only on summary statistics, it is computationally efficient and easy to implement. However, since it uses only selected characteristics of the data rather than the full probability structure, it may be less statistically efficient and may not capture subtle features of the distribution as effectively.

Maximum Likelihood Estimation typically requires more computational effort, especially when closed-form solutions do not exist and numerical optimization is necessary. Despite this, MLE has stronger theoretical properties under standard conditions. It is generally consistent, asymptotically normal, and asymptotically efficient, meaning that in large samples it tends to produce estimates with desirable statistical properties. Because it is based on the full likelihood function, it often provides a slightly better overall fit.

In conclusion, both methods perform similarly for this dataset, as shown by the nearly overlapping fitted curves. However, Maximum Likelihood Estimation is generally preferred in practice due to its stronger theoretical justification and its use of the full distributional information. The Method of Moments remains useful for its simplicity and as a practical starting point for more advanced estimation procedures.