Problem 5

The Petal.Lengths (the third variable) in the iris dataset exhibit a bimodal pattern. We are going to fit them using a mixture of two normal distributions. For simplicity, we only study the first 100 rows in the dataset, which correspond to setosa and versicolor (50 for each). After loading iris in R, you can select the data by executing S = iris$Petal.Length[1:100] or S = iris[1:100, 3]. Now S becomes the dataset under investigation. Suppose that the entries of S are random samples \(\{X_i\}^n_{i=1}\) from the mixture distribution in Problem 4 above. The goal is to estimate the parameters \(μ_1\), \(μ_2\), \(σ_1\) and \(σ_2\) by the Method of Moments (MoM).

# Cleaning data
data(iris)
n = 100
S = iris$Petal.Length[1:100]
S
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1
  1. Compute the 1st-4th sample moments \(\frac{1}{n} \sum^n_{i=1} X^k_i\)for \(k\) = 1 to 4 using R. Define a 4-dimensional vector M whose \(k\)-th entry is the \(k\)-th sample moment.
find_k_moment <- function(k,x){
  mean(x^k)
}

M = sapply(c(1,2,3,4), find_k_moment, S)
M
## [1]   2.86100  10.26550  41.63513 178.49022
  1. Define an R function momentsmixture that computes \(\mathbb{E} X^k (k = 1, 2, 3, 4)\) for a random variable \(X\) from the mixture distribution in Problem 4, with parameters mu1, mu2, sigma1 and sigma2. The output of momentsmixture(mu1, mu2, sigma1, sigma2) should be a vector of length 4. Then, define an R function differences that satisfies the followings:
momentmixture <- function(mu1, mu2, sigma1, sigma2){
  c(1/2*(mu1+mu2), 
    1/2*(mu1^2 + mu2^2 + sigma1^2 + sigma2^2),
    1/2*(3*sigma1^2*mu1+mu1^3+
         3*sigma2^2*mu2+mu2^3),
    1/2*(3*sigma1^4+6*sigma1^2*mu1^2+mu1^4+
         3*sigma2^4+6*sigma2^2*mu2^2+mu2^4))
}
differences <- function(theta){
  momentmixture(theta[1],theta[2],theta[3],theta[4]) - M
}
  1. MoM amounts to solving the system of equations differences(theta) = 0. To that end, we will apply the function multiroot in the R package rootSolve to find an approximate solution. This function requires an initial guess for the solutions. Use the 1st and 3rd quartiles

mu10 = quantile(S, 0.25, names = FALSE) mu20 = quantile(S, 0.75, names = FALSE)

for \(\mu_1\) and \(\mu_2\); use 1 as the initial guesses for \(\sigma_1\) and \(\sigma_2\). That is, the initial guess for theta is c(mu10, mu20, 1, 1). Report the estimated parameters.

mu10 = quantile(S, 0.25, names = FALSE)
mu20 = quantile(S, 0.75, names = FALSE)
init = c(mu10, mu20, 1, 1)

library(rootSolve)
res = multiroot(differences, init)
res
## $root
## [1] 1.4545495 4.2674505 0.1268971 0.4336460
## 
## $f.root
## [1] 0.000000e+00 2.273737e-13 1.016076e-12 3.780087e-12
## 
## $iter
## [1] 8
## 
## $estim.precis
## [1] 1.255884e-12
mu1_h = res$root[1]
mu2_h = res$root[2]
sigma1_h = res$root[3]
sigma2_h = res$root[4]
mu1_h
## [1] 1.454549
mu2_h
## [1] 4.267451
sigma1_h
## [1] 0.1268971
sigma2_h
## [1] 0.433646
  1. Run hist(S, freq = FALSE, breaks = 20) to get a histogram of S showing relative fre- quencies. On the same figure, plot the PDF of the mixture distribution with estimated parameters. How is the fit?
hist(S, freq = FALSE, breaks = 20, 
     xlim = c(0, 6), ylim = c(0, 2))
x = seq(0, 6, length = 100)
y1 = sapply(x, dnorm, mu1_h, sigma1_h)
y2 = sapply(x, dnorm, mu2_h, sigma2_h)
lines(x,(y1+y2)/2,col = 'blue')

# The fit is fairly good.