Dear teaching team, I think I am learning R markdown a bit, and it seems a game changer for sure. Happy to have you throughout the PHS! I learned wiki or chatgpt could give the latex formatted formulas at ease. “\” - sign seems to give the Greek alphabet.

Question 1

  1. Define a random variable representing our outcome of interest.

The random variable is the count of exoneration among departments as discrete values.

  1. Assume a “model” for your random variable, i.e. assume a population distribution. Under this assumption write down the probability distribution function (i.e. the pdf or pmf) for one observation of the outcome of interest. Make sure to define the parameters in the pdf or pmf.

For this random variable, let us assume a Poisson variable. Hence the probability mass function is as followed.

\(P(Y=yi) = \frac{e^{-\lambda}\lambda^{yi}}{yi!}\)

Y- random variable,

yi- specific observation of the random variable,

\(\lambda\)-mean per draw

Let us demonstrate this in a scenario for a value of lambda=2.

(exp(-2*10)*2^(2+1+2+1+1+3+0+1+3+0))/(factorial(2)*factorial(1)*factorial(2)*factorial(1)*factorial(1)*factorial(3)*factorial(0)*factorial(1)*factorial(3)*factorial(0))
## [1] 2.345135e-07

Therefore the likelihood (as I understand this is probability for exact point probability for 2.00 not for 2.0001) of having 2 is 2.345135e-07.

  1. Write down the joint probability distribution for all 10 observations. What two key assumptions are you making in construction this joint distribution? Describe these assumptions in non-statistical terms (e.g. that someone unfamiliar with statistics might understand). Are they reasonable in this case?

\(P(Y=y_1, Y=y_2, \dots, Y=y_{10} | \lambda) = \prod_{i=1}^{10} \frac{e^{-\lambda}\lambda^{y_i}}{y_i!}\)

Observations are independent (One occurrence does not affect the other) Identically distributed (Distributions are identical) It is plausible in this case to assume joint probability distribution.

  1. Using the joint probability distribution and the data, show that the likelihood function for λ - the mean number of exonerations in a year in French courts - is:

\(L(\lambda | Y1,Y2,dots, Y10)=\frac{e^{-10\lambda}\lambda^{14}}{144}\)

\(log(\lambda)=log(\frac{e^{-10\lambda}\lambda^{14}}{144})\)

\(log(\lambda)=-10\lambda+14log(\lambda))\)

Differentiate now (Although I haven’t recalled this math function, I have done all the latex typing myself)

\(\frac{dl(\lambda)}{d\lambda}=-10+\frac{14}{\lambda}\)

\(-10+\frac{14}{\lambda}=0\)

\(\lambda=1.4\)

  1. Create a graph of the likelihood function above with λ on the x-axis and the value of the likelihood function on the y-axis. Using the graph, identify the parameter estimate that maximizes the likelihood.
curve(
  expr=(exp(-10*x)*x^(14))/(144),
  from=0,
  to=4,
  main="Likelihood function",
  ylab="likelihood",
  xlab="lambda"
)
abline(v=1.4, col="brown")

  1. Create a graph of the log-likelihood function with λ on the x-axis and the value of the log-likelihood function on the y-axis. Identify the parameter estimate that maximizes the log-likelihood. Compare this with the previous value that maximized the likelihood function. What do you notice?
curve(
  expr=log((exp(-10*x)*x^(14))/(144)),
  from=0,
  to=4,
  main="Log likelihood function",
  ylab="log likelihood",
  xlab="lambda"
)
abline(v=1.4, col="brown")

The maximum point of the log-likelihood function corresponds to the same value of λ=1.4 that maximized the likelihood function.

  1. Calculate the simple sample mean of the number of exonerations, i.e. ¯Y=1/n ∑_(i=1)^n▒Y_i . How does this compare to the results obtained in (e) and (f)?
sum(2,1,2,1,1,3,0,1,3,0)/10
## [1] 1.4

It results the same value.

  1. Using the property that the sampling distribution of the maximum likelihood estimator is asymptotically normal, derive and interpret a 95% confidence interval for the lambda. To help, your friend who is really into taking derivatives tells you that the observed information evaluated at the MLE for your estimator is n^2/(∑_i▒Y_i ).

\((\frac{n^{2}}{\sum Yi})^{-1}=\frac{14}{100}=0.14\)

\(\bar{\lambda}=Z_{1-\alpha/2}=1.4\pm1.96*\sqrt{0.14}\)

1.96*sqrt(0.14)
## [1] 0.7333648
1.4+1.96*sqrt(0.14)
## [1] 2.133365
1.4-1.96*sqrt(0.14)
## [1] 0.6666352

1.4 95%CI(0.67,2.13)

Across repeated samples with same methods, 95 per cent of the times the true population mean would fall within confidence intervals between 0.67 to 2.13.

  1. What concern might you have about the validity of this confidence interval given the particular data with which you are working?

We have a large confidence interval due small sample size, making out estimate less precise. Secondly, we assumed identical dispersion, which may not be explicitly true about social contexts.

  1. Someone else in the Paris philosophical society argues vehemently that your assumed distribution (your “model”) is wrong. If this were the case would your results in (h) still be valid? Why or why not? What other theorem that we learned previously might you use to derive an asymptotically-valid 95% confidence interval for λ in this case?

Central limit theorem can be used to construct the confidence interval instead. However, our confidence intervals would be valid as we have proven as the estimate was equal to mean.

  1. BONUS QUESTION: Using calculus, find the maximum likelihood estimate by maximizing the log-likelihood function analytically. Note this is optional!

Question 2 Estimation

  1. After intense debate during the next meeting of the Paris philosophical society, you manage to convince your colleagues that numerical investigations of human behavior, a ``social physics’’ if you will, is a worthwhile endeavor and that specifically we can use math to understand the mean number of exonerations in French courts. However, in typical fashion, none of the members can agree on the best estimator of the mean number of exonerations. You all agree to define Y_i as the number of exonerations in year i in France. You further agree that Y_i∼Poisson(λ) since it is a count, and that λ is the mean number of exonerations per year. You assume the data collected each year is an independent random sample. The group does not agree on how to estimate λ, however. Three different estimators have been proposed: ˆλ_1=1/n ∑(i=1)^n▒Y_i ˆλ_2=1/n Y((1)) ˆλ_3=1/n+1/n ∑(i=1)^n▒Y_i Where Y((1)) is the first observation in a set of n. For each estimator, calculate the mean, variance, bias, and mean square error. Show your work, and ignore the finite population correction as there are numerous courts in France. Estimator E[ˆλ] Var[ˆλ] Bias[ˆλ] MSE[ˆλ] ˆλ_1
    ˆλ_2
    ˆλ_3
#Lambda1
"mean"; sum(2,1,2,1,1,3,0,1,3,0)/10; "variance"; var(c(2,1,2,1,1,3,0,1,3,0)); "bias"; sum(2,1,2,1,1,3,0,1,3,0)/(10)-1.4; "MSE"; var(c(2,1,2,1,1,3,0,1,3,0))+(sum(2,1,2,1,1,3,0,1,3,0)/(10)-1.4)^2
## [1] "mean"
## [1] 1.4
## [1] "variance"
## [1] 1.155556
## [1] "bias"
## [1] 0
## [1] "MSE"
## [1] 1.155556
#Lambda2
"mean"; 2; "var"; var(c(1.4,2)); "bias"; 2-1.4; "MSE"; var(c(1.4,2))+(1.4-2)^2
## [1] "mean"
## [1] 2
## [1] "var"
## [1] 0.18
## [1] "bias"
## [1] 0.6
## [1] "MSE"
## [1] 0.54
#Lambda3
"mean"; 1/10+sum(2,1,2,1,1,3,0,1,3,0)/10; "var"; var(c(2,1,2,1,1,3,0,1,3,0)); "bias"; (1/10+sum(2,1,2,1,1,3,0,1,3,0)/10)-1.4; "MSE"; var(c(2,1,2,1,1,3,0,1,3,0))+((1/10+sum(2,1,2,1,1,3,0,1,3,0)/10)-1.4)^2
## [1] "mean"
## [1] 1.5
## [1] "var"
## [1] 1.155556
## [1] "bias"
## [1] 0.1
## [1] "MSE"
## [1] 1.165556

Let us arrange a table for the values we computed.

matrix(c(
  "Lambda", "Mean", "Variance", "Bias", "MSE",
  "L1", 1.4, 1.155556, 0, 1.155556,
  "L2", 2, 0.18, 0.6, 0.54,
  "L3", 1.5, 1.155556, 0.1, 1.165556
), nrow = 4, byrow = TRUE)
##      [,1]     [,2]   [,3]       [,4]   [,5]      
## [1,] "Lambda" "Mean" "Variance" "Bias" "MSE"     
## [2,] "L1"     "1.4"  "1.155556" "0"    "1.155556"
## [3,] "L2"     "2"    "0.18"     "0.6"  "0.54"    
## [4,] "L3"     "1.5"  "1.155556" "0.1"  "1.165556"

According to above estimations, the R code “var” provide n-1 correction for samples. However I couldn’t intuitively decide which formula is appropriate for variance, as it would provide 1.04 when I do hand calculation sigma^2/n VS the 1.156 as shown. I wasn’t sure as we knew all 10 values in this setting, we may go with 1.04, or should we go with the variance. Since we never truly find the sampling variance in life, I have chosen sample variance for this exercise. I greatly appreciate if the teaching team enlighten me on this subject.

My final answer to 2a is here sorry.

\(\bar\lambda_1=\lambda\) \(var=\frac{\lambda}{n}\) \(bias=0\) \(MSE=\frac{\lambda}{n}\) \(\bar\lambda_2=\frac{\lambda}{n}\) \(var=\frac{\lambda}{n^2}\) \(bias=\frac{\lambda}{n}-\lambda\) \(MSE=\frac{\lambda}{n^2}+(\frac{\lambda}{n}-\lambda)^2\) \(\bar\lambda_3=\lambda+\frac{1}{n}\) \(var=\frac{\lambda}{n}\) \(bias=\frac{1}{n}\) \(MSE=\frac{\lambda}{n}+\frac{1}{n^2}\)

  1. Compare and contrast these estimators in terms of bias, efficiency and mean square error. Please also comment on the cost of collecting data for the different estimators in terms of interview time.

The Lambda 1 is a unbiased estimator of the true population mean in this example, and this setting also is the most efficient as we sample all the units of analysis for the study. However, It is a resource and labor intensive approach.

The Lambda 2 is very biased estimator of the true population mean, and this setting is also less efficient due to low sample size. As we sample only one department, we could save enormous resource and time, but yields a very biased estimate.

The lambda 3 is a less biased estimator of the true population mean compared to Lambda 2, and this setting also provide more efficient than Lambda 2. It is a resource and labor intensive approach than lambda 2. (Although I don’t understand fully why 1/n+lambda overcame to the formula math wise, I conceptualize it as, as long as the sample size grew, we care less about the bias in our study and our estimate gets closer to the truth)