use ```{engine='python', engine.path='/usr/bin/python3'} ``` for inlining python 3 code
Homework Problems: 5, 6, 10, 11, 15, 29 (do boostrap on part c as well)
Let \(X_1, X_2, \cdots ,X_n\) be a sample (i.i.d) from a distribution function \(F\), and let \(F_n\) denote the ecdf. Show that, \[Cov[F_n(u), F_n(v)] = \frac{1}{n}[F(m) - F(u)F(v)]\] where \(m=min(u,v)\). Conclude that \(F_n(u)\) and \(F_n(v)\) are positively corrolated.
Solution
\[Cov(F_n(u), F_n(v)) = E[F_n(u)F_n(v)] - E[F_n(u)]E[F_n(v)]\] Now simply writting out the ecdf as defined with the indicator function we have the following: \[Cov(F_n(u), F_n(v)) = E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)\frac{1}{n}\sum_j\mathbb{I}_v(X_j)] - E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)]E[\frac{1}{n}\sum_j\mathbb{I}_v(X_j)]\] The second part of the last equation we have shown to be \(P(X \leq u)\) and \(P(X\leq v)\) and so we can write this out as: \[Cov(F_n(u), F_n(v)) = E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)\frac{1}{n}\sum_j\mathbb{I}_v(X_j)] - F(u)F(v) = \] \[E[\frac{1}{n^2}\sum_i\mathbb{I}_u(X_i)\sum_j\mathbb{I}_v(X_j)] - F(u)F(v) = \] \[\frac{1}{n^2}\sum_{i = j}[P(X_i \leq u, X_j \leq v) - F(u)F(v)] + \frac{1}{n^2}\sum_{i \neq j}[P(X_i \leq u, X_j \leq v) - F(u)F(v)] = \] and so by independence this simplifies down to (the second term is zero for this reason): \[\frac{1}{n^2}\sum[F(m) - F(u)F(v)] = \frac{1}{n}[F(m) - F(u)F(v)]\]
Moreover this value is positive by the following reasoning:
\(m = min(u,v)\), now without loss of generality assume that \(u < v\) it follows then that \(0 < F(u) < F(v) \leq 1\) by definition of the cdf. And so it follows that by multiplying by \(F(u)\) we have that \[0 < F^2(u) < F(u)F(v) \leq F(u)\] and so we can conclude the covariance is positive.
Various chemical tests were conducted on Beeswax by White, Riethof, and Kushmir. In particular, the percentage of hydrocarbons in each sample of wax was determined.
Plot the ecdf, a histogram, and a normal probability plot of the percentages of hydrocarbons given in the following table. Find the .90, .75, .50, .25, and .10 quantiles. Does this distribution appear Gaussian?
The average percentage of hydrocarbons in microcrystalline wax (a synthetic commercial wax) is 85%. Suppose that beeswax was diluted with 1% microcrystalline wax. Could this be detected? What about 3% or a 5% dilution?
Solutions
library(ggplot2)
p6_data <- c(14.27,15.15,13.98,15.40,14.04,14.10,13.75,14.23,14.80,
13.98,14.47,14.68,13.68,15.47,14.87,14.44,12.28,
14.90,14.65,13.33,15.31,13.73,15.28,14.57,17.09,15.91,
14.73,14.41,14.32,13.65,14.43,15.10,14.52,15.18,
14.19,13.64,15.02,13.96,12.92,15.63,14.49,15.21,14.77,
14.01,14.57,15.56,13.83,14.56,14.75,14.30,14.92,15.49,
15.38,13.66,15.03,14.41,14.62,15.47,15.13)
Fn <- ecdf(p6_data)
Fn_values <- knots(Fn)
plot(ecdf(p6_data), main = "ecdf")
qplot(p6_data)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(p6_data); qqline(p6_data)
# the quantiles are given here
quantile(p6_data, c(.10, .25, .5, .75, .9))
## 10% 25% 50% 75% 90%
## 13.676 14.070 14.570 15.115 15.470
From the above it becomes apparent that the data is approximately Normal.
Let \(X_1, X_2, ..., X_n\) be a sample from cdf \(F\) and denote the order statistics by \(X_{(1)}, ..., X_{(n)}\). We will assume that \(F\) is continuous with density \(f\).From a theorem we have that the density function of \(X_{(k)}\) is \[f_k(x) = n\dbinom{n-1}{k-1}[F(x)]^{k-1}[1 - F(x)]^{n-k}f(x)\]
Find the mean and variance of \(X_{(k)}\)
Find the approximate mean and variance of \(Y_{(k)}\), the \(k\)-th order statistic of a sample of size \(n\) from \(F\).
Solutions
Now to find the variance we simply do \(Var(X) = E(X^2) - [E(X)]^2\), we already have \(E(X)\) above, so we now find \(E(X^2)\). To do so we do the same as we did above, instead using the substitutions, \(l = k+2\) and \(s = n+2\), yielding, \[E(X^2) = \frac{k(k+1)}{(n+1)(n+2)}\] and thus yielding: \[Var(X) = \frac{1}{n+2}\Big(\frac{k}{n+1}\Big)\Big(1 - \frac{k}{n+1}\Big)\]
Calculate the hazard function for \[F(t) = 1 - e^{-\alpha t^\beta} \;\; t\geq 0\]
Solution
From the cdf we can derive the density function, taking its derivative \[f(t) = \frac{d}{dt}F(t) = \alpha\beta e^{-\alpha t^\beta}t^{\beta-1}\] and so by definition of the hazard function we have that \[h(t) = \frac{\alpha\beta e^{-\alpha t^\beta}t^{\beta-1}}{e^{-\alpha t^\beta}} = \] \[\alpha\beta t^{\beta - 1}\]
A prisoner is told that he will be released at a time chosen uniformly at random within the next 24 hours. Let T denote the time that he is released. What is the hazard function for T? For what values of t is it smallest and largest? If he has been waiting for 5 hours, is it more likely that he will be released in the next few minutes than if he has been waiting for 1 hour?
Solution
From the information given above we have that \(T\sim Unif(0,24)\), and so we have the following \[f(t) = 1/24 \;\;\text{ and }\;\; F(t) = t/24\] it follows then that the hazard function is given by \[h(t) = \frac{1}{24 - t}\]
library(ggplot2)
haz <- function(t) {
return (1/(24 -t))
}
t <- seq(0, 23, by = .01)
qplot(t, haz(t), geom = 'line')
we see the the smaller values of the hazard function occur at the smaller values of time \(t\) and likewise the largest values of the hazard function occur at the largest values of \(t\). If he has been waiting for 5 hours it is more likely that he will released then if he had only been waiting for 1.
Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers (see Figure 10.10). Let N denote the number of these outliers that occur in a bootstrap sample (sample with replacement) of the 26 measurements.