Stat 6502 Homework 1

use ```{engine='python', engine.path='/usr/bin/python3'} ``` for inlining python 3 code

Homework Problems: 5, 6, 10, 11, 15, 29 (do boostrap on part c as well)

5.

Let \(X_1, X_2, \cdots ,X_n\) be a sample (i.i.d) from a distribution function \(F\), and let \(F_n\) denote the ecdf. Show that, \[Cov[F_n(u), F_n(v)] = \frac{1}{n}[F(m) - F(u)F(v)]\] where \(m=min(u,v)\). Conclude that \(F_n(u)\) and \(F_n(v)\) are positively corrolated.

Solution

\[Cov(F_n(u), F_n(v)) = E[F_n(u)F_n(v)] - E[F_n(u)]E[F_n(v)]\] Now simply writting out the ecdf as defined with the indicator function we have the following: \[Cov(F_n(u), F_n(v)) = E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)\frac{1}{n}\sum_j\mathbb{I}_v(X_j)] - E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)]E[\frac{1}{n}\sum_j\mathbb{I}_v(X_j)]\] The second part of the last equation we have shown to be \(P(X \leq u)\) and \(P(X\leq v)\) and so we can write this out as: \[Cov(F_n(u), F_n(v)) = E[\frac{1}{n}\sum_i\mathbb{I}_u(X_i)\frac{1}{n}\sum_j\mathbb{I}_v(X_j)] - F(u)F(v) = \] \[E[\frac{1}{n^2}\sum_i\mathbb{I}_u(X_i)\sum_j\mathbb{I}_v(X_j)] - F(u)F(v) = \] \[\frac{1}{n^2}\sum_{i = j}[P(X_i \leq u, X_j \leq v) - F(u)F(v)] + \frac{1}{n^2}\sum_{i \neq j}[P(X_i \leq u, X_j \leq v) - F(u)F(v)] = \] and so by independence this simplifies down to (the second term is zero for this reason): \[\frac{1}{n^2}\sum[F(m) - F(u)F(v)] = \frac{1}{n}[F(m) - F(u)F(v)]\]

Moreover this value is positive by the following reasoning:

\(m = min(u,v)\), now without loss of generality assume that \(u < v\) it follows then that \(0 < F(u) < F(v) \leq 1\) by definition of the cdf. And so it follows that by multiplying by \(F(u)\) we have that \[0 < F^2(u) < F(u)F(v) \leq F(u)\] and so we can conclude the covariance is positive.

6.

Various chemical tests were conducted on Beeswax by White, Riethof, and Kushmir. In particular, the percentage of hydrocarbons in each sample of wax was determined.

Plot the ecdf, a histogram, and a normal probability plot of the percentages of hydrocarbons given in the following table. Find the .90, .75, .50, .25, and .10 quantiles. Does this distribution appear Gaussian?
The average percentage of hydrocarbons in microcrystalline wax (a synthetic commercial wax) is 85%. Suppose that beeswax was diluted with 1% microcrystalline wax. Could this be detected? What about 3% or a 5% dilution?

Solutions

library(ggplot2)
p6_data <- c(14.27,15.15,13.98,15.40,14.04,14.10,13.75,14.23,14.80,
             13.98,14.47,14.68,13.68,15.47,14.87,14.44,12.28,
             14.90,14.65,13.33,15.31,13.73,15.28,14.57,17.09,15.91,
             14.73,14.41,14.32,13.65,14.43,15.10,14.52,15.18,
             14.19,13.64,15.02,13.96,12.92,15.63,14.49,15.21,14.77,
             14.01,14.57,15.56,13.83,14.56,14.75,14.30,14.92,15.49,
             15.38,13.66,15.03,14.41,14.62,15.47,15.13)
Fn <- ecdf(p6_data)
Fn_values <- knots(Fn)
plot(ecdf(p6_data), main = "ecdf")

qplot(p6_data)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(p6_data); qqline(p6_data)

# the quantiles are given here
quantile(p6_data, c(.10, .25, .5, .75, .9))

##    10%    25%    50%    75%    90% 
## 13.676 14.070 14.570 15.115 15.470

From the above it becomes apparent that the data is approximately Normal.

10.

Let \(X_1, X_2, ..., X_n\) be a sample from cdf \(F\) and denote the order statistics by \(X_{(1)}, ..., X_{(n)}\). We will assume that \(F\) is continuous with density \(f\).From a theorem we have that the density function of \(X_{(k)}\) is \[f_k(x) = n\dbinom{n-1}{k-1}[F(x)]^{k-1}[1 - F(x)]^{n-k}f(x)\]

Find the mean and variance of \(X_{(k)}\)
Find the approximate mean and variance of \(Y_{(k)}\), the \(k\)-th order statistic of a sample of size \(n\) from \(F\).

Solutions

Since we have that \(X_{(k)}\sim Unif(0,1)\) it follows then that \(F(x) = x\) and \(f(x) = 1\), and so its density simplifies down to: \[f_k(x) = n\dbinom{n-1}{k-1}x^{k-1}(1-x)^{n-k}\] and so we wish to find \[E[X] = \int_{-\infty}^{\infty} xn\dbinom{n-1}{k-1}x^{k-1}(1-x)^{n-k}dx\] \[E[X] = \int_{-\infty}^{\infty} n\dbinom{n-1}{k-1}x^{k}(1-x)^{n-k}dx\] we now make the following substitutions to make this look like the density \(f_k(x)\), let \(l = k + 1\) and \(s = n+1\), then we have \[E[X] = \int_{-\infty}^{\infty} (s-1)\dbinom{s-2}{l-2}x^{l-1}(1-x)^{s-l}dx\] we distribute in the \((s-1)\) term to get \[E[X] = \int_{-\infty}^{\infty}\dbinom{s-1}{l-2}x^{l-1}(1-x)^{s-l}dx\] and multiply by constant 1’s \[E[X] = \frac{l-1}{s}\int_{-\infty}^{\infty}\frac{s}{l-1}\dbinom{s-1}{l-2}x^{l-1}(1-x)^{s-l}dx\] multiplying out we get: \[E[X] = \frac{l-1}{s}\int_{-\infty}^{\infty}s\dbinom{s-1}{l-1}x^{l-1}(1-x)^{s-l}dx\] but this is just the density of \(X_{(k)}\) which integrates to 1, and so, \[E[X] = \frac{1-1}{s} = \frac{k + 1 - 1}{n + 1} = \frac{k}{n+1}\]

Now to find the variance we simply do \(Var(X) = E(X^2) - [E(X)]^2\), we already have \(E(X)\) above, so we now find \(E(X^2)\). To do so we do the same as we did above, instead using the substitutions, \(l = k+2\) and \(s = n+2\), yielding, \[E(X^2) = \frac{k(k+1)}{(n+1)(n+2)}\] and thus yielding: \[Var(X) = \frac{1}{n+2}\Big(\frac{k}{n+1}\Big)\Big(1 - \frac{k}{n+1}\Big)\]

11.

Calculate the hazard function for \[F(t) = 1 - e^{-\alpha t^\beta} \;\; t\geq 0\]

Solution

From the cdf we can derive the density function, taking its derivative \[f(t) = \frac{d}{dt}F(t) = \alpha\beta e^{-\alpha t^\beta}t^{\beta-1}\] and so by definition of the hazard function we have that \[h(t) = \frac{\alpha\beta e^{-\alpha t^\beta}t^{\beta-1}}{e^{-\alpha t^\beta}} = \] \[\alpha\beta t^{\beta - 1}\]

15.

A prisoner is told that he will be released at a time chosen uniformly at random within the next 24 hours. Let T denote the time that he is released. What is the hazard function for T? For what values of t is it smallest and largest? If he has been waiting for 5 hours, is it more likely that he will be released in the next few minutes than if he has been waiting for 1 hour?

Solution

From the information given above we have that \(T\sim Unif(0,24)\), and so we have the following \[f(t) = 1/24 \;\;\text{ and }\;\; F(t) = t/24\] it follows then that the hazard function is given by \[h(t) = \frac{1}{24 - t}\]

library(ggplot2)
haz <- function(t) {
  return (1/(24 -t))
}

t <- seq(0, 23, by = .01)
qplot(t, haz(t), geom = 'line')

we see the the smaller values of the hazard function occur at the smaller values of time \(t\) and likewise the largest values of the hazard function occur at the largest values of \(t\). If he has been waiting for 5 hours it is more likely that he will released then if he had only been waiting for 1.

29.

Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers (see Figure 10.10). Let N denote the number of these outliers that occur in a bootstrap sample (sample with replacement) of the 26 measurements.

Explain why the distribution of \(N\) is binomial
Find \(P(N \leq 10)\)
In 1000 bootstrap samples, how many would you expect to contain 10 or more of these outliers?

Stat 6502 Homework 1

Enrique Rodriguez

January 12, 2016

5.

6.

10.

11.

15.

29.