Going over Imbens and Kalyanaraman 2012 - MSE Approximation and Optimal Bandwidth

Author

Dor Leventer

Published

September 18, 2022

This document goes over the technical details in Imbens and Kalyanaraman (2012). All errors are my own.

# Set parameters for number of iterations and observations in simulations
N = 1000
num_it = 100

Setup and Estimand

Setup

Consider a sample of N units with running variable X and outcome Y. Define potential outcomes of treatment and control as Y(1) and Y(0). Define the treatment assignment mechanism W=\mathbf{1}(X\geq c) for some known cutoff c. The function \mathbf{1}(.) is the indicator function. This defines how treatment, potential and observed outcomes are connected.

\begin{equation} Y=\begin{cases} Y(1) & W=1\leftrightarrow X\geq c\\ Y(0) & W=0\leftrightarrow X<c \end{cases} \end{equation}

We also define the expectation of the observed outcome conditional on the running variable m(x)=\mathbb{E}[Y \mid X_i=x].

Estimand

The causal parameter of interest in the sharp RD design, i.e. the defined treatment assignment mechanism, is the average treatment effect local to the cutoff

\tau = \mathbf[Y_i(1)-Y_i(0) \mid X_i = c]

Identification and Estimation

Identification

If we assume that:

The conditional distribution functions F_{Y(w) \mid X}(y \mid x) for w\in\{0,1\} are continuous in x around c for all y,
The conditional first moments \mathbb{E}[Y_i(w) \mid X_i = x] for w\in\{0,1\} exist and are continuous in x around c, then \tau=\mu_{+}-\mu_{-}\quad\text{where}\quad \mu_{+}=\lim_{x\searrow c}m\left(x\right)\quad\text{and}\quad\mu_{-}=\lim_{x\nearrow c}m\left(x\right)

Estimation

We estimate the boundary of m(x) above and below c, i.e. \mu_{+} and \mu_{-}, via local weighted least squares.

Local, as in we use only observations around the cutoff, defined via the bandwidth h. The choice of the bandwidth h is the focus of this paper.
Weighted, as in we give these observations weights as a function of their distance from the cutoff, defined via the kernel K(.).

Kernels

Defintion (Kernel)

A kernel is a real-valued function K:\mathbb{R}\rightarrow\mathbb{R} that satisfies the following three properties.

Non-negative, i.e., u\in\mathbb{R}:K(u)\geq 0.
Symmetric, i.e., u\in\mathbb{R}:K(u)=K(-u).
Normalization, i.e., \int_{-\infty}^{\infty}K(u)du=1.

Three examples of often used kernels are

The uniform kernel gives the same weight to all observations in the support u\in[-1,1]:K_{uniform}(u)=0.5, otherwise gives weight zero.
- One can check this satisfies the above definition. Obviously, K_{uniform}(u)=0.5 satisfies non-negativity and symmetry. Lets check normalization

\int_{-1}^{1}0.5du = \left[0.5x\right]_{-1}^{1} = 0.5-\left(-0.5\right) = 1 - The edge or triangular kernel gives more weight as observations come closer to the cutoff, within the support, u\in[-1,1]:K_{triangular}(u) = 1 - |u|, otherwise gives weight zero. - The Epanechnikov kernel gives more weight as observations come closer, but with a more ``bell’’ shaped curve then the triangular kernel, u\in[-1,1]:K_{Epanechnikov}(u) = \frac{3}{4}(1-u^2).

Lets visualize these three examples.

Code

# define each kernel
k_uniform <- function(u) {1*(u >= -1 & u <= 1) * 0.5}
k_triangular <- function(u) {1*(u >= -1 & u <= 1) * (1 - abs(u))}
k_epanechnikov <- function(u) {1*(u >= -1 & u <= 1) * (3/4) * (1 - u^2)}

ggplot() +
  # define support limits
  xlim(-1.5, 1.5) +
  # add each function, use colors for legend
  geom_function(aes(color = "Uniform"), fun = k_uniform) +
  geom_function(aes(color = "Triangular"), fun = k_triangular) +
  geom_function(aes(color = "Epanechnikov"), fun = k_epanechnikov) +
  # some cosmetics
  scale_color_brewer(name = NULL, palette = "Dark2") + 
  theme_bw() + 
  labs(x = "Support", y = "Weight")

Local weighted Least squares regression

Given a choice of bandwidth h and kernel K(.) with support [-h,h], we estimate from above and below the cutoff two weighted regressions

\begin{align*} \left(\widehat{\alpha}_{-},\widehat{\beta}_{-}\right) & =\arg\min_{\alpha,\beta}\sum_{i=1}^{N}\mathbf{1}\left(X_{i}<c\right)\times\left(Y_{i}-\alpha-\beta\left(X_{i}-c\right)\right)^{2}\times K\left(\frac{X_{i}-c}{h}\right)\\ \left(\widehat{\alpha}_{+},\widehat{\beta}_{+}\right) & =\arg\min_{\alpha,\beta}\sum_{i=1}^{N}\mathbf{1}\left(X_{i}\geq c\right)\times\left(Y_{i}-\alpha-\beta\left(X_{i}-c\right)\right)^{2}\times K\left(\frac{X_{i}-c}{h}\right) \end{align*}

We set the estimated expectation function as \widehat{\mu}_{-} = \widehat{\alpha}_{-} and \widehat{\mu}_{+} = \widehat{\alpha}_{+}, which gives us our estimator \widehat{\tau}=\widehat{\mu}_{+}-\widehat{\mu}_{-}.

MSE Optimization

Defintion (MSE)

Define the MSE for the estimand and estimator as a function of the bandwidth, for a given kernel, as follows.

MSE(h) = \mathbb{E}[(\widehat{\tau} - \tau)^2] = \mathbb{E}[((\widehat{\mu}_{+}-\mu_{+}) - (\widehat{\mu}_{-} - \mu_{-}))^2]

Our goal is to find a solution for the bandwidth that minimizes the MSE. To do this, we will consider a first order approximation. Before stating the final proposition, we begin with some helpful lemmas.

First Lemma

Suppose we are looking at observations only above the cutoff c. Without loss of generality, normalize the cutoff such that c=0.

Assumptions

We have N pairs (Y_i, X_i) i.i.d with X_i\geq 0.
m(x) = \mathbb{E}[Y_i \mid X_i=x] is three times continuously differentiable.
The density of X, f(x), is continuously differentiable at x=0, with f(0)>0.
The conditional variance \sigma^2(x)=Var(Y \mid X=x)>0 is bounded and continuous at x=0.
We have a kernel K:\mathbb{R}_+\rightarrow \mathbb{R}, with K(u)=0 for u\geq\bar{u}, and \int_{0}^{\bar{u}}K(u)du=1.

Define K_h(u)=K(u/h)/h, and \mu=m(0).

Lemma Statement

Lemma 1 (Kernel Weighted Sample Moments)

Define F_j = N^{-1}\sum_{i=1}^NK_h(X_i)X_i^j. Under the above assumptions 1-5 F_j=h^jf(0)v_j+o_p(h^j) where v_j=\int_{0}^{\infty} t^j K(t) dt.

O-notation

Reminder (little-o and big-O notation)

We remind that the notation o_p(.) indicates the rate of convergence of a function. We read f(n)=o_p(g_n) as “the function f is bounded at the limit by g”. A ``little’’ more technically, for some function f(n), we say a function is bounded by g(n) if \lim_{n\rightarrow\infty}f(n)/g(n)=0, and in this case we write f(n)=o_p(g_n). As examples, the function f(n)=2n^3 is o_p(n^4), as well as o_p(n^5).

In what follows, we will also consider convergence as the bandwidth h tends towards zero. That is, for some function f(h), we will say a function is bounded by g(h) if \lim_{h\rightarrow 0}f(n)/g(n)=0. As examples, the function f(h)=2h^3 is o_p(h^2).

Once we got little-o notation down, the big-O notation is very similar: it is the highest order of the function. As an example, the function f(n)=2n^3+n-6=O_p(n^3). We read f(n)=O_p(g_n) as “the function f is of the order of g at the limit”. Technically, the notation f(n)=O_p(g_n) means that there exists a large N, and some constant M, such that n\geq N:|f(n)|\leq M\times g(n).

Proof

Proof. We begin breaking F_j apart into an expectation and a variance component. Since X and hence K_h(X) are i.i.d random variables, we get F_j = \mathbb{E}[F_j] + O_p(\mathbb{V}(F_j)^{1/2}) (one can take this to mean – every realization in some distribution is some constant standard deviations away from the expectation) We begin with considering the expectation. Some algebra and the definition of K_h(.) produces

\begin{align*} \mathbb{E}\left[F_{j}\right] & =\int_{0}^{\infty}N^{-1}\sum_{i=1}^{N}K_{h}\left(X_{i}\right)X_{i}^{j}f(x)dx\\ & =N^{-1}\int_{0}^{\infty}\sum_{i=1}^{N}K_{h}\left(X_{i}\right)X_{i}^{j}f(x)dx\\ & =N^{-1}\sum_{i=1}^{N}\int_{0}^{\infty}K_{h}\left(x\right)x^{j}f(x)dx\\ & =\int_{0}^{\infty}K_{h}\left(x\right)x^{j}f(x)dx\\ & =\int_{0}^{\infty}h^{-1}K\left(x/h\right)x^{j}f(x)dx \end{align*}

Using a change of variables z=x/h, which produces x=hz and dx=hdz, gives us

\begin{align*} \mathbb{E}\left[F_{j}\right] & =\int_{0}^{\infty}h^{-1}K\left(z\right)z^{j}h^{j}f(zh)hdz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f(zh)dz \end{align*}

Adding and subtracting f(0) we get

\begin{align*} \mathbb{E}\left[F_{j}\right] & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}\left(f(zh)+f\left(0\right)-f\left(0\right)\right)dz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz+h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}\left(f(zh)-f\left(0\right)\right)dz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz+h^{j+1}\int_{0}^{\infty}K\left(z\right)z^{j+1}\frac{\left(f(zh)-f\left(0\right)\right)}{zh}dz \end{align*}

We can simplify the left element as

\begin{align*} h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz & =h^{j}f\left(0\right)\int_{0}^{\infty}K\left(z\right)z^{j}dz\\ & =h^{j}f\left(0\right)v_{j} \end{align*}

The right element can be bounded by a function of magnitude h^{j+1}, and hence converges to zero at a rate of o_p(h^j). Hence we get \mathbb{E}\left[F_{j}\right]=h^{j}f\left(0\right)v_{j}+O_{p}\left(h^{j+1}\right)=h^jf(0)v_j + o_p(h^j)

We continue with the variance of F_j.

\begin{align*} \mathbb{V}\left(F_{j}\right) & =\mathbb{E}\left[\left(F_{j}-\mathbb{E}\left[F_{j}\right]\right)^{2}\right]\\ & \leq\mathbb{E}\left[F_{j}^{2}\right]\\ & =N^{-2}\mathbb{E}\left[\sum_{i=1}^{N}\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right]\\ & =N^{-2}\sum_{i=1}^{N}\mathbb{E}\left[\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right]\\ & =N^{-1}\mathbb{E}\left[\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right] \end{align*}

Using the defintion of K_h(.) we get \begin{align*} \mathbb{V}\left(F_{j}\right) & =N^{-1}\mathbb{E}\left[\left(h^{-1}K\left(X_{i}/h\right)\right)^{2}X_{i}^{2j}\right]\\ & =\frac{1}{Nh^{2}}\mathbb{E}\left[\left(K\left(X_{i}/h\right)\right)^{2}X_{i}^{2j}\right]\\ & =\frac{1}{Nh^{2}}\int_{0}^{\infty}\left(K\left(x/h\right)\right)^{2}x^{2j}dx \end{align*}

Again using a change of variables z=x/h we get

\begin{align*} \mathbb{V}\left(F_{j}\right) & =\frac{1}{Nh^{2}}\int_{0}^{\infty}\left(K\left(z\right)\right)^{2}z^{2j}h^{2j}hdz\\ & =\frac{h^{2j}}{Nh}\int_{0}^{\infty}\left(K\left(z\right)\right)^{2}z^{2j}hdz\\ & =O_{p}\left(\frac{h^{2j-1}}{N}\right)=o_{p}\left(h^{2j}\right) \end{align*}

Putting everything together, we get \begin{align*} F_{j} & =\mathbb{E}\left[F_{j}\right]+O_{p}\left(\mathbb{V}\left(F_{j}\right)^{1/2}\right)\\ & =h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right)+o_{p}\left(h^{2j}\right)\\ & =h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right)\\ & =h^{j}\left(f\left(0\right)v_{j}+o_{p}\left(1\right)\right) \end{align*}

Completing the proof.

What did we just prove?

Lets consider a distribution for the running variable, and consider the proven statement.

set.seed(1)
# distribution parameters
N <- 1000
# sample from the distribution
X <- rnorm(n = N, mean = 0, sd = 1)
# show distribution in histogram
hist(X)

We continue with observing the assigned kernel weights.

# define a kernel
k_triangular <- function(u) {
  1*(u >= -1 & u <= 1) * (1 - abs(u))
}
# define K_h
kernel_h <- function(u, h, kernel) {
  (1/h) * kernel(u/h)
}
# put this into a data-frame
df <- X |> 
  as.data.frame() |> 
  mutate(W = 1*(X >= 0))
# define bandwidth
h <- df$X |> sd()
# add kernel weights
df <- df |> 
  mutate(K_h = kernel_h(u = X, h = h, kernel = k_triangular))
# plot the weights by X
df |> 
  mutate(treatment = case_when(W == 0 ~ "Control", W == 1 ~ "Treatment")) |> 
  ggplot() + 
  # add points by X and treatment status
  geom_point(aes(x=X, y = treatment, color = K_h), 
             # need some jitter
             position = position_jitter(width = 0, height = 0.4), 
             # some cosmetics
             stroke = 2, shape = 16, alpha = .75) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  # add a diverging color series
  scale_color_gradientn(name = "Weights", 
                        colours = wesanderson::wes_palette(name = "Zissou1", type = "continuous")) + 
  theme_bw()

Finally we consider the weighted moments

### we remind that we only showed proof for observations above the cutoff.
# filter correct DGP
df <- df |> filter(X >= 0)

# function to calculate the weighted moments
weighted_moments <- function(X, j, kernel, h) {
  N = length(X)
  K_h_X = kernel_h(u = X, h = h, kernel = kernel)
  X_j = X^j
  F_j = N^-1 * sum(K_h_X * X_j)
  return(F_j)
}
# lets calculate different moments
F_j_df <- map(0:5, weighted_moments, X = df$X, h = h, kernel = k_triangular) |> 
  unlist() |> 
  as.data.frame() |> 
  rename(F_j = 1) |> 
  mutate(j = row_number()-1)
F_j_df

       F_j j
1 0.342195 0
2 0.119008 1
3 0.060413 2
4 0.036576 3
5 0.024625 4
6 0.017816 5

Lets play around with this. We saw that F_j = h^j\left( f(0)v_j + o_p(1) \right) This can also be written as F_j / h^j = f(0)v_j + o_p(1) Which can be arranged to F_j / h^j - f(0)v_j = o_p(1) Lets see this.

# for a normal distribution with mean zero and sd 1, which is the underlying DGP in this example
f_0 = dnorm(x = 0)
# by definition
v_j_fun <- function(t, j, kernel) {t^j * kernel(t)}
v_j_int = function(j, kernel) {
  integrate(f = v_j_fun, lower = 0, upper = Inf, j = j, kernel = kernel)$value
}
F_j_df |> 
  # calculate other elements
  mutate(f_0 = f_0,
         v_j = map_dbl(0:5, v_j_int, kernel = k_triangular)) |> 
  # calculate final equation
  mutate(result = (F_j/h^j) - (f_0*v_j))

       F_j j     f_0      v_j    result
1 0.342195 0 0.39894 0.500000 0.1427239
2 0.119008 1 0.39894 0.166667 0.0485021
3 0.060413 2 0.39894 0.083333 0.0231600
4 0.036576 3 0.39894 0.050000 0.0130506
5 0.024625 4 0.39894 0.033333 0.0081686
6 0.017816 5 0.39894 0.023810 0.0055084

We can see that the calculated difference shrinks with j.

Second Lemma

Lemma Statement

Lemma 2 (Variance and Kernel Weighted Sample Moments)

Define G_j = N^{-1}\sum_{i=1}^N\left[(K_h(X_i))^2\times \sigma^2(X_i) \times X_i^j\right]. Under the above assumptions 1-5 G_{j}=h^{j-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{j}\left(1+o_{p}\left(1\right)\right) where \pi_{j}=\int_{0}^{\infty}t^{j}\left(K(t)\right)^{2}dt.

Leave the proof of this lemma aside on continue with the main proposition.

Third Lemma: Bias Variance and MSE of Boundary Kernel Estimators

Lemma Statment

Lemma 3 (One-Sided MSE) We remind that we are considering only observations where the running variable X is above the cutoff, which we generalized to be equal c=0. We take as inputs a bandwidth h and kernel K. Estimate the expectation at the boundary by \left(\widehat{\mu},\widehat{\beta}\right)=\arg\min_{\mu,\beta}\sum_{i}\left(Y_{i}-\mu-\beta X_{i}\right)^{2}\times K_{h}\left(X_{i}\right)

The bias and variance of the estimator of the boundary are given by \begin{align*} Bias\left(\widehat{\mu}\right) & =\mathbb{E}\left[\widehat{\mu}\mid X\right]-\mu=C_{1}^{1/2}m^{\left(2\right)}\left(0\right)h^{2}+o_{p}\left(h^{2}\right)\\ \mathbb{V}\left(\widehat{\mu}\mid X\right) & =C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}

And this gives us the MSE \begin{align*} MSE\left(\widehat{\mu}\right) & =\mathbb{E}\left[\left(\widehat{\mu}-\mu\right)^{2}\mid X\right]=\left(Bias\left(\widehat{\mu}\right)\right)^{2}+\mathbb{V}\left(\widehat{\mu}\mid X\right)\\ & =\left(C_{1}^{1/2}m^{\left(2\right)}\left(0\right)h^{2}+o_{p}\left(h^{2}\right)\right)^{2}+C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(\frac{1}{Nh}\right)\\ & =C_{1}\left(m^{\left(2\right)}\left(0\right)\right)^{2}h^{4}+C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(h^{4}+\frac{1}{Nh}\right) \end{align*}

Proof: Setup

Define by R the N\times 2 matrix of covariates, which are made up of a constant 1 and the running variable X \begin{equation*} R=\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{2} \end{array}\right)_{N\times2}=\left[\iota\;X\right] \end{equation*} where \iota is a N-length vector of ones. Also define by W the N\times N weighting matrix, with typical element \begin{equation*} W_{ij}=\begin{cases} K_{h}\left(X_{i}\right) & i=j\\ 0 & o.w. \end{cases} \end{equation*}

Lastly, we define the vector e_1 = (1, 0)^\prime. We can now write the estimation of \widehat{\mu} via \widehat{m}\left(0\right)=\widehat{\mu}=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WY For intuition of the above equation, consider the following. If would have wanted an unweighted estimation, i.e. OLS, we can calculate our estimators via \widehat{\beta}_{unweighted} = \left(R^{\prime}R\right)^{-1}R^{\prime}Y. In order to incorporate the weights, we add the W matrix, \widehat{\beta}_{weighted} = \left(R^{\prime}WR\right)^{-1}R^{\prime}WY. And lastly, in order to choose the first coefficient out of \widehat{\beta}_{weighted}, we multiply this by the ``choosing vector’’ e_1.

Lets consider the bias of our estimator, which we write as Bias\left(\widehat{m}\left(0\right)\right)=\mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right]-m\left(0\right)

To calculate this bias, we begin with writing out \mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right]=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WM

where M=\left(m\left(X_{1}\right),...,m\left(X_{N}\right)\right). To approximate this expression, we use the assumption that m(x)=\mathbb{E}[Y \mid X=x] is thrice differentiable, and hence a Taylor expansion can be written as

m\left(X_{i}\right)=m\left(0\right)+m^{\left(1\right)}\left(0\right)X_{i}+m^{\left(2\right)}\left(0\right)\frac{X_{i}}{2}+T_i

where we denote by T_i the taylor expansion reminder so we can write an equality (and not approximation). This element is bounded by \left|T_{i}\right| \leq \max_{c\in supp(X)}\left|m^{\left(3\right)}\left(c\right)\right|\times\left|X_{i}^{3}\right|

Taylor Expansion (and remainder bounds)

For those that don’t feel at home with Taylor Polynomials and bounding their errors, we provide a little more depth to the above argument. Define an expectation function of variable Y conditional on variable X as m\left(x\right)=\mathbb{E}=\left[Y\mid X=x\right]. Assuming this function is thrice differentiable, we can approximate a taylor expansion using the first two derivates as follows.

m\left(X_{i}\right)\approx m\left(0\right)+m^{\left(1\right)}\left(0\right)X_{i}+m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2}

For equality, add in a remainder term m\left(X_{i}\right)=m\left(0\right) + m^{\left(1\right)}\left(0\right)X_{i} + m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2} + T_{i}

Now say we want to bound T_{i}. For degree two taylor expansion, one can use the formula T_{i}=\frac{m^{\left(3\right)}\left(c\right)}{3!}X_{i}^{3}\quad c\in support\left(X\right)

If we try to bound this, we can set \begin{align*} \left|T_{i}\right| & =\frac{1}{6}\left|m^{\left(3\right)}\left(c\right)X_{i}^{3}\right|\\ & =\frac{1}{6}\left|m^{\left(3\right)}\left(c\right)\right|\times\left|X_{i}^{3}\right|\\ & \leq\frac{1}{6}\times\max_{c\in supp(X)}\left|m^{\left(3\right)}\left(c\right)\right|\times\left|X_{i}^{3}\right|\\ & \leq\max_{c\in supp(X)}\left|m^{\left(3\right)}\left(c\right)\right|\times\left|X_{i}^{3}\right| \end{align*}

Lets continue with the proof. Lets denote by S a N\times 1 vector with typical element S_i=m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2}, and by T the N\times 1 vector with typical element i equal to T_i as defined above. Using this notation, we can re-write the taylor expansion for a single X_i as

\begin{equation*} m\left(X_{i}\right)=(1\;X_{i})\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right) + S_{i}+T_{i} \end{equation*}

and for the whole matrix M we can write

\begin{equation*} M=R\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+S+T \end{equation*}

Returning to the conditional expectation of our estimate, we can now write \begin{align*} \mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right] & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WM\\ & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(R\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+S_{i}+T_{i}\right)\\ & =e_{1}^{\prime}\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right)\\ & =m\left(0\right)+e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right) \end{align*}

Substituting this into the bias equation gives us Bias\left(\widehat{m}\left(0\right)\right)=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right)

Proof: Bias First Element

Lets consider each of these in turn. We’ll start with e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WT. Consider the inverse element separately.

\begin{align*} R^{\prime}WR & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{cc} \sum K_{h}\left(X_{i}\right) & \sum K_{h}\left(X_{i}\right)X_{i}\\ \sum K_{h}\left(X_{i}\right)X_{i} & \sum K_{h}\left(X_{i}\right)X_{i}^{2} \end{array}\right)_{2\times2} \end{align*}

Multiplying this by the N inverse gives us

\begin{align*} \frac{1}{N}R^{\prime}WR & =\left(\begin{array}{cc} N^{-1}\sum K_{h}\left(X_{i}\right) & N^{-1}\sum K_{h}\left(X_{i}\right)X_{i}\\ N^{-1}\sum K_{h}\left(X_{i}\right)X_{i} & N^{-1}\sum K_{h}\left(X_{i}\right)X_{i}^{2} \end{array}\right)_{2\times2}\\ & =\left(\begin{array}{cc} F_{0} & F_{1}\\ F_{1} & F_{2} \end{array}\right)_{2\times2} \end{align*}

The inverse of this matrix is \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\left(\begin{array}{cc} F_{0} & F_{1}\\ F_{1} & F_{2} \end{array}\right)^{-1}\\ &=\frac{1}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{F_{2}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\\ -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & \frac{F_{0}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} \end{array}\right) \end{align*}

Going back to the Lemma 1, we use h^{j}\left(f(0)v_{j}+o_{p}(1)\right). Lets start with the nominator \begin{align*} F_{0}F_{2}-F_{1}^{2} & =\left[\left(f(0)v_{0}+o_{p}\left(1\right)\right)\right]\times\left[h^{2}\left(f(0)v_{0}+o_{p}\left(1\right)\right)\right]-\left[h^{1}\left(f(0)v_{1}+o_{p}\left(1\right)\right)\right]^{2}\\ & =h^{2}\left[f(0)v_{0}+o_{p}\left(1\right)\right]\times\left[f(0)v_{2}+o_{p}\left(1\right)\right]-h^{2}\left[f(0)v_{1}+o_{p}\left(1\right)\right]^{2}\\ & =h^{2}\left[f(0)v_{0}\times f(0)v_{2}+f(0)v_{0}o_{p}\left(1\right)+f(0)v_{2}o_{p}\left(1\right)+o_{p}\left(1\right)\right]\\ & -h^{2}\left[f(0)^{2}v_{1}^{2}-2f(0)v_{1}o_{p}\left(1\right)-o_{p}\left(1\right)\right]\\ & =h^{2}\left[f(0)v_{0}\times f(0)v_{2}-f(0)^{2}v_{1}^{2}+o_{p}\left(1\right)\right]\\ & \approx h^{2}\left[f(0)v_{0}\times f(0)v_{2}-f(0)^{2}v_{1}^{2}\right] \end{align*}

A similar process for the nominator allow us to write \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\left(\begin{array}{cc} \frac{F_{2}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\\ -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & \frac{F_{0}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{h^{2}\left(f(0)v_{2}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} & -\frac{h^{1}\left(f(0)v_{1}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}\\ -\frac{h^{1}\left(f(0)v_{1}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} & \frac{f(0)v_{0}+o_{p}(1)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{f(0)v_{2}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}+o_{p}(1)}{v_{0}\times f(0)v_{2}-f(0)v_{1}^{2}}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\\ & =\left(\begin{array}{cc} O_{p}\left(1\right) & O_{p}\left(\frac{1}{h}\right)\\ O_{p}\left(\frac{1}{h}\right) & O_{p}\left(\frac{1}{h^{2}}\right) \end{array}\right) \end{align*}

Lets turn to the second part. \begin{align*}R^{\prime}WT & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{c} T_{1}\\ \vdots\\ T_{N} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{c} T_{1}\\ \vdots\\ T_{N} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)T_{i}\\ \sum K_{h}\left(X_{i}\right)X_{i}T_{i} \end{array}\right)_{2\times 1} \end{align*}

The absolute value of this gives us \begin{align*} \left|R^{\prime}WT\right| & =\left(\begin{array}{c} \left|\sum K_{h}\left(X_{i}\right)T_{i}\right|\\ \left|\sum K_{h}\left(X_{i}\right)X_{i}T_{i}\right| \end{array}\right)\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)\left|T_{i}\right|\\ \sum K_{h}\left(X_{i}\right)X_{i}\left|T_{i}\right| \end{array}\right)\\ & \leq\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)\left[\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left|X_{i}^{3}\right|\right]\\ \sum K_{h}\left(X_{i}\right)X_{i}\left[\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & =\left(\begin{array}{c} \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left[\sum K_{h}\left(X_{i}\right)\left|X_{i}^{3}\right|\right]\\ \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left[\sum K_{h}\left(X_{i}\right)X_{i}\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \left[\sum K_{h}\left(X_{i}\right)\left|X_{i}^{3}\right|\right]\\ \left[\sum K_{h}\left(X_{i}\right)X_{i}\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \left[\sum K_{h}\left(X_{i}\right)X_{i}^{3}\right]\\ \left[\sum K_{h}\left(X_{i}\right)X_{i}^{4}\right] \end{array}\right) \end{align*} where we got the last equality since we are only considering X\geq0. Adding N inverse we get \begin{align*} \left|\frac{1}{N}R^{\prime}WT\right| & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \frac{1}{N}\left[\sum K_{h}\left(X_{i}\right)X_{i}^{3}\right]\\ \frac{1}{N}\left[\sum K_{h}\left(X_{i}\right)X_{i}^{4}\right] \end{array}\right)\\ & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} F_{3}\\ F_{4} \end{array}\right)\\ & \leq\left(\begin{array}{c} o_{p}\left(h^{2}\right)\\ o_{p}\left(h^{3}\right) \end{array}\right) \end{align*}

This finally allows us to write \begin{align*} e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WT & \leq\left(\begin{array}{cc} O_{p}\left(1\right) & O_{p}\left(\frac{1}{h}\right)\\ O_{p}\left(\frac{1}{h}\right) & O_{p}\left(\frac{1}{h^{2}}\right) \end{array}\right)\left(\begin{array}{c} o_{p}\left(h^{2}\right)\\ o_{p}\left(h^{3}\right) \end{array}\right)\\ & =e_{1}^{\prime}\left(\begin{array}{c} O_{p}\left(1\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{3}\right)\\ O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h^{2}}\right)\times o_{p}\left(h^{3}\right) \end{array}\right)\\ & =O_{p}\left(1\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{3}\right)\\ & =o_{p}\left(h^{2}\right) \end{align*}

Phew. Halfway there.

Proof: Bias Second Element

We now turn to consider e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WS. We already did some work, so we only to find R^{\prime}WS.

\begin{align*} R^{\prime}WS & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{c} m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2}\\ \vdots\\ m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{c} m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2}\\ \vdots\\ m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2}\\ \sum K_{h}\left(X_{i}\right)X_{i}m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2} \end{array}\right)_{2\times1}\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)X_{i}^{2}\\ \sum K_{h}\left(X_{i}\right)X_{i}^{3} \end{array}\right) \end{align*}

Dividing by N we get

\begin{align*} \frac{1}{N}R^{\prime}WS & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} F_{2}\\ F_{3} \end{array}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right) \end{align*}

Putting it together we have

\begin{align*}e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WS & =e_{1}^{\prime}\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}+o_{p}(1)}{v_{0}\times f(0)v_{2}-f(0)v_{1}^{2}}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\end{array}\right)\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left[\frac{v_{2}h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(h^{2}\right)-\frac{v_{1}h^{2}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-o_{p}\left(h^{2}\right)\right]\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left[\frac{v_{2}^{2}h^{2}f\left(0\right)+o_{p}\left(h^{2}\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{v_{1}v_{3}h^{2}f(0)+o_{p}\left(h^{2}\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(h^{2}\right)\right]\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right) \end{align*}

All the above allows us to write the bias as equal to

\begin{align*} Bias\left(\widehat{m}\left(0\right)\right) & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S+T\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right)+o_{p}\left(h^{2}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right) \end{align*}

Proof: Variance

Define the n\times n matrix \Sigma with typical element (i,i) \begin{equation*} \Sigma_{ij}=\begin{cases} \sigma^2(X_i) & i=j\\ 0 & o.w. \end{cases} \end{equation*}

The conditional variance of our estimator is equal to

\begin{equation*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right)=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\Sigma WR\left(R^{\prime}WR\right)^{-1}e_{1} \end{equation*}

Due to our work above, we only need to do some work on the middle element.

\begin{align*} R^{\prime}W & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\\ WR & =\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ R^{\prime}W\Sigma WR & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{ccc} \sigma^{2}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & \sigma^{2}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right)\sigma^{2}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\sigma^{2}\left(X_{N}\right)\\ K_{h}\left(X_{1}\right)\sigma^{2}\left(X_{1}\right)X_{1} & \cdots & K_{h}\left(X_{N}\right)\sigma^{2}\left(X_{N}\right)X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{cc} \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right) & \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i}\\ \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i} & \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i}^{2} \end{array}\right) \end{align*}

Dividing by the N

\begin{align*} \frac{1}{N}R^{\prime}W\Sigma WR & =\left(\begin{array}{cc} G_{0} & G_{1}\\ G_{1} & G_{2} \end{array}\right) \end{align*}

We remind that we already found that \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right) \end{align*}

And putting it all together \begin{align*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right) & =\frac{1}{N}e_{1}^{\prime}\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)\left(\begin{array}{cc} G_{0} & G_{1}\\ G_{1} & G_{2} \end{array}\right)\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)e_{1}\\ & =\frac{1}{N}e_{1}^{\prime}\frac{1}{\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}}\left(\begin{array}{cc} F_{2}G_{0}-F_{1}G_{1} & F_{2}G_{1}-F_{1}G_{2}\\ -F_{1}G_{0}+F_{0}G_{1} & -F_{1}G_{1}+F_{0}G_{2} \end{array}\right)\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)e_{1}\\ & =e_{1}^{\prime}\frac{1}{N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}}\left(\begin{array}{cc} F_{2}\left[F_{2}G_{0}-F_{1}G_{1}\right]-F_{1}\left[F_{2}G_{1}-F_{1}G_{2}\right] & -F_{1}\left[F_{2}G_{0}-F_{1}G_{1}\right]+F_{0}\left[F_{2}G_{1}-F_{1}G_{2}\right]\\ F_{2}\left[-F_{1}G_{0}+F_{0}G_{1}\right]-F_{1}\left[-F_{1}G_{1}+F_{0}G_{2}\right] & -F_{1}\left[-F_{1}G_{0}+F_{0}G_{1}\right]+F_{0}\left[-F_{1}G_{1}+F_{0}G_{2}\right] \end{array}\right)e_{1}\\ & =\frac{F_{2}^{2}G_{0}-2F_{1}F_{2}G_{1}+F_{1}^{2}G_{2}}{N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}} \end{align*}

Applying Lemma 1 and Lemma 2 we use F_{j}=h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right) and G_{j}=h^{j-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{j}+o_{p}\left(h^{j-1}\right) which leads to \begin{align*} F_{2}^{2}G_{0} & =\left(h^{2}f\left(0\right)v_{2}+o_{p}\left(h^{2}\right)\right)^{2}\left(h^{-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{0}+o_{p}\left(h^{-1}\right)\right)\\ & =\left(h^{4}f^{2}\left(0\right)v_{2}^{2}+o_{p}\left(h^{4}\right)\right)\left(h^{-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{0}+o_{p}\left(h^{-1}\right)\right)\\ & =h^{3}f^{3}\left(0\right)v_{2}^{2}\sigma^{2}\left(0\right)\pi_{0}+o_{p}\left(h^{3}\right)\\ 2F_{1}F_{2}G_{1} & =2\left(hf\left(0\right)v_{1}+o_{p}\left(h\right)\right)\left(h^{2}f\left(0\right)v_{2}+o_{p}\left(h^{2}\right)\right)\left(\sigma^{2}\left(0\right)f\left(0\right)\pi_{1}+o_{p}\left(1\right)\right)\\ & =2\left(h^{3}f^{2}\left(0\right)v_{1}v_{2}+o_{p}\left(h^{3}\right)\right)\left(\sigma^{2}\left(0\right)f\left(0\right)\pi_{1}+o_{p}\left(1\right)\right)\\ & =2h^{3}f^{3}\left(0\right)v_{1}v_{2}\sigma^{2}\left(0\right)\pi_{1}+o_{p}\left(h^{3}\right)\\ F_{1}^{2}G_{2} & =\left(hf\left(0\right)v_{1}+o_{p}\left(h\right)\right)^{2}\left(h\sigma^{2}\left(0\right)f\left(0\right)\pi_{2}+o_{p}\left(h\right)\right)\\ & =\left(h^{2}f^{2}\left(0\right)v_{1}^{2}+o_{p}\left(h^{2}\right)\right)\left(h\sigma^{2}\left(0\right)f\left(0\right)\pi_{2}+o_{p}\left(h\right)\right)\\ & =h^{3}f^{3}\left(0\right)v_{1}^{2}\sigma^{2}\left(0\right)\pi_{2}+o_{p}\left(h^{3}\right)\\ F_{2}^{2}G_{0}-2F_{1}F_{2}G_{1}+F_{1}^{2}G_{2} & =h^{3}f^{3}\left(0\right)v_{2}^{2}\sigma^{2}\left(0\right)\pi_{0}+o_{p}\left(h^{3}\right)-2h^{3}f^{3}\left(0\right)v_{1}v_{2}\sigma^{2}\left(0\right)\pi_{1}\\ & -h^{3}f^{3}\left(0\right)v_{1}^{2}\sigma^{2}\left(0\right)\pi_{2}+o_{p}\left(h^{3}\right)+o_{p}\left(h^{3}\right)\\ & =h^{3}f^{3}\left(0\right)\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]+o_{p}\left(h^{3}\right) \end{align*}

That was the nominator. The denominator is similar to something we did previously \begin{align*} N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2} & =N\left[h^{2}f^{2}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)+o_{p}\left(h^{2}\right)\right]^{2}\\ & =Nh^{4}f^{4}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}+o_{p}\left(Nh^{4}\right) \end{align*}

Putting it together we get \begin{align*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right) & =\frac{h^{3}f^{3}\left(0\right)\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]+o_{p}\left(h^{3}\right)}{Nh^{4}f^{4}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}+o_{p}\left(Nh^{4}\right)}\\ & =\frac{\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]}{Nhf\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}

which concludes the proof of the lemma.

Illustration

Lets take a look at what we just proved. We’ll start with the bias, then the variance, and lastly the MSE.

Lets set up an example RD DGP, look only at obs. from above, as in Lemma 3.

# function that creates RD dgp with outcome function `y_function`
rd_dgp <- function(N, jump = 0.25, y_function, seed = 1, error_sd = 0.4) {
  set.seed(seed)
  data.frame(X = rnorm(N)) |> 
  as_tibble() |> 
  mutate(
    W = 1*(X >= 0),
    Y = y_function(X) + rnorm(N, sd = error_sd)
  ) |> 
  # add discontinuity
  mutate(Y = ifelse(W == 1, Y + jump, Y))
}

# some random outcome function
y_function1 <- function(X) {10 + X + 0.2*X^2 - 0.2*X^3}

# example dgp and plot
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function1)

df |> 
  ggplot(aes(x=X, y=Y)) + 
  geom_point(aes(color = as.factor(W))) + 
  theme_bw() + geom_vline(xintercept = 0, linetype = "dashed")

Okay, so we have a DGP. Lets add a kernel and example bandwidth.

# this is our DGP from before
N <- 1000
jump <- 1
df <- rd_dgp(N = N, jump = jump, y_function = y_function1)

# create a kernel function
k_triangular = function(u) {1*(u >= -1 & u <= 1) * (1-abs(u))}

# choose some bandwidth
h = df$X |> sd()

# calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
df <- df |> 
  mutate(weights = k_triangular(X/h))  

df |> 
  ggplot(aes(x=X, y=Y, color = weights)) + 
  geom_point() + 
  geom_vline(xintercept = h, linetype = "dashed") +
  geom_vline(xintercept = -h, linetype = "dashed") +
  scale_color_gradientn(name = "Weights", 
                        colours = wesanderson::wes_palette(name = "Zissou1", type = "continuous")) + 
  theme_bw()

Lastly, lets focus on an estimator of the conditional expectation m(0).

## as in the lemma, keep only obs above the cutoff
df_above <- df |> 
  filter(X >= 0)


## lets calculate the estimator by hand, as we did in the lemma
# define all the needed vectors and matrices
X <- df_above$X
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k_triangular(X/h))
Y <- df_above$Y
e_1 <- c(1, 0)

# calculate
e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Y

       [,1]
[1,] 10.974

## lets put this into a function
m_0_function <- function(X, Y, k, h, c = 0) {
  R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
  W <- diag(x = k((X-c)/h))
  e_1 <- c(1, 0)
  
  # calculate
  m_0 <- e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Y
  return(m_0)
 
}

m_0_function(X = df_above$X, Y = df_above$Y, k = k_triangular, h = sd(df$X))

       [,1]
[1,] 10.974

And last but not least, lets find out the truth

## to calculate the truth above the cutoff
## we use the function for m(X), and the jump
y_function1(0) + jump

[1] 11

Now that we have everything, we can simulate DGPs and calculate the bias.

## create a function that simulates a dgp, and calculates the bias
single_simulation <- function(N = 1000, jump = 1, y_function, k, h, seed = 1) {
  # simulate a data set
  df <- rd_dgp(N = N, jump = jump, y_function = y_function, seed = seed)
  # calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
  df <- df |> 
    mutate(weights = k(X/h))
  # keep obs above cutoff
  df <- df |> 
    filter(X >= 0)
  # estimate m_0
  m_0_hat <- m_0_function(X = df$X, Y = df$Y, k = k, h = h)
  # return data frame of estimate and bias
  data.frame(estimate = m_0_hat) |> 
    mutate(
      truth = y_function(0) + jump,
      bias = estimate - truth
    )
}

# test the function
single_simulation(N = 10000, y_function = y_function1, k = k_triangular, h = 0.2)

  estimate truth      bias
1   10.985    11 -0.014864

Lets do this simulation a lot of times, and calculate mean bias over the simulations

simulation_function <- function(num_it= 10, N = 1000, jump = 1, y_function, k, h) {
  lapply(1:num_it, single_simulation, N = N, jump = jump, y_function = y_function, k = k, h = h) |> 
    dplyr::bind_rows() |> 
    mutate(iteration = row_number())
}

h_vector = seq(0.1, 1, 0.1)

full = NULL
for(h in h_vector) {
  full = rbind(full, 
               simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function1, k = k_triangular, h = h ) |> 
                 mutate(h = h)
  )
  print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}

Finished h = 0.1, at time 2022-09-28 14:51:20.
Finished h = 0.2, at time 2022-09-28 14:51:21.
Finished h = 0.3, at time 2022-09-28 14:51:23.
Finished h = 0.4, at time 2022-09-28 14:51:24.
Finished h = 0.5, at time 2022-09-28 14:51:25.
Finished h = 0.6, at time 2022-09-28 14:51:27.
Finished h = 0.7, at time 2022-09-28 14:51:28.
Finished h = 0.8, at time 2022-09-28 14:51:29.
Finished h = 0.9, at time 2022-09-28 14:51:31.
Finished h = 1, at time 2022-09-28 14:51:32.

full |> 
  group_by(h) |> 
  summarise(mean_bias = mean(bias)) |> 
  ggplot(aes(x = h, y = mean_bias)) +
  geom_line()

Lets calculate the first order approximation we found.

## we'll use the equation in the last lemma

## we start with the true second derivative
# lets start with the true bias
# the conditional expectation function we have used so far is
y_function1

function(X) {10 + X + 0.2*X^2 - 0.2*X^3}
<bytecode: 0x00000213747d9670>

# taking second derivative by X we get
y_function_1_sec_dev = function(X) {0.2*2*1 - 0.2*3*2*X}

# hence the true second derivative at zero is 
y_function_1_sec_dev(0)

[1] 0.4

## we next calculate all the elements that depend on K
v_j = function(u, k, j) {k(u) * u^j}
v_j_int = function(k, j) {integrate(f = v_j, lower = 0, upper = Inf, j = j, k = k)$value}

pi_j = function(u, k, j) {k(u)^2 * u^j}
pi_j_int = function(k, j) {integrate(f = pi_j, lower = 0, upper = Inf, j = j, k = k)$value}

C_1_k <- function(k = k_triangular) {
  1 / 2 * ((v_j_int(k = k, j = 2) ^ 2) -
             (v_j_int(k = k, j = 1) * v_j_int(k = k, j = 3))) /
    ((v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2)) -
       v_j_int(k = k, j = 1) ^ 2)
}

C_1_k()

[1] -0.05

## now we can calculate the true first order approximation of the bias
bias_approx = function(k, h) {
  C_1_k(k = k_triangular) * y_function_1_sec_dev(0) * h^2
}

## an example
bias_approx(k = k_triangular, h = 2)

[1] -0.08

Lets add this to each sample, and see how close it is to the true bias

full |> 
  group_by(h) |> 
  summarise(mean_bias = mean(bias)) |> 
  mutate(approx_true_mean_bias = bias_approx(k = k_triangular, h = h)) |> 
  ggplot(aes(x = h)) + 
  geom_line(aes(y = mean_bias, color = "Mean Bias")) + 
  geom_line(aes(y = approx_true_mean_bias, color = "First Order Approx. Bias"))

Okay, this didn’t come out as close as I thought it would. Lets continue with the variance.

## as in the lemma, keep only obs above the cutoff
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function1) |> 
  filter(X >= 0)

## lets calculate the estimator by hand, as we did in the lemma
# define all the needed vectors and matrices
X <- df_above$X
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k_triangular(X/h))
Y <- df_above$Y
e_1 <- c(1, 0)
Sigma <- diag(x = rep(1, times = length(X)))

# calculate
e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Sigma%*%W%*%R %*% solve(t(R)%*%W%*%R) %*% matrix(c(1,0), ncol = 1)

         [,1]
[1,] 0.015304

## lets put this into a function
V_0_function <- function(X, Y, k, h, c = 0) {
  R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
  W <- diag(x = k((X-c)/h))
  e_1 <- c(1, 0)
  Sigma <- diag(x = rep(1, times = length(X)))
  
  # calculate
  V_0 <- e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Sigma%*%W%*%R %*% solve(t(R)%*%W%*%R) %*% matrix(c(1,0), ncol = 1)
  return(V_0)
 
}

V_0_function(X = df$X, Y = df$Y, k = k_triangular, h = sd(df$X))

         [,1]
[1,] 0.024837

## next we calculate the first order approximation
C_2_k <- function(k) {
  (
    (v_j_int(k = k, j = 2)^2 * pi_j_int(k = k, j = 0)) - 
     (2*v_j_int(k = k, j = 1) * v_j_int(k = k, j = 2) * pi_j_int(k = k, j = 1)) + 
     (v_j_int(k = k, j = 1)^2 * pi_j_int(k = k, j = 2))
    ) / 
    (
      (v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2)) - 
        (v_j_int(k = k, j = 1)^2)
      )^2
}

## now we can calculate the true first order approximation of the variance
## note that since we are simulating, we know the density function, which is unknown in observable data...
variance_approx = function(k, h, N, sigma = 1) {
  (C_2_k(k = k_triangular) * sigma^1) / 
    (N * h * dnorm(x = 0))
}

## putting everything together
# create a function that simulates a dgp, and calculates all of the above
single_simulation <- function(N = 1000, jump = 1, y_function, k, h, seed = 1) {
  # simulate a data set
  df <- rd_dgp(N = N, jump = jump, y_function = y_function, seed = seed)
  # calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
  df <- df |> 
    mutate(weights = k(X/h))
  # keep obs above cutoff
  df <- df |> 
    filter(X >= 0)
  # estimate m_0
  m_0_hat <- m_0_function(X = df$X, Y = df$Y, k = k, h = h)
  # return data frame of estimate and bias
  data.frame(estimate = m_0_hat) |> 
    mutate(
      truth = y_function(0) + jump,
      bias = estimate - truth,
      variance = V_0_function(X = df$X, Y = df$Y, k = k, h = h)
    )
}

simulation_function <- function(num_it= 10, N = 1000, jump = 1, y_function, k, h) {
  lapply(1:num_it, single_simulation, N = N, jump = jump, y_function = y_function, k = k, h = h) |> 
    dplyr::bind_rows() |> 
    mutate(iteration = row_number())
}

h_vector = seq(0.1, 3, 0.1)

full = NULL
for(h in h_vector) {
  full = rbind(full, 
               simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function1, k = k_triangular, h = h ) |> 
                 mutate(h = h) |> 
                 mutate(approx_bias = bias_approx(k = k, h = h),
                        approx_var = variance_approx(k = k, h = h, N = N))
  )
  print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}

Finished h = 0.1, at time 2022-09-28 14:51:35.
Finished h = 0.2, at time 2022-09-28 14:51:36.
Finished h = 0.3, at time 2022-09-28 14:51:38.
Finished h = 0.4, at time 2022-09-28 14:51:40.
Finished h = 0.5, at time 2022-09-28 14:51:42.
Finished h = 0.6, at time 2022-09-28 14:51:44.
Finished h = 0.7, at time 2022-09-28 14:51:46.
Finished h = 0.8, at time 2022-09-28 14:51:47.
Finished h = 0.9, at time 2022-09-28 14:51:49.
Finished h = 1, at time 2022-09-28 14:51:51.
Finished h = 1.1, at time 2022-09-28 14:51:53.
Finished h = 1.2, at time 2022-09-28 14:51:55.
Finished h = 1.3, at time 2022-09-28 14:51:57.
Finished h = 1.4, at time 2022-09-28 14:51:58.
Finished h = 1.5, at time 2022-09-28 14:52:00.
Finished h = 1.6, at time 2022-09-28 14:52:02.
Finished h = 1.7, at time 2022-09-28 14:52:04.
Finished h = 1.8, at time 2022-09-28 14:52:06.
Finished h = 1.9, at time 2022-09-28 14:52:07.
Finished h = 2, at time 2022-09-28 14:52:09.
Finished h = 2.1, at time 2022-09-28 14:52:11.
Finished h = 2.2, at time 2022-09-28 14:52:13.
Finished h = 2.3, at time 2022-09-28 14:52:15.
Finished h = 2.4, at time 2022-09-28 14:52:17.
Finished h = 2.5, at time 2022-09-28 14:52:19.
Finished h = 2.6, at time 2022-09-28 14:52:21.
Finished h = 2.7, at time 2022-09-28 14:52:22.
Finished h = 2.8, at time 2022-09-28 14:52:24.
Finished h = 2.9, at time 2022-09-28 14:52:26.
Finished h = 3, at time 2022-09-28 14:52:28.

full |> 
  group_by(h) |> 
  summarise_all(mean) |> 
  ggplot(aes(x = h)) + 
  geom_line(aes(y = bias, color = "Mean Bias")) + 
  geom_line(aes(y = approx_bias, color = "First Order Approx Bias"))

full |> 
  group_by(h) |> 
  summarise_all(mean) |> 
  ggplot(aes(x = h)) + 
  geom_line(aes(y = variance, color = "Mean Variance")) + 
  geom_line(aes(y = approx_var, color = "First Order Approx Variance"))

The variance is spot on! Lets finish with the rmse

full |> 
  mutate(bias_sq = bias^2) |> 
  group_by(h) |> 
  summarise_all(mean) |> 
  rename(mse = bias_sq) |> 
  mutate(mse_sum = bias^2 + variance) |> 
  mutate(mse_approx = approx_bias^2 + approx_var) |> 
  ggplot(aes(x = h)) + 
  geom_line(aes(y = mse, color = "MSE")) + 
  geom_line(aes(y = mse_sum, color = "MSE Sum")) + 
  geom_line(aes(y = mse_approx, color = "MSE Approx.")) +
  theme_bw() + 
  labs(x = "Bandwidth", y = "MSE", color = NULL)

Proposition: MSE in RD

Proposition Statement

Proposition 1 (RD MSE) We finish with the MSE and MSE minimizing bandwidth in RD settings. We remind that denote the cutoff by c. Define the first order approximation of the MSE as AMSE\left(h\right)=C_{1}h^{4}\left(m_{+}^{\left(2\right)}\left(c\right)-m_{-}^{\left(2\right)}\left(c\right)\right)^2+\frac{C_{2}f\left(c\right)}{Nh}\left(\sigma_{+}^{2}\left(c\right)-\sigma_{-}^{2}\left(c\right)\right)

The MSE and AMSE are connected via MSE\left(h\right)=AMSE\left(h\right)+o_{p}\left(h^{4}+\frac{1}{Nh}\right)

And the bandwidth that minimizes the AMSE is equal to h_{opt}\equiv\arg\min_{h}AMSE\left(h\right)=\left[\frac{\left(C_{2}/4C_{1}\right)}{Nf\left(c\right)}\left(\frac{\sigma_{+}^{2}\left(c\right)-\sigma_{-}^{2}\left(c\right)}{m_{+}^{\left(2\right)}\left(c\right)-m_{-}^{\left(2\right)}\left(c\right)}\right)\right]^{1/5}

Proof

We will make an additional assupmtion, that m_{+}^{\left(2\right)}\left(0\right) \neq m_{-}^{\left(2\right)}, which is needed due to the denominator in h_{opt}.

We will start with considering the RD MSE as defined above.

In the context of RD, we are essentially using Lemma 3 twice, once above and once below the cutoff.

\begin{align*}MSE(h) & =\mathbb{E}\left[\left(\left(\widehat{\mu}_{+}-\mu_{+}\right)-\left(\widehat{\mu}_{-}-\mu_{-}\right)\right)^{2}\mid X\right]\\ & =\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)^{2}-2\left(\widehat{\mu}_{+}-\mu_{+}\right)\left(\widehat{\mu}_{-}-\mu_{-}\right)+\left(\widehat{\mu}_{-}-\mu_{-}\right)^{2}\mid X\right]\\ & =\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)^{2}\mid X\right]+\mathbb{E}\left[\left(\widehat{\mu}_{-}-\mu_{-}\right)^{2}\mid X\right]\\ & -2\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)\left(\widehat{\mu}_{-}-\mu_{-}\right)\mid X\right]\\ & =MSE\left(\widehat{\mu}_{+}\right)+MSE\left(\widehat{\mu}_{-}\right)\\ COV\left(\widehat{\mu}_{+},\widehat{\mu}_{-} \mid X\right)=0\rightarrow & -2\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)\mid X\right]\times\mathbb{E}\left[\left(\widehat{\mu}_{-}-\mu_{-}\right)\mid X\right]\\ & =MSE\left(\widehat{\mu}_{+}\right)+MSE\left(\widehat{\mu}_{-}\right)\\ & -2Bias\left(\widehat{\mu}_{+}\right)\times Bias\left(\widehat{\mu}_{-}\right) \end{align*}

Substituting the expression we found in Lemma 3, we get

\begin{align*}MSE(h) & =\left(C_{1}^{1/2}m_{+}^{\left(2\right)}\left(0\right)h^{2}\right)^{2}+o_{p}\left(h^{4}\right)+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+o_{p}\left(\frac{1}{N_{+}h}\right)\\ & +\left(C_{1}^{1/2}m_{-}^{\left(2\right)}\left(0\right)h^{2}\right)^{2}+o_{p}\left(h^{4}\right)+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+o_{p}\left(\frac{1}{N_{-}h}\right)\\ & -2\left(C_{1}^{1/2}m_{+}^{\left(2\right)}\left(0\right)h^{2}\right)\times\left(C_{1}^{1/2}m_{-}^{\left(2\right)}\left(0\right)h^{2}\right)\\ & =C_{1}h^{4}\left(\left(m_{+}^{\left(2\right)}\left(0\right)\right)^{2}-2m_{-}^{\left(2\right)}\left(0\right)m_{+}^{\left(2\right)}\left(0\right)+\left(m_{-}^{\left(2\right)}\left(0\right)\right)^{2}\right)+o_{p}\left(h^{4}\right)\\ & +C_{2}\left(\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}\right)+o_{p}\left(\frac{1}{N_{+}h}\right)+o_{p}\left(\frac{1}{N_{-}h}\right) \end{align*}

We’re almost there.

Consider the fraction N_{+}/N and N_{-}/N. At the limit n\rightarrow\infty these are equal to Pr\left(X\geq c\right) and Pr\left(X<c\right) respectively. In our sample, the equality doesn’t hold, but the error diminshes with N. That is N_{+}/N=Pr\left(X\geq c\right)+O_{p}\left(N^{-1}\right) and N_{-}/N=Pr\left(X\geq c\right)+O_{p}\left(N^{-1}\right).

Next, consider the density of X at the cutoff from above f_{+}\left(c\right). We can write this as f_{X:X\geq c}\left(c\right), i.e. the density of X at c conditional on X being greater then c. Using Bayes rule we can write this as f_{X:X\geq c}\left(c\right)=f_{X}\left(c\right)/Pr\left(X\geq c\right).

Putting these together we have

\begin{align*} f_{+}\left(c\right)N_{+} & =\frac{f\left(c\right)}{Pr\left(X\geq c\right)}\times\left(N\times Pr\left(X\geq c\right)+NO_{p}\left(N^{-1}\right)\right)\\ & =f\left(c\right)N+O_{p}\left(1\right) \end{align*}

and similarly f_{-}\left(c\right)N_{-}=f\left(c\right)N+O_{p}\left(1\right). Using this results we can write

\begin{align*} C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f_{+}\left(c\right)N_{+}h}+o_{p}\left(\frac{1}{N_{+}h}\right) & =C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f\left(c\right)Nh+O_{p}\left(h\right)}+o_{p}\left(\frac{1}{Nh}\right)\\ & \approx C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f\left(c\right)Nh}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}

Using such a result for above and below the cutoff allows us to write

MSE(h)=C_{1}h^{4}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}+\frac{C_{2}}{f\left(0\right)Nh}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)+o_{p}\left(h^{4}+\frac{1}{Nh}\right)

This concludes the proof of the MSE and AMSE.

Optimal Bandwidth

Define the optimal bandwidth as in Proposition Proposition 1. If we calculate the derivative of the AMSE by h we get

\begin{align*} \frac{\partial AMSE}{\partial h} & =4C_{1}h^{3}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}-\frac{C_{2}}{f\left(0\right)Nh^{2}}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)=0\\ & 4C_{1}h^{3}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}=\frac{C_{2}}{f\left(0\right)Nh^{2}}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)\\ & h^{5}=\frac{C_{2}}{f\left(0\right)N}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)\times\left(4C_{1}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}\right)^{-1}\\ & =\frac{C_{2}}{4C1}\frac{\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)}{f\left(0\right)N\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}}\\ & h=\left(\frac{C_{2}}{4C1}\frac{\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)}{f\left(0\right)N\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}}\right)^{1/5} \end{align*}

Illustration

Lets calculate for a simulation the optimal bandwidth

## we next calculate all the elements that depend on K
v_j = function(u, k, j) {k(u) * u^j}
v_j_int = function(k, j) {integrate(f = v_j, lower = 0, upper = Inf, j = j, k = k)$value}

pi_j = function(u, k, j) {k(u)^2 * u^j}
pi_j_int = function(k, j) {integrate(f = pi_j, lower = 0, upper = Inf, j = j, k = k)$value}

C_1_k <- function(k = k_triangular) {
  nom <- v_j_int(k = k, j = 2) ^ 2 - v_j_int(k = k, j = 1) * v_j_int(k = k, j = 3)
  denom <- v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2) - v_j_int(k = k, j = 1) ^ 2
  return(1/4 *(nom / denom) ^ 2)
}

## next we calculate the first order approximation
C_2_k <- function(k = k_triangular) {
  nom <- v_j_int(k = k, j = 2)^2 * pi_j_int(k = k, j = 0) - 2 * v_j_int(k = k, j = 1) * v_j_int(k = k, j = 2) * pi_j_int(k = k, j = 1) + v_j_int(k = k, j = 1)^2 * pi_j_int(k = k, j = 2)
  denom <- (v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2) - v_j_int(k = k, j = 1) ^ 2)^2
  return(nom / denom)
}

C_k <- function(k = k_triangular) {C_2_k(k=k) / (4*C_1_k(k=k))}

## begin with a function, that given truth, finds the approximation of the optimal bandwidth
# the quantity (1-2/pi) is the variance of a standard normal truncated at zero
h_opt <- function(C, f_0, N, sigma_plus = (1-2/pi), sigma_minus = (1-2/pi), m_sq_plus, m_sq_minus) {
  ((C / (f_0 * N)) * ((sigma_plus + sigma_minus) / (m_sq_plus - m_sq_minus)^2))^(1/5)

}

Note that we follow assumptions, where second derivative is not equal above and below the cutoff

y_function2 <- function(X) {
  (X >= 0) * (10 + X + 0.2 * X ^ 2 - 0.3 * X ^ 3) +
    (X < 0) * (10 + 0.5 * X + 0.4 * X ^ 2 - 0.1 * X ^ 3)
}

# example dgp and plot
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function2)

df |> 
  ggplot(aes(x=X, y=Y)) + 
  geom_point(aes(color = as.factor(W))) + 
  theme_bw() + geom_vline(xintercept = 0, linetype = "dashed")

And now we simulate

h_vector = seq(0.1, 3, 0.1)

full = NULL
for(h in h_vector) {
  full = rbind(full, 
               simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function2, k = k_triangular, h = h ) |> 
                 mutate(h = h) |> 
                 mutate(approx_bias = bias_approx(k = k, h = h),
                        approx_var = variance_approx(k = k, h = h, N = N))
  )
  print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}

Finished h = 0.1, at time 2022-09-28 14:52:32.
Finished h = 0.2, at time 2022-09-28 14:52:33.
Finished h = 0.3, at time 2022-09-28 14:52:35.
Finished h = 0.4, at time 2022-09-28 14:52:37.
Finished h = 0.5, at time 2022-09-28 14:52:39.
Finished h = 0.6, at time 2022-09-28 14:52:41.
Finished h = 0.7, at time 2022-09-28 14:52:43.
Finished h = 0.8, at time 2022-09-28 14:52:45.
Finished h = 0.9, at time 2022-09-28 14:52:46.
Finished h = 1, at time 2022-09-28 14:52:48.
Finished h = 1.1, at time 2022-09-28 14:52:50.
Finished h = 1.2, at time 2022-09-28 14:52:52.
Finished h = 1.3, at time 2022-09-28 14:52:54.
Finished h = 1.4, at time 2022-09-28 14:52:56.
Finished h = 1.5, at time 2022-09-28 14:52:58.
Finished h = 1.6, at time 2022-09-28 14:52:59.
Finished h = 1.7, at time 2022-09-28 14:53:01.
Finished h = 1.8, at time 2022-09-28 14:53:03.
Finished h = 1.9, at time 2022-09-28 14:53:05.
Finished h = 2, at time 2022-09-28 14:53:07.
Finished h = 2.1, at time 2022-09-28 14:53:09.
Finished h = 2.2, at time 2022-09-28 14:53:11.
Finished h = 2.3, at time 2022-09-28 14:53:12.
Finished h = 2.4, at time 2022-09-28 14:53:14.
Finished h = 2.5, at time 2022-09-28 14:53:16.
Finished h = 2.6, at time 2022-09-28 14:53:18.
Finished h = 2.7, at time 2022-09-28 14:53:20.
Finished h = 2.8, at time 2022-09-28 14:53:22.
Finished h = 2.9, at time 2022-09-28 14:53:24.
Finished h = 3, at time 2022-09-28 14:53:25.

h_opt_approx = h_opt(C = C_k(k = k_triangular), f_0 = dnorm(0), N = N, 
                     sigma_plus = (1-2/pi), sigma_minus = (1-2/pi), 
                     m_sq_plus = 0.2 , m_sq_minus = 0.4)

full |> 
  mutate(bias_sq = bias^2) |> 
  group_by(h) |> 
  summarise_all(mean) |> 
  rename(mse = bias_sq) |> 
  mutate(mse_sum = bias^2 + variance) |> 
  mutate(mse_approx = approx_bias^2 + approx_var) |> 
  ggplot(aes(x = h)) + 
  geom_line(aes(y = mse, color = "MSE Truth")) + 
  geom_line(aes(y = mse_sum, color = "MSE Truth Sum")) + 
  geom_line(aes(y = mse_approx, color = "MSE Approx.")) +
  theme_bw() + 
  labs(x = "Bandwidth", y = "MSE", color = NULL) + 
  geom_vline(xintercept = h_opt_approx)

References

Imbens, Guido, and Karthik Kalyanaraman. 2012. “Optimal Bandwidth Choice for the Regression Discontinuity Estimator.” The Review of Economic Studies 79 (3): 933–59.