# Set parameters for number of iterations and observations in simulations
N = 1000
num_it = 100Going over Imbens and Kalyanaraman 2012 - MSE Approximation and Optimal Bandwidth
This document goes over the technical details in Imbens and Kalyanaraman (2012). All errors are my own.
Setup and Estimand
Setup
Consider a sample of N units with running variable X and outcome Y. Define potential outcomes of treatment and control as Y(1) and Y(0). Define the treatment assignment mechanism W=\mathbf{1}(X\geq c) for some known cutoff c. The function \mathbf{1}(.) is the indicator function. This defines how treatment, potential and observed outcomes are connected.
\begin{equation} Y=\begin{cases} Y(1) & W=1\leftrightarrow X\geq c\\ Y(0) & W=0\leftrightarrow X<c \end{cases} \end{equation}We also define the expectation of the observed outcome conditional on the running variable m(x)=\mathbb{E}[Y \mid X_i=x].
Estimand
The causal parameter of interest in the sharp RD design, i.e. the defined treatment assignment mechanism, is the average treatment effect local to the cutoff
\tau = \mathbf[Y_i(1)-Y_i(0) \mid X_i = c]
Identification and Estimation
Identification
If we assume that:
- The conditional distribution functions F_{Y(w) \mid X}(y \mid x) for w\in\{0,1\} are continuous in x around c for all y,
- The conditional first moments \mathbb{E}[Y_i(w) \mid X_i = x] for w\in\{0,1\} exist and are continuous in x around c, then \tau=\mu_{+}-\mu_{-}\quad\text{where}\quad \mu_{+}=\lim_{x\searrow c}m\left(x\right)\quad\text{and}\quad\mu_{-}=\lim_{x\nearrow c}m\left(x\right)
Estimation
We estimate the boundary of m(x) above and below c, i.e. \mu_{+} and \mu_{-}, via local weighted least squares.
- Local, as in we use only observations around the cutoff, defined via the bandwidth h. The choice of the bandwidth h is the focus of this paper.
- Weighted, as in we give these observations weights as a function of their distance from the cutoff, defined via the kernel K(.).
Kernels
Three examples of often used kernels are
- The uniform kernel gives the same weight to all observations in the support u\in[-1,1]:K_{uniform}(u)=0.5, otherwise gives weight zero.
- One can check this satisfies the above definition. Obviously, K_{uniform}(u)=0.5 satisfies non-negativity and symmetry. Lets check normalization
\int_{-1}^{1}0.5du = \left[0.5x\right]_{-1}^{1} = 0.5-\left(-0.5\right) = 1 - The edge or triangular kernel gives more weight as observations come closer to the cutoff, within the support, u\in[-1,1]:K_{triangular}(u) = 1 - |u|, otherwise gives weight zero. - The Epanechnikov kernel gives more weight as observations come closer, but with a more ``bell’’ shaped curve then the triangular kernel, u\in[-1,1]:K_{Epanechnikov}(u) = \frac{3}{4}(1-u^2).
Lets visualize these three examples.
Code
# define each kernel
k_uniform <- function(u) {1*(u >= -1 & u <= 1) * 0.5}
k_triangular <- function(u) {1*(u >= -1 & u <= 1) * (1 - abs(u))}
k_epanechnikov <- function(u) {1*(u >= -1 & u <= 1) * (3/4) * (1 - u^2)}
ggplot() +
# define support limits
xlim(-1.5, 1.5) +
# add each function, use colors for legend
geom_function(aes(color = "Uniform"), fun = k_uniform) +
geom_function(aes(color = "Triangular"), fun = k_triangular) +
geom_function(aes(color = "Epanechnikov"), fun = k_epanechnikov) +
# some cosmetics
scale_color_brewer(name = NULL, palette = "Dark2") +
theme_bw() +
labs(x = "Support", y = "Weight")Local weighted Least squares regression
Given a choice of bandwidth h and kernel K(.) with support [-h,h], we estimate from above and below the cutoff two weighted regressions
\begin{align*} \left(\widehat{\alpha}_{-},\widehat{\beta}_{-}\right) & =\arg\min_{\alpha,\beta}\sum_{i=1}^{N}\mathbf{1}\left(X_{i}<c\right)\times\left(Y_{i}-\alpha-\beta\left(X_{i}-c\right)\right)^{2}\times K\left(\frac{X_{i}-c}{h}\right)\\ \left(\widehat{\alpha}_{+},\widehat{\beta}_{+}\right) & =\arg\min_{\alpha,\beta}\sum_{i=1}^{N}\mathbf{1}\left(X_{i}\geq c\right)\times\left(Y_{i}-\alpha-\beta\left(X_{i}-c\right)\right)^{2}\times K\left(\frac{X_{i}-c}{h}\right) \end{align*}We set the estimated expectation function as \widehat{\mu}_{-} = \widehat{\alpha}_{-} and \widehat{\mu}_{+} = \widehat{\alpha}_{+}, which gives us our estimator \widehat{\tau}=\widehat{\mu}_{+}-\widehat{\mu}_{-}.
MSE Optimization
Our goal is to find a solution for the bandwidth that minimizes the MSE. To do this, we will consider a first order approximation. Before stating the final proposition, we begin with some helpful lemmas.
First Lemma
Suppose we are looking at observations only above the cutoff c. Without loss of generality, normalize the cutoff such that c=0.
Assumptions
- We have N pairs (Y_i, X_i) i.i.d with X_i\geq 0.
- m(x) = \mathbb{E}[Y_i \mid X_i=x] is three times continuously differentiable.
- The density of X, f(x), is continuously differentiable at x=0, with f(0)>0.
- The conditional variance \sigma^2(x)=Var(Y \mid X=x)>0 is bounded and continuous at x=0.
- We have a kernel K:\mathbb{R}_+\rightarrow \mathbb{R}, with K(u)=0 for u\geq\bar{u}, and \int_{0}^{\bar{u}}K(u)du=1.
Define K_h(u)=K(u/h)/h, and \mu=m(0).
Lemma Statement
Lemma 1 (Kernel Weighted Sample Moments)
Define F_j = N^{-1}\sum_{i=1}^NK_h(X_i)X_i^j. Under the above assumptions 1-5 F_j=h^jf(0)v_j+o_p(h^j) where v_j=\int_{0}^{\infty} t^j K(t) dt.
O-notation
Proof
Proof. We begin breaking F_j apart into an expectation and a variance component. Since X and hence K_h(X) are i.i.d random variables, we get F_j = \mathbb{E}[F_j] + O_p(\mathbb{V}(F_j)^{1/2}) (one can take this to mean – every realization in some distribution is some constant standard deviations away from the expectation) We begin with considering the expectation. Some algebra and the definition of K_h(.) produces
\begin{align*} \mathbb{E}\left[F_{j}\right] & =\int_{0}^{\infty}N^{-1}\sum_{i=1}^{N}K_{h}\left(X_{i}\right)X_{i}^{j}f(x)dx\\ & =N^{-1}\int_{0}^{\infty}\sum_{i=1}^{N}K_{h}\left(X_{i}\right)X_{i}^{j}f(x)dx\\ & =N^{-1}\sum_{i=1}^{N}\int_{0}^{\infty}K_{h}\left(x\right)x^{j}f(x)dx\\ & =\int_{0}^{\infty}K_{h}\left(x\right)x^{j}f(x)dx\\ & =\int_{0}^{\infty}h^{-1}K\left(x/h\right)x^{j}f(x)dx \end{align*}Using a change of variables z=x/h, which produces x=hz and dx=hdz, gives us
\begin{align*} \mathbb{E}\left[F_{j}\right] & =\int_{0}^{\infty}h^{-1}K\left(z\right)z^{j}h^{j}f(zh)hdz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f(zh)dz \end{align*}Adding and subtracting f(0) we get
\begin{align*} \mathbb{E}\left[F_{j}\right] & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}\left(f(zh)+f\left(0\right)-f\left(0\right)\right)dz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz+h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}\left(f(zh)-f\left(0\right)\right)dz\\ & =h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz+h^{j+1}\int_{0}^{\infty}K\left(z\right)z^{j+1}\frac{\left(f(zh)-f\left(0\right)\right)}{zh}dz \end{align*}We can simplify the left element as
\begin{align*} h^{j}\int_{0}^{\infty}K\left(z\right)z^{j}f\left(0\right)dz & =h^{j}f\left(0\right)\int_{0}^{\infty}K\left(z\right)z^{j}dz\\ & =h^{j}f\left(0\right)v_{j} \end{align*}The right element can be bounded by a function of magnitude h^{j+1}, and hence converges to zero at a rate of o_p(h^j). Hence we get \mathbb{E}\left[F_{j}\right]=h^{j}f\left(0\right)v_{j}+O_{p}\left(h^{j+1}\right)=h^jf(0)v_j + o_p(h^j)
We continue with the variance of F_j.
\begin{align*} \mathbb{V}\left(F_{j}\right) & =\mathbb{E}\left[\left(F_{j}-\mathbb{E}\left[F_{j}\right]\right)^{2}\right]\\ & \leq\mathbb{E}\left[F_{j}^{2}\right]\\ & =N^{-2}\mathbb{E}\left[\sum_{i=1}^{N}\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right]\\ & =N^{-2}\sum_{i=1}^{N}\mathbb{E}\left[\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right]\\ & =N^{-1}\mathbb{E}\left[\left(K_{h}\left(X_{i}\right)\right)^{2}X_{i}^{2j}\right] \end{align*}Using the defintion of K_h(.) we get \begin{align*} \mathbb{V}\left(F_{j}\right) & =N^{-1}\mathbb{E}\left[\left(h^{-1}K\left(X_{i}/h\right)\right)^{2}X_{i}^{2j}\right]\\ & =\frac{1}{Nh^{2}}\mathbb{E}\left[\left(K\left(X_{i}/h\right)\right)^{2}X_{i}^{2j}\right]\\ & =\frac{1}{Nh^{2}}\int_{0}^{\infty}\left(K\left(x/h\right)\right)^{2}x^{2j}dx \end{align*}
Again using a change of variables z=x/h we get
\begin{align*} \mathbb{V}\left(F_{j}\right) & =\frac{1}{Nh^{2}}\int_{0}^{\infty}\left(K\left(z\right)\right)^{2}z^{2j}h^{2j}hdz\\ & =\frac{h^{2j}}{Nh}\int_{0}^{\infty}\left(K\left(z\right)\right)^{2}z^{2j}hdz\\ & =O_{p}\left(\frac{h^{2j-1}}{N}\right)=o_{p}\left(h^{2j}\right) \end{align*}Putting everything together, we get \begin{align*} F_{j} & =\mathbb{E}\left[F_{j}\right]+O_{p}\left(\mathbb{V}\left(F_{j}\right)^{1/2}\right)\\ & =h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right)+o_{p}\left(h^{2j}\right)\\ & =h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right)\\ & =h^{j}\left(f\left(0\right)v_{j}+o_{p}\left(1\right)\right) \end{align*}
Completing the proof.
What did we just prove?
Lets consider a distribution for the running variable, and consider the proven statement.
set.seed(1)
# distribution parameters
N <- 1000
# sample from the distribution
X <- rnorm(n = N, mean = 0, sd = 1)
# show distribution in histogram
hist(X)We continue with observing the assigned kernel weights.
# define a kernel
k_triangular <- function(u) {
1*(u >= -1 & u <= 1) * (1 - abs(u))
}
# define K_h
kernel_h <- function(u, h, kernel) {
(1/h) * kernel(u/h)
}
# put this into a data-frame
df <- X |>
as.data.frame() |>
mutate(W = 1*(X >= 0))
# define bandwidth
h <- df$X |> sd()
# add kernel weights
df <- df |>
mutate(K_h = kernel_h(u = X, h = h, kernel = k_triangular))
# plot the weights by X
df |>
mutate(treatment = case_when(W == 0 ~ "Control", W == 1 ~ "Treatment")) |>
ggplot() +
# add points by X and treatment status
geom_point(aes(x=X, y = treatment, color = K_h),
# need some jitter
position = position_jitter(width = 0, height = 0.4),
# some cosmetics
stroke = 2, shape = 16, alpha = .75) +
geom_vline(xintercept = 0, linetype = "dashed") +
# add a diverging color series
scale_color_gradientn(name = "Weights",
colours = wesanderson::wes_palette(name = "Zissou1", type = "continuous")) +
theme_bw()Finally we consider the weighted moments
### we remind that we only showed proof for observations above the cutoff.
# filter correct DGP
df <- df |> filter(X >= 0)
# function to calculate the weighted moments
weighted_moments <- function(X, j, kernel, h) {
N = length(X)
K_h_X = kernel_h(u = X, h = h, kernel = kernel)
X_j = X^j
F_j = N^-1 * sum(K_h_X * X_j)
return(F_j)
}
# lets calculate different moments
F_j_df <- map(0:5, weighted_moments, X = df$X, h = h, kernel = k_triangular) |>
unlist() |>
as.data.frame() |>
rename(F_j = 1) |>
mutate(j = row_number()-1)
F_j_df F_j j
1 0.342195 0
2 0.119008 1
3 0.060413 2
4 0.036576 3
5 0.024625 4
6 0.017816 5
Lets play around with this. We saw that F_j = h^j\left( f(0)v_j + o_p(1) \right) This can also be written as F_j / h^j = f(0)v_j + o_p(1) Which can be arranged to F_j / h^j - f(0)v_j = o_p(1) Lets see this.
# for a normal distribution with mean zero and sd 1, which is the underlying DGP in this example
f_0 = dnorm(x = 0)
# by definition
v_j_fun <- function(t, j, kernel) {t^j * kernel(t)}
v_j_int = function(j, kernel) {
integrate(f = v_j_fun, lower = 0, upper = Inf, j = j, kernel = kernel)$value
}
F_j_df |>
# calculate other elements
mutate(f_0 = f_0,
v_j = map_dbl(0:5, v_j_int, kernel = k_triangular)) |>
# calculate final equation
mutate(result = (F_j/h^j) - (f_0*v_j)) F_j j f_0 v_j result
1 0.342195 0 0.39894 0.500000 0.1427239
2 0.119008 1 0.39894 0.166667 0.0485021
3 0.060413 2 0.39894 0.083333 0.0231600
4 0.036576 3 0.39894 0.050000 0.0130506
5 0.024625 4 0.39894 0.033333 0.0081686
6 0.017816 5 0.39894 0.023810 0.0055084
We can see that the calculated difference shrinks with j.
Second Lemma
Lemma Statement
Lemma 2 (Variance and Kernel Weighted Sample Moments)
Define G_j = N^{-1}\sum_{i=1}^N\left[(K_h(X_i))^2\times \sigma^2(X_i) \times X_i^j\right]. Under the above assumptions 1-5 G_{j}=h^{j-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{j}\left(1+o_{p}\left(1\right)\right) where \pi_{j}=\int_{0}^{\infty}t^{j}\left(K(t)\right)^{2}dt.
Leave the proof of this lemma aside on continue with the main proposition.
Third Lemma: Bias Variance and MSE of Boundary Kernel Estimators
Lemma Statment
Lemma 3 (One-Sided MSE) We remind that we are considering only observations where the running variable X is above the cutoff, which we generalized to be equal c=0. We take as inputs a bandwidth h and kernel K. Estimate the expectation at the boundary by \left(\widehat{\mu},\widehat{\beta}\right)=\arg\min_{\mu,\beta}\sum_{i}\left(Y_{i}-\mu-\beta X_{i}\right)^{2}\times K_{h}\left(X_{i}\right)
The bias and variance of the estimator of the boundary are given by \begin{align*} Bias\left(\widehat{\mu}\right) & =\mathbb{E}\left[\widehat{\mu}\mid X\right]-\mu=C_{1}^{1/2}m^{\left(2\right)}\left(0\right)h^{2}+o_{p}\left(h^{2}\right)\\ \mathbb{V}\left(\widehat{\mu}\mid X\right) & =C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}
And this gives us the MSE \begin{align*} MSE\left(\widehat{\mu}\right) & =\mathbb{E}\left[\left(\widehat{\mu}-\mu\right)^{2}\mid X\right]=\left(Bias\left(\widehat{\mu}\right)\right)^{2}+\mathbb{V}\left(\widehat{\mu}\mid X\right)\\ & =\left(C_{1}^{1/2}m^{\left(2\right)}\left(0\right)h^{2}+o_{p}\left(h^{2}\right)\right)^{2}+C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(\frac{1}{Nh}\right)\\ & =C_{1}\left(m^{\left(2\right)}\left(0\right)\right)^{2}h^{4}+C_{2}\frac{\sigma^{2}\left(0\right)}{f\left(0\right)Nh}+o_{p}\left(h^{4}+\frac{1}{Nh}\right) \end{align*}
Proof: Setup
Define by R the N\times 2 matrix of covariates, which are made up of a constant 1 and the running variable X \begin{equation*} R=\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{2} \end{array}\right)_{N\times2}=\left[\iota\;X\right] \end{equation*} where \iota is a N-length vector of ones. Also define by W the N\times N weighting matrix, with typical element \begin{equation*} W_{ij}=\begin{cases} K_{h}\left(X_{i}\right) & i=j\\ 0 & o.w. \end{cases} \end{equation*}
Lastly, we define the vector e_1 = (1, 0)^\prime. We can now write the estimation of \widehat{\mu} via \widehat{m}\left(0\right)=\widehat{\mu}=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WY For intuition of the above equation, consider the following. If would have wanted an unweighted estimation, i.e. OLS, we can calculate our estimators via \widehat{\beta}_{unweighted} = \left(R^{\prime}R\right)^{-1}R^{\prime}Y. In order to incorporate the weights, we add the W matrix, \widehat{\beta}_{weighted} = \left(R^{\prime}WR\right)^{-1}R^{\prime}WY. And lastly, in order to choose the first coefficient out of \widehat{\beta}_{weighted}, we multiply this by the ``choosing vector’’ e_1.
Lets consider the bias of our estimator, which we write as Bias\left(\widehat{m}\left(0\right)\right)=\mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right]-m\left(0\right)
To calculate this bias, we begin with writing out \mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right]=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WM
where M=\left(m\left(X_{1}\right),...,m\left(X_{N}\right)\right). To approximate this expression, we use the assumption that m(x)=\mathbb{E}[Y \mid X=x] is thrice differentiable, and hence a Taylor expansion can be written as
m\left(X_{i}\right)=m\left(0\right)+m^{\left(1\right)}\left(0\right)X_{i}+m^{\left(2\right)}\left(0\right)\frac{X_{i}}{2}+T_i
where we denote by T_i the taylor expansion reminder so we can write an equality (and not approximation). This element is bounded by \left|T_{i}\right| \leq \max_{c\in supp(X)}\left|m^{\left(3\right)}\left(c\right)\right|\times\left|X_{i}^{3}\right|
Lets continue with the proof. Lets denote by S a N\times 1 vector with typical element S_i=m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2}, and by T the N\times 1 vector with typical element i equal to T_i as defined above. Using this notation, we can re-write the taylor expansion for a single X_i as
\begin{equation*} m\left(X_{i}\right)=(1\;X_{i})\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right) + S_{i}+T_{i} \end{equation*}and for the whole matrix M we can write
\begin{equation*} M=R\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+S+T \end{equation*}Returning to the conditional expectation of our estimate, we can now write \begin{align*} \mathbb{E}\left[\widehat{m}\left(0\right)\mid X\right] & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WM\\ & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(R\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+S_{i}+T_{i}\right)\\ & =e_{1}^{\prime}\left(\begin{array}{c} m\left(0\right)\\ m^{\left(1\right)}\left(0\right) \end{array}\right)+e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right)\\ & =m\left(0\right)+e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right) \end{align*}
Substituting this into the bias equation gives us Bias\left(\widehat{m}\left(0\right)\right)=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S_{i}+T_{i}\right)
Proof: Bias First Element
Lets consider each of these in turn. We’ll start with e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WT. Consider the inverse element separately.
\begin{align*} R^{\prime}WR & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{cc} 1 & X_{1}\\ \vdots & \vdots\\ 1 & X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{cc} \sum K_{h}\left(X_{i}\right) & \sum K_{h}\left(X_{i}\right)X_{i}\\ \sum K_{h}\left(X_{i}\right)X_{i} & \sum K_{h}\left(X_{i}\right)X_{i}^{2} \end{array}\right)_{2\times2} \end{align*}Multiplying this by the N inverse gives us
\begin{align*} \frac{1}{N}R^{\prime}WR & =\left(\begin{array}{cc} N^{-1}\sum K_{h}\left(X_{i}\right) & N^{-1}\sum K_{h}\left(X_{i}\right)X_{i}\\ N^{-1}\sum K_{h}\left(X_{i}\right)X_{i} & N^{-1}\sum K_{h}\left(X_{i}\right)X_{i}^{2} \end{array}\right)_{2\times2}\\ & =\left(\begin{array}{cc} F_{0} & F_{1}\\ F_{1} & F_{2} \end{array}\right)_{2\times2} \end{align*}The inverse of this matrix is \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\left(\begin{array}{cc} F_{0} & F_{1}\\ F_{1} & F_{2} \end{array}\right)^{-1}\\ &=\frac{1}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{F_{2}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\\ -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & \frac{F_{0}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} \end{array}\right) \end{align*}
Going back to the Lemma 1, we use h^{j}\left(f(0)v_{j}+o_{p}(1)\right). Lets start with the nominator \begin{align*} F_{0}F_{2}-F_{1}^{2} & =\left[\left(f(0)v_{0}+o_{p}\left(1\right)\right)\right]\times\left[h^{2}\left(f(0)v_{0}+o_{p}\left(1\right)\right)\right]-\left[h^{1}\left(f(0)v_{1}+o_{p}\left(1\right)\right)\right]^{2}\\ & =h^{2}\left[f(0)v_{0}+o_{p}\left(1\right)\right]\times\left[f(0)v_{2}+o_{p}\left(1\right)\right]-h^{2}\left[f(0)v_{1}+o_{p}\left(1\right)\right]^{2}\\ & =h^{2}\left[f(0)v_{0}\times f(0)v_{2}+f(0)v_{0}o_{p}\left(1\right)+f(0)v_{2}o_{p}\left(1\right)+o_{p}\left(1\right)\right]\\ & -h^{2}\left[f(0)^{2}v_{1}^{2}-2f(0)v_{1}o_{p}\left(1\right)-o_{p}\left(1\right)\right]\\ & =h^{2}\left[f(0)v_{0}\times f(0)v_{2}-f(0)^{2}v_{1}^{2}+o_{p}\left(1\right)\right]\\ & \approx h^{2}\left[f(0)v_{0}\times f(0)v_{2}-f(0)^{2}v_{1}^{2}\right] \end{align*}
A similar process for the nominator allow us to write \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\left(\begin{array}{cc} \frac{F_{2}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}}\\ -\frac{F_{1}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} & \frac{F_{0}}{F_{0}F_{2}-\left(F_{1}\right)^{2}} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{h^{2}\left(f(0)v_{2}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} & -\frac{h^{1}\left(f(0)v_{1}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}\\ -\frac{h^{1}\left(f(0)v_{1}+o_{p}(1)\right)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} & \frac{f(0)v_{0}+o_{p}(1)}{h^{2}f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]} \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{f(0)v_{2}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}}{f(0)^{2}\left[v_{0}v_{2}-v_{1}^{2}\right]}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}+o_{p}(1)}{v_{0}\times f(0)v_{2}-f(0)v_{1}^{2}}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\\ & =\left(\begin{array}{cc} O_{p}\left(1\right) & O_{p}\left(\frac{1}{h}\right)\\ O_{p}\left(\frac{1}{h}\right) & O_{p}\left(\frac{1}{h^{2}}\right) \end{array}\right) \end{align*}
Lets turn to the second part. \begin{align*}R^{\prime}WT & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{c} T_{1}\\ \vdots\\ T_{N} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{c} T_{1}\\ \vdots\\ T_{N} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)T_{i}\\ \sum K_{h}\left(X_{i}\right)X_{i}T_{i} \end{array}\right)_{2\times 1} \end{align*}
The absolute value of this gives us \begin{align*} \left|R^{\prime}WT\right| & =\left(\begin{array}{c} \left|\sum K_{h}\left(X_{i}\right)T_{i}\right|\\ \left|\sum K_{h}\left(X_{i}\right)X_{i}T_{i}\right| \end{array}\right)\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)\left|T_{i}\right|\\ \sum K_{h}\left(X_{i}\right)X_{i}\left|T_{i}\right| \end{array}\right)\\ & \leq\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)\left[\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left|X_{i}^{3}\right|\right]\\ \sum K_{h}\left(X_{i}\right)X_{i}\left[\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & =\left(\begin{array}{c} \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left[\sum K_{h}\left(X_{i}\right)\left|X_{i}^{3}\right|\right]\\ \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\left[\sum K_{h}\left(X_{i}\right)X_{i}\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \left[\sum K_{h}\left(X_{i}\right)\left|X_{i}^{3}\right|\right]\\ \left[\sum K_{h}\left(X_{i}\right)X_{i}\left|X_{i}^{3}\right|\right] \end{array}\right)\\ & \sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \left[\sum K_{h}\left(X_{i}\right)X_{i}^{3}\right]\\ \left[\sum K_{h}\left(X_{i}\right)X_{i}^{4}\right] \end{array}\right) \end{align*} where we got the last equality since we are only considering X\geq0. Adding N inverse we get \begin{align*} \left|\frac{1}{N}R^{\prime}WT\right| & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} \frac{1}{N}\left[\sum K_{h}\left(X_{i}\right)X_{i}^{3}\right]\\ \frac{1}{N}\left[\sum K_{h}\left(X_{i}\right)X_{i}^{4}\right] \end{array}\right)\\ & =\sup_{c}\left|m^{\left(3\right)}\left(c\right)\right|\times\left(\begin{array}{c} F_{3}\\ F_{4} \end{array}\right)\\ & \leq\left(\begin{array}{c} o_{p}\left(h^{2}\right)\\ o_{p}\left(h^{3}\right) \end{array}\right) \end{align*}
This finally allows us to write \begin{align*} e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WT & \leq\left(\begin{array}{cc} O_{p}\left(1\right) & O_{p}\left(\frac{1}{h}\right)\\ O_{p}\left(\frac{1}{h}\right) & O_{p}\left(\frac{1}{h^{2}}\right) \end{array}\right)\left(\begin{array}{c} o_{p}\left(h^{2}\right)\\ o_{p}\left(h^{3}\right) \end{array}\right)\\ & =e_{1}^{\prime}\left(\begin{array}{c} O_{p}\left(1\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{3}\right)\\ O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h^{2}}\right)\times o_{p}\left(h^{3}\right) \end{array}\right)\\ & =O_{p}\left(1\right)\times o_{p}\left(h^{2}\right)+O_{p}\left(\frac{1}{h}\right)\times o_{p}\left(h^{3}\right)\\ & =o_{p}\left(h^{2}\right) \end{align*}
Phew. Halfway there.
Proof: Bias Second Element
We now turn to consider e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WS. We already did some work, so we only to find R^{\prime}WS.
\begin{align*} R^{\prime}WS & =\left(\begin{array}{ccc} 1 & \cdots & 1\\ X_{1} & \cdots & X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & K_{h}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{c} m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2}\\ \vdots\\ m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{c} m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2}\\ \vdots\\ m^{\left(2\right)}\left(0\right)\frac{X_{1}^{2}}{2} \end{array}\right)_{N\times1}\\ & =\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2}\\ \sum K_{h}\left(X_{i}\right)X_{i}m^{\left(2\right)}\left(0\right)\frac{X_{i}^{2}}{2} \end{array}\right)_{2\times1}\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} \sum K_{h}\left(X_{i}\right)X_{i}^{2}\\ \sum K_{h}\left(X_{i}\right)X_{i}^{3} \end{array}\right) \end{align*}Dividing by N we get
\begin{align*} \frac{1}{N}R^{\prime}WS & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} F_{2}\\ F_{3} \end{array}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right) \end{align*}Putting it together we have
\begin{align*}e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}WS & =e_{1}^{\prime}\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\\ -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right) & \frac{1}{h^{2}}\frac{f(0)v_{1}+o_{p}(1)}{v_{0}\times f(0)v_{2}-f(0)v_{1}^{2}}+\frac{1}{h^{2}}o_{p}\left(1\right) \end{array}\right)\frac{m^{\left(2\right)}\left(0\right)}{2}\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right)\\ & =\left(\begin{array}{cc} \frac{v_{2}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(1\right) & -\frac{1}{h}\frac{v_{1}}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{1}{h}o_{p}\left(1\right)\end{array}\right)\left(\begin{array}{c} h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)\\ h^{3}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right) \end{array}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left[\frac{v_{2}h^{2}\left(f\left(0\right)v_{2}+o_{p}\left(1\right)\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(h^{2}\right)-\frac{v_{1}h^{2}\left(f\left(0\right)v_{3}+o_{p}\left(1\right)\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-o_{p}\left(h^{2}\right)\right]\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}\left[\frac{v_{2}^{2}h^{2}f\left(0\right)+o_{p}\left(h^{2}\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}-\frac{v_{1}v_{3}h^{2}f(0)+o_{p}\left(h^{2}\right)}{f(0)\left[v_{0}v_{2}-v_{1}^{2}\right]}+o_{p}\left(h^{2}\right)\right]\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right) \end{align*}All the above allows us to write the bias as equal to
\begin{align*} Bias\left(\widehat{m}\left(0\right)\right) & =e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\left(S+T\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right)+o_{p}\left(h^{2}\right)\\ & =\frac{m^{\left(2\right)}\left(0\right)}{2}h^{2}\left[\frac{v_{2}^{2}-v_{1}v_{3}}{v_{0}v_{2}-v_{1}^{2}}\right]+o_{p}\left(h^{2}\right) \end{align*}
Proof: Variance
Define the n\times n matrix \Sigma with typical element (i,i) \begin{equation*} \Sigma_{ij}=\begin{cases} \sigma^2(X_i) & i=j\\ 0 & o.w. \end{cases} \end{equation*}
The conditional variance of our estimator is equal to
\begin{equation*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right)=e_{1}^{\prime}\left(R^{\prime}WR\right)^{-1}R^{\prime}W\Sigma WR\left(R^{\prime}WR\right)^{-1}e_{1} \end{equation*}Due to our work above, we only need to do some work on the middle element.
\begin{align*} R^{\prime}W & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\\ WR & =\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ R^{\prime}W\Sigma WR & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\\ X_{1}K_{h}\left(X_{1}\right) & \cdots & X_{N}K_{h}\left(X_{N}\right) \end{array}\right)_{2\times N}\left(\begin{array}{ccc} \sigma^{2}\left(X_{1}\right) & & 0\\ & \ddots\\ 0 & & \sigma^{2}\left(X_{N}\right) \end{array}\right)_{N\times N}\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{ccc} K_{h}\left(X_{1}\right)\sigma^{2}\left(X_{1}\right) & \cdots & K_{h}\left(X_{N}\right)\sigma^{2}\left(X_{N}\right)\\ K_{h}\left(X_{1}\right)\sigma^{2}\left(X_{1}\right)X_{1} & \cdots & K_{h}\left(X_{N}\right)\sigma^{2}\left(X_{N}\right)X_{N} \end{array}\right)_{2\times N}\left(\begin{array}{cc} K_{h}\left(X_{1}\right) & K_{h}\left(X_{1}\right)X_{1}\\ \vdots & \vdots\\ K_{h}\left(X_{N}\right) & K_{h}\left(X_{N}\right)X_{N} \end{array}\right)_{N\times2}\\ & =\left(\begin{array}{cc} \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right) & \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i}\\ \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i} & \sum\left(K_{h}\left(X_{i}\right)\right)^{2}\sigma^{2}\left(X_{i}\right)X_{i}^{2} \end{array}\right) \end{align*}Dividing by the N
\begin{align*} \frac{1}{N}R^{\prime}W\Sigma WR & =\left(\begin{array}{cc} G_{0} & G_{1}\\ G_{1} & G_{2} \end{array}\right) \end{align*}We remind that we already found that \begin{align*} \left(\frac{1}{N}R^{\prime}WR\right)^{-1} & =\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right) \end{align*}
And putting it all together \begin{align*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right) & =\frac{1}{N}e_{1}^{\prime}\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)\left(\begin{array}{cc} G_{0} & G_{1}\\ G_{1} & G_{2} \end{array}\right)\frac{1}{F_{0}F_{2}-F_{1}^{2}}\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)e_{1}\\ & =\frac{1}{N}e_{1}^{\prime}\frac{1}{\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}}\left(\begin{array}{cc} F_{2}G_{0}-F_{1}G_{1} & F_{2}G_{1}-F_{1}G_{2}\\ -F_{1}G_{0}+F_{0}G_{1} & -F_{1}G_{1}+F_{0}G_{2} \end{array}\right)\left(\begin{array}{cc} F_{2} & -F_{1}\\ -F_{1} & F_{0} \end{array}\right)e_{1}\\ & =e_{1}^{\prime}\frac{1}{N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}}\left(\begin{array}{cc} F_{2}\left[F_{2}G_{0}-F_{1}G_{1}\right]-F_{1}\left[F_{2}G_{1}-F_{1}G_{2}\right] & -F_{1}\left[F_{2}G_{0}-F_{1}G_{1}\right]+F_{0}\left[F_{2}G_{1}-F_{1}G_{2}\right]\\ F_{2}\left[-F_{1}G_{0}+F_{0}G_{1}\right]-F_{1}\left[-F_{1}G_{1}+F_{0}G_{2}\right] & -F_{1}\left[-F_{1}G_{0}+F_{0}G_{1}\right]+F_{0}\left[-F_{1}G_{1}+F_{0}G_{2}\right] \end{array}\right)e_{1}\\ & =\frac{F_{2}^{2}G_{0}-2F_{1}F_{2}G_{1}+F_{1}^{2}G_{2}}{N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2}} \end{align*}
Applying Lemma 1 and Lemma 2 we use F_{j}=h^{j}f\left(0\right)v_{j}+o_{p}\left(h^{j}\right) and G_{j}=h^{j-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{j}+o_{p}\left(h^{j-1}\right) which leads to \begin{align*} F_{2}^{2}G_{0} & =\left(h^{2}f\left(0\right)v_{2}+o_{p}\left(h^{2}\right)\right)^{2}\left(h^{-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{0}+o_{p}\left(h^{-1}\right)\right)\\ & =\left(h^{4}f^{2}\left(0\right)v_{2}^{2}+o_{p}\left(h^{4}\right)\right)\left(h^{-1}\sigma^{2}\left(0\right)f\left(0\right)\pi_{0}+o_{p}\left(h^{-1}\right)\right)\\ & =h^{3}f^{3}\left(0\right)v_{2}^{2}\sigma^{2}\left(0\right)\pi_{0}+o_{p}\left(h^{3}\right)\\ 2F_{1}F_{2}G_{1} & =2\left(hf\left(0\right)v_{1}+o_{p}\left(h\right)\right)\left(h^{2}f\left(0\right)v_{2}+o_{p}\left(h^{2}\right)\right)\left(\sigma^{2}\left(0\right)f\left(0\right)\pi_{1}+o_{p}\left(1\right)\right)\\ & =2\left(h^{3}f^{2}\left(0\right)v_{1}v_{2}+o_{p}\left(h^{3}\right)\right)\left(\sigma^{2}\left(0\right)f\left(0\right)\pi_{1}+o_{p}\left(1\right)\right)\\ & =2h^{3}f^{3}\left(0\right)v_{1}v_{2}\sigma^{2}\left(0\right)\pi_{1}+o_{p}\left(h^{3}\right)\\ F_{1}^{2}G_{2} & =\left(hf\left(0\right)v_{1}+o_{p}\left(h\right)\right)^{2}\left(h\sigma^{2}\left(0\right)f\left(0\right)\pi_{2}+o_{p}\left(h\right)\right)\\ & =\left(h^{2}f^{2}\left(0\right)v_{1}^{2}+o_{p}\left(h^{2}\right)\right)\left(h\sigma^{2}\left(0\right)f\left(0\right)\pi_{2}+o_{p}\left(h\right)\right)\\ & =h^{3}f^{3}\left(0\right)v_{1}^{2}\sigma^{2}\left(0\right)\pi_{2}+o_{p}\left(h^{3}\right)\\ F_{2}^{2}G_{0}-2F_{1}F_{2}G_{1}+F_{1}^{2}G_{2} & =h^{3}f^{3}\left(0\right)v_{2}^{2}\sigma^{2}\left(0\right)\pi_{0}+o_{p}\left(h^{3}\right)-2h^{3}f^{3}\left(0\right)v_{1}v_{2}\sigma^{2}\left(0\right)\pi_{1}\\ & -h^{3}f^{3}\left(0\right)v_{1}^{2}\sigma^{2}\left(0\right)\pi_{2}+o_{p}\left(h^{3}\right)+o_{p}\left(h^{3}\right)\\ & =h^{3}f^{3}\left(0\right)\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]+o_{p}\left(h^{3}\right) \end{align*}
That was the nominator. The denominator is similar to something we did previously \begin{align*} N\left(F_{0}F_{2}-F_{1}^{2}\right)^{2} & =N\left[h^{2}f^{2}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)+o_{p}\left(h^{2}\right)\right]^{2}\\ & =Nh^{4}f^{4}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}+o_{p}\left(Nh^{4}\right) \end{align*}
Putting it together we get \begin{align*} \mathbb{V}\left(\widehat{m}\left(0\right)\mid X\right) & =\frac{h^{3}f^{3}\left(0\right)\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]+o_{p}\left(h^{3}\right)}{Nh^{4}f^{4}\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}+o_{p}\left(Nh^{4}\right)}\\ & =\frac{\sigma^{2}\left(0\right)\left[v_{2}^{2}\pi_{0}-2v_{1}v_{2}\pi_{1}+v_{1}^{2}\pi_{2}\right]}{Nhf\left(0\right)\left(v_{0}v_{2}-v_{1}^{2}\right)^{2}}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}
which concludes the proof of the lemma.
Illustration
Lets take a look at what we just proved. We’ll start with the bias, then the variance, and lastly the MSE.
Lets set up an example RD DGP, look only at obs. from above, as in Lemma 3.
# function that creates RD dgp with outcome function `y_function`
rd_dgp <- function(N, jump = 0.25, y_function, seed = 1, error_sd = 0.4) {
set.seed(seed)
data.frame(X = rnorm(N)) |>
as_tibble() |>
mutate(
W = 1*(X >= 0),
Y = y_function(X) + rnorm(N, sd = error_sd)
) |>
# add discontinuity
mutate(Y = ifelse(W == 1, Y + jump, Y))
}
# some random outcome function
y_function1 <- function(X) {10 + X + 0.2*X^2 - 0.2*X^3}
# example dgp and plot
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function1)
df |>
ggplot(aes(x=X, y=Y)) +
geom_point(aes(color = as.factor(W))) +
theme_bw() + geom_vline(xintercept = 0, linetype = "dashed")Okay, so we have a DGP. Lets add a kernel and example bandwidth.
# this is our DGP from before
N <- 1000
jump <- 1
df <- rd_dgp(N = N, jump = jump, y_function = y_function1)
# create a kernel function
k_triangular = function(u) {1*(u >= -1 & u <= 1) * (1-abs(u))}
# choose some bandwidth
h = df$X |> sd()
# calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
df <- df |>
mutate(weights = k_triangular(X/h))
df |>
ggplot(aes(x=X, y=Y, color = weights)) +
geom_point() +
geom_vline(xintercept = h, linetype = "dashed") +
geom_vline(xintercept = -h, linetype = "dashed") +
scale_color_gradientn(name = "Weights",
colours = wesanderson::wes_palette(name = "Zissou1", type = "continuous")) +
theme_bw()Lastly, lets focus on an estimator of the conditional expectation m(0).
## as in the lemma, keep only obs above the cutoff
df_above <- df |>
filter(X >= 0)
## lets calculate the estimator by hand, as we did in the lemma
# define all the needed vectors and matrices
X <- df_above$X
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k_triangular(X/h))
Y <- df_above$Y
e_1 <- c(1, 0)
# calculate
e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Y [,1]
[1,] 10.974
## lets put this into a function
m_0_function <- function(X, Y, k, h, c = 0) {
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k((X-c)/h))
e_1 <- c(1, 0)
# calculate
m_0 <- e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Y
return(m_0)
}
m_0_function(X = df_above$X, Y = df_above$Y, k = k_triangular, h = sd(df$X)) [,1]
[1,] 10.974
And last but not least, lets find out the truth
## to calculate the truth above the cutoff
## we use the function for m(X), and the jump
y_function1(0) + jump[1] 11
Now that we have everything, we can simulate DGPs and calculate the bias.
## create a function that simulates a dgp, and calculates the bias
single_simulation <- function(N = 1000, jump = 1, y_function, k, h, seed = 1) {
# simulate a data set
df <- rd_dgp(N = N, jump = jump, y_function = y_function, seed = seed)
# calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
df <- df |>
mutate(weights = k(X/h))
# keep obs above cutoff
df <- df |>
filter(X >= 0)
# estimate m_0
m_0_hat <- m_0_function(X = df$X, Y = df$Y, k = k, h = h)
# return data frame of estimate and bias
data.frame(estimate = m_0_hat) |>
mutate(
truth = y_function(0) + jump,
bias = estimate - truth
)
}
# test the function
single_simulation(N = 10000, y_function = y_function1, k = k_triangular, h = 0.2) estimate truth bias
1 10.985 11 -0.014864
Lets do this simulation a lot of times, and calculate mean bias over the simulations
simulation_function <- function(num_it= 10, N = 1000, jump = 1, y_function, k, h) {
lapply(1:num_it, single_simulation, N = N, jump = jump, y_function = y_function, k = k, h = h) |>
dplyr::bind_rows() |>
mutate(iteration = row_number())
}
h_vector = seq(0.1, 1, 0.1)
full = NULL
for(h in h_vector) {
full = rbind(full,
simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function1, k = k_triangular, h = h ) |>
mutate(h = h)
)
print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}Finished h = 0.1, at time 2022-09-28 14:51:20.
Finished h = 0.2, at time 2022-09-28 14:51:21.
Finished h = 0.3, at time 2022-09-28 14:51:23.
Finished h = 0.4, at time 2022-09-28 14:51:24.
Finished h = 0.5, at time 2022-09-28 14:51:25.
Finished h = 0.6, at time 2022-09-28 14:51:27.
Finished h = 0.7, at time 2022-09-28 14:51:28.
Finished h = 0.8, at time 2022-09-28 14:51:29.
Finished h = 0.9, at time 2022-09-28 14:51:31.
Finished h = 1, at time 2022-09-28 14:51:32.
full |>
group_by(h) |>
summarise(mean_bias = mean(bias)) |>
ggplot(aes(x = h, y = mean_bias)) +
geom_line()Lets calculate the first order approximation we found.
## we'll use the equation in the last lemma
## we start with the true second derivative
# lets start with the true bias
# the conditional expectation function we have used so far is
y_function1function(X) {10 + X + 0.2*X^2 - 0.2*X^3}
<bytecode: 0x00000213747d9670>
# taking second derivative by X we get
y_function_1_sec_dev = function(X) {0.2*2*1 - 0.2*3*2*X}
# hence the true second derivative at zero is
y_function_1_sec_dev(0)[1] 0.4
## we next calculate all the elements that depend on K
v_j = function(u, k, j) {k(u) * u^j}
v_j_int = function(k, j) {integrate(f = v_j, lower = 0, upper = Inf, j = j, k = k)$value}
pi_j = function(u, k, j) {k(u)^2 * u^j}
pi_j_int = function(k, j) {integrate(f = pi_j, lower = 0, upper = Inf, j = j, k = k)$value}
C_1_k <- function(k = k_triangular) {
1 / 2 * ((v_j_int(k = k, j = 2) ^ 2) -
(v_j_int(k = k, j = 1) * v_j_int(k = k, j = 3))) /
((v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2)) -
v_j_int(k = k, j = 1) ^ 2)
}
C_1_k()[1] -0.05
## now we can calculate the true first order approximation of the bias
bias_approx = function(k, h) {
C_1_k(k = k_triangular) * y_function_1_sec_dev(0) * h^2
}
## an example
bias_approx(k = k_triangular, h = 2)[1] -0.08
Lets add this to each sample, and see how close it is to the true bias
full |>
group_by(h) |>
summarise(mean_bias = mean(bias)) |>
mutate(approx_true_mean_bias = bias_approx(k = k_triangular, h = h)) |>
ggplot(aes(x = h)) +
geom_line(aes(y = mean_bias, color = "Mean Bias")) +
geom_line(aes(y = approx_true_mean_bias, color = "First Order Approx. Bias"))Okay, this didn’t come out as close as I thought it would. Lets continue with the variance.
## as in the lemma, keep only obs above the cutoff
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function1) |>
filter(X >= 0)
## lets calculate the estimator by hand, as we did in the lemma
# define all the needed vectors and matrices
X <- df_above$X
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k_triangular(X/h))
Y <- df_above$Y
e_1 <- c(1, 0)
Sigma <- diag(x = rep(1, times = length(X)))
# calculate
e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Sigma%*%W%*%R %*% solve(t(R)%*%W%*%R) %*% matrix(c(1,0), ncol = 1) [,1]
[1,] 0.015304
## lets put this into a function
V_0_function <- function(X, Y, k, h, c = 0) {
R <- matrix(c(rep(1, times = length(X)), X), ncol = 2)
W <- diag(x = k((X-c)/h))
e_1 <- c(1, 0)
Sigma <- diag(x = rep(1, times = length(X)))
# calculate
V_0 <- e_1 %*% solve(t(R)%*%W%*%R) %*% t(R)%*%W%*%Sigma%*%W%*%R %*% solve(t(R)%*%W%*%R) %*% matrix(c(1,0), ncol = 1)
return(V_0)
}
V_0_function(X = df$X, Y = df$Y, k = k_triangular, h = sd(df$X)) [,1]
[1,] 0.024837
## next we calculate the first order approximation
C_2_k <- function(k) {
(
(v_j_int(k = k, j = 2)^2 * pi_j_int(k = k, j = 0)) -
(2*v_j_int(k = k, j = 1) * v_j_int(k = k, j = 2) * pi_j_int(k = k, j = 1)) +
(v_j_int(k = k, j = 1)^2 * pi_j_int(k = k, j = 2))
) /
(
(v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2)) -
(v_j_int(k = k, j = 1)^2)
)^2
}
## now we can calculate the true first order approximation of the variance
## note that since we are simulating, we know the density function, which is unknown in observable data...
variance_approx = function(k, h, N, sigma = 1) {
(C_2_k(k = k_triangular) * sigma^1) /
(N * h * dnorm(x = 0))
}
## putting everything together
# create a function that simulates a dgp, and calculates all of the above
single_simulation <- function(N = 1000, jump = 1, y_function, k, h, seed = 1) {
# simulate a data set
df <- rd_dgp(N = N, jump = jump, y_function = y_function, seed = seed)
# calculate weights K((X-c)/h), noting that as in the lemma we set c = 0.
df <- df |>
mutate(weights = k(X/h))
# keep obs above cutoff
df <- df |>
filter(X >= 0)
# estimate m_0
m_0_hat <- m_0_function(X = df$X, Y = df$Y, k = k, h = h)
# return data frame of estimate and bias
data.frame(estimate = m_0_hat) |>
mutate(
truth = y_function(0) + jump,
bias = estimate - truth,
variance = V_0_function(X = df$X, Y = df$Y, k = k, h = h)
)
}
simulation_function <- function(num_it= 10, N = 1000, jump = 1, y_function, k, h) {
lapply(1:num_it, single_simulation, N = N, jump = jump, y_function = y_function, k = k, h = h) |>
dplyr::bind_rows() |>
mutate(iteration = row_number())
}
h_vector = seq(0.1, 3, 0.1)
full = NULL
for(h in h_vector) {
full = rbind(full,
simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function1, k = k_triangular, h = h ) |>
mutate(h = h) |>
mutate(approx_bias = bias_approx(k = k, h = h),
approx_var = variance_approx(k = k, h = h, N = N))
)
print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}Finished h = 0.1, at time 2022-09-28 14:51:35.
Finished h = 0.2, at time 2022-09-28 14:51:36.
Finished h = 0.3, at time 2022-09-28 14:51:38.
Finished h = 0.4, at time 2022-09-28 14:51:40.
Finished h = 0.5, at time 2022-09-28 14:51:42.
Finished h = 0.6, at time 2022-09-28 14:51:44.
Finished h = 0.7, at time 2022-09-28 14:51:46.
Finished h = 0.8, at time 2022-09-28 14:51:47.
Finished h = 0.9, at time 2022-09-28 14:51:49.
Finished h = 1, at time 2022-09-28 14:51:51.
Finished h = 1.1, at time 2022-09-28 14:51:53.
Finished h = 1.2, at time 2022-09-28 14:51:55.
Finished h = 1.3, at time 2022-09-28 14:51:57.
Finished h = 1.4, at time 2022-09-28 14:51:58.
Finished h = 1.5, at time 2022-09-28 14:52:00.
Finished h = 1.6, at time 2022-09-28 14:52:02.
Finished h = 1.7, at time 2022-09-28 14:52:04.
Finished h = 1.8, at time 2022-09-28 14:52:06.
Finished h = 1.9, at time 2022-09-28 14:52:07.
Finished h = 2, at time 2022-09-28 14:52:09.
Finished h = 2.1, at time 2022-09-28 14:52:11.
Finished h = 2.2, at time 2022-09-28 14:52:13.
Finished h = 2.3, at time 2022-09-28 14:52:15.
Finished h = 2.4, at time 2022-09-28 14:52:17.
Finished h = 2.5, at time 2022-09-28 14:52:19.
Finished h = 2.6, at time 2022-09-28 14:52:21.
Finished h = 2.7, at time 2022-09-28 14:52:22.
Finished h = 2.8, at time 2022-09-28 14:52:24.
Finished h = 2.9, at time 2022-09-28 14:52:26.
Finished h = 3, at time 2022-09-28 14:52:28.
full |>
group_by(h) |>
summarise_all(mean) |>
ggplot(aes(x = h)) +
geom_line(aes(y = bias, color = "Mean Bias")) +
geom_line(aes(y = approx_bias, color = "First Order Approx Bias"))full |>
group_by(h) |>
summarise_all(mean) |>
ggplot(aes(x = h)) +
geom_line(aes(y = variance, color = "Mean Variance")) +
geom_line(aes(y = approx_var, color = "First Order Approx Variance"))The variance is spot on! Lets finish with the rmse
full |>
mutate(bias_sq = bias^2) |>
group_by(h) |>
summarise_all(mean) |>
rename(mse = bias_sq) |>
mutate(mse_sum = bias^2 + variance) |>
mutate(mse_approx = approx_bias^2 + approx_var) |>
ggplot(aes(x = h)) +
geom_line(aes(y = mse, color = "MSE")) +
geom_line(aes(y = mse_sum, color = "MSE Sum")) +
geom_line(aes(y = mse_approx, color = "MSE Approx.")) +
theme_bw() +
labs(x = "Bandwidth", y = "MSE", color = NULL)Proposition: MSE in RD
Proposition Statement
Proposition 1 (RD MSE) We finish with the MSE and MSE minimizing bandwidth in RD settings. We remind that denote the cutoff by c. Define the first order approximation of the MSE as AMSE\left(h\right)=C_{1}h^{4}\left(m_{+}^{\left(2\right)}\left(c\right)-m_{-}^{\left(2\right)}\left(c\right)\right)^2+\frac{C_{2}f\left(c\right)}{Nh}\left(\sigma_{+}^{2}\left(c\right)-\sigma_{-}^{2}\left(c\right)\right)
The MSE and AMSE are connected via MSE\left(h\right)=AMSE\left(h\right)+o_{p}\left(h^{4}+\frac{1}{Nh}\right)
And the bandwidth that minimizes the AMSE is equal to h_{opt}\equiv\arg\min_{h}AMSE\left(h\right)=\left[\frac{\left(C_{2}/4C_{1}\right)}{Nf\left(c\right)}\left(\frac{\sigma_{+}^{2}\left(c\right)-\sigma_{-}^{2}\left(c\right)}{m_{+}^{\left(2\right)}\left(c\right)-m_{-}^{\left(2\right)}\left(c\right)}\right)\right]^{1/5}
Proof
We will make an additional assupmtion, that m_{+}^{\left(2\right)}\left(0\right) \neq m_{-}^{\left(2\right)}, which is needed due to the denominator in h_{opt}.
We will start with considering the RD MSE as defined above.
In the context of RD, we are essentially using Lemma 3 twice, once above and once below the cutoff.
\begin{align*}MSE(h) & =\mathbb{E}\left[\left(\left(\widehat{\mu}_{+}-\mu_{+}\right)-\left(\widehat{\mu}_{-}-\mu_{-}\right)\right)^{2}\mid X\right]\\ & =\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)^{2}-2\left(\widehat{\mu}_{+}-\mu_{+}\right)\left(\widehat{\mu}_{-}-\mu_{-}\right)+\left(\widehat{\mu}_{-}-\mu_{-}\right)^{2}\mid X\right]\\ & =\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)^{2}\mid X\right]+\mathbb{E}\left[\left(\widehat{\mu}_{-}-\mu_{-}\right)^{2}\mid X\right]\\ & -2\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)\left(\widehat{\mu}_{-}-\mu_{-}\right)\mid X\right]\\ & =MSE\left(\widehat{\mu}_{+}\right)+MSE\left(\widehat{\mu}_{-}\right)\\ COV\left(\widehat{\mu}_{+},\widehat{\mu}_{-} \mid X\right)=0\rightarrow & -2\mathbb{E}\left[\left(\widehat{\mu}_{+}-\mu_{+}\right)\mid X\right]\times\mathbb{E}\left[\left(\widehat{\mu}_{-}-\mu_{-}\right)\mid X\right]\\ & =MSE\left(\widehat{\mu}_{+}\right)+MSE\left(\widehat{\mu}_{-}\right)\\ & -2Bias\left(\widehat{\mu}_{+}\right)\times Bias\left(\widehat{\mu}_{-}\right) \end{align*}
Substituting the expression we found in Lemma 3, we get
\begin{align*}MSE(h) & =\left(C_{1}^{1/2}m_{+}^{\left(2\right)}\left(0\right)h^{2}\right)^{2}+o_{p}\left(h^{4}\right)+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+o_{p}\left(\frac{1}{N_{+}h}\right)\\ & +\left(C_{1}^{1/2}m_{-}^{\left(2\right)}\left(0\right)h^{2}\right)^{2}+o_{p}\left(h^{4}\right)+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+o_{p}\left(\frac{1}{N_{-}h}\right)\\ & -2\left(C_{1}^{1/2}m_{+}^{\left(2\right)}\left(0\right)h^{2}\right)\times\left(C_{1}^{1/2}m_{-}^{\left(2\right)}\left(0\right)h^{2}\right)\\ & =C_{1}h^{4}\left(\left(m_{+}^{\left(2\right)}\left(0\right)\right)^{2}-2m_{-}^{\left(2\right)}\left(0\right)m_{+}^{\left(2\right)}\left(0\right)+\left(m_{-}^{\left(2\right)}\left(0\right)\right)^{2}\right)+o_{p}\left(h^{4}\right)\\ & +C_{2}\left(\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}+C_{2}\frac{\sigma_{-}^{2}\left(c\right)}{f_{-}\left(c\right)N_{-}h}\right)+o_{p}\left(\frac{1}{N_{+}h}\right)+o_{p}\left(\frac{1}{N_{-}h}\right) \end{align*}
We’re almost there.
Consider the fraction N_{+}/N and N_{-}/N. At the limit n\rightarrow\infty these are equal to Pr\left(X\geq c\right) and Pr\left(X<c\right) respectively. In our sample, the equality doesn’t hold, but the error diminshes with N. That is N_{+}/N=Pr\left(X\geq c\right)+O_{p}\left(N^{-1}\right) and N_{-}/N=Pr\left(X\geq c\right)+O_{p}\left(N^{-1}\right).
Next, consider the density of X at the cutoff from above f_{+}\left(c\right). We can write this as f_{X:X\geq c}\left(c\right), i.e. the density of X at c conditional on X being greater then c. Using Bayes rule we can write this as f_{X:X\geq c}\left(c\right)=f_{X}\left(c\right)/Pr\left(X\geq c\right).
Putting these together we have
\begin{align*} f_{+}\left(c\right)N_{+} & =\frac{f\left(c\right)}{Pr\left(X\geq c\right)}\times\left(N\times Pr\left(X\geq c\right)+NO_{p}\left(N^{-1}\right)\right)\\ & =f\left(c\right)N+O_{p}\left(1\right) \end{align*}and similarly f_{-}\left(c\right)N_{-}=f\left(c\right)N+O_{p}\left(1\right). Using this results we can write
\begin{align*} C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f_{+}\left(c\right)N_{+}h}+o_{p}\left(\frac{1}{N_{+}h}\right) & =C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f\left(c\right)Nh+O_{p}\left(h\right)}+o_{p}\left(\frac{1}{Nh}\right)\\ & \approx C_{2}\frac{\sigma_{+}^{2}\left(c\right)}{f\left(c\right)Nh}+o_{p}\left(\frac{1}{Nh}\right) \end{align*}Using such a result for above and below the cutoff allows us to write
MSE(h)=C_{1}h^{4}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}+\frac{C_{2}}{f\left(0\right)Nh}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)+o_{p}\left(h^{4}+\frac{1}{Nh}\right)
This concludes the proof of the MSE and AMSE.
Optimal Bandwidth
Define the optimal bandwidth as in Proposition Proposition 1. If we calculate the derivative of the AMSE by h we get
\begin{align*} \frac{\partial AMSE}{\partial h} & =4C_{1}h^{3}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}-\frac{C_{2}}{f\left(0\right)Nh^{2}}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)=0\\ & 4C_{1}h^{3}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}=\frac{C_{2}}{f\left(0\right)Nh^{2}}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)\\ & h^{5}=\frac{C_{2}}{f\left(0\right)N}\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)\times\left(4C_{1}\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}\right)^{-1}\\ & =\frac{C_{2}}{4C1}\frac{\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)}{f\left(0\right)N\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}}\\ & h=\left(\frac{C_{2}}{4C1}\frac{\left(\sigma_{+}^{2}\left(0\right)+\sigma_{-}^{2}\left(0\right)\right)}{f\left(0\right)N\left(m_{+}^{\left(2\right)}\left(0\right)-m_{-}^{\left(2\right)}\left(0\right)\right)^{2}}\right)^{1/5} \end{align*}
Illustration
Lets calculate for a simulation the optimal bandwidth
## we next calculate all the elements that depend on K
v_j = function(u, k, j) {k(u) * u^j}
v_j_int = function(k, j) {integrate(f = v_j, lower = 0, upper = Inf, j = j, k = k)$value}
pi_j = function(u, k, j) {k(u)^2 * u^j}
pi_j_int = function(k, j) {integrate(f = pi_j, lower = 0, upper = Inf, j = j, k = k)$value}
C_1_k <- function(k = k_triangular) {
nom <- v_j_int(k = k, j = 2) ^ 2 - v_j_int(k = k, j = 1) * v_j_int(k = k, j = 3)
denom <- v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2) - v_j_int(k = k, j = 1) ^ 2
return(1/4 *(nom / denom) ^ 2)
}
## next we calculate the first order approximation
C_2_k <- function(k = k_triangular) {
nom <- v_j_int(k = k, j = 2)^2 * pi_j_int(k = k, j = 0) - 2 * v_j_int(k = k, j = 1) * v_j_int(k = k, j = 2) * pi_j_int(k = k, j = 1) + v_j_int(k = k, j = 1)^2 * pi_j_int(k = k, j = 2)
denom <- (v_j_int(k = k, j = 0) * v_j_int(k = k, j = 2) - v_j_int(k = k, j = 1) ^ 2)^2
return(nom / denom)
}
C_k <- function(k = k_triangular) {C_2_k(k=k) / (4*C_1_k(k=k))}
## begin with a function, that given truth, finds the approximation of the optimal bandwidth
# the quantity (1-2/pi) is the variance of a standard normal truncated at zero
h_opt <- function(C, f_0, N, sigma_plus = (1-2/pi), sigma_minus = (1-2/pi), m_sq_plus, m_sq_minus) {
((C / (f_0 * N)) * ((sigma_plus + sigma_minus) / (m_sq_plus - m_sq_minus)^2))^(1/5)
}Note that we follow assumptions, where second derivative is not equal above and below the cutoff
y_function2 <- function(X) {
(X >= 0) * (10 + X + 0.2 * X ^ 2 - 0.3 * X ^ 3) +
(X < 0) * (10 + 0.5 * X + 0.4 * X ^ 2 - 0.1 * X ^ 3)
}
# example dgp and plot
df <- rd_dgp(N = 1000, jump = 1, y_function = y_function2)
df |>
ggplot(aes(x=X, y=Y)) +
geom_point(aes(color = as.factor(W))) +
theme_bw() + geom_vline(xintercept = 0, linetype = "dashed")And now we simulate
h_vector = seq(0.1, 3, 0.1)
full = NULL
for(h in h_vector) {
full = rbind(full,
simulation_function(num_it = num_it, N = N, jump = 1, y_function = y_function2, k = k_triangular, h = h ) |>
mutate(h = h) |>
mutate(approx_bias = bias_approx(k = k, h = h),
approx_var = variance_approx(k = k, h = h, N = N))
)
print(glue::glue("Finished h = {h}, at time {Sys.time()}."))
}Finished h = 0.1, at time 2022-09-28 14:52:32.
Finished h = 0.2, at time 2022-09-28 14:52:33.
Finished h = 0.3, at time 2022-09-28 14:52:35.
Finished h = 0.4, at time 2022-09-28 14:52:37.
Finished h = 0.5, at time 2022-09-28 14:52:39.
Finished h = 0.6, at time 2022-09-28 14:52:41.
Finished h = 0.7, at time 2022-09-28 14:52:43.
Finished h = 0.8, at time 2022-09-28 14:52:45.
Finished h = 0.9, at time 2022-09-28 14:52:46.
Finished h = 1, at time 2022-09-28 14:52:48.
Finished h = 1.1, at time 2022-09-28 14:52:50.
Finished h = 1.2, at time 2022-09-28 14:52:52.
Finished h = 1.3, at time 2022-09-28 14:52:54.
Finished h = 1.4, at time 2022-09-28 14:52:56.
Finished h = 1.5, at time 2022-09-28 14:52:58.
Finished h = 1.6, at time 2022-09-28 14:52:59.
Finished h = 1.7, at time 2022-09-28 14:53:01.
Finished h = 1.8, at time 2022-09-28 14:53:03.
Finished h = 1.9, at time 2022-09-28 14:53:05.
Finished h = 2, at time 2022-09-28 14:53:07.
Finished h = 2.1, at time 2022-09-28 14:53:09.
Finished h = 2.2, at time 2022-09-28 14:53:11.
Finished h = 2.3, at time 2022-09-28 14:53:12.
Finished h = 2.4, at time 2022-09-28 14:53:14.
Finished h = 2.5, at time 2022-09-28 14:53:16.
Finished h = 2.6, at time 2022-09-28 14:53:18.
Finished h = 2.7, at time 2022-09-28 14:53:20.
Finished h = 2.8, at time 2022-09-28 14:53:22.
Finished h = 2.9, at time 2022-09-28 14:53:24.
Finished h = 3, at time 2022-09-28 14:53:25.
h_opt_approx = h_opt(C = C_k(k = k_triangular), f_0 = dnorm(0), N = N,
sigma_plus = (1-2/pi), sigma_minus = (1-2/pi),
m_sq_plus = 0.2 , m_sq_minus = 0.4)
full |>
mutate(bias_sq = bias^2) |>
group_by(h) |>
summarise_all(mean) |>
rename(mse = bias_sq) |>
mutate(mse_sum = bias^2 + variance) |>
mutate(mse_approx = approx_bias^2 + approx_var) |>
ggplot(aes(x = h)) +
geom_line(aes(y = mse, color = "MSE Truth")) +
geom_line(aes(y = mse_sum, color = "MSE Truth Sum")) +
geom_line(aes(y = mse_approx, color = "MSE Approx.")) +
theme_bw() +
labs(x = "Bandwidth", y = "MSE", color = NULL) +
geom_vline(xintercept = h_opt_approx)