Problem/Task Description

The Defense Advanced Research Projects Agency (DARPA) has given us some data from their air to air missile defense program.

The data provided includes two variables. First is the actual distance (km) between the missile and test drone. Second is the error(m) in the estimated position of the drone by the missile and the true position of the drone. See Figure 1 below.

Figure 1

Figure 1

We have been tasked to build a model that can estimate the average error in drone position given a fixed distance between the drone and missile. In other words, given a fixed distance \(x^{*}\), we wish to estimate \(E(y|x^{*})\). Thus, we are constructing a model that will output an estimate \(\hat{E}(y|x^{*}))\) to the average error between true and estimated position for fixed distances \(x^{*}\). In particular DARPA wishes us to calculate the average error when the distance between missile and drone is 0.1, 1 and 10km respectively.

Our Methodology: Kernel Regression

We will use Kernel Regression to estimate the function relationship between Distance (km) and Error in Position Estimation (m).

In short, we wish to write \(\hat{E}(y|x^{*}))\) as a weighted sum of the data response variables \(y_i\). Kernel Regression helps us find the weights \(w_i\), which themselves are functions of the predictor data \(x_i\) and the fixed distance (km) value \(x^{*}\). In mathematics, we write, \[ \hat{E}(y|x^{*})) = \sum_{i=1}^nw_iy_i \] See Appendix A for more details of what these weights \(w_i\) are and how we find the optimal ones to fit our data.

Using this methodology we arrive at the following model. See Figure 2 for a graphical solution to estimating the relationship between Distance (km) and Error in Estimation (m). Our estimates can be seen in red.
Figure 2

Figure 2

Estimate the Average Error when the Distance (km) is fixed at 0.1 km.

We will provide a point estimate and a 95% confidence interval for the average error (m) when the distance is fixed at 0.1 km.

The point estimate is \(-0.07611554\) and the 95% confidence interval is \((-0.1564094, 0.02416303)\). We estimate the average error (m) to be in in this range when the distance (km) is fixed at 0.1 km. See Figure 4 below for a graphical representation of our confidence interval. Note the top right graph has been zoomed in around the fixed distance 0.1 km.The green represents our point estimate, and the red and blue represent our lower and upper bounds of the 95% confidence interval.

Figure 4

Figure 4

Estimate the Average Error when the Distance (km) is fixed at 1 km.

We will provide a point estimate and a 95% confidence interval for the average error (m) when the distance is fixed at 1 km.

The point estimate is \(-0.1742748\) and the 95% confidence interval is \((-0.4067928, 0.02019911)\). We estimate the average error (m) to be in our interval when the distance (km) is fixed at 1 km. See Figure 3 below for a graphical representation of our confidence interval. Note the top right graph has been zoomed in around the fixed distance 1km. The green represents our point estimate, and the red and blue represent our lower and upper bounds of the 95% confidence interval.
Figure 3

Figure 3

Estimate the Average Error when the Distance (km) is fixed at 10 km.

Unfortunately we cannot provide estimates to the average error when the distance (km) is fixed at 10 km. In fact, since the largest recorded distance (km) in the data set provided is 8 km, we would not be able to provide estimates for the error (m) beyond fixed distance 8 km. There is no data for our model to interpret outside of the range of distance (km) values of 1.25e-07 and 8. Put simply, we just don’t know what relationship may exist outside of these distance (km) range of values.

While I have graphically displayed 3 possible scenarios, there are infinitely many ways the relationship between Distance (km) and Error (m) could go beyond our observed data values. For this reason we will not give an estimate for the average error for a fixed distance of 10 km. If such an estimate needs to be done we recommend collecting more empirical data at that range of values.

Appendix A: Further Analysis of Kernel Regression

As stated earlier we wish to write \(\hat{E}(y|x^{*}))\) as a weighted sum of the data response variables \(y_i\). Kernel Regression helps us find the weights \(w_i\), which themselves are functions of the predictor data \(x_i\) and the fixed distance (km) value \(x^{*}\). In mathematics, we write, \[ \hat{E}(y|x^{*})) = \sum_{i=1}^nw_iy_i \]

In our model we used a Normal Density as our Kernel function \(K\). In other words \(K\) is defined by the following: \[ K(x, x_i, h) = \frac{1}{\sqrt{2\pi}h}\exp({\frac{-(x_i-x)^2}{2h^2}}) \]

Our weights are then defined by the formula: \[ w_i = \frac{K(x^{*}, x_i, h)}{\sum_{j=1}^n K(x^{*}, x_j, h)} \]

Our choice of \(h\) will dictate our model, and we seek the \(h\) that minimizes the sum of squared residuals between our estimates \(\hat{y_i}\) and \(y_i\).

See the following graphs in Figure 1 that show how our choice of \(h\) will change how well we are able to fit the relationship between Distance (km) and Error (m).
Figure 1

Figure 1

As we can see, as the \(h\) is fine tuned to minimize the sum of squared residuals, the red line approximation to the relationship between Distance (km) and Error(m) becomes more and more precise.

Appendix B: Code Used

Below you will find all code used in this consulting. In particular you can also find the bootstrap methods used to estimate a 95% confidence interval for our point estimates.

data = read.csv("exam1.csv")
x = data$Distance..km.
y = data$Error..m.
datan = data.frame(x = x, y = y)
plot(x, y, xlab = "Distance (km) between missile and drone", ylab = "Error (m) [True Position - Estimated Position]")

#--------------------------
SSR.KR = function(h, data) {
    x1 = data$x
    y = data$y
    y1 = rep(0, length(x1))
    for (i in 1:length(x1)) {
        y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x[-i])/h)^2) * y[-i])/sum((1/(sqrt(2 * 
            pi))) * exp(-0.5 * ((x1[i] - x[-i])/h)^2))
    }
    res = y - y1
    SSR = sum(res^2)
}

model = optim(2, SSR.KR, data = datan, method = "Brent", lower = 0, upper = 10)  #0.03155492
h = model$par

# x1 is now the mesh of x-values we want predictions at y1 is our predictions at
# these x1 values
x1 = seq(min(x), max(x), length.out = 1000)
y1 = rep(0, 1000)
for (i in 1:1000) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x)/h)^2) * y)/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x)/h)^2))
}
# plot of original data x and y and the function estimation
plot(x, y, xlab = "Distance (km) between missile and drone", ylab = "Error (m) [True Position - Estimated Position]")
lines(x1, y1, col = 2)

#-------------------------
x.star = 0.1
y.star = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - x)/h)^2) * y)/sum((1/(sqrt(2 * 
    pi))) * exp(-0.5 * ((x.star - x)/h)^2))
# print(paste('y.star = ', y.star)) -0.0761155422994658

# Need to bootstrap the standard error of the residuals Step 1: compute original
# residuals x1 needs to be the data y1 will be predicted values at the x-data
x1 = x
y1 = rep(0, length(x1))
orig.res = rep(0, length(x1))
for (i in 1:length(x1)) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x[-i])/h)^2) * y[-i])/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x[-i])/h)^2))
}
orig.res = y - y1

#------------------------------------
#------------------------------------
#------------------------------------
numSim = 1000
x.star = 0.1
y.star = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - x)/h)^2) * y)/sum((1/(sqrt(2 * 
    pi))) * exp(-0.5 * ((x.star - x)/h)^2))
# y.star #-0.1742581
yhat.star.vec = rep(-100, numSim)  #100 simulations
for (k in 1:numSim) {
    
    # Step 2: Sample from original sample. needs to be same size.
    bs.x = sample(x, length(x), replace = TRUE)
    
    bs.y.hat = rep(0, length(bs.x))
    bs.y = rep(0, length(bs.x))
    
    # Step 3: Plug bs.x into original model to get bs.y.hat.
    for (i in 1:length(bs.x)) {
        bs.y.hat[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((bs.x[i] - x)/h)^2) * 
            y)/sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((bs.x[i] - x)/h)^2))
        
    }
    
    # Step 4: Add resid to get bs.y. We now have bs.y and bs.x for this current
    # iteration
    bs.y = bs.y.hat + sample(orig.res, length(bs.x), replace = TRUE)
    
    # Step 5: Generate bs.model from this bootstrapped data x and y bs.data =
    # data.frame(x=bs.x, y=bs.y) bs.model = optim(1,SSR.KR.bs, data=bs.data) bs.h =
    # bs.model$par
    
    # Step 6: Plug this x.star into bs.model. This is E(y|x*) yhat.star.vec[k] =
    # sum((1/(sqrt(2*pi)))*exp(-.5*((x.star-bs.x)/bs.h)^2)*bs.y)/sum((1/(sqrt(2*pi)))*exp(-.5*((x.star-bs.x)/bs.h)^2))
    yhat.star.vec[k] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - bs.x)/h)^2) * 
        bs.y)/sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - bs.x)/h)^2))
    
    # No Step 7. No need to add residual because the prompt says average response.
    # not predicted response
}
lower = quantile(yhat.star.vec, 0.025, na.rm = TRUE)  #-0.1564094
upper = quantile(yhat.star.vec, 0.975, na.rm = TRUE)  #0.02416303

par(mfrow = c(1, 2))
plot(x, y, xlab = "Distance (km)", ylab = "Error (m)")
points(x.star, y.star, pch = 22, col = "green")
points(x.star, lower, pch = 22, col = "red")
points(x.star, upper, pch = 22, col = "blue")

x.sub = x[x > 0 & x < 0.2]
y.sub = y[x > 0 & x < 0.2]
plot(x.sub, y.sub, xlab = "Distance (km)", ylab = "Error (m)", main = "Zoom around Distance 0.1km")
points(x.star, y.star, pch = 22, col = "green")
points(x.star, lower, pch = 22, col = "red")
points(x.star, upper, pch = 22, col = "blue")
#------------------------------------
#------------------------------------
#------------------------------------
numSim = 1000
x.star = 1
y.star = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - x)/h)^2) * y)/sum((1/(sqrt(2 * 
    pi))) * exp(-0.5 * ((x.star - x)/h)^2))
# y.star #-0.1742581
yhat.star.vec = rep(-100, numSim)  #100 simulations
for (k in 1:numSim) {
    
    # Step 2: Sample from original sample. needs to be same size.
    bs.x = sample(x, length(x), replace = TRUE)
    
    bs.y.hat = rep(0, length(bs.x))
    bs.y = rep(0, length(bs.x))
    
    # Step 3: Plug bs.x into original model to get bs.y.hat.
    for (i in 1:length(bs.x)) {
        bs.y.hat[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((bs.x[i] - x)/h)^2) * 
            y)/sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((bs.x[i] - x)/h)^2))
        
    }
    
    # Step 4: Add resid to get bs.y. We now have bs.y and bs.x for this current
    # iteration
    bs.y = bs.y.hat + sample(orig.res, length(bs.x), replace = TRUE)
    
    # Step 5: Generate bs.model from this bootstrapped data x and y bs.data =
    # data.frame(x=bs.x, y=bs.y) bs.model = optim(1,SSR.KR.bs, data=bs.data) bs.h =
    # bs.model$par
    
    # Step 6: Plug this x.star into bs.model. This is E(y|x*) yhat.star.vec[k] =
    # sum((1/(sqrt(2*pi)))*exp(-.5*((x.star-bs.x)/bs.h)^2)*bs.y)/sum((1/(sqrt(2*pi)))*exp(-.5*((x.star-bs.x)/bs.h)^2))
    yhat.star.vec[k] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - bs.x)/h)^2) * 
        bs.y)/sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x.star - bs.x)/h)^2))
    
    # No Step 7. No need to add residual because the prompt says average response.
    # not predicted response
}
lower = quantile(yhat.star.vec, 0.025, na.rm = TRUE)  #-0.4067928
upper = quantile(yhat.star.vec, 0.975, na.rm = TRUE)  #0.02019911

par(mfrow = c(1, 2))
plot(x, y, xlab = "Distance (km)", ylab = "Error (m)")
points(x.star, y.star, pch = 22, col = "green")
points(x.star, lower, pch = 22, col = "red")
points(x.star, upper, pch = 22, col = "blue")

x.sub = x[x > 0.8 & x < 1.2]
y.sub = y[x > 0.8 & x < 1.2]
plot(x.sub, y.sub, xlab = "Distance (km)", ylab = "Error (m)", main = "Zoom around Distance 1km")
points(x.star, y.star, pch = 22, col = "green")
points(x.star, lower, pch = 22, col = "red")
points(x.star, upper, pch = 22, col = "blue")

#------------------------------------------
par(mfrow = c(2, 2))
plot(x, y, xlab = "Distance (km)", ylab = "Error (m)")

x.new = c(x, seq(8, 10, length.out = 50))
y.new = c(y, seq(7, -10, length.out = 50))
plot(x.new, y.new, xlab = "Distance (km)", ylab = "Error (m)", main = "Scenario 1")

y.new2 = c(y, seq(7, 0, length.out = 25), rep(0, 25))
plot(x.new, y.new2, xlab = "Distance (km)", ylab = "Error (m)", main = "Scenario 2")

y.new3 = c(y, rep(7, 25), seq(7, -2, length.out = 25))
plot(x.new, y.new3, xlab = "Distance (km)", ylab = "Error (m)", main = "Scenario 3")



par(mfrow = c(2, 2))

h = 5
x1 = seq(min(x), max(x), length.out = 1000)
y1 = rep(0, 1000)
for (i in 1:1000) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x)/h)^2) * y)/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x)/h)^2))
}
# plot of original data x and y and the function estimation
plot(x, y, xlab = "Distance (km)", ylab = "Error (m) ", main = "h=5")
lines(x1, y1, col = 2)

# ==============================
h = 1
x1 = seq(min(x), max(x), length.out = 1000)
y1 = rep(0, 1000)
for (i in 1:1000) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x)/h)^2) * y)/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x)/h)^2))
}
# plot of original data x and y and the function estimation
plot(x, y, xlab = "Distance (km)", ylab = "Error (m) ", main = "h=1")
lines(x1, y1, col = 2)

# ==============================
h = 0.5
x1 = seq(min(x), max(x), length.out = 1000)
y1 = rep(0, 1000)
for (i in 1:1000) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x)/h)^2) * y)/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x)/h)^2))
}
# plot of original data x and y and the function estimation
plot(x, y, xlab = "Distance (km)", ylab = "Error (m) ", main = "h=0.5")
lines(x1, y1, col = 2)
# ==============================
h = model$par
x1 = seq(min(x), max(x), length.out = 1000)
y1 = rep(0, 1000)
for (i in 1:1000) {
    y1[i] = sum((1/(sqrt(2 * pi))) * exp(-0.5 * ((x1[i] - x)/h)^2) * y)/sum((1/(sqrt(2 * 
        pi))) * exp(-0.5 * ((x1[i] - x)/h)^2))
}
# plot of original data x and y and the function estimation
plot(x, y, xlab = "Distance (km)", ylab = "Error (m) ", main = "h=0.03155492 Optimal")
lines(x1, y1, col = 2)