Non-parametric Analysis of Defense Missile Data

\(\Huge \text{Problem Statement:} \\~\\ \large \text{We have been tasked by The Defense Advanced Research Projects Agency (DARPA) to analyze data from their air missle defense program. Specifically, DARPA}\\ \large\text{wants to know the accuracy of their hostile missle intercept program. The data provided is of the distance between DARPA's missle and that what they call a "hostile" object.}\\ \large\text{Essentially, we are analyzing distance error between the two objects.} \large \text{The report below counts the results of a model that estimates the average error given the distance}\\ \large\text{between target drone and missle.}\\~\\\)

\(\Huge \text{The Data at Hand}\\~\\ \large \text{The dataset provided has two variables, distance (km) between the missle and a test drone. The second variable is the error (m) between the test drone and its true position. The distance}\\ \large \text{is the predictor variable and the error is the response variable in this model. Figure 1. is a plot of the dataset. Showing that the error increases as the distance increases. }\)

\(\Huge \text{The Model and What it Tells Us} \\~\\ \large \text{We are tasked with finding point estimates for the average error when the distance is 0.1, 1, and 10km.}\\ \large\text{Using a kernel regression model, we are able to fit a curve to the data using cross-validation for an optimal h. Figure 2. depicts our model in red with h=0.03154297.}\)

## Plot of the model with red line 

x1 = seq(min(x),max(x),by=(max(x)-min(x))/99)
y1 = rep(0,100)
for(i in 1:100){
  y1[i] = sum(exp(-.5*((x1[i]-x)/h)^2)*y)/sum(exp(-.5*((x1[i]-x)/h)^2))
}
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig. 2)', ylim=c(-8,8))
lines(x1,y1,col=2)

\(\large \text{Below are two graphs that show the confidence interval for the point estimate at a distance of 0.1 km. Figure 3 depicts the point estimate (brown) and the 95% confidence interval (blue) }\\\large\text{For the point estimate at x = 0.1 km, we get an error of -0.0761 m with a 95% confidence interval of [-0.142, 0.0167]}\)

\(\large\text{Evaluating the model at distance x = 1 km we get a point estimate of -0.1742 and a 95% confidence interval of [-0.38401544, -0.01256813]. Like the plots above, the confidence interval is colored in blue}\\ \large\text{and the point estimate is colored brown. Its interesting to note that the point estimate is along the confidence interval for x = 1 km. This suggests that analysis at 0.1 km is pretty weak compared to at 1 km }\)

\(\huge\text{The Good, The Bad, and The Ugly}\\ \large\text{In our analysis above we are given points that exist between 0 and 8 for the predictor variable. However, we are tasked with finding a point estimate and confidence interval for x = 10 km.}\\ \large\text{Unfortunately, this point does not exist in our data set. The max of the predictor variable in the data set is 8. Points after 8 are unknown, and there are infinitely many ways for the data to exist}\\~\\ \large\text{That brings us near the end of this analysis; We have some point estimates. The good estimate, rather, the best of the 3, is at x = 1km, we get a useful point estimate and confidence interval for the error.}\\ \large\text{We also got a bad point estimate k = 0.1 km, and ultiamtely an ugly point estimate 10 km. The latter is ugly because of the many possibilites for answering that question, with data that doesnt exist.}\\ \large\text{Our bad analysis exists when x= 0.1 km, where the point estimate exists outside of the confidence interval (we generally don't want this).}\\ \large\text{In sum, 1 km was the best candidate for meaningful analysis. The other points, well, are not useful in predicting error at 0.1 km and 10 km is just not too useful or in the case of 10 km, just doesn't make sense given our data.}\\~\\ \huge\text{Appendix 1.1 - Methodology}\\ \large\text{This is a brief discussion of the methodology, specifically, the rhyme of our kernel regression model. We begin with the problem we're trying to solve. The task enlisted to us was to evaluate point}\\ \large\text{estimates and confidence intervals for differences in predictor variables. For that, we look to regression analysis. For a non-parametric approach we specifically use kernel regression.}\\ \large\text{(1) depicts the problem we want to solve. We want to find this average error given a data point x, which is the distance between test drone and missle. Given a set of weights,}\\ \large\text{which are found using (2) and assuming a Gaussian kernel defined by (3), we ultiamtely arrive at our average error.}\\ \hspace{18cm} \Large (1)\hspace{1cm} \hat{E}(y|x)= \sum_{i}^{n}w_iy_i \\ \hspace{18cm} \Large (2)\hspace{1cm}w_i = \frac{K(x,x_i,h)}{\sum_{j=1}^{n}K(x,x_j,h)}\\ \hspace{18cm} \Large (3)\hspace{1cm}K(x,x_i,h)= \frac{1}{\sqrt{2\pi}h}e^{\frac{-1}{2}(\frac{x-x_j}{h})^2}\\ \large\text{We see that (3) has an undefined variable, h. The kernel relies on a value of h such that the sum of squared residuals are minimized. The best h for this model was obtained using}\\ \large\text{ Cross-Validation, the details of this are listed in Appendix 1.2, which lists all the code used in evaluating the model. These Eq. 1,2,3, and h are all we need to compose our analysis.}\\ \large\text{Down below is the code used for this analysis and details of the methods used such as bootstrapped a confidence interval and cross-validation for our optimal value of h.}\\~\\\)

\(\Huge\text{Appendix 1.2}\\ \large\text{This is the code for analysis.}\)

library(ggplot2)
## Read data and plot raw data 
data = read.csv('exam1.csv')
x = data$Distance..km.
y = data$Error..m.
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig. 1)', ylim=c(-8,8))

## Kernel Regression and Cross-Validation, find optimal h. Use optimal h and superimpose model onto raw data plot. 

n = length(x)
CV = function(h){
  SSR = 0
  for(i in 1:n){
    y1 = sum(exp(-.5*((x[i]-x[-i])/h)^2)*y[-i])/sum(exp(-.5*((x[i]-x[-i])/h)^2))
    resid = y[i] - y1
    SSR = SSR + resid^2
  }
  SSR
}

h=optim(1,CV)$par


### Code below looks at residual diagnostics to check for constant variance. 

x1 = x 
y1 = length(x)
for(i in 1:n){
  y1[i] = sum(exp(-.5*((x1[i]-x)/h)^2)*y)/sum(exp(-.5*((x1[i]-x)/h)^2))
}
yhat = y1
res = y - y1
plot(yhat, res)

## Everything looks good and we can begin bootstrapping values for the confidence interval. 

## x is a distance estimate, h is determined by CV and optim(1,CV).
## I could have combined into both functions, but for simplicity I'll keep them separate. 

point.estimates = function(X){ 
  x1 = X
  y1 = rep(0,length(x1))
  h = 0.03154297
  for(i in 1:length(x1)){
     y1[i] = sum(exp(-.5*((x1[i]-x)/h)^2)*y)/sum(exp(-.5*((x1[i]-x)/h)^2))
  }
  return(y1)
}

## Confidence Interval function for a given point X. 

C.I = function(X){
  h = 0.03154297
  n = 1400
  BS.cond.avg.y = rep(0,1000)
  for(k in 1:1000){
    BS.x = sample(x,n,replace=T)
  
  x1 = BS.x
  y1 = length(x1)
    for(i in 1:length(x1)){
        y1[i] = sum(exp(-.5*((x1[i]-x)/h)^2)*y)/sum(exp(-.5*((x1[i]-x)/h)^2))
    }
    BS.yhat = y1
    
    #Add a residual to BS.yhat to turn it into a BS.y
    BS.y = BS.yhat + sample(res,n,replace=T)
    
    x2 = X 
    y2 = rep(0, length(x2))
    for(i in 1:length(x2)){
      y2[i]=sum(exp(-.5*((x2[i]-BS.x)/h)^2)*BS.y)/sum(exp(-.5*((x2[i]-BS.x)/h)^2))
    }
    BS.cond.avg.y[k] = y2
  }
  low.bound = sort(BS.cond.avg.y)[25]
  high.bound = sort(BS.cond.avg.y)[975]
  return(c(low.bound, high.bound))
}

## Plot of the model with red line 

x1 = seq(min(x),max(x),by=(max(x)-min(x))/99)
y1 = rep(0,100)
for(i in 1:100){
  y1[i] = sum(exp(-.5*((x1[i]-x)/h)^2)*y)/sum(exp(-.5*((x1[i]-x)/h)^2))
}
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig. 2)', ylim=c(-8,8))
lines(x1,y1,col=2)


## raw data Plot compared to zoomed in around X=0.1. 
## Confidence interval and point estimate for X=0.1
point.estimates(X = 0.1)
C.I(0.1)

par(mfrow=c(1,2))
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig. 3)', ylim=c(-8,8)) 
lines(x = c(0.1, 0.1),y = c(-0.14557, 0.02232), col='blue', lwd=5)
points(x = 0.1 , y = -0.761, col='brown', lwd=3, pch=20)
## Zoom in 
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig 3.1)', xlim=c(0,0.2), ylim=c(-1,1))
lines(x = c(0.1, 0.1),y = c(-0.14557, 0.02232), col='blue', lwd=3)
points(x = 0.1 , y = -0.761, col='brown', lwd=3, pch=19)


## Raw data plot next to zoomed in around X = 1. 
## Confidence interval and point estimate for X=1. 

point.estimates(1)
C.I(1)

par(mfrow=c(1,2))
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig. 4)', ylim=c(-8,8)) 
lines(x = c(1, 1),y = c(-0.3840, -0.01256), col='blue', lwd=5)
points(x = 1 , y = -0.1742581, col='brown', lwd=3, pch=20)
## Zoom in 
plot(x,y, xlab = 'Distance (km) between missle and drone', ylab = 'Error (m)', main='Distance vs Error (Fig 4.1)', xlim=c(1,2), ylim=c(-1,1))
lines(x = c(1, 1),y = c(-0.3840, -0.01256), col='blue', lwd=3)
points(x = 1 , y = -0.1742581, col='brown', lwd=3, pch=19)

Non-parametric Analysis of Defense Missile Data

Lennox Garay

2023-03-17