1 Theory

Local kernel regression, also known as kernel smoothing or local regression, is a non-parametric regression technique used to estimate the underlying relationship between variables in a dataset. It is particularly useful when the relationship between variables is complex and cannot be adequately captured by simple linear models.

In local kernel regression, instead of fitting a single global function to the entire dataset, separate regressions are performed for different subsets of the data. At each point where a prediction is needed, a weighted average of nearby data points is calculated, with weights determined by a kernel function. This means that points closer to the point of interest have more influence on the prediction than points farther away.

The kernel function typically assigns higher weights to points that are closer to the point of interest and lower weights to points that are farther away.
- Common kernel functions include Gaussian, Epanechnikov, and biweight kernels.
The bandwidth parameter controls the size of the local neighborhood used in the regression.
- A smaller bandwidth results in a more localized estimate, whereas a larger bandwidth results in a smoother estimate that incorporates more data points.

Local kernel regression is a flexible and powerful method for modeling complex relationships in data, but it can be computationally intensive, especially for large datasets. Additionally, the choice of kernel function and bandwidth can have a significant impact on the results, so careful selection and tuning of these parameters are important.

1.1 Non-parametric Methods Refresher

Non-parametric regression is a statistical technique used to model the relationship between variables without assuming a specific functional form for the relationship. Unlike parametric regression models, which assume a predetermined form (such as linear, quadratic, or exponential), non-parametric regression models estimate the relationship directly from the data, making them more flexible and able to capture complex patterns.

There are several methods for non-parametric regression, including:

Kernel Smoothing (Local Regression): This method, which I mentioned earlier, involves estimating the local relationship between variables using weighted averages of nearby data points, with weights determined by a kernel function.
Splines: Splines are piecewise polynomials that are connected at specific points, called knots. They provide a flexible way to model nonlinear relationships by fitting separate polynomial functions to different segments of the data.
Smoothing Splines: Similar to splines, smoothing splines use piecewise polynomials but also penalize the roughness of the fitted curve, resulting in a smoother estimate of the underlying relationship.
Local Polynomial Regression: This technique fits separate polynomial functions to local neighborhoods of the data, typically using weighted least squares to give more importance to nearby observations.
Generalized Additive Models (GAMs): GAMs extend traditional linear models by allowing for nonlinear relationships between predictors and the response variable through smooth functions, often implemented using spline-based approaches.

Non-parametric regression techniques are particularly useful when the relationship between variables is complex, irregular, or unknown, as they do not impose strict assumptions about the functional form of the relationship. However, they may require more data and computational resources compared to parametric models, and they can be sensitive to the choice of tuning parameters, such as the bandwidth in kernel smoothing or the number of knots in spline-based methods.

2 SetUp

# install.packages('gplm')
library('gplm')

## Loading required package: AER

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

3 Data

n <- 1000
x <- rnorm(n)
m <- sin(x)
y <- m + rnorm(n)
plot(x,y,col="gray")
o <- order(x); lines(x[o],m[o],col="green")
lines(kreg(x,y),lwd=2)

3.1 two-dimensional

n <- 100
x <- 6*cbind(runif(n), runif(n))-3
m <- function(x1,x2){ 4*sin(x1) + x2 }
y <- m(x[,1],x[,2]) + rnorm(n)
mh <- kreg(x,y)##,bandwidth=1)

grid1 <- unique(mh$x[,1])
grid2 <- unique(mh$x[,2])
est.m  <- t(matrix(mh$y,length(grid1),length(grid2)))
orig.m <- outer(grid1,grid2,m)
par(mfrow=c(1,2))
persp(grid1,grid2,orig.m,main="Original Function",
      theta=30,phi=30,expand=0.5,col="lightblue",shade=0.5)
persp(grid1,grid2,est.m,main="Estimated Function",
      theta=30,phi=30,expand=0.5,col="lightblue",shade=0.5)

par(mfrow=c(1,1))

3.2 Boundary Problem

Now with normal x, note the boundary problem, which can be somewhat reduced by a gaussian kernel -

n <- 1000
x <- cbind(rnorm(n), rnorm(n))
m <- function(x1,x2){ 4*sin(x1) + x2 }
y <- m(x[,1],x[,2]) + rnorm(n)
mh <- kreg(x,y)##,p="gaussian")

grid1 <- unique(mh$x[,1])
grid2 <- unique(mh$x[,2])
est.m  <- t(matrix(mh$y,length(grid1),length(grid2)))
orig.m <- outer(grid1,grid2,m)
par(mfrow=c(1,2))
persp(grid1,grid2,orig.m,main="Original Function",
      theta=30,phi=30,expand=0.5,col="lightblue",shade=0.5)
persp(grid1,grid2,est.m,main="Estimated Function",
      theta=30,phi=30,expand=0.5,col="lightblue",shade=0.5)

par(mfrow=c(1,1))

4 Source

https://search.r-project.org/CRAN/refmans/gplm/html/kreg.html

Kernel Regression

Arvind Sharma

2024-02-21