We’ll use the built-in swiss dataset in R here to model
relationships. This dataset includes fertility measurements for
French-speaking swiss provinces around 1888.
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
First, let’s plot the relationships between our variables in the dataset. This will help us to indentify any relationships we may want to try to model with non-parametric regression
plot(swiss)
We see an interesting relationship between the catholic percentage of the population of Swiss provinces and the province’s fertility rate. There’s more measurements of fertility rates for provinces with either high or low catholic population percentages, with not many measurements in the middle. This is similar to some examples as well in chapter 11 of Extending the Linear Model with R.
ggplot(swiss, aes(x=Catholic, y=Fertility)) + geom_point()
We can use the ksmooth function from the
stats package in R to perform
Nadaraya-Watson kernel estimation. We can use several bandwith
values (\(\lambda\)) in order to see
which ones provide the best fit
bandwidths = c(1, 5, 10, 15, 20, 25, 30, 40, 50)
# Loop over different smoothing params to visually garner best fit to data
for (b in bandwidths){
smoothed <- ksmooth(swiss$Catholic, swiss$Fertility, "normal", bandwidth=b)
plot(Fertility ~ Catholic, swiss, main=glue("bandwidth = {b}"))
lines(smoothed)
}
We start off seeing that low bandwidth values overfit our data, and aren’t very smooth estimators. On the other end, higher bandwidth alues (40, 50) are smoothing out our estimator too much. We are shooting for the goldilocks zone of selecting \(\lambda\), and see the best fit that can generalize at about \(\lambda = 20\).
Overall, non-parametric regression can be useful when there isn’t much prior knowledge about the underlying relationship between covariates. However, it should be used sparingly as overfiting and generalizations can be concerns.