Blog 5: Non-parametric Regression

We’ll use the built-in swiss dataset in R here to model relationships. This dataset includes fertility measurements for French-speaking swiss provinces around 1888.

summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

First, let’s plot the relationships between our variables in the dataset. This will help us to indentify any relationships we may want to try to model with non-parametric regression

plot(swiss)

We see an interesting relationship between the catholic percentage of the population of Swiss provinces and the province’s fertility rate. There’s more measurements of fertility rates for provinces with either high or low catholic population percentages, with not many measurements in the middle. This is similar to some examples as well in chapter 11 of Extending the Linear Model with R.

ggplot(swiss, aes(x=Catholic, y=Fertility)) + geom_point()

We can use the ksmooth function from the stats package in R to perform Nadaraya-Watson kernel estimation. We can use several bandwith values (\(\lambda\)) in order to see which ones provide the best fit

bandwidths = c(1, 5, 10, 15, 20, 25, 30, 40, 50)

# Loop over different smoothing params to visually garner best fit to data
for (b in bandwidths){
  smoothed <- ksmooth(swiss$Catholic, swiss$Fertility, "normal", bandwidth=b)
  plot(Fertility ~ Catholic, swiss, main=glue("bandwidth = {b}"))
  lines(smoothed)
}

We start off seeing that low bandwidth values overfit our data, and aren’t very smooth estimators. On the other end, higher bandwidth alues (40, 50) are smoothing out our estimator too much. We are shooting for the goldilocks zone of selecting \(\lambda\), and see the best fit that can generalize at about \(\lambda = 20\).

Overall, non-parametric regression can be useful when there isn’t much prior knowledge about the underlying relationship between covariates. However, it should be used sparingly as overfiting and generalizations can be concerns.