Data 621 Blog 4

Introduction

In this blog, we examine lowess, a statistical curve-fitting technique. We will provide some mathematical preliminaries, then give an example calculation followed by an example in R. Lastly, we describe a commonly used R graphical function where lowess is the underlying curve fitting methodology.

Mathematical Overview

Lowess regression stands for Locally weighted scatterplot smoothing. Lowess is a non-parametric method of estimating a curve to fit observed data points. Assuming a data set of \(N\) points in \(R^n\), at each point \(x\), a subset \(S(x)\) of nearby points is chosen based on a distance metric – usually the Euclidean distance - of size \(s < N\). Typically \(s\) is chosen to represent a fixed percentage \(\alpha\) of the available \(N\) points.

For each point \(x_i \in S(x)\), a weight function \(w(x_i)\) is evaluated in order to assign more influence to neighboring points to \(x\). A common choice is the tri-cubic function \(h(d)\) defined as:

\[h(d) = (1 - |d|^3)^3 \text{ if } |d| < 1 \] For the tri-cubic function, the weight of all points is scaled by the maximum distance realized within the subset \(S(x)\).

\[ w(x_i) = h\left( \frac{ || x - x_i || }{ \max_{j} || x - x_j ||} \right)\]

Once the weights are derived for the points in \(S(x)\), a local regression function is estimated in a neighborhood using a polynomial of degree 1 or 2 typically. Solving for the regression involves deriving the coefficients of a linear or quadratic function using a weighted least squares approach. We refer the reader to one of the references linked below to explore the mathematical details further.

One advantage of the Lowess approach is that the data set need not be linear. No functional form of the curve used to fit the data is assumed globally. The approach is computationally intensive but often used in exploratory data analysis where more theoretical approaches may fail.

Disdvantages of the lowess approach include its greater computing cost and lack of a clear mathematical formula to represent the fitted curve.

Example

R directly supports lowess through a direct function call in the base package. We will illustrate the functionality with a variation on a well known example of simulated data. Consider a sine wave function \(f(x)= sin(x)\) on the interval \(x \in [0, 4\pi]\) that represents the deterministic function term. We will add a simulated noise term to obtain the observed function \(g(x)\).

We define the observed function \(g\) as:

\[g(x) = f(x) + v(x) \cdot \epsilon\] where \(\epsilon\) is a random variable drawn from a uniform \([0,1]\) distribution and \(v(x)\) is a volatility function that depends on \(x\).

\[v(x) = 0.2 \frac{x}{2\pi}\]

We will take a sample of \(N\) points in the interval \(x_i \in [0, 2\pi]\), calculate their observed function value \(g(x_i)\) and apply the R package lowess function to fit a curve through the observed data. The code is presented below.

set.seed(10)
u = seq(0, 1, by = 0.01)   # Obtain a sample of points
x = 4 * pi * u             # Scale the sample of points to be uniform net over the desired interval

scale = 0.2                # Volatility scale term
f = sin(x)                 # Define the deterministic function term
e = runif(length(x)) - 0.5 # Define the vector of uniformly distributed random variates
v = scale * x / (2 * pi)

g = f + v * e              # Define the observed data sample

df = data.frame( x= x, y = g )  # Store the data into a dataframe

head(df) %>% kable(digits=3)  # Some sample output

x	y
0.000	0.000
0.126	0.125
0.251	0.248
0.377	0.370
0.503	0.475
0.628	0.582

A plot of the raw data shows the heteroskedasticity is noticeable for \(x > 10\).

ggplot(data=df, aes(x,y))  + geom_point()

The plot below shows the raw data and the underlying sine wave function in blue.

plot(df$x, df$y, main="True Underlying Function and Observed Data", xlab="x", ylab="y")
lines(data.frame(x, f) , col = 4, lty = 2 , lwd = 3)

Finally, the plot below shows the lowess fit with a span parameter of 0.2 which means 20% of the neighboring points are used to fit each curve. When span is smaller (say 0.1), the lowess curve (in red) will more closely hug the blue curve.

Instead, we have chosen the span to be sufficiently large to create visible divergence. Notice that the lowess curve systematically undershoots the blue curve at the maxima and overshoots the blue curve at the minima. This clear bias may be due to the fact that points in a local neighborhood of the extrema are biased to lower or higher respectively. The local regression shows higher weight more than bottom but the leverage of the data points in the sample penalizes errors farther away from the local extrema. This likely explains the observed bias.

plot(df$x, df$y, xlab = "x", ylab="y",
     main="Lowess Fit vs. True Underlying Function")
lines(lowess(df, f = 0.2), col = 2, lty = 1,  lwd = 2 )

lines(data.frame(x, f) , col = 4, lty = 2 , lwd = 3)
legend( "bottomleft", legend = c( "Lowess", "True Function"), 
        lty = c(1,2) , col = c( "red", "blue"))

Application using ggplot2

One application of lowess is implemented in the ggplot2 library as the geom_smooth() function. As Hadley Wickham points out, “This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence interval”. The method argument controls curve generation and defaults to “loess”.

A sample invocation of geom_smooth() with span parameter set to 0.2 matches the above previous lowess example invoked directly above.

ggplot( df , aes( x, y)) + geom_point() + geom_smooth(span=0.2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Wickham points out that loess does not work well for large datasets because it uses \(O(n^2)\) memory so the when \(N>1000\) points the loess algorithm is not used.

References

https://ggplot2-book.org/ by Hadley Wickham describes geom_smooth() and loess in Section 2.6.1

https://en.wikipedia.org/wiki/Local_regression describes the lowess regression method and the mathematical framework.

https://towardsdatascience.com/loess-373d43b03564 Joao Paulo Figueira presents the sine wave example without a heteroskedastic noise term and shows the tri-cubic weight function with code in Python.

Data 621 Blog 4

Lowess in Practice

Alexander Ng

4/12/2020

Introduction

Mathematical Overview

Example

Application using ggplot2

References