Let \(X\) be a \(N\times 2\) matrix, interpreted as a sample of \(N\) points in \(\mathbb R^2\) from a density \(f(x)\).

Let \(\mathcal G\) be a rectangular grid in \(\mathbb R^2\) of size \(k\).

Define \(\mathbf z_{\mathcal G} = \{z_i: \, i = 1, \ldots, k\}\) where \(z_i\) is the number of points \(X\) in the rectangle \(i\) of \(\mathcal G\).

  1. Given \(\mathbf z_{\mathcal G}\), estimate the density \(f\) non-parametrically

  2. Derive a significance test for the hypothesis that \(f\) is bivariate Gaussian.

Data simulation setup

Define simulation parameters for the underlying Gaussian density

N <- 1e4
mu <- c(x = 1, y = 1)
Sigma <- matrix(c(4, 3, 3, 9), 2, 2)
grid_size <- c(10, 10)
GT <- GridTopology(c(x = -9, y = -9), c(2, 2), grid_size)
SG <- SpatialGrid(GT)

Simulate latent and aggregated data

## Latent (unobserved) data
dat <- data.frame(rmvnorm(N, mu, Sigma))

## Aggregate into bins
zG <- aggregate(dat, SG)

Visualise

ggplot(as.data.frame(zG), aes(x, y)) + 
  geom_point(data = dat) +
  geom_raster(aes(fill = counts), alpha = 0.95) +
  coord_fixed() +
  scale_fill_viridis()

1. Density surface estimation

We are going to use an Inverse-Distance Weighted interpolation (see e.g. Bivand et al 2008, S. 8.3.1).

There are two tunning parameters that control the output:

  1. idp: The inverse distance power determines the degree to which the nearer point(s) are preferred over more distant points; for large values IDW converges to the one-nearest-neighbour interpolation

  2. nmax: Maximum number of neighbours to average over. By default it takes all observations.

Since we have aggregated data, we want a low value of idp in order to smooth-out the aggregated data. On the other hand, since many of the grid counts have very low values this will shrink the prediction too much towards zero unless very close to a centroid. We can overcome this by limiting the number of obervations to average to the nearest nmax.

We need to play a bit with the values in order to get a reasonably smooth estimate.

pred.grid <- refine_grid(zG, 10)
idw.out <- idw(counts ~ 1, zG, pred.grid, idp = 1.5, nmax = 5)
## [inverse distance weighted interpolation]
ggplot(as.data.frame(idw.out), aes(x, y)) + 
  geom_point(data = dat) +
  geom_raster(aes(fill = var1.pred), alpha = 0.95) +
  coord_fixed() +
  scale_fill_viridis()

Comparison of predicted vs. true intensity.

## True generating density surface
idw.out$true_dens <- dmvnorm(coordinates(idw.out), mu, Sigma)

## Predicted intensity (in the same scale)
idw.out$pred <- idw.out$var1.pred/sum(idw.out$var1.pred)*sum(idw.out$true_dens)

ggplot(as.data.frame(idw.out), aes(true_dens, pred)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, col = 'darkgray')

2. Testing for Normality

We could devise a Chi-squared goodness-of-fit test (e.g. http://itl.nist.gov/div898/handbook/eda/section3/eda35f.htm).

Basically, the procedure would be as follows:

  1. Assume the original distribution is bivariate Gaussian with unknown mean \(\mu\) and covariance matrix \(\Sigma\) (Null Hypothesis)

  2. Under this hypothesis, compute the expected frequency \(E_i\) of observations in each rectangle \(i = 1, ..., k\) of the grid. For this, the \(c = 5\) parameters of the distribution need to be estimated (i.e. the marginal means \(\mu = (x_0, y_0)\), the standard deviations \(\sigma_x\), \(\sigma_y\), and the correlation\(\rho\)).

  3. Measure the discrepancy between the observed and the expected with the statistic \[\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\]

  4. When \(N\) is sufficiently large, this statistic follows a \(\chi^2_{k-c}\) distribution. Compute \(p = P(\chi^2_{k-c} > \chi^2)\).

The technical issues in this case are:

  1. The estimation of \(\mu\) and \(\Sigma\) from binned data. For the mean, I guess that the center of mass would be a reasonable choice. Probably even unbiased. But the estimation of the standard deviations and the correlation are not that clear.

I think the easiest would be to use a simulation approach. Defining a function of \(\mu\) and \(\Sigma\) that generates millions of samples from the corresponding Gaussian and calculates the corresponding frequencies in each grid cell. The proportion of times that the frequencies reproduce the observed ones is an estimate of the likelihood of \(\mu\) and \(\Sigma\).

This function can then be used within a standard optimization algorithm for maximizing the likelihood of \(\mu\) and \(\Sigma\).

  1. The computation of the expected frequencie within each grid cell. This can be done either by numerical integration or by Monte Carlo simulation.

References

Roger S. Bivand, Edzer J. Pebesma and Virgilio Gómez-Rubio. Applied Spatial Data Analysis With R. Second edition. Springer 2008.

Appendix

Packages used:

library(dplyr)
library(ggplot2)
library(viridis)
library(gstat)
library(mvtnorm)
library(sp)
theme_set(theme_bw())

Auxiliar functions

## Write a matrix x as a LaTeX string
mat2tex <- function(x, d = 0) {
  require(xtable)
  xt <- xtable(x,
               digits = d,
               align = rep("", ncol(x)+1)) # We repeat empty string 6 times
  print(xt,
        floating=FALSE,
        tabular.environment="pmatrix",
        hline.after=NULL,
        include.rownames=FALSE,
        include.colnames=FALSE,
        comment = FALSE)
}

## Count the number of points in each square of grid
aggregate <- function(pts, grid) {
  points <- as.data.frame(pts)
  coordinates(points) <- ~ x + y
  
  agg <- over(points, grid)
  counts <- unclass(table(factor(agg, levels = 1:prod(grid_size))))
  SpatialGridDataFrame(grid, data.frame(counts))
}

## Refine k-folds a given grid
refine_grid =  function(grd, k) {
  stopifnot(k>1)
  gp <- gridparameters(grd)
  dim <- gp$cells.dim * k
  size <- gp$cellsize / k
  offset <- gp$cellcentre.offset - size * (k-1) / 2
  names(offset) <- row.names(gp)
  SpatialGrid(GridTopology(offset, size, dim))
}