Week 1 - Kernel density estimation (KDE)

Introduction/organization

Structure, generally:

  1. quick summary of readings (\(\leq 10\) minutes)
    • quick because we all read it
    • sprint through definitions, concepts, theorems, important figures
  2. zoom in on 1 or 2 topics (\(\approx 20\) minutes)
    • a proof, demo, simulation, application, …
  3. Overleaf questions

Stop me if I go over 30 min. It is a book club, not a lecture.

Short-term plan:

  • Week 1: 30 minutes of 1, no topics
  • Week 2: using simulations for self-study
  • Week 3: proof of key theorem

Let me know if you want something different.

  • topic: kernel density estimation (KDE)
  • readings: HPSE 17.1-17.7, 17.9, 17.14, 17.15

Summary

One of the most basic tasks in data analysis is to construct a summary for one variable in your sample data.

Here is some data from the Current Population Survey

library(tidyverse)
library(janitor)
library(AER)
data("CPSSWEducation")
glimpse(CPSSWEducation)
Rows: 2,950
Columns: 4
$ age       <int> 30, 30, 30, 30, 30, 30, 30, 29, 29, 30, 29, 29, 29, 29, 29, ~
$ gender    <fct> male, female, female, female, female, female, male, male, ma~
$ earnings  <dbl> 34.615383, 19.230770, 13.736263, 13.942307, 19.230770, 8.000~
$ education <int> 16, 16, 12, 13, 16, 12, 12, 16, 16, 12, 14, 18, 18, 11, 12, ~

How to summarize the information in a given variable?

For categorical variables, can use frequency tables.

Here is one for gender:

CPSSWEducation %>% tabyl(gender)
 gender    n   percent
 female 1202 0.4074576
   male 1748 0.5925424

and one for education.

CPSSWEducation %>% tabyl(education)
 education   n    percent
         6  45 0.01525424
         8  35 0.01186441
         9  49 0.01661017
        10  35 0.01186441
        11  61 0.02067797
        12 887 0.30067797
        13 607 0.20576271
        14 307 0.10406780
        16 752 0.25491525
        18 172 0.05830508

You can use their graphical analog, the bar plot:

Height of a bar is \[\widehat P(X = 1) = \frac{\sum_{i=1}^n 1\{X_i = 1\}}{n}.\]

For continuous variables, this does not work.

CPSSWEducation %>% tabyl(earnings)
  earnings   n      percent
  2.136752   1 0.0003389831
  2.403846   3 0.0010169492
  2.564103   1 0.0003389831
  2.622378   1 0.0003389831
  2.644231   1 0.0003389831
  2.724359   1 0.0003389831
  2.747253   1 0.0003389831
  2.797203   1 0.0003389831
  2.884615   6 0.0020338983
  2.958580   1 0.0003389831
  3.058824   1 0.0003389831
  3.076923   1 0.0003389831
  3.125000   1 0.0003389831
  3.219697   1 0.0003389831
  3.296703   1 0.0003389831
  3.328402   2 0.0006779661
  3.571429   2 0.0006779661
  3.605769   1 0.0003389831
  3.846154   5 0.0016949153
  3.885004   1 0.0003389831
  3.891941   1 0.0003389831
  4.120879   2 0.0006779661
  4.166667   1 0.0003389831
  4.200000   1 0.0003389831
  4.273504   1 0.0003389831
  4.273626   1 0.0003389831
  4.307693   1 0.0003389831
  4.326923   2 0.0006779661
  4.423077   1 0.0003389831
  4.430769   1 0.0003389831
  4.487179   1 0.0003389831
  4.554656   1 0.0003389831
  4.567307   1 0.0003389831
  4.615385   1 0.0003389831
  4.653846   1 0.0003389831
  4.730769   1 0.0003389831
  4.807693  12 0.0040677966
  4.967033   1 0.0003389831
  5.000000   1 0.0003389831
  5.021368   1 0.0003389831
  5.076923   2 0.0006779661
  5.128205   1 0.0003389831
  5.144231   1 0.0003389831
  5.208333   2 0.0006779661
  5.263158   1 0.0003389831
  5.274725   1 0.0003389831
  5.288462   8 0.0027118644
  5.341880   1 0.0003389831
  5.384615   2 0.0006779661
  5.448718   1 0.0003389831
  5.494505   2 0.0006779661
  5.528846   2 0.0006779661
  5.538462   1 0.0003389831
  5.547337   2 0.0006779661
  5.555555   1 0.0003389831
  5.594406   1 0.0003389831
  5.673077   1 0.0003389831
  5.723443   1 0.0003389831
  5.769231  24 0.0081355932
  5.833333   1 0.0003389831
  5.982906   1 0.0003389831
  6.000000   2 0.0006779661
  6.009615   1 0.0003389831
  6.043956   1 0.0003389831
  6.122449   1 0.0003389831
  6.129807   1 0.0003389831
  6.153846   3 0.0010169492
  6.172840   1 0.0003389831
  6.221719   1 0.0003389831
  6.224696   1 0.0003389831
  6.230769   1 0.0003389831
  6.250000  17 0.0057627119
  6.258503   1 0.0003389831
  6.303922   1 0.0003389831
  6.318681   1 0.0003389831
  6.346154   1 0.0003389831
  6.384615   1 0.0003389831
  6.410256   3 0.0010169492
  6.557693   1 0.0003389831
  6.593407   4 0.0013559322
  6.596154   1 0.0003389831
  6.722689   1 0.0003389831
  6.730769  18 0.0061016949
  6.734694   1 0.0003389831
  6.736363   1 0.0003389831
  6.750000   1 0.0003389831
  6.770833   1 0.0003389831
  6.778846   1 0.0003389831
  6.783217   1 0.0003389831
  6.868132   1 0.0003389831
  6.875000   1 0.0003389831
  6.923077   6 0.0020338983
  6.938776   1 0.0003389831
  6.946154   1 0.0003389831
  6.971154   2 0.0006779661
  6.976744   1 0.0003389831
  7.000000   1 0.0003389831
  7.083333   1 0.0003389831
  7.142857   4 0.0013559322
  7.211538  42 0.0142372881
  7.250000   1 0.0003389831
  7.259615   1 0.0003389831
  7.272727   1 0.0003389831
  7.294117   1 0.0003389831
  7.307693   1 0.0003389831
  7.339744   1 0.0003389831
  7.350000   1 0.0003389831
  7.371795   1 0.0003389831
  7.403846   1 0.0003389831
  7.451923   1 0.0003389831
  7.478632   1 0.0003389831
  7.500000  15 0.0050847458
  7.504691   1 0.0003389831
  7.548077   1 0.0003389831
  7.591093   1 0.0003389831
  7.632653   1 0.0003389831
  7.632692   1 0.0003389831
  7.644231   1 0.0003389831
  7.692307  25 0.0084745763
  7.766272   1 0.0003389831
  7.788462   1 0.0003389831
  7.812500   2 0.0006779661
  7.820513   1 0.0003389831
  7.840237   1 0.0003389831
  7.843137   1 0.0003389831
  7.846154   1 0.0003389831
  7.932693   3 0.0010169492
  7.982584   1 0.0003389831
  7.984615   1 0.0003389831
  8.000000   5 0.0016949153
  8.012820   3 0.0010169492
  8.069231   1 0.0003389831
  8.076923   1 0.0003389831
  8.108109   1 0.0003389831
  8.119781   1 0.0003389831
  8.133846   1 0.0003389831
  8.136095   3 0.0010169492
  8.153846   1 0.0003389831
  8.170513   1 0.0003389831
  8.173077  22 0.0074576271
  8.190884   1 0.0003389831
  8.192788   1 0.0003389831
  8.222596   1 0.0003389831
  8.238867   1 0.0003389831
  8.241758   6 0.0020338983
  8.284023   1 0.0003389831
  8.284314   1 0.0003389831
  8.288462   1 0.0003389831
  8.333333   1 0.0003389831
  8.444445   1 0.0003389831
  8.461538   2 0.0006779661
  8.489583   1 0.0003389831
  8.547009   6 0.0020338983
  8.571428   1 0.0003389831
  8.603239   2 0.0006779661
  8.653846  33 0.0111864407
  8.741259   2 0.0006779661
  8.750000   2 0.0006779661
  8.791209   4 0.0013559322
  8.798077   1 0.0003389831
  8.800000   1 0.0003389831
  8.810375   1 0.0003389831
  8.814102   1 0.0003389831
  8.846154   1 0.0003389831
  8.875740   1 0.0003389831
  8.881119   1 0.0003389831
  8.888889   1 0.0003389831
  8.894231   2 0.0006779661
  8.935509   1 0.0003389831
  8.947369   1 0.0003389831
  8.974359   2 0.0006779661
  8.990385   1 0.0003389831
  9.000000   2 0.0006779661
  9.065934   1 0.0003389831
  9.081197   1 0.0003389831
  9.113461   1 0.0003389831
  9.114583   1 0.0003389831
  9.134615  19 0.0064406780
  9.183674   1 0.0003389831
  9.230769   7 0.0023728814
  9.245563   3 0.0010169492
  9.265735   1 0.0003389831
  9.285714   2 0.0006779661
  9.326923   1 0.0003389831
  9.340659   2 0.0006779661
  9.348290   1 0.0003389831
  9.355510   1 0.0003389831
  9.368836   1 0.0003389831
  9.375000   2 0.0006779661
  9.380863   1 0.0003389831
  9.391771   1 0.0003389831
  9.401710   1 0.0003389831
  9.410802   1 0.0003389831
  9.452991   1 0.0003389831
  9.473684   1 0.0003389831
  9.500000   1 0.0003389831
  9.519231   1 0.0003389831
  9.545455   1 0.0003389831
  9.558824   1 0.0003389831
  9.567307   1 0.0003389831
  9.568750   1 0.0003389831
  9.586057   1 0.0003389831
  9.600000   2 0.0006779661
  9.615385  98 0.0332203390
  9.616346   1 0.0003389831
  9.616667   1 0.0003389831
  9.635417   1 0.0003389831
  9.759231   1 0.0003389831
  9.759615   1 0.0003389831
  9.763313   2 0.0006779661
  9.790210   1 0.0003389831
  9.803922   2 0.0006779661
  9.807693   1 0.0003389831
  9.829060   2 0.0006779661
  9.871795   1 0.0003389831
  9.890110   3 0.0010169492
  9.935898   2 0.0006779661
  9.951923   1 0.0003389831
 10.000000  23 0.0077966102
 10.024038   1 0.0003389831
 10.038462   1 0.0003389831
 10.059172   1 0.0003389831
 10.096154  19 0.0064406780
 10.121457   2 0.0006779661
 10.149572   1 0.0003389831
 10.164835   1 0.0003389831
 10.204082   1 0.0003389831
 10.230769   1 0.0003389831
 10.250000   1 0.0003389831
 10.256411  11 0.0037288136
 10.302197   2 0.0006779661
 10.324519   1 0.0003389831
 10.336538   2 0.0006779661
 10.375000   1 0.0003389831
 10.384615   2 0.0006779661
 10.415865   1 0.0003389831
 10.416667   3 0.0010169492
 10.439561   1 0.0003389831
 10.474038   1 0.0003389831
 10.500000   1 0.0003389831
 10.526316   1 0.0003389831
 10.531136   1 0.0003389831
 10.555555   1 0.0003389831
 10.576923  36 0.0122033898
 10.625000   2 0.0006779661
 10.627530   3 0.0010169492
 10.683761   6 0.0020338983
 10.714286   1 0.0003389831
 10.769231   5 0.0016949153
 10.817307   5 0.0016949153
 10.856079   1 0.0003389831
 10.859729   1 0.0003389831
 10.897436   1 0.0003389831
 10.914761   2 0.0006779661
 10.925000   1 0.0003389831
 10.937500   1 0.0003389831
 10.964912   1 0.0003389831
 10.989011   9 0.0030508475
 11.000000   6 0.0020338983
 11.009615   1 0.0003389831
 11.015865   1 0.0003389831
 11.017629   1 0.0003389831
 11.025641   1 0.0003389831
 11.051282   1 0.0003389831
 11.057693  26 0.0088135593
 11.076923   2 0.0006779661
 11.100000   1 0.0003389831
 11.111111   2 0.0006779661
 11.188811   1 0.0003389831
 11.217949   5 0.0016949153
 11.250000   4 0.0013559322
 11.274509   1 0.0003389831
 11.298077   1 0.0003389831
 11.332417   1 0.0003389831
 11.351352   1 0.0003389831
 11.363636   1 0.0003389831
 11.367521   1 0.0003389831
 11.428572   1 0.0003389831
 11.434511   1 0.0003389831
 11.446886   2 0.0006779661
 11.458333   1 0.0003389831
 11.500000   1 0.0003389831
 11.513158   1 0.0003389831
 11.530488   1 0.0003389831
 11.538462  67 0.0227118644
 11.555555   1 0.0003389831
 11.557693   1 0.0003389831
 11.600000   1 0.0003389831
 11.602884   1 0.0003389831
 11.647436   1 0.0003389831
 11.653846   1 0.0003389831
 11.666667   1 0.0003389831
 11.730769   2 0.0006779661
 11.805555   1 0.0003389831
 11.834319   1 0.0003389831
 11.868132   1 0.0003389831
 11.875000   1 0.0003389831
 11.884266   1 0.0003389831
 11.899038   1 0.0003389831
 11.904762   3 0.0010169492
 11.914893   1 0.0003389831
 11.923077   3 0.0010169492
 11.950000   1 0.0003389831
 12.000000   5 0.0016949153
 12.019231  76 0.0257627119
 12.087913   3 0.0010169492
 12.115385   2 0.0006779661
 12.133699   1 0.0003389831
 12.204142   1 0.0003389831
 12.219551   1 0.0003389831
 12.237762   4 0.0013559322
 12.254902   1 0.0003389831
 12.259615   1 0.0003389831
 12.307693  13 0.0044067797
 12.336720   1 0.0003389831
 12.352942   1 0.0003389831
 12.393163   3 0.0010169492
 12.395299   1 0.0003389831
 12.400000   1 0.0003389831
 12.500000  46 0.0155932203
 12.538462   1 0.0003389831
 12.561058   1 0.0003389831
 12.585470   1 0.0003389831
 12.587413   1 0.0003389831
 12.596154   2 0.0006779661
 12.628572   1 0.0003389831
 12.637362   3 0.0010169492
 12.674825   1 0.0003389831
 12.692307   4 0.0013559322
 12.740385   4 0.0013559322
 12.750000   1 0.0003389831
 12.820192   1 0.0003389831
 12.820513  14 0.0047457627
 12.912087   1 0.0003389831
 12.920673   1 0.0003389831
 12.932693   1 0.0003389831
 12.941176   1 0.0003389831
 12.969588   1 0.0003389831
 12.980769  37 0.0125423729
 13.000000   1 0.0003389831
 13.020833   2 0.0006779661
 13.062409   2 0.0006779661
 13.076923   1 0.0003389831
 13.125000   1 0.0003389831
 13.186813   2 0.0006779661
 13.197115   1 0.0003389831
 13.221154   6 0.0020338983
 13.247863   4 0.0013559322
 13.286714   4 0.0013559322
 13.313609   1 0.0003389831
 13.333333   2 0.0006779661
 13.354701   4 0.0013559322
 13.365385   1 0.0003389831
 13.366827   1 0.0003389831
 13.373077   1 0.0003389831
 13.416816   1 0.0003389831
 13.422596   1 0.0003389831
 13.461538  37 0.0125423729
 13.500000   1 0.0003389831
 13.513514   1 0.0003389831
 13.574660   1 0.0003389831
 13.580129   2 0.0006779661
 13.621795   1 0.0003389831
 13.636364   1 0.0003389831
 13.650962   1 0.0003389831
 13.663968   1 0.0003389831
 13.675214   1 0.0003389831
 13.690476   1 0.0003389831
 13.701923   2 0.0006779661
 13.736263   4 0.0013559322
 13.750000   3 0.0010169492
 13.759134   1 0.0003389831
 13.846154   6 0.0020338983
 13.888889   1 0.0003389831
 13.899038   1 0.0003389831
 13.900000   1 0.0003389831
 13.927350   1 0.0003389831
 13.942307  23 0.0077966102
 13.986014   3 0.0010169492
 14.000000   2 0.0006779661
 14.005602   1 0.0003389831
 14.022436   1 0.0003389831
 14.041347   1 0.0003389831
 14.053254   2 0.0006779661
 14.102564   3 0.0010169492
 14.140271   1 0.0003389831
 14.150944   1 0.0003389831
 14.155983   1 0.0003389831
 14.170040   2 0.0006779661
 14.182693   3 0.0010169492
 14.194139   1 0.0003389831
 14.211538   1 0.0003389831
 14.230769   3 0.0010169492
 14.278846   1 0.0003389831
 14.285714   3 0.0010169492
 14.302885   1 0.0003389831
 14.320786   1 0.0003389831
 14.349650   1 0.0003389831
 14.400000   1 0.0003389831
 14.423077 123 0.0416949153
 14.473684   1 0.0003389831
 14.497042   1 0.0003389831
 14.500000   1 0.0003389831
 14.502885   1 0.0003389831
 14.519231   1 0.0003389831
 14.529915   3 0.0010169492
 14.544025   1 0.0003389831
 14.577259   1 0.0003389831
 14.583333   2 0.0006779661
 14.601140   1 0.0003389831
 14.615385   4 0.0013559322
 14.679808   1 0.0003389831
 14.687500   2 0.0006779661
 14.697803   1 0.0003389831
 14.743589   1 0.0003389831
 14.792899   1 0.0003389831
 14.823718   1 0.0003389831
 14.835165   1 0.0003389831
 14.847116   1 0.0003389831
 14.857142   1 0.0003389831
 14.860140   2 0.0006779661
 14.874519   1 0.0003389831
 14.884615   1 0.0003389831
 14.903846  16 0.0054237288
 14.914530   1 0.0003389831
 14.925000   1 0.0003389831
 14.957265  10 0.0033898305
 15.000000  10 0.0033898305
 15.034965   1 0.0003389831
 15.046538   1 0.0003389831
 15.064102   1 0.0003389831
 15.109890   2 0.0006779661
 15.139116   1 0.0003389831
 15.144231   4 0.0013559322
 15.170940   1 0.0003389831
 15.182186   2 0.0006779661
 15.196078   1 0.0003389831
 15.224359   2 0.0006779661
 15.250545   1 0.0003389831
 15.271539   1 0.0003389831
 15.297203   1 0.0003389831
 15.307693   1 0.0003389831
 15.384615  69 0.0233898305
 15.432693   1 0.0003389831
 15.453297   1 0.0003389831
 15.461538   1 0.0003389831
 15.468227   2 0.0006779661
 15.476191   1 0.0003389831
 15.488461   1 0.0003389831
 15.532544   1 0.0003389831
 15.548282   1 0.0003389831
 15.559441   1 0.0003389831
 15.625000   2 0.0006779661
 15.698587   1 0.0003389831
 15.701923   1 0.0003389831
 15.734265   3 0.0010169492
 15.769231   1 0.0003389831
 15.796703   1 0.0003389831
 15.811966   1 0.0003389831
 15.865385  13 0.0044067797
 15.913462   1 0.0003389831
 15.923077   2 0.0006779661
 15.961538   1 0.0003389831
 15.976332   1 0.0003389831
 16.000000   3 0.0010169492
 16.025640   6 0.0020338983
 16.042780   1 0.0003389831
 16.063942   1 0.0003389831
 16.105770   1 0.0003389831
 16.112267   2 0.0006779661
 16.153847   3 0.0010169492
 16.198225   1 0.0003389831
 16.201923   1 0.0003389831
 16.203703   2 0.0006779661
 16.239317   4 0.0013559322
 16.240000   1 0.0003389831
 16.250000   2 0.0006779661
 16.272190   1 0.0003389831
 16.346153  32 0.0108474576
 16.400000   1 0.0003389831
 16.433567   1 0.0003389831
 16.483517   2 0.0006779661
 16.538462   1 0.0003389831
 16.586538   3 0.0010169492
 16.615385   1 0.0003389831
 16.642012   1 0.0003389831
 16.651031   1 0.0003389831
 16.666666   8 0.0027118644
 16.700405   2 0.0006779661
 16.707787   1 0.0003389831
 16.722408   2 0.0006779661
 16.746796   1 0.0003389831
 16.826923  96 0.0325423729
 16.875000   2 0.0006779661
 16.923077   1 0.0003389831
 16.941391   1 0.0003389831
 16.987179   1 0.0003389831
 17.032967   2 0.0006779661
 17.051283   1 0.0003389831
 17.067308   1 0.0003389831
 17.077404   1 0.0003389831
 17.083334   1 0.0003389831
 17.094017   8 0.0027118644
 17.105263   2 0.0006779661
 17.115385   1 0.0003389831
 17.129328   1 0.0003389831
 17.156862   1 0.0003389831
 17.159763   1 0.0003389831
 17.170330   1 0.0003389831
 17.184942   1 0.0003389831
 17.187500   2 0.0006779661
 17.195673   1 0.0003389831
 17.307692  36 0.0122033898
 17.308655   1 0.0003389831
 17.320000   1 0.0003389831
 17.320261   1 0.0003389831
 17.369230   1 0.0003389831
 17.441860   1 0.0003389831
 17.482517   4 0.0013559322
 17.500000   2 0.0006779661
 17.521368   3 0.0010169492
 17.533937   1 0.0003389831
 17.548077   2 0.0006779661
 17.620192   1 0.0003389831
 17.628204   1 0.0003389831
 17.660910   1 0.0003389831
 17.675962   1 0.0003389831
 17.692308   5 0.0016949153
 17.701050   1 0.0003389831
 17.751480   1 0.0003389831
 17.788462  15 0.0050847458
 17.857143   1 0.0003389831
 17.889088   1 0.0003389831
 17.948717   8 0.0027118644
 17.965588   1 0.0003389831
 17.980770   1 0.0003389831
 18.000000   2 0.0006779661
 18.028847   3 0.0010169492
 18.076923   2 0.0006779661
 18.121302   1 0.0003389831
 18.131868   2 0.0006779661
 18.132692   1 0.0003389831
 18.154762   1 0.0003389831
 18.162394   1 0.0003389831
 18.191269   1 0.0003389831
 18.218624   1 0.0003389831
 18.229166   1 0.0003389831
 18.269230  26 0.0088135593
 18.307692   1 0.0003389831
 18.315018   1 0.0003389831
 18.376068   2 0.0006779661
 18.461538   2 0.0006779661
 18.491125   2 0.0006779661
 18.509615   2 0.0006779661
 18.576923   1 0.0003389831
 18.589743   2 0.0006779661
 18.626373   1 0.0003389831
 18.696581   2 0.0006779661
 18.711020   1 0.0003389831
 18.750000  10 0.0033898305
 18.750481   1 0.0003389831
 18.803419   2 0.0006779661
 18.812710   2 0.0006779661
 18.830128   1 0.0003389831
 18.923611   1 0.0003389831
 19.097221   1 0.0003389831
 19.151846   1 0.0003389831
 19.181923   1 0.0003389831
 19.183674   1 0.0003389831
 19.230770 116 0.0393220339
 19.270834   1 0.0003389831
 19.326923   2 0.0006779661
 19.428572   1 0.0003389831
 19.471153   1 0.0003389831
 19.580420   1 0.0003389831
 19.607843   3 0.0010169492
 19.615385   1 0.0003389831
 19.658119   3 0.0010169492
 19.667831   1 0.0003389831
 19.711538  15 0.0050847458
 19.792692   1 0.0003389831
 19.871796   2 0.0006779661
 19.951923   3 0.0010169492
 20.000000  11 0.0037288136
 20.052404   1 0.0003389831
 20.085470   2 0.0006779661
 20.089285   1 0.0003389831
 20.146521   1 0.0003389831
 20.192308  27 0.0091525424
 20.242914   4 0.0013559322
 20.270269   1 0.0003389831
 20.299145   2 0.0006779661
 20.340237   1 0.0003389831
 20.384615   3 0.0010169492
 20.408163   1 0.0003389831
 20.432692   1 0.0003389831
 20.495951   2 0.0006779661
 20.512821   3 0.0010169492
 20.566238   1 0.0003389831
 20.673077  16 0.0054237288
 20.710060   1 0.0003389831
 20.748987   1 0.0003389831
 20.759615   1 0.0003389831
 20.769230   2 0.0006779661
 20.833334   3 0.0010169492
 20.875000   1 0.0003389831
 20.884615   2 0.0006779661
 20.940170   1 0.0003389831
 20.979021   1 0.0003389831
 21.000000   1 0.0003389831
 21.049572   1 0.0003389831
 21.059999   1 0.0003389831
 21.079882   1 0.0003389831
 21.129808   1 0.0003389831
 21.153847  20 0.0067796610
 21.367521  14 0.0047457627
 21.394230   2 0.0006779661
 21.418270   1 0.0003389831
 21.449703   3 0.0010169492
 21.500000   2 0.0006779661
 21.538462   2 0.0006779661
 21.634615  33 0.0111864407
 21.678322   1 0.0003389831
 21.750000   1 0.0003389831
 21.853148   1 0.0003389831
 21.878365   1 0.0003389831
 21.978022  11 0.0037288136
 22.000000   2 0.0006779661
 22.035257   1 0.0003389831
 22.055555   1 0.0003389831
 22.115385  15 0.0050847458
 22.123398   1 0.0003389831
 22.189348   1 0.0003389831
 22.222221   3 0.0010169492
 22.235577   1 0.0003389831
 22.307692   3 0.0010169492
 22.355770   1 0.0003389831
 22.395834   1 0.0003389831
 22.435898   1 0.0003389831
 22.459866   1 0.0003389831
 22.555555   1 0.0003389831
 22.596153   4 0.0013559322
 22.649572   1 0.0003389831
 22.773279   1 0.0003389831
 22.893772   1 0.0003389831
 22.951923   1 0.0003389831
 22.993311   1 0.0003389831
 23.000000   1 0.0003389831
 23.072308   1 0.0003389831
 23.076923  30 0.0101694915
 23.092789   1 0.0003389831
 23.148148   2 0.0006779661
 23.192308   1 0.0003389831
 23.455385   1 0.0003389831
 23.504274   6 0.0020338983
 23.557692   3 0.0010169492
 23.928366   1 0.0003389831
 23.931623   2 0.0006779661
 24.000000   1 0.0003389831
 24.038462  62 0.0210169492
 24.188847   1 0.0003389831
 24.230770   1 0.0003389831
 24.267399   1 0.0003389831
 24.291498   2 0.0006779661
 24.358974   2 0.0006779661
 24.518269   1 0.0003389831
 24.519230   3 0.0010169492
 24.615385   2 0.0006779661
 24.666666   1 0.0003389831
 24.725275   2 0.0006779661
 24.786325   2 0.0006779661
 24.954212   1 0.0003389831
 25.000000  25 0.0084745763
 25.240385   1 0.0003389831
 25.303644   1 0.0003389831
 25.320513   1 0.0003389831
 25.480770   3 0.0010169492
 25.510204   1 0.0003389831
 25.641026   7 0.0023728814
 25.846153   1 0.0003389831
 25.961538   7 0.0023728814
 26.041666   1 0.0003389831
 26.153847   2 0.0006779661
 26.250000   1 0.0003389831
 26.315790   2 0.0006779661
 26.346153   2 0.0006779661
 26.373627   1 0.0003389831
 26.442308  15 0.0050847458
 26.470589   2 0.0006779661
 26.556776   1 0.0003389831
 26.595745   1 0.0003389831
 26.634615   1 0.0003389831
 26.709402   1 0.0003389831
 26.923077  10 0.0033898305
 27.097902   1 0.0003389831
 27.237762   1 0.0003389831
 27.243589   1 0.0003389831
 27.262020   1 0.0003389831
 27.307692   1 0.0003389831
 27.403847   5 0.0016949153
 27.472527   4 0.0013559322
 27.692308   1 0.0003389831
 27.777779   5 0.0016949153
 27.884615   8 0.0027118644
 27.972029   1 0.0003389831
 27.980770   1 0.0003389831
 28.000000   2 0.0006779661
 28.365385   2 0.0006779661
 28.571428   2 0.0006779661
 28.616154   1 0.0003389831
 28.632479   2 0.0006779661
 28.671329   1 0.0003389831
 28.846153  29 0.0098305085
 29.401709   1 0.0003389831
 29.585798   2 0.0006779661
 29.647436   1 0.0003389831
 29.807692   8 0.0027118644
 29.914530   3 0.0010169492
 30.000000   1 0.0003389831
 30.064102   1 0.0003389831
 30.288462   5 0.0016949153
 30.364372   1 0.0003389831
 30.769230   9 0.0030508475
 30.982906   2 0.0006779661
 31.185032   1 0.0003389831
 31.250000  17 0.0057627119
 31.250481   1 0.0003389831
 31.730770   3 0.0010169492
 32.000000   1 0.0003389831
 32.051281   3 0.0010169492
 32.066345   1 0.0003389831
 32.211540   4 0.0013559322
 32.500000   1 0.0003389831
 32.692307   7 0.0023728814
 32.905983   1 0.0003389831
 33.173077   1 0.0003389831
 33.284023   1 0.0003389831
 33.653847  14 0.0047457627
 33.783783   1 0.0003389831
 33.846153   1 0.0003389831
 33.854168   1 0.0003389831
 34.134617   1 0.0003389831
 34.188034   5 0.0016949153
 34.340660   1 0.0003389831
 34.615383  11 0.0037288136
 34.722221   1 0.0003389831
 34.920635   1 0.0003389831
 35.096153   2 0.0006779661
 35.153847   1 0.0003389831
 35.164837   2 0.0006779661
 35.256409   1 0.0003389831
 35.470085   1 0.0003389831
 35.576923   2 0.0006779661
 35.714287   1 0.0003389831
 36.057693  14 0.0047457627
 36.324787   1 0.0003389831
 36.538460   2 0.0006779661
 36.923077   1 0.0003389831
 37.019230   2 0.0006779661
 37.115383   1 0.0003389831
 37.245193   1 0.0003389831
 37.500000   1 0.0003389831
 37.980770   1 0.0003389831
 38.014313   1 0.0003389831
 38.461540  10 0.0033898305
 38.942307   2 0.0006779661
 39.062500   1 0.0003389831
 39.215687   1 0.0003389831
 39.262821   1 0.0003389831
 39.448719   1 0.0003389831
 39.903847   2 0.0006779661
 40.063942   1 0.0003389831
 40.064102   1 0.0003389831
 40.384617   1 0.0003389831
 40.598289   1 0.0003389831
 40.865383   8 0.0027118644
 41.346153   1 0.0003389831
 41.538460   1 0.0003389831
 41.958042   1 0.0003389831
 42.307693   2 0.0006779661
 42.735043   1 0.0003389831
 43.269230   5 0.0016949153
 43.269711   1 0.0003389831
 43.750000   1 0.0003389831
 44.070515   1 0.0003389831
 44.230770   1 0.0003389831
 45.096153   1 0.0003389831
 45.454544   1 0.0003389831
 45.673077   3 0.0010169492
 46.153847   4 0.0013559322
 47.003845   1 0.0003389831
 47.115383   1 0.0003389831
 48.076923   3 0.0010169492
 48.557693   1 0.0003389831
 49.450550   1 0.0003389831
 49.519230   2 0.0006779661
 50.480770   1 0.0003389831
 50.909092   1 0.0003389831
 50.961540   1 0.0003389831
 51.282051   1 0.0003389831
 52.163460   1 0.0003389831
 52.884617   3 0.0010169492
 53.418804   1 0.0003389831
 53.981731   1 0.0003389831
 54.945053   1 0.0003389831
 55.288460   2 0.0006779661
 56.089745   1 0.0003389831
 56.604168   1 0.0003389831
 57.692307   3 0.0010169492
 59.134617   1 0.0003389831
 59.230770   1 0.0003389831
 61.111111   1 0.0003389831
 61.188812   1 0.0003389831
 66.784134   1 0.0003389831
 72.115387   1 0.0003389831
 74.519234   1 0.0003389831
 76.923080   1 0.0003389831
 82.775917   1 0.0003389831
 84.134613   1 0.0003389831
 86.538460   2 0.0006779661
 92.500000   1 0.0003389831
 97.500000   1 0.0003389831

The barplot also looks weird:

A solution: frequency tables for classes:

tabyl(cut(CPSSWEducation$earnings,10))
 cut(CPSSWEducation$earnings, 10)    n      percent
                      (2.04,11.7]  922 0.3125423729
                      (11.7,21.2] 1384 0.4691525424
                      (21.2,30.7]  424 0.1437288136
                      (30.7,40.3]  145 0.0491525424
                      (40.3,49.8]   43 0.0145762712
                      (49.8,59.4]   20 0.0067796610
                      (59.4,68.9]    3 0.0010169492
                      (68.9,78.4]    3 0.0010169492
                        (78.4,88]    4 0.0013559322
                        (88,97.6]    2 0.0006779661

The visual analog of this is the histogram

CPSSWEducation %>% ggplot(aes(x=earnings)) + geom_histogram(bins = 10)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice
CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_histogram(binwidth = 10)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice
CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_histogram(binwidth = 5)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice
CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_histogram(binwidth = 2)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice
CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_histogram(binwidth = 0.1)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice

Enter: KDE

CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_density() +
  geom_histogram(binwidth = 2, aes(y = ..density..), alpha = 0.5)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice

Enter: KDE

CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_density(bw = 5) +
  geom_histogram(binwidth = 2, aes(y = ..density..), alpha = 0.5)

Histogram has at least four drawbacks:

  1. does not integrate to 1
  2. discontinuous
  3. constant within a bin
  4. depends on bin choice

Enter: KDE

CPSSWEducation %>% 
  ggplot(aes(x=earnings)) + 
  geom_density(bw = 0.1) +
  geom_histogram(binwidth = 2, aes(y = ..density..), alpha = 0.5)

Definition 1

The kernel density estimator (KDE) of \(f(x)\) is \[\widehat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K \left( \frac{X_i - x}{h} \right),\] where \(K(\cdot)\) is a weighting/kernel function, and \(h>0\) is a bandwidth.

Kernel function determines relative weight given to observations around \(x\).

Definition 2 A (second-order) kernel function \(K(u)\) satisfies

  1. \(0 \leq K(u) \leq \overline{K} < \infty\)
  2. \(K(u) = K(-u)\)
  3. \(\int_{-\infty}^{+\infty} K(u) du = 1\)
  4. \(\int_{-\infty}^{+\infty} |u|^r K(u) du < \infty\) for all positive integers \(r\).

Definition 3

A normalized kernel function is a kernel function that satisfies \(\int_{-\infty}^{+\infty} u^2 K(u) du = 1.\)

Definition 4

The roughness of a kernel function \(R_K = \int_{-\infty}^{+\infty} K(u)^2 du\).

More important than the choice of kernel is the choice of bandwidth.

Definition 5

A bandwidth or tuning parameter \(h>0\) is a real number used to control the degree of smoothing of a nonparametric estimator.

For any fixed \(h>0\), KDE is biased.

[draw]

Bias depends on:

  1. \(h\)
  2. kernel
  3. smoothness of \(f\)

Theorem 1 If \(f(x)\) is continuous in a neighbourhood of \(x\), then as \(h \to 0\),

\[E[\widehat{f}(x)] = f(x) + o(1).\]

Proof: Section 17.16.

With additional smoothness, we have the following.

Theorem 2

If \(f''(x)\) is continuous in \(\mathcal{N}\), then as \(h\to 0\), \[E[\widehat{f}(x)] = f(x) + \frac{1}{2} f''(x)h^2 + o(h^2).\]

Proof sketch:

\[ \begin{aligned} E\left( \hat{f}_h(x)\right) &= E\left( \frac{1}{nh} \sum_i K\left(\frac{X_i-x}{h}\right)\right) \\ &= \frac{1}{h} E\left( K\left(\frac{X_i-x}{h}\right)\right) \\ &= \frac{1}{h} \int K\left(\frac{x_i-x}{h}\right) f(x_i) dx_i \\ &= \int K\left(u\right) f(x+uh) du. \\ \end{aligned} \]

Use smoothness to expand:

\[ \begin{aligned} \int K\left(u\right) f(x+uh) du &= \int K\left(u\right) \left[ f(x) + uh f'(x) + \frac{u^2h^2}{2} f''(x) + o(h^3) \right] du \\ &= f(x) \int K(u)du + hf'(x) \int u K(u) du \\ &\phantom{=} + h^2 f''(x) \int u^2 K(u)du + o(h^3) \\ &= f(x) + h^2 b(x) + O(h^3) \end{aligned} \]

where the scaled bias is

\[ b(x) = \frac{f''(x)}{2} \int u^2 K(u)du = \frac{f''(x)}{2}.\]

Theorem 3 The exact variance of \(\widehat f(x)\) is \[V_{\widehat f} = \text{var}\left[\widehat f(x)\right] = \frac{1}{nh^2} \text{var}\left[ K\left(\frac{X_i-x}{h}\right)\right].\]

If \(f(x)\) is continuous in an open neighbourhood of \(x\), then as \(h\to 0\) and \(nh \to \infty\), \[V_{\widehat f} = \frac{v(x)}{nh} + o\left(\frac{1}{nh}\right).\]

where the scaled variance is \[v(x) = f(x)R_K = f(x) \int (K(u))^2 du.\]

The IMSE is a measure of the distance between the KDE and \(f(x)\), summarized across all values of \(x\) in the support of \(X\):

Definition 6

The integrated mean squared error (IMSE) is \[\text{IMSE} = \int_{-\infty}^{+\infty} E \left[ \left( \widehat f(x) - f(x) \right)^2 \right] dx.\]

[draw]

The term under the IMSE integral is the MSE …

\[MSE(x) = E \left[ \left( \widehat f(x) - f(x) \right)^2 \right]\]

… which we know (HPSE, 6.11) is …

\[MSE(x) = bias(x)^2 + variance(x)\]

… which is approximately …

\[MSE(x) \to h^4 b(x)^2 + o(h^4) + \frac{v(x)}{nh} + o\left(\frac{1}{nh}\right)\]

as \(h\to 0\) and \(nh \to \infty\).

We call the leading term the AMSE(x) …

\[AMSE(x) = h^4 b(x)^2 + \frac{v(x)}{nh}\]

… and its integral the approximate IMSE …

\[AIMSE = h^4 \int_{-\infty}^{+\infty} b(x)^2 + \frac{\int_{-\infty}^{+\infty} v(x)}{nh}dx.\]

… which simplifies to

\[AIMSE = \frac{h^4}{4} \int_{-\infty}^{+\infty} (f''(x))^2 + \frac{R_K}{nh}.\]

because \(b(x) = f''(x)/2\) and \(v(x) = f(x)R_K\).

… which simplifies to

\[AIMSE = \frac{h^4}{4} R(f'') + \frac{R_K}{nh}.\]

because \(b(x) = f''(x)/2\) and \(v(x) = f(x)R_K\), and where \[R(f'') = \int_{-\infty}^{+\infty} (f''(x))^2.\]

AIMSE \(\frac{h^4}{4} R(f'') + \frac{R_K}{nh}\)

  • increases in the roughness of \(f''\)
  • increases in the roughness of \(R_K\)
  • decreases with \(n\).

Furthermore:

  • bias increases with \(h\)
  • variance decreases in \(h\)

For a fixed kernel function, sample size, and unknown function \(f\), all we can control is \(h\).

FOC of AIMSE wrt \(h\):

\[h^3 R(f'') - h^{-2} R_K/n = 0\]

so the AIMSE-optimal bandwidth is

\[h_0 = \left(\frac{R_K}{R(f'')}\right)^{1/5} n^{-1/5}.\]

The result

\[h_0 = \left(\frac{R_K}{R(f'')}\right)^{1/5} n^{-1/5},\]

without further analysis, is useless in practice because …

… it depends on the unknown quantity \(R(f'')\).

But we do gain theoretical knowledge.

Viewed as a function of the bandwidth,

\[AIMSE(h) = \frac{h^4}{4} R(f'') + \frac{R_K}{nh},\]

converges to 0 at rate

\[AIMSE(h_0) \propto \frac{n^{-4/5}}{4} R(f'') + \frac{R_K}{n^{4/5}} \propto n^{-4/5}.\]

if using \(h_0 \propto n^{-1/5}\).

Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\), where

\[MSE = E(\bar X - \mu)^2\]

Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\), where

\[MSE = bias^2 + variance\]

Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\), where

\[MSE = 0 + \frac{\sigma^2}{n}\]

Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\), where

\[MSE = 0 + \frac{\sigma^2}{n} \propto n^{-1}\]

Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\), where

\[MSE = 0 + \frac{\sigma^2}{n} \propto n^{-1} << n^{-4/5}.\]

Silverman’s rule of thumb (HPSE 17.9) turns theoretical bandwidth into practical advice.

Definition 7

Choosing the bandwidth equal to \[h_r = 0.9 \widetilde \sigma n^{-1/5},\] where \(\widetilde \sigma\) is defined in (HPSE 11.14), is referred to as Silverman’s rule of thumb.

KDE + Silverman is a proper estimator: depends on data alone.

Two more things: consistency and asymptotic normality.

Theorem 4 If \(f(x)\) is continuous in a neighbourhood of \(x\), then as \(h \to 0\) and \(nh \to \infty\), then

\[\widehat f(x) - f(x) \stackrel{p}{\to} 0.\]

Proof: An implication of Chebyshev’s, see (HPSE, Th 7.2) applies, using that \(var\) goes to 0 and \(exp\) goes to \(f(x)\).

A standard CLT leads to:

\[ \begin{aligned} \sqrt{n} (\hat{f}(x)-f(x) - h^2 b(x)) &\stackrel{"d"}{\to} \mathcal{N}\left(0,\frac{1}{h} v(x)\right). \end{aligned} \]

The limit is well-defined for fixed \(h\), but does not give good guidance for \(h \to 0\).

Not useful when we use Silverman or other good choices of \(h\).

Theorem 5 If \(f''\) is continuous in a neighbourhood of \(x\), then as \(nh \to \infty\) such that \(h = O\left( n^{-1/5}\right)\),

\[\sqrt{nh} (\hat{f}(x)-f(x) - h^2 b(x)) \stackrel{d}{\to} \mathcal{N}(0,v(x)).\]

Proof: we will see similar proofs for nonparametric regression.

Left undiscussed:

  • more on optimal bandwidths and CV: week 2
  • proof of asymptotic normality: week 3
  • undersmoothing: week 3
  • multivariate density estimation: ?

Topic 1

From histogram to KDE

Problems:

  • how to choose the number of classes
  • estimate same all along the width of bar
  • non-smooth: density usually don’t look like this
  • YES show effect of kernel and bandwidth while demo-ing histogram TO KDE

Topic 2

Proof of Theorem 17.6.

Prove it!

Topic 3

Simulation study

  • show slow rate
  • [DEMO: show the difference between estimation of \(E(X)\) and \(f(x)\) at a point: rate of convergence!] …
  • YES show curse of dimensionality and the slow rate. Also show how CIs for multiple points do not cover those simultaneously for 95%.

Topic 4

Cross-validation

Cross-validation

In practice, we choose a bandwidth based on minimizing the mean integrated squared error MISE

\[ \begin{aligned} \int \left( \hat{f}_h(x) - f(x) \right)^2 dx &= \int \hat{f}_h(x)^2dx - 2 \int \hat{f}_h(x) f(x)dx + \int f(x)^2 dx \end{aligned} \]

Terms:

  1. Convolution kernel
  2. Leave-one-out
  3. Does not depend on \(h\)

CV: second term

In the second term,

\[ \int \hat{f}_h(x) f(x)dx = E \left(\hat{f}_h(X)\right)\]

Estimating this by

\[ \frac{1}{n} \sum_i \hat{f}_h(X_i)\]

does not work because \(f(h)\) and \(X_i\) are dependent.

CV: second term (2)

Instead, use the leave-one-out estimator

\[\hat{f}_{-i}(X_i) = \frac{1}{h(n-1)} \sum_{j \neq i} K\left( \frac{X_j-X_i}{h}\right)\]

to estimate \(E \left(\hat{f}_h(X)\right)\) by

\[ \frac{1}{n} \sum_i \hat{f}_{-i}(X_i) \]

CV: third term

The third term can be written

\[ \begin{aligned} \int \hat{f}_h(x)^2dx &= \frac{1}{n^2h^2} \sum_{i} \sum_j \int K\left(\frac{X_i-x}{h}\right)K\left(\frac{X_j-x}{h}\right) dx \\ &\equiv \frac{1}{n^2h}\sum_i\sum_j \bar{K}\left( \frac{X_i-X_j}{h} \right) \end{aligned} \]

where \(\bar{K}\) is the convolution kernel, and can be obtained from \(K\)

CV: implementation

To minimize the MISE, CV chooses:

\[ \hat{h} = \text{argmin} CV_{f}(h)\]

where

\[ CV_f(h) = \frac{1}{n^2h}\sum_i\sum_j \bar{K}\left( \frac{X_i-X_j}{h} \right) - \frac{2}{hn(n-1)} \sum_i \sum_{j \neq i} K\left( \frac{X_j-X_i}{h}\right)\]

It can be shown that \(\hat{h}\) converges to \(h^*\).

Questions

  • The order of a kernel is defined as the order of the first non-zero moment. We are using second-order kernels: first moment is zero, second is not.
    • question shows up twice. Check before adding!
  • Normalized kernel: see derivations above

Decide on important remainder questions to discuss next week