One of the most basic tasks in data analysis is to construct a summary for one variable in your sample data.
Here is some data from the C urrent P opulation S urvey
library (tidyverse)
library (janitor)
library (AER)
data ("CPSSWEducation" )
glimpse (CPSSWEducation)
Rows: 2,950
Columns: 4
$ age <int> 30, 30, 30, 30, 30, 30, 30, 29, 29, 30, 29, 29, 29, 29, 29, ~
$ gender <fct> male, female, female, female, female, female, male, male, ma~
$ earnings <dbl> 34.615383, 19.230770, 13.736263, 13.942307, 19.230770, 8.000~
$ education <int> 16, 16, 12, 13, 16, 12, 12, 16, 16, 12, 14, 18, 18, 11, 12, ~
How to summarize the information in a given variable?
For categorical variables, can use frequency tables.
Here is one for gender:
CPSSWEducation %>% tabyl (gender)
gender n percent
female 1202 0.4074576
male 1748 0.5925424
and one for education.
CPSSWEducation %>% tabyl (education)
education n percent
6 45 0.01525424
8 35 0.01186441
9 49 0.01661017
10 35 0.01186441
11 61 0.02067797
12 887 0.30067797
13 607 0.20576271
14 307 0.10406780
16 752 0.25491525
18 172 0.05830508
You can use their graphical analog, the bar plot :
Height of a bar is \[\widehat P(X = 1) = \frac{\sum_{i=1}^n 1\{X_i = 1\}}{n}.\]
For continuous variables, this does not work.
CPSSWEducation %>% tabyl (earnings)
earnings n percent
2.136752 1 0.0003389831
2.403846 3 0.0010169492
2.564103 1 0.0003389831
2.622378 1 0.0003389831
2.644231 1 0.0003389831
2.724359 1 0.0003389831
2.747253 1 0.0003389831
2.797203 1 0.0003389831
2.884615 6 0.0020338983
2.958580 1 0.0003389831
3.058824 1 0.0003389831
3.076923 1 0.0003389831
3.125000 1 0.0003389831
3.219697 1 0.0003389831
3.296703 1 0.0003389831
3.328402 2 0.0006779661
3.571429 2 0.0006779661
3.605769 1 0.0003389831
3.846154 5 0.0016949153
3.885004 1 0.0003389831
3.891941 1 0.0003389831
4.120879 2 0.0006779661
4.166667 1 0.0003389831
4.200000 1 0.0003389831
4.273504 1 0.0003389831
4.273626 1 0.0003389831
4.307693 1 0.0003389831
4.326923 2 0.0006779661
4.423077 1 0.0003389831
4.430769 1 0.0003389831
4.487179 1 0.0003389831
4.554656 1 0.0003389831
4.567307 1 0.0003389831
4.615385 1 0.0003389831
4.653846 1 0.0003389831
4.730769 1 0.0003389831
4.807693 12 0.0040677966
4.967033 1 0.0003389831
5.000000 1 0.0003389831
5.021368 1 0.0003389831
5.076923 2 0.0006779661
5.128205 1 0.0003389831
5.144231 1 0.0003389831
5.208333 2 0.0006779661
5.263158 1 0.0003389831
5.274725 1 0.0003389831
5.288462 8 0.0027118644
5.341880 1 0.0003389831
5.384615 2 0.0006779661
5.448718 1 0.0003389831
5.494505 2 0.0006779661
5.528846 2 0.0006779661
5.538462 1 0.0003389831
5.547337 2 0.0006779661
5.555555 1 0.0003389831
5.594406 1 0.0003389831
5.673077 1 0.0003389831
5.723443 1 0.0003389831
5.769231 24 0.0081355932
5.833333 1 0.0003389831
5.982906 1 0.0003389831
6.000000 2 0.0006779661
6.009615 1 0.0003389831
6.043956 1 0.0003389831
6.122449 1 0.0003389831
6.129807 1 0.0003389831
6.153846 3 0.0010169492
6.172840 1 0.0003389831
6.221719 1 0.0003389831
6.224696 1 0.0003389831
6.230769 1 0.0003389831
6.250000 17 0.0057627119
6.258503 1 0.0003389831
6.303922 1 0.0003389831
6.318681 1 0.0003389831
6.346154 1 0.0003389831
6.384615 1 0.0003389831
6.410256 3 0.0010169492
6.557693 1 0.0003389831
6.593407 4 0.0013559322
6.596154 1 0.0003389831
6.722689 1 0.0003389831
6.730769 18 0.0061016949
6.734694 1 0.0003389831
6.736363 1 0.0003389831
6.750000 1 0.0003389831
6.770833 1 0.0003389831
6.778846 1 0.0003389831
6.783217 1 0.0003389831
6.868132 1 0.0003389831
6.875000 1 0.0003389831
6.923077 6 0.0020338983
6.938776 1 0.0003389831
6.946154 1 0.0003389831
6.971154 2 0.0006779661
6.976744 1 0.0003389831
7.000000 1 0.0003389831
7.083333 1 0.0003389831
7.142857 4 0.0013559322
7.211538 42 0.0142372881
7.250000 1 0.0003389831
7.259615 1 0.0003389831
7.272727 1 0.0003389831
7.294117 1 0.0003389831
7.307693 1 0.0003389831
7.339744 1 0.0003389831
7.350000 1 0.0003389831
7.371795 1 0.0003389831
7.403846 1 0.0003389831
7.451923 1 0.0003389831
7.478632 1 0.0003389831
7.500000 15 0.0050847458
7.504691 1 0.0003389831
7.548077 1 0.0003389831
7.591093 1 0.0003389831
7.632653 1 0.0003389831
7.632692 1 0.0003389831
7.644231 1 0.0003389831
7.692307 25 0.0084745763
7.766272 1 0.0003389831
7.788462 1 0.0003389831
7.812500 2 0.0006779661
7.820513 1 0.0003389831
7.840237 1 0.0003389831
7.843137 1 0.0003389831
7.846154 1 0.0003389831
7.932693 3 0.0010169492
7.982584 1 0.0003389831
7.984615 1 0.0003389831
8.000000 5 0.0016949153
8.012820 3 0.0010169492
8.069231 1 0.0003389831
8.076923 1 0.0003389831
8.108109 1 0.0003389831
8.119781 1 0.0003389831
8.133846 1 0.0003389831
8.136095 3 0.0010169492
8.153846 1 0.0003389831
8.170513 1 0.0003389831
8.173077 22 0.0074576271
8.190884 1 0.0003389831
8.192788 1 0.0003389831
8.222596 1 0.0003389831
8.238867 1 0.0003389831
8.241758 6 0.0020338983
8.284023 1 0.0003389831
8.284314 1 0.0003389831
8.288462 1 0.0003389831
8.333333 1 0.0003389831
8.444445 1 0.0003389831
8.461538 2 0.0006779661
8.489583 1 0.0003389831
8.547009 6 0.0020338983
8.571428 1 0.0003389831
8.603239 2 0.0006779661
8.653846 33 0.0111864407
8.741259 2 0.0006779661
8.750000 2 0.0006779661
8.791209 4 0.0013559322
8.798077 1 0.0003389831
8.800000 1 0.0003389831
8.810375 1 0.0003389831
8.814102 1 0.0003389831
8.846154 1 0.0003389831
8.875740 1 0.0003389831
8.881119 1 0.0003389831
8.888889 1 0.0003389831
8.894231 2 0.0006779661
8.935509 1 0.0003389831
8.947369 1 0.0003389831
8.974359 2 0.0006779661
8.990385 1 0.0003389831
9.000000 2 0.0006779661
9.065934 1 0.0003389831
9.081197 1 0.0003389831
9.113461 1 0.0003389831
9.114583 1 0.0003389831
9.134615 19 0.0064406780
9.183674 1 0.0003389831
9.230769 7 0.0023728814
9.245563 3 0.0010169492
9.265735 1 0.0003389831
9.285714 2 0.0006779661
9.326923 1 0.0003389831
9.340659 2 0.0006779661
9.348290 1 0.0003389831
9.355510 1 0.0003389831
9.368836 1 0.0003389831
9.375000 2 0.0006779661
9.380863 1 0.0003389831
9.391771 1 0.0003389831
9.401710 1 0.0003389831
9.410802 1 0.0003389831
9.452991 1 0.0003389831
9.473684 1 0.0003389831
9.500000 1 0.0003389831
9.519231 1 0.0003389831
9.545455 1 0.0003389831
9.558824 1 0.0003389831
9.567307 1 0.0003389831
9.568750 1 0.0003389831
9.586057 1 0.0003389831
9.600000 2 0.0006779661
9.615385 98 0.0332203390
9.616346 1 0.0003389831
9.616667 1 0.0003389831
9.635417 1 0.0003389831
9.759231 1 0.0003389831
9.759615 1 0.0003389831
9.763313 2 0.0006779661
9.790210 1 0.0003389831
9.803922 2 0.0006779661
9.807693 1 0.0003389831
9.829060 2 0.0006779661
9.871795 1 0.0003389831
9.890110 3 0.0010169492
9.935898 2 0.0006779661
9.951923 1 0.0003389831
10.000000 23 0.0077966102
10.024038 1 0.0003389831
10.038462 1 0.0003389831
10.059172 1 0.0003389831
10.096154 19 0.0064406780
10.121457 2 0.0006779661
10.149572 1 0.0003389831
10.164835 1 0.0003389831
10.204082 1 0.0003389831
10.230769 1 0.0003389831
10.250000 1 0.0003389831
10.256411 11 0.0037288136
10.302197 2 0.0006779661
10.324519 1 0.0003389831
10.336538 2 0.0006779661
10.375000 1 0.0003389831
10.384615 2 0.0006779661
10.415865 1 0.0003389831
10.416667 3 0.0010169492
10.439561 1 0.0003389831
10.474038 1 0.0003389831
10.500000 1 0.0003389831
10.526316 1 0.0003389831
10.531136 1 0.0003389831
10.555555 1 0.0003389831
10.576923 36 0.0122033898
10.625000 2 0.0006779661
10.627530 3 0.0010169492
10.683761 6 0.0020338983
10.714286 1 0.0003389831
10.769231 5 0.0016949153
10.817307 5 0.0016949153
10.856079 1 0.0003389831
10.859729 1 0.0003389831
10.897436 1 0.0003389831
10.914761 2 0.0006779661
10.925000 1 0.0003389831
10.937500 1 0.0003389831
10.964912 1 0.0003389831
10.989011 9 0.0030508475
11.000000 6 0.0020338983
11.009615 1 0.0003389831
11.015865 1 0.0003389831
11.017629 1 0.0003389831
11.025641 1 0.0003389831
11.051282 1 0.0003389831
11.057693 26 0.0088135593
11.076923 2 0.0006779661
11.100000 1 0.0003389831
11.111111 2 0.0006779661
11.188811 1 0.0003389831
11.217949 5 0.0016949153
11.250000 4 0.0013559322
11.274509 1 0.0003389831
11.298077 1 0.0003389831
11.332417 1 0.0003389831
11.351352 1 0.0003389831
11.363636 1 0.0003389831
11.367521 1 0.0003389831
11.428572 1 0.0003389831
11.434511 1 0.0003389831
11.446886 2 0.0006779661
11.458333 1 0.0003389831
11.500000 1 0.0003389831
11.513158 1 0.0003389831
11.530488 1 0.0003389831
11.538462 67 0.0227118644
11.555555 1 0.0003389831
11.557693 1 0.0003389831
11.600000 1 0.0003389831
11.602884 1 0.0003389831
11.647436 1 0.0003389831
11.653846 1 0.0003389831
11.666667 1 0.0003389831
11.730769 2 0.0006779661
11.805555 1 0.0003389831
11.834319 1 0.0003389831
11.868132 1 0.0003389831
11.875000 1 0.0003389831
11.884266 1 0.0003389831
11.899038 1 0.0003389831
11.904762 3 0.0010169492
11.914893 1 0.0003389831
11.923077 3 0.0010169492
11.950000 1 0.0003389831
12.000000 5 0.0016949153
12.019231 76 0.0257627119
12.087913 3 0.0010169492
12.115385 2 0.0006779661
12.133699 1 0.0003389831
12.204142 1 0.0003389831
12.219551 1 0.0003389831
12.237762 4 0.0013559322
12.254902 1 0.0003389831
12.259615 1 0.0003389831
12.307693 13 0.0044067797
12.336720 1 0.0003389831
12.352942 1 0.0003389831
12.393163 3 0.0010169492
12.395299 1 0.0003389831
12.400000 1 0.0003389831
12.500000 46 0.0155932203
12.538462 1 0.0003389831
12.561058 1 0.0003389831
12.585470 1 0.0003389831
12.587413 1 0.0003389831
12.596154 2 0.0006779661
12.628572 1 0.0003389831
12.637362 3 0.0010169492
12.674825 1 0.0003389831
12.692307 4 0.0013559322
12.740385 4 0.0013559322
12.750000 1 0.0003389831
12.820192 1 0.0003389831
12.820513 14 0.0047457627
12.912087 1 0.0003389831
12.920673 1 0.0003389831
12.932693 1 0.0003389831
12.941176 1 0.0003389831
12.969588 1 0.0003389831
12.980769 37 0.0125423729
13.000000 1 0.0003389831
13.020833 2 0.0006779661
13.062409 2 0.0006779661
13.076923 1 0.0003389831
13.125000 1 0.0003389831
13.186813 2 0.0006779661
13.197115 1 0.0003389831
13.221154 6 0.0020338983
13.247863 4 0.0013559322
13.286714 4 0.0013559322
13.313609 1 0.0003389831
13.333333 2 0.0006779661
13.354701 4 0.0013559322
13.365385 1 0.0003389831
13.366827 1 0.0003389831
13.373077 1 0.0003389831
13.416816 1 0.0003389831
13.422596 1 0.0003389831
13.461538 37 0.0125423729
13.500000 1 0.0003389831
13.513514 1 0.0003389831
13.574660 1 0.0003389831
13.580129 2 0.0006779661
13.621795 1 0.0003389831
13.636364 1 0.0003389831
13.650962 1 0.0003389831
13.663968 1 0.0003389831
13.675214 1 0.0003389831
13.690476 1 0.0003389831
13.701923 2 0.0006779661
13.736263 4 0.0013559322
13.750000 3 0.0010169492
13.759134 1 0.0003389831
13.846154 6 0.0020338983
13.888889 1 0.0003389831
13.899038 1 0.0003389831
13.900000 1 0.0003389831
13.927350 1 0.0003389831
13.942307 23 0.0077966102
13.986014 3 0.0010169492
14.000000 2 0.0006779661
14.005602 1 0.0003389831
14.022436 1 0.0003389831
14.041347 1 0.0003389831
14.053254 2 0.0006779661
14.102564 3 0.0010169492
14.140271 1 0.0003389831
14.150944 1 0.0003389831
14.155983 1 0.0003389831
14.170040 2 0.0006779661
14.182693 3 0.0010169492
14.194139 1 0.0003389831
14.211538 1 0.0003389831
14.230769 3 0.0010169492
14.278846 1 0.0003389831
14.285714 3 0.0010169492
14.302885 1 0.0003389831
14.320786 1 0.0003389831
14.349650 1 0.0003389831
14.400000 1 0.0003389831
14.423077 123 0.0416949153
14.473684 1 0.0003389831
14.497042 1 0.0003389831
14.500000 1 0.0003389831
14.502885 1 0.0003389831
14.519231 1 0.0003389831
14.529915 3 0.0010169492
14.544025 1 0.0003389831
14.577259 1 0.0003389831
14.583333 2 0.0006779661
14.601140 1 0.0003389831
14.615385 4 0.0013559322
14.679808 1 0.0003389831
14.687500 2 0.0006779661
14.697803 1 0.0003389831
14.743589 1 0.0003389831
14.792899 1 0.0003389831
14.823718 1 0.0003389831
14.835165 1 0.0003389831
14.847116 1 0.0003389831
14.857142 1 0.0003389831
14.860140 2 0.0006779661
14.874519 1 0.0003389831
14.884615 1 0.0003389831
14.903846 16 0.0054237288
14.914530 1 0.0003389831
14.925000 1 0.0003389831
14.957265 10 0.0033898305
15.000000 10 0.0033898305
15.034965 1 0.0003389831
15.046538 1 0.0003389831
15.064102 1 0.0003389831
15.109890 2 0.0006779661
15.139116 1 0.0003389831
15.144231 4 0.0013559322
15.170940 1 0.0003389831
15.182186 2 0.0006779661
15.196078 1 0.0003389831
15.224359 2 0.0006779661
15.250545 1 0.0003389831
15.271539 1 0.0003389831
15.297203 1 0.0003389831
15.307693 1 0.0003389831
15.384615 69 0.0233898305
15.432693 1 0.0003389831
15.453297 1 0.0003389831
15.461538 1 0.0003389831
15.468227 2 0.0006779661
15.476191 1 0.0003389831
15.488461 1 0.0003389831
15.532544 1 0.0003389831
15.548282 1 0.0003389831
15.559441 1 0.0003389831
15.625000 2 0.0006779661
15.698587 1 0.0003389831
15.701923 1 0.0003389831
15.734265 3 0.0010169492
15.769231 1 0.0003389831
15.796703 1 0.0003389831
15.811966 1 0.0003389831
15.865385 13 0.0044067797
15.913462 1 0.0003389831
15.923077 2 0.0006779661
15.961538 1 0.0003389831
15.976332 1 0.0003389831
16.000000 3 0.0010169492
16.025640 6 0.0020338983
16.042780 1 0.0003389831
16.063942 1 0.0003389831
16.105770 1 0.0003389831
16.112267 2 0.0006779661
16.153847 3 0.0010169492
16.198225 1 0.0003389831
16.201923 1 0.0003389831
16.203703 2 0.0006779661
16.239317 4 0.0013559322
16.240000 1 0.0003389831
16.250000 2 0.0006779661
16.272190 1 0.0003389831
16.346153 32 0.0108474576
16.400000 1 0.0003389831
16.433567 1 0.0003389831
16.483517 2 0.0006779661
16.538462 1 0.0003389831
16.586538 3 0.0010169492
16.615385 1 0.0003389831
16.642012 1 0.0003389831
16.651031 1 0.0003389831
16.666666 8 0.0027118644
16.700405 2 0.0006779661
16.707787 1 0.0003389831
16.722408 2 0.0006779661
16.746796 1 0.0003389831
16.826923 96 0.0325423729
16.875000 2 0.0006779661
16.923077 1 0.0003389831
16.941391 1 0.0003389831
16.987179 1 0.0003389831
17.032967 2 0.0006779661
17.051283 1 0.0003389831
17.067308 1 0.0003389831
17.077404 1 0.0003389831
17.083334 1 0.0003389831
17.094017 8 0.0027118644
17.105263 2 0.0006779661
17.115385 1 0.0003389831
17.129328 1 0.0003389831
17.156862 1 0.0003389831
17.159763 1 0.0003389831
17.170330 1 0.0003389831
17.184942 1 0.0003389831
17.187500 2 0.0006779661
17.195673 1 0.0003389831
17.307692 36 0.0122033898
17.308655 1 0.0003389831
17.320000 1 0.0003389831
17.320261 1 0.0003389831
17.369230 1 0.0003389831
17.441860 1 0.0003389831
17.482517 4 0.0013559322
17.500000 2 0.0006779661
17.521368 3 0.0010169492
17.533937 1 0.0003389831
17.548077 2 0.0006779661
17.620192 1 0.0003389831
17.628204 1 0.0003389831
17.660910 1 0.0003389831
17.675962 1 0.0003389831
17.692308 5 0.0016949153
17.701050 1 0.0003389831
17.751480 1 0.0003389831
17.788462 15 0.0050847458
17.857143 1 0.0003389831
17.889088 1 0.0003389831
17.948717 8 0.0027118644
17.965588 1 0.0003389831
17.980770 1 0.0003389831
18.000000 2 0.0006779661
18.028847 3 0.0010169492
18.076923 2 0.0006779661
18.121302 1 0.0003389831
18.131868 2 0.0006779661
18.132692 1 0.0003389831
18.154762 1 0.0003389831
18.162394 1 0.0003389831
18.191269 1 0.0003389831
18.218624 1 0.0003389831
18.229166 1 0.0003389831
18.269230 26 0.0088135593
18.307692 1 0.0003389831
18.315018 1 0.0003389831
18.376068 2 0.0006779661
18.461538 2 0.0006779661
18.491125 2 0.0006779661
18.509615 2 0.0006779661
18.576923 1 0.0003389831
18.589743 2 0.0006779661
18.626373 1 0.0003389831
18.696581 2 0.0006779661
18.711020 1 0.0003389831
18.750000 10 0.0033898305
18.750481 1 0.0003389831
18.803419 2 0.0006779661
18.812710 2 0.0006779661
18.830128 1 0.0003389831
18.923611 1 0.0003389831
19.097221 1 0.0003389831
19.151846 1 0.0003389831
19.181923 1 0.0003389831
19.183674 1 0.0003389831
19.230770 116 0.0393220339
19.270834 1 0.0003389831
19.326923 2 0.0006779661
19.428572 1 0.0003389831
19.471153 1 0.0003389831
19.580420 1 0.0003389831
19.607843 3 0.0010169492
19.615385 1 0.0003389831
19.658119 3 0.0010169492
19.667831 1 0.0003389831
19.711538 15 0.0050847458
19.792692 1 0.0003389831
19.871796 2 0.0006779661
19.951923 3 0.0010169492
20.000000 11 0.0037288136
20.052404 1 0.0003389831
20.085470 2 0.0006779661
20.089285 1 0.0003389831
20.146521 1 0.0003389831
20.192308 27 0.0091525424
20.242914 4 0.0013559322
20.270269 1 0.0003389831
20.299145 2 0.0006779661
20.340237 1 0.0003389831
20.384615 3 0.0010169492
20.408163 1 0.0003389831
20.432692 1 0.0003389831
20.495951 2 0.0006779661
20.512821 3 0.0010169492
20.566238 1 0.0003389831
20.673077 16 0.0054237288
20.710060 1 0.0003389831
20.748987 1 0.0003389831
20.759615 1 0.0003389831
20.769230 2 0.0006779661
20.833334 3 0.0010169492
20.875000 1 0.0003389831
20.884615 2 0.0006779661
20.940170 1 0.0003389831
20.979021 1 0.0003389831
21.000000 1 0.0003389831
21.049572 1 0.0003389831
21.059999 1 0.0003389831
21.079882 1 0.0003389831
21.129808 1 0.0003389831
21.153847 20 0.0067796610
21.367521 14 0.0047457627
21.394230 2 0.0006779661
21.418270 1 0.0003389831
21.449703 3 0.0010169492
21.500000 2 0.0006779661
21.538462 2 0.0006779661
21.634615 33 0.0111864407
21.678322 1 0.0003389831
21.750000 1 0.0003389831
21.853148 1 0.0003389831
21.878365 1 0.0003389831
21.978022 11 0.0037288136
22.000000 2 0.0006779661
22.035257 1 0.0003389831
22.055555 1 0.0003389831
22.115385 15 0.0050847458
22.123398 1 0.0003389831
22.189348 1 0.0003389831
22.222221 3 0.0010169492
22.235577 1 0.0003389831
22.307692 3 0.0010169492
22.355770 1 0.0003389831
22.395834 1 0.0003389831
22.435898 1 0.0003389831
22.459866 1 0.0003389831
22.555555 1 0.0003389831
22.596153 4 0.0013559322
22.649572 1 0.0003389831
22.773279 1 0.0003389831
22.893772 1 0.0003389831
22.951923 1 0.0003389831
22.993311 1 0.0003389831
23.000000 1 0.0003389831
23.072308 1 0.0003389831
23.076923 30 0.0101694915
23.092789 1 0.0003389831
23.148148 2 0.0006779661
23.192308 1 0.0003389831
23.455385 1 0.0003389831
23.504274 6 0.0020338983
23.557692 3 0.0010169492
23.928366 1 0.0003389831
23.931623 2 0.0006779661
24.000000 1 0.0003389831
24.038462 62 0.0210169492
24.188847 1 0.0003389831
24.230770 1 0.0003389831
24.267399 1 0.0003389831
24.291498 2 0.0006779661
24.358974 2 0.0006779661
24.518269 1 0.0003389831
24.519230 3 0.0010169492
24.615385 2 0.0006779661
24.666666 1 0.0003389831
24.725275 2 0.0006779661
24.786325 2 0.0006779661
24.954212 1 0.0003389831
25.000000 25 0.0084745763
25.240385 1 0.0003389831
25.303644 1 0.0003389831
25.320513 1 0.0003389831
25.480770 3 0.0010169492
25.510204 1 0.0003389831
25.641026 7 0.0023728814
25.846153 1 0.0003389831
25.961538 7 0.0023728814
26.041666 1 0.0003389831
26.153847 2 0.0006779661
26.250000 1 0.0003389831
26.315790 2 0.0006779661
26.346153 2 0.0006779661
26.373627 1 0.0003389831
26.442308 15 0.0050847458
26.470589 2 0.0006779661
26.556776 1 0.0003389831
26.595745 1 0.0003389831
26.634615 1 0.0003389831
26.709402 1 0.0003389831
26.923077 10 0.0033898305
27.097902 1 0.0003389831
27.237762 1 0.0003389831
27.243589 1 0.0003389831
27.262020 1 0.0003389831
27.307692 1 0.0003389831
27.403847 5 0.0016949153
27.472527 4 0.0013559322
27.692308 1 0.0003389831
27.777779 5 0.0016949153
27.884615 8 0.0027118644
27.972029 1 0.0003389831
27.980770 1 0.0003389831
28.000000 2 0.0006779661
28.365385 2 0.0006779661
28.571428 2 0.0006779661
28.616154 1 0.0003389831
28.632479 2 0.0006779661
28.671329 1 0.0003389831
28.846153 29 0.0098305085
29.401709 1 0.0003389831
29.585798 2 0.0006779661
29.647436 1 0.0003389831
29.807692 8 0.0027118644
29.914530 3 0.0010169492
30.000000 1 0.0003389831
30.064102 1 0.0003389831
30.288462 5 0.0016949153
30.364372 1 0.0003389831
30.769230 9 0.0030508475
30.982906 2 0.0006779661
31.185032 1 0.0003389831
31.250000 17 0.0057627119
31.250481 1 0.0003389831
31.730770 3 0.0010169492
32.000000 1 0.0003389831
32.051281 3 0.0010169492
32.066345 1 0.0003389831
32.211540 4 0.0013559322
32.500000 1 0.0003389831
32.692307 7 0.0023728814
32.905983 1 0.0003389831
33.173077 1 0.0003389831
33.284023 1 0.0003389831
33.653847 14 0.0047457627
33.783783 1 0.0003389831
33.846153 1 0.0003389831
33.854168 1 0.0003389831
34.134617 1 0.0003389831
34.188034 5 0.0016949153
34.340660 1 0.0003389831
34.615383 11 0.0037288136
34.722221 1 0.0003389831
34.920635 1 0.0003389831
35.096153 2 0.0006779661
35.153847 1 0.0003389831
35.164837 2 0.0006779661
35.256409 1 0.0003389831
35.470085 1 0.0003389831
35.576923 2 0.0006779661
35.714287 1 0.0003389831
36.057693 14 0.0047457627
36.324787 1 0.0003389831
36.538460 2 0.0006779661
36.923077 1 0.0003389831
37.019230 2 0.0006779661
37.115383 1 0.0003389831
37.245193 1 0.0003389831
37.500000 1 0.0003389831
37.980770 1 0.0003389831
38.014313 1 0.0003389831
38.461540 10 0.0033898305
38.942307 2 0.0006779661
39.062500 1 0.0003389831
39.215687 1 0.0003389831
39.262821 1 0.0003389831
39.448719 1 0.0003389831
39.903847 2 0.0006779661
40.063942 1 0.0003389831
40.064102 1 0.0003389831
40.384617 1 0.0003389831
40.598289 1 0.0003389831
40.865383 8 0.0027118644
41.346153 1 0.0003389831
41.538460 1 0.0003389831
41.958042 1 0.0003389831
42.307693 2 0.0006779661
42.735043 1 0.0003389831
43.269230 5 0.0016949153
43.269711 1 0.0003389831
43.750000 1 0.0003389831
44.070515 1 0.0003389831
44.230770 1 0.0003389831
45.096153 1 0.0003389831
45.454544 1 0.0003389831
45.673077 3 0.0010169492
46.153847 4 0.0013559322
47.003845 1 0.0003389831
47.115383 1 0.0003389831
48.076923 3 0.0010169492
48.557693 1 0.0003389831
49.450550 1 0.0003389831
49.519230 2 0.0006779661
50.480770 1 0.0003389831
50.909092 1 0.0003389831
50.961540 1 0.0003389831
51.282051 1 0.0003389831
52.163460 1 0.0003389831
52.884617 3 0.0010169492
53.418804 1 0.0003389831
53.981731 1 0.0003389831
54.945053 1 0.0003389831
55.288460 2 0.0006779661
56.089745 1 0.0003389831
56.604168 1 0.0003389831
57.692307 3 0.0010169492
59.134617 1 0.0003389831
59.230770 1 0.0003389831
61.111111 1 0.0003389831
61.188812 1 0.0003389831
66.784134 1 0.0003389831
72.115387 1 0.0003389831
74.519234 1 0.0003389831
76.923080 1 0.0003389831
82.775917 1 0.0003389831
84.134613 1 0.0003389831
86.538460 2 0.0006779661
92.500000 1 0.0003389831
97.500000 1 0.0003389831
The barplot also looks weird:
A solution: frequency tables for classes :
tabyl (cut (CPSSWEducation$ earnings,10 ))
cut(CPSSWEducation$earnings, 10) n percent
(2.04,11.7] 922 0.3125423729
(11.7,21.2] 1384 0.4691525424
(21.2,30.7] 424 0.1437288136
(30.7,40.3] 145 0.0491525424
(40.3,49.8] 43 0.0145762712
(49.8,59.4] 20 0.0067796610
(59.4,68.9] 3 0.0010169492
(68.9,78.4] 3 0.0010169492
(78.4,88] 4 0.0013559322
(88,97.6] 2 0.0006779661
The visual analog of this is the histogram
CPSSWEducation %>% ggplot (aes (x= earnings)) + geom_histogram (bins = 10 )
Think of the histogram as an estimate of \(f(x)\) , the density of the continuous random variable \(X\) , that sets \[\widehat P(X \in A) = \frac{\sum_i 1\{X_i \in A\}}{n}\] and then \(\widehat f(x) = \widehat P(X \in A)\) where \(x\in A\) .
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_histogram (binwidth = 10 )
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_histogram (binwidth = 5 )
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_histogram (binwidth = 2 )
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_histogram (binwidth = 0.1 )
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
Enter: KDE
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_density () +
geom_histogram (binwidth = 2 , aes (y = ..density..), alpha = 0.5 )
Solves problems 1, 2, and 3. How about 4?
Swaps number of bins for a bandwidth
Choosing bandwidth: later!
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
Enter: KDE
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_density (bw = 5 ) +
geom_histogram (binwidth = 2 , aes (y = ..density..), alpha = 0.5 )
Histogram has at least four drawbacks:
does not integrate to 1
discontinuous
constant within a bin
depends on bin choice
Enter: KDE
CPSSWEducation %>%
ggplot (aes (x= earnings)) +
geom_density (bw = 0.1 ) +
geom_histogram (binwidth = 2 , aes (y = ..density..), alpha = 0.5 )
Definition 1
The kernel density estimator (KDE) of \(f(x)\) is \[\widehat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K \left( \frac{X_i - x}{h} \right),\] where \(K(\cdot)\) is a weighting/kernel function, and \(h>0\) is a bandwidth.
this pictures uses a Gaussian kernel
this picture does not draw a kernel function on top of any x value that is not in the data
Kernel function determines relative weight given to observations around \(x\) .
Definition 2 A (second-order) kernel function \(K(u)\) satisfies
\(0 \leq K(u) \leq \overline{K} < \infty\)
\(K(u) = K(-u)\)
\(\int_{-\infty}^{+\infty} K(u) du = 1\)
\(\int_{-\infty}^{+\infty} |u|^r K(u) du < \infty\) for all positive integers \(r\) .
bounded
symmetric
integrates to 1
b/c 3 K is density. 4 says that corresponding rv has all moments finite
Normalized kernels slightly simply the derivation. This way, you don’t have to carry the variance term that can show up with Taylor expansions.
Definition 3
A normalized kernel function is a kernel function that satisfies \(\int_{-\infty}^{+\infty} u^2 K(u) du = 1.\)
Definition 4
The roughness of a kernel function \(R_K = \int_{-\infty}^{+\infty} K(u)^2 du\) .
More important than the choice of kernel is the choice of bandwidth.
Definition 5
A bandwidth or tuning parameter \(h>0\) is a real number used to control the degree of smoothing of a nonparametric estimator.
For any fixed \(h>0\) , KDE is biased.
[draw]
Bias depends on:
\(h\)
kernel
smoothness of \(f\)
Theorem 1 If \(f(x)\) is continuous in a neighbourhood of \(x\) , then as \(h \to 0\) ,
\[E[\widehat{f}(x)] = f(x) + o(1).\]
Proof: Section 17.16.
With additional smoothness, we have the following.
Theorem 2
If \(f''(x)\) is continuous in \(\mathcal{N}\) , then as \(h\to 0\) , \[E[\widehat{f}(x)] = f(x) + \frac{1}{2} f''(x)h^2 + o(h^2).\]
Proof sketch:
\[
\begin{aligned}
E\left( \hat{f}_h(x)\right) &=
E\left( \frac{1}{nh} \sum_i K\left(\frac{X_i-x}{h}\right)\right) \\
&= \frac{1}{h} E\left( K\left(\frac{X_i-x}{h}\right)\right) \\
&= \frac{1}{h} \int K\left(\frac{x_i-x}{h}\right) f(x_i) dx_i \\
&= \int K\left(u\right) f(x+uh) du. \\
\end{aligned}
\]
second equality from iidness
fourth equality from change of variables \(x_i = x+uh\)
Use smoothness to expand:
\[
\begin{aligned}
\int K\left(u\right) f(x+uh) du
&=
\int K\left(u\right) \left[ f(x) + uh f'(x) + \frac{u^2h^2}{2} f''(x) + o(h^3) \right] du \\
&= f(x) \int K(u)du + hf'(x) \int u K(u) du \\
&\phantom{=} + h^2 f''(x) \int u^2 K(u)du + o(h^3) \\
&= f(x) + h^2 b(x) + O(h^3)
\end{aligned}
\]
where the scaled bias is
\[ b(x) = \frac{f''(x)}{2} \int u^2 K(u)du = \frac{f''(x)}{2}.\]
If \(h \to 0\) and \(nh \to \infty\) , we can also approximate the variance:
Theorem 3 The exact variance of \(\widehat f(x)\) is \[V_{\widehat f} = \text{var}\left[\widehat f(x)\right] = \frac{1}{nh^2} \text{var}\left[ K\left(\frac{X_i-x}{h}\right)\right].\]
If \(f(x)\) is continuous in an open neighbourhood of \(x\) , then as \(h\to 0\) and \(nh \to \infty\) , \[V_{\widehat f} = \frac{v(x)}{nh} + o\left(\frac{1}{nh}\right).\]
where the scaled variance is \[v(x) = f(x)R_K = f(x) \int (K(u))^2 du.\]
The IMSE is a measure of the distance between the KDE and \(f(x)\) , summarized across all values of \(x\) in the support of \(X\) :
Definition 6
The integrated mean squared error (IMSE) is \[\text{IMSE} = \int_{-\infty}^{+\infty} E \left[ \left( \widehat f(x) - f(x) \right)^2 \right] dx.\]
[draw]
Other measures are possible.
In particular, shouldn’t it be weighted by the density of \(X\) , or at least we should allow the researcher to put weights on \(x\) ’s?
The term under the IMSE integral is the MSE …
\[MSE(x) = E \left[ \left( \widehat f(x) - f(x) \right)^2 \right]\]
… which we know (HPSE, 6.11) is …
\[MSE(x) = bias(x)^2 + variance(x)\]
… which is approximately …
\[MSE(x) \to h^4 b(x)^2 + o(h^4) + \frac{v(x)}{nh} + o\left(\frac{1}{nh}\right)\]
as \(h\to 0\) and \(nh \to \infty\) .
We call the leading term the AMSE(x) …
\[AMSE(x) = h^4 b(x)^2 + \frac{v(x)}{nh}\]
… and its integral the approximate IMSE …
\[AIMSE = h^4 \int_{-\infty}^{+\infty} b(x)^2 + \frac{\int_{-\infty}^{+\infty} v(x)}{nh}dx.\]
… which simplifies to
\[AIMSE = \frac{h^4}{4} \int_{-\infty}^{+\infty} (f''(x))^2 + \frac{R_K}{nh}.\]
because \(b(x) = f''(x)/2\) and \(v(x) = f(x)R_K\) .
… which simplifies to
\[AIMSE = \frac{h^4}{4} R(f'') + \frac{R_K}{nh}.\]
because \(b(x) = f''(x)/2\) and \(v(x) = f(x)R_K\) , and where \[R(f'') = \int_{-\infty}^{+\infty} (f''(x))^2.\]
AIMSE \(\frac{h^4}{4} R(f'') + \frac{R_K}{nh}\)
increases in the roughness of \(f''\)
increases in the roughness of \(R_K\)
decreases with \(n\) .
Furthermore:
bias increases with \(h\)
variance decreases in \(h\)
For a fixed kernel function, sample size, and unknown function \(f\) , all we can control is \(h\) .
FOC of AIMSE wrt \(h\) :
\[h^3 R(f'') - h^{-2} R_K/n = 0\]
so the AIMSE-optimal bandwidth is
\[h_0 = \left(\frac{R_K}{R(f'')}\right)^{1/5} n^{-1/5}.\]
You should choose Epanechnikov, because it minimizes \(R_K\) .
The result
\[h_0 = \left(\frac{R_K}{R(f'')}\right)^{1/5} n^{-1/5},\]
without further analysis, is useless in practice because …
… it depends on the unknown quantity \(R(f'')\) .
But we do gain theoretical knowledge.
Viewed as a function of the bandwidth,
\[AIMSE(h) = \frac{h^4}{4} R(f'') + \frac{R_K}{nh},\]
converges to 0 at rate
\[AIMSE(h_0) \propto \frac{n^{-4/5}}{4} R(f'') + \frac{R_K}{n^{4/5}} \propto n^{-4/5}.\]
if using \(h_0 \propto n^{-1/5}\) .
Just sub in \(h_0 \propto n^{-1/5}\) .
The (best possible) behavior of the KDE has this rate.
Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\) , where
\[MSE = E(\bar X - \mu)^2\]
Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\) , where
\[MSE = bias^2 + variance\]
Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\) , where
\[MSE = 0 + \frac{\sigma^2}{n}\]
Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\) , where
\[MSE = 0 + \frac{\sigma^2}{n} \propto n^{-1}\]
Compare this to the situation of using the mean \(\bar X\) of a sample of i.i.d. observations to estimate the mean \(\mu\) , where
\[MSE = 0 + \frac{\sigma^2}{n} \propto n^{-1} << n^{-4/5}.\]
When one estimator is more efficient than another, it means that the asymptotic variance is lower. It is less variables.
Rate of convergence is more serious: the KDE approaches the truth at a slower rate than the sample mean: it needs faster increases in sample sizes to achieve the same performance.
The KDE’s asymptotic variance, in some sense, is infinitely bigger than that of the sample mean.
Silverman’s rule of thumb (HPSE 17.9) turns theoretical bandwidth into practical advice.
Definition 7
Choosing the bandwidth equal to \[h_r = 0.9 \widetilde \sigma n^{-1/5},\] where \(\widetilde \sigma\) is defined in (HPSE 11.14), is referred to as Silverman’s rule of thumb .
KDE + Silverman is a proper estimator: depends on data alone.
I am spending less time on Silverman’s. We will learn about different, computationally intensive cross-validation methods for choosing bandwidths later. Rules such as Silverman’s are typically not available for more complicated settings.
Theorem 17.5 computes the roughness for normal densities. Silverman’s ROT is derived assuming normal densities, so we can sub in the result in 17.5 into the formula for the optimal bandwidth.
Two more things: consistency and asymptotic normality.
Theorem 4 If \(f(x)\) is continuous in a neighbourhood of \(x\) , then as \(h \to 0\) and \(nh \to \infty\) , then
\[\widehat f(x) - f(x) \stackrel{p}{\to} 0.\]
Proof: An implication of Chebyshev’s, see (HPSE, Th 7.2) applies, using that \(var\) goes to 0 and \(exp\) goes to \(f(x)\) .
A standard CLT leads to:
\[
\begin{aligned}
\sqrt{n} (\hat{f}(x)-f(x) - h^2 b(x)) &\stackrel{"d"}{\to}
\mathcal{N}\left(0,\frac{1}{h} v(x)\right).
\end{aligned}
\]
The limit is well-defined for fixed \(h\) , but does not give good guidance for \(h \to 0\) .
Not useful when we use Silverman or other good choices of \(h\) .
Theorem 5 If \(f''\) is continuous in a neighbourhood of \(x\) , then as \(nh \to \infty\) such that \(h = O\left( n^{-1/5}\right)\) ,
\[\sqrt{nh} (\hat{f}(x)-f(x) - h^2 b(x)) \stackrel{d}{\to} \mathcal{N}(0,v(x)).\]
What does it mean to have \(\sqrt{nh} \sim n^{2/5} < n^{1/2}\) as the rate of convergence (assume optimal bandwidth).
Proof: we will see similar proofs for nonparametric regression.
Left undiscussed:
more on optimal bandwidths and CV: week 2
proof of asymptotic normality: week 3
undersmoothing: week 3
multivariate density estimation: ?
Cross-validation
In practice, we choose a bandwidth based on minimizing the mean integrated squared error MISE
\[
\begin{aligned}
\int \left( \hat{f}_h(x) - f(x) \right)^2 dx &=
\int \hat{f}_h(x)^2dx - 2 \int \hat{f}_h(x) f(x)dx + \int f(x)^2 dx
\end{aligned}
\]
Terms:
Convolution kernel
Leave-one-out
Does not depend on \(h\)
CV: second term
In the second term,
\[ \int \hat{f}_h(x) f(x)dx = E \left(\hat{f}_h(X)\right)\]
Estimating this by
\[ \frac{1}{n} \sum_i \hat{f}_h(X_i)\]
does not work because \(f(h)\) and \(X_i\) are dependent.
CV: second term (2)
Instead, use the leave-one-out estimator
\[\hat{f}_{-i}(X_i) = \frac{1}{h(n-1)} \sum_{j \neq i} K\left( \frac{X_j-X_i}{h}\right)\]
to estimate \(E \left(\hat{f}_h(X)\right)\) by
\[ \frac{1}{n} \sum_i \hat{f}_{-i}(X_i) \]
CV: third term
The third term can be written
\[
\begin{aligned}
\int \hat{f}_h(x)^2dx &= \frac{1}{n^2h^2} \sum_{i} \sum_j \int K\left(\frac{X_i-x}{h}\right)K\left(\frac{X_j-x}{h}\right) dx \\
&\equiv \frac{1}{n^2h}\sum_i\sum_j \bar{K}\left( \frac{X_i-X_j}{h} \right)
\end{aligned}
\]
where \(\bar{K}\) is the convolution kernel , and can be obtained from \(K\)
CV: implementation
To minimize the MISE, CV chooses:
\[ \hat{h} = \text{argmin} CV_{f}(h)\]
where
\[ CV_f(h) = \frac{1}{n^2h}\sum_i\sum_j \bar{K}\left( \frac{X_i-X_j}{h} \right) - \frac{2}{hn(n-1)} \sum_i \sum_{j \neq i} K\left( \frac{X_j-X_i}{h}\right)\]
It can be shown that \(\hat{h}\) converges to \(h^*\) .