bootstrap

Review of Confidence Intervals

When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.

For example, we draw a random sample then compute the sample mean to estimate the parameter.

With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.

An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).

We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.

As the confidence level increases, what happens to the width of the confidence interval?

As the confidence level increases, the width of the confidence will decrease.

As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.

When the population mean lies in a bigger interval, I will be less confident.

As the sample size increases, what happens to the width of the confidence interval?

As the sample size increases, the width of confidence interval decreases.

In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?

With a larger sample size, it will be easier to localize the population mean better.

If the variablility in the population increases, what happens to the width of the confidence interval?

If the variability in the population increases, the width of the confidence interval will increase.

In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population?

With more variability in the population, it will be harder to localize the population mean better.

Drawing Confidence Intervals Assuming You Know Shape of Sampling Distribution

According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.

Bootstrap

Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)

We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.

Now we draw new samples with the same distribution:

set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 2 2 3 4

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 1 4 4 3

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 1 1 1

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 2 4 2

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 4 2 4

The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.

Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)

library(ggplot2)
data(diamonds)

sample.size <- 1000
set.seed(2)
dia64 <- sample(diamonds$price, size=sample.size)
dia64

##    [1]  4702  1006   745  4516  2318  2316  4150  1635 13853   706   709
##   [12]  5368  1185  4661  9918  1728  2548  5183 12148  3528   911  9116
##   [23]  1651  4350  7644   645   582  7983  2432  4175  2832  4484  1436
##   [34]  1810 18736   842  1683  6164   476  4348  2589  6405  3998  4472
##   [45]  2315  1356  2530  7701   646  1436  2811  2862   957  2207  5976
##   [56]  1438  1299   552   814   383  1232  1898   837  5702  1755 11578
##   [67]  9123 13278  5084  3415  5982   610  3154  4694  4681  1157  6216
##   [78]  1799  9786   743  7744   931  2963  9702  4854  1744   550  6988
##   [89]  1079  7445  2542   626  8807   721 13477  4824 11040  3744  3999
##  [100] 11756  4867   630  2569  1608  6184   781  1966 12647  4318  4138
##  [111]  2960  1084  8574   745  1588  1442   529  3948  2359   448  3095
##  [122]  5456  2551   531  5392  1163   447  6558   980  7304  4928  2112
##  [133]  2940  2425  6787   914   680   394  4702  9590  4632  6164   844
##  [144]  6451 11958   491   475  6675 14674  6288  4667  8064  1998  9343
##  [155]   504  6133  1721  4561  3581   608  1656  2310  3180  1168  6389
##  [166]   882  3648  2515   557   435  2440  3827  5625  1939  9077  1338
##  [177]  7680  4234   472   697  2307   739  2358   504  4036   523  3674
##  [188]  1124  3401  3556  3880  9920  2511  4481   645  4871   594  4626
##  [199]   605  4158  1746   872   791  2484  8669  1546  5184  4928  4786
##  [210]  5378   552   665  1584   923  2530  3319  7313  2849  8128  4969
##  [221]  4363  2238  9645 15339   477  1716  2055 14388  1020  1605   477
##  [232]  7411  1908   669   527  5655   691  5044  5051 14215  1915  6477
##  [243]   709  1151  2364  2354  7227 18795  4321  7917  1002  3595  5543
##  [254]  3251  1087   557  3154  6081  3136   464  1145  3247   842  7924
##  [265]  1409   924  2929   965  1881   683  1865  2058  1689  6300  1273
##  [276] 16097  2558  5408  3495  5242  6048  2459   898   475  4969  2623
##  [287]   676   743  1227  2716  5613  6780   492 16688  1009 10312  4969
##  [298]  3599  4432   645  1624 15451 11218  1841  1187  3293  4844   957
##  [309]  1949  3189  3699 11667  1243  4241   942  3208   795 12308  1114
##  [320]  4609   886  4473  5704  3755  9302  5003   862  5627  3750  3460
##  [331]  4714  1807  2541  9171  2890   561  3900   734   523  1436  1749
##  [342]  1698  1356   558   658  8119   462  2061  4666   633  5619   775
##  [353]  9397  1807   589  2711  1326   791  3259   967  4350  4851  3856
##  [364]  2849   471  5390  4136  3326  1003  3016  1833  2872 17849   868
##  [375]   979   894  5602  2674   969   564  3830 14625 13703  5775  3694
##  [386]   803  9306   684  1746  1125   498  2114  7094   583  5026   576
##  [397]  4795  1721  5187  1191   814   732  4374  3068  2671  1617   764
##  [408]  7357  4497  1397   876   999 12342  3478  8079   872  1049 10662
##  [419]   767  2726  1665   402  6412   971  2391  4142   571   729   707
##  [430]  3366   924  2041  1865   612  3734  1593  4116  1777   544   645
##  [441]   882  4773  1170   436  5005   965  2450   708   666  5040   942
##  [452]  1103 13228  4213  1429  1600  7792  2498  2575   464  2251   557
##  [463]  2964 15235  5802  1174 15348   492  3158 11572  4293 11654  6100
##  [474]  3500   948  1320   681  2956  5992  1636 13499  1153  2657   388
##  [485]  1819 15873  1006 15278  1250  2013  6134   798   684  2647   802
##  [496]  4967  6779  1970  6925   715   963   713  7445  5266 11005  1073
##  [507]   421 15384  1348   944   947   574  8739  9813   625   450  1612
##  [518]   650 11106  1843   536  3274 12944  1019  8513  3011  2986   981
##  [529]  5632  1875  3595  2329   547  4238   931  1293   984  5864 15067
##  [540]  1810  2774  2345   644  2359 16665  2933  3927  1105   776  1240
##  [551]  2268  7876 15100   663  1662 16560  7476   754   663  1971  2491
##  [562]  2860   491 14266  1921  6661  5736   702   537  3890  2312   747
##  [573]  1815  1376 16704  1607  1952  2765   639  4769  4136   579  1031
##  [584]   764   675  1132  9828  3669  2147  1035  8012 11400   898  3036
##  [595]  2701  1076  3641  5211  5445  2974   421   572   625  5145  1401
##  [606]  4249  3008  2655  1179   848  1259  7168   742  5556  1443   605
##  [617]  1879   901  5458  2514   972   906   612   616  6552  2958  2952
##  [628]  1395  6097   596   942  8687  1845   867   692   900   675  4578
##  [639] 10437 15647  6581  1980  1723  5949  1441   813   855  1103   806
##  [650]  4718   983 10702  2737  6864  2414 13215  2155   837  8583  4081
##  [661]  1561 17760 13140  2079 10581  2938  1175  3010  1656  3806  8645
##  [672]  9407  1229   559 15515   523  2140  1736   916  4829   726  4480
##  [683]   872  7409  3792  4984   765 18342  1158   911  9483   408  8509
##  [694]  3358  1607  6123  1307   816   871   655  2100  1911  1095   705
##  [705]  7781  3290  2192 10046  2553  2812  2042  4417  4557  6351  1316
##  [716]  4959  1179   673   695   605  2197   530  2064  6416  7911  1250
##  [727]  6578  2630  3007  6501  4108   787  6892  2210  1806   611  8467
##  [738] 16778 18172  4836 10905   625  2544   945  4960 11946  1847   827
##  [749] 14948  6177  5540  9037  9495  2428  2583  9858   714  3023  1200
##  [760]  2507  7163 12648  5198   537   440   802  5292 13321  1944   752
##  [771]  3311  4659 18371  9182 15175   583   537  1013  1299  8253   731
##  [782]  8970  1963   854  1822  3444  1367  3422 12268  4626   838  1151
##  [793]  4445  4390  1754  5050   764   976  1432  1013 10090   666  2774
##  [804]   602  4375  1792  4864   748  1343   605  9659  8645  5338  1617
##  [815]  2810 13986  1774   720 16340  1774   652  3725  4179  3160   741
##  [826]  3511  3908  2973 11975  1332  3747  2923  1723   745  4391 17838
##  [837]  2200  1692   458   812  1177   707  3858  4429  2041  1690  1169
##  [848]  2923  1897 10429  1401  4476   485  4284   393  5088  4011  2265
##  [859]   838 13465  4037 14275  8167  1952  4350  1668  1942  1893 17877
##  [870] 10147 13329  4791  7035  2290   749   805 15767 18190  6907   698
##  [881]  2887  3308  8922  3919  3313  4743   802   765   770  1933 11660
##  [892]   383  5287  4248  1672  3511  2923  4060   956 12674  5805 12308
##  [903] 17197  5055  9366  1040  4459  2335  2196  7816  9115  5559  6503
##  [914]  4152  2447   772  1088  3468   742   802 14639  1235   435   802
##  [925]  9304   596  5476  5451  1446   756   945  2123   907  1605   640
##  [936]   435  1913  2367  3662   769  2167   411  2062  2954  1433  1421
##  [947]   574  3335   789  7812  4965   803   620  1619  5826   544   554
##  [958]  2608  6439  3801  4242  1908   698   765  4672  1806 13225  2310
##  [969]  2200  6664   576   651  3208   499   431   911  4523  2268   587
##  [980]   373  1343  4453  2717  9898  1899 12252  4218  1294   871 13986
##  [991] 15081   536  3854   625  4560  5014 10396  2469   702  4604

We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:

set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
  bs1000[i] <- mean(sample(dia64, size=sample.size, replace=TRUE))
}
head(bs1000)

## [1] 3825.653 3647.214 3993.650 4017.054 3904.232 3929.650

Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.

set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
  dp1000[i] = mean(sample(diamonds$price, size=sample.size))
}
head(dp1000)

## [1] 4036.461 3790.136 3905.845 4153.485 4066.295 3940.525

Now let’s display the two side by side.

labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)

ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()

Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.

lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")

The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.

Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).

New after Thanksgiving

I’ll condense the steps into two: (1) obtain the sample you plan to use, then (2) compute the confidence interval.

For (1) we will use subpopulations of diamonds:

set.seed(10)
sample.size <- 64
subpop <- subset(diamonds, color=="J")$price
mysample <- sample(subpop, size=sample.size)

Now (2) compute the CI

set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
  bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci

##     2.5%    97.5% 
## 4602.061 6512.144

As a third, extra step, we can plot results, but unlike what we did above, we won’t plot the correct answer, because we won’t assume we know the correct answer.

df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")

Found on Wikipedia concerning this method: “This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic[27] and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). In other cases, the percentile bootstrap can be too narrow.[citation needed] When working with small sample sizes (i.e., less than 50), the percentile confidence intervals for (for example) the variance statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time[28].”

Try defining a sample with the “cut” characteristic of diamonds and repeating the calculation of confidence interval. Based on samples are “fair” or “premium” diamonds more expensive? Change the sample size until confidence intervals don’t overlap. About how large a sample size do you need (try 1, 4, 16, 64, 256, 1024)?

**Cut==“Premium”

set.seed(10)
sample.size <- 1024
subpop <- subset(diamonds,cut=="Premium")$price
mysample <- sample(subpop, size=sample.size)

set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
  bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci

##     2.5%    97.5% 
## 4238.425 4781.810

df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")

**Cut=“Fair”

set.seed(10)
sample.size <- 1024
subpop <- subset(diamonds,cut="Fair")$price
mysample <- sample(subpop, size=sample.size)

set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
  bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci

##     2.5%    97.5% 
## 3639.059 4093.393

df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")

Answer: Cut==“Premium” shows the more expensive result. And we should have the larger sample size, 1024, so that we won’t have the overlapping part.

bootstrap

He Zhao

November 21, 2017

Review of Confidence Intervals

Drawing Confidence Intervals Assuming You Know Shape of Sampling Distribution

Bootstrap

New after Thanksgiving