When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.
For example, we draw a random sample then compute the sample mean to estimate the parameter.
With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.
An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).
We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.
As the confidence level increases, what happens to the width of the confidence interval?
As the confidence level increases, the width of the confidence will decrease.
As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.
When the population mean lies in a bigger interval, I will be less confident.
As the sample size increases, what happens to the width of the confidence interval?
As the sample size increases, the width of confidence interval decreases.
In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?
With a larger sample size, it will be easier to localize the population mean better.
If the variablility in the population increases, what happens to the width of the confidence interval?
If the variability in the population increases, the width of the confidence interval will increase.
In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population?
With more variability in the population, it will be harder to localize the population mean better.
According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.
Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)
We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.
Now we draw new samples with the same distribution:
set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 2 2 3 4
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 1 4 4 3
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 1 1 1
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 2 4 2
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 4 2 4
The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.
Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)
library(ggplot2)
data(diamonds)
sample.size <- 1000
set.seed(2)
dia64 <- sample(diamonds$price, size=sample.size)
dia64
## [1] 4702 1006 745 4516 2318 2316 4150 1635 13853 706 709
## [12] 5368 1185 4661 9918 1728 2548 5183 12148 3528 911 9116
## [23] 1651 4350 7644 645 582 7983 2432 4175 2832 4484 1436
## [34] 1810 18736 842 1683 6164 476 4348 2589 6405 3998 4472
## [45] 2315 1356 2530 7701 646 1436 2811 2862 957 2207 5976
## [56] 1438 1299 552 814 383 1232 1898 837 5702 1755 11578
## [67] 9123 13278 5084 3415 5982 610 3154 4694 4681 1157 6216
## [78] 1799 9786 743 7744 931 2963 9702 4854 1744 550 6988
## [89] 1079 7445 2542 626 8807 721 13477 4824 11040 3744 3999
## [100] 11756 4867 630 2569 1608 6184 781 1966 12647 4318 4138
## [111] 2960 1084 8574 745 1588 1442 529 3948 2359 448 3095
## [122] 5456 2551 531 5392 1163 447 6558 980 7304 4928 2112
## [133] 2940 2425 6787 914 680 394 4702 9590 4632 6164 844
## [144] 6451 11958 491 475 6675 14674 6288 4667 8064 1998 9343
## [155] 504 6133 1721 4561 3581 608 1656 2310 3180 1168 6389
## [166] 882 3648 2515 557 435 2440 3827 5625 1939 9077 1338
## [177] 7680 4234 472 697 2307 739 2358 504 4036 523 3674
## [188] 1124 3401 3556 3880 9920 2511 4481 645 4871 594 4626
## [199] 605 4158 1746 872 791 2484 8669 1546 5184 4928 4786
## [210] 5378 552 665 1584 923 2530 3319 7313 2849 8128 4969
## [221] 4363 2238 9645 15339 477 1716 2055 14388 1020 1605 477
## [232] 7411 1908 669 527 5655 691 5044 5051 14215 1915 6477
## [243] 709 1151 2364 2354 7227 18795 4321 7917 1002 3595 5543
## [254] 3251 1087 557 3154 6081 3136 464 1145 3247 842 7924
## [265] 1409 924 2929 965 1881 683 1865 2058 1689 6300 1273
## [276] 16097 2558 5408 3495 5242 6048 2459 898 475 4969 2623
## [287] 676 743 1227 2716 5613 6780 492 16688 1009 10312 4969
## [298] 3599 4432 645 1624 15451 11218 1841 1187 3293 4844 957
## [309] 1949 3189 3699 11667 1243 4241 942 3208 795 12308 1114
## [320] 4609 886 4473 5704 3755 9302 5003 862 5627 3750 3460
## [331] 4714 1807 2541 9171 2890 561 3900 734 523 1436 1749
## [342] 1698 1356 558 658 8119 462 2061 4666 633 5619 775
## [353] 9397 1807 589 2711 1326 791 3259 967 4350 4851 3856
## [364] 2849 471 5390 4136 3326 1003 3016 1833 2872 17849 868
## [375] 979 894 5602 2674 969 564 3830 14625 13703 5775 3694
## [386] 803 9306 684 1746 1125 498 2114 7094 583 5026 576
## [397] 4795 1721 5187 1191 814 732 4374 3068 2671 1617 764
## [408] 7357 4497 1397 876 999 12342 3478 8079 872 1049 10662
## [419] 767 2726 1665 402 6412 971 2391 4142 571 729 707
## [430] 3366 924 2041 1865 612 3734 1593 4116 1777 544 645
## [441] 882 4773 1170 436 5005 965 2450 708 666 5040 942
## [452] 1103 13228 4213 1429 1600 7792 2498 2575 464 2251 557
## [463] 2964 15235 5802 1174 15348 492 3158 11572 4293 11654 6100
## [474] 3500 948 1320 681 2956 5992 1636 13499 1153 2657 388
## [485] 1819 15873 1006 15278 1250 2013 6134 798 684 2647 802
## [496] 4967 6779 1970 6925 715 963 713 7445 5266 11005 1073
## [507] 421 15384 1348 944 947 574 8739 9813 625 450 1612
## [518] 650 11106 1843 536 3274 12944 1019 8513 3011 2986 981
## [529] 5632 1875 3595 2329 547 4238 931 1293 984 5864 15067
## [540] 1810 2774 2345 644 2359 16665 2933 3927 1105 776 1240
## [551] 2268 7876 15100 663 1662 16560 7476 754 663 1971 2491
## [562] 2860 491 14266 1921 6661 5736 702 537 3890 2312 747
## [573] 1815 1376 16704 1607 1952 2765 639 4769 4136 579 1031
## [584] 764 675 1132 9828 3669 2147 1035 8012 11400 898 3036
## [595] 2701 1076 3641 5211 5445 2974 421 572 625 5145 1401
## [606] 4249 3008 2655 1179 848 1259 7168 742 5556 1443 605
## [617] 1879 901 5458 2514 972 906 612 616 6552 2958 2952
## [628] 1395 6097 596 942 8687 1845 867 692 900 675 4578
## [639] 10437 15647 6581 1980 1723 5949 1441 813 855 1103 806
## [650] 4718 983 10702 2737 6864 2414 13215 2155 837 8583 4081
## [661] 1561 17760 13140 2079 10581 2938 1175 3010 1656 3806 8645
## [672] 9407 1229 559 15515 523 2140 1736 916 4829 726 4480
## [683] 872 7409 3792 4984 765 18342 1158 911 9483 408 8509
## [694] 3358 1607 6123 1307 816 871 655 2100 1911 1095 705
## [705] 7781 3290 2192 10046 2553 2812 2042 4417 4557 6351 1316
## [716] 4959 1179 673 695 605 2197 530 2064 6416 7911 1250
## [727] 6578 2630 3007 6501 4108 787 6892 2210 1806 611 8467
## [738] 16778 18172 4836 10905 625 2544 945 4960 11946 1847 827
## [749] 14948 6177 5540 9037 9495 2428 2583 9858 714 3023 1200
## [760] 2507 7163 12648 5198 537 440 802 5292 13321 1944 752
## [771] 3311 4659 18371 9182 15175 583 537 1013 1299 8253 731
## [782] 8970 1963 854 1822 3444 1367 3422 12268 4626 838 1151
## [793] 4445 4390 1754 5050 764 976 1432 1013 10090 666 2774
## [804] 602 4375 1792 4864 748 1343 605 9659 8645 5338 1617
## [815] 2810 13986 1774 720 16340 1774 652 3725 4179 3160 741
## [826] 3511 3908 2973 11975 1332 3747 2923 1723 745 4391 17838
## [837] 2200 1692 458 812 1177 707 3858 4429 2041 1690 1169
## [848] 2923 1897 10429 1401 4476 485 4284 393 5088 4011 2265
## [859] 838 13465 4037 14275 8167 1952 4350 1668 1942 1893 17877
## [870] 10147 13329 4791 7035 2290 749 805 15767 18190 6907 698
## [881] 2887 3308 8922 3919 3313 4743 802 765 770 1933 11660
## [892] 383 5287 4248 1672 3511 2923 4060 956 12674 5805 12308
## [903] 17197 5055 9366 1040 4459 2335 2196 7816 9115 5559 6503
## [914] 4152 2447 772 1088 3468 742 802 14639 1235 435 802
## [925] 9304 596 5476 5451 1446 756 945 2123 907 1605 640
## [936] 435 1913 2367 3662 769 2167 411 2062 2954 1433 1421
## [947] 574 3335 789 7812 4965 803 620 1619 5826 544 554
## [958] 2608 6439 3801 4242 1908 698 765 4672 1806 13225 2310
## [969] 2200 6664 576 651 3208 499 431 911 4523 2268 587
## [980] 373 1343 4453 2717 9898 1899 12252 4218 1294 871 13986
## [991] 15081 536 3854 625 4560 5014 10396 2469 702 4604
We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:
set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(dia64, size=sample.size, replace=TRUE))
}
head(bs1000)
## [1] 3825.653 3647.214 3993.650 4017.054 3904.232 3929.650
Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.
set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
dp1000[i] = mean(sample(diamonds$price, size=sample.size))
}
head(dp1000)
## [1] 4036.461 3790.136 3905.845 4153.485 4066.295 3940.525
Now let’s display the two side by side.
labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()
Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.
lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")
The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.
Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).
I’ll condense the steps into two: (1) obtain the sample you plan to use, then (2) compute the confidence interval.
For (1) we will use subpopulations of diamonds:
set.seed(10)
sample.size <- 64
subpop <- subset(diamonds, color=="J")$price
mysample <- sample(subpop, size=sample.size)
Now (2) compute the CI
set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci
## 2.5% 97.5%
## 4602.061 6512.144
As a third, extra step, we can plot results, but unlike what we did above, we won’t plot the correct answer, because we won’t assume we know the correct answer.
df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")
Found on Wikipedia concerning this method: “This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic[27] and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). In other cases, the percentile bootstrap can be too narrow.[citation needed] When working with small sample sizes (i.e., less than 50), the percentile confidence intervals for (for example) the variance statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time[28].”
Try defining a sample with the “cut” characteristic of diamonds and repeating the calculation of confidence interval. Based on samples are “fair” or “premium” diamonds more expensive? Change the sample size until confidence intervals don’t overlap. About how large a sample size do you need (try 1, 4, 16, 64, 256, 1024)?
**Cut==“Premium”
set.seed(10)
sample.size <- 1024
subpop <- subset(diamonds,cut=="Premium")$price
mysample <- sample(subpop, size=sample.size)
set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci
## 2.5% 97.5%
## 4238.425 4781.810
df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")
**Cut=“Fair”
set.seed(10)
sample.size <- 1024
subpop <- subset(diamonds,cut="Fair")$price
mysample <- sample(subpop, size=sample.size)
set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci
## 2.5% 97.5%
## 3639.059 4093.393
df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")
Answer: Cut==“Premium” shows the more expensive result. And we should have the larger sample size, 1024, so that we won’t have the overlapping part.