A census is a survey in which the entire population is measured. Though the census is able to pull from the entire population, a sample may be more useful in some cases. A census may not be as flexible with the current state of events. The census typically comes out every year, but if we wanted to understand a more recent situation closer to the event (for example, peoples reactions to an election, thoughts on gun violence after a shooting, etc.) it is easier and quicker to take a sample.
Additionally, a sample is more monetarily efficient. The census is collected by the Census Bureau. If the exact question we wanted to study was not asked, it might be difficult to use the entire population.
The two types of nonsampling errors are selection bias and measurement error.
Selection bias occurs when some part of the population is not sampled. Some examples of this are
-undercoverage which is failing to include all of the target population in the sampling frame
-overcoverage which is including population units in the sampling frame that are not in the target population
-nonresponse which is failing to obtain responses from all of the chosen sample
Measurement Error occurs if the response has a tendency to differ from the true value in one direction. An example of this is somoene lying about the times they workout. Their answer may be a bit higher than they actually do.
The way this question is asked may lead the reader to answering in a specific way. By including external information, such as recent tragic shootings, stirs emotion. Additionally, the wording “indiscriminantly kill dozens of people very rapidly” forces an individual who disagrees to really think about their ethics and how they feel about the topic which may skew the results.
The question should be state a bit more simply and without any external bias. Also, providing a definition of what an assault rifle might be helpful. A better way to state the question would be:
“Do you think automatic rifles (selective-fire rifle that uses an intermediate cartridge and a detachable magazine) should be banned?”
Statement: A simple random sample is a sample where every observation has the same sampling probability.
True, thats the definition of a simple random sample. Every possible subset of size n from the population has an equal chance of being in the sample.
Source: Lohr
In a simple random sample, each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process, and each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals.
Source: Wikipedia
So, first I googled all the colleges in the US that have official Quidditch Teams at their school and found this list: https://www.usquidditch.org/teams. I am counting only individuals who are part of this quidditch team as individuals playing quidditch.
There are 143 colleges and adult communities, and of that list 107 are listed as colleges.
Additionally, 107 of 143 are strictly college. We can find the ratio of students that are playing quidditch at each school vs the total population.
Once we have this, we can find out the type of schools we looked at and use post stratification to weight the schools (public, private) in order to estimate the total number of students that play Quidditch in the U.S.
One would need a sample of at least 475 in order to estimate with 95% confidence and 0.045 MoE how many adults between the ages of 30 to 50 have heard of Lil Baby.
ceiling(n_0 <- (1.96^2*(1/2)*(1-1/2)) / (0.045)^2)
## [1] 475
The mean is 8.8. This histogram is depicted below and shows a fairly normal distribution.
pop <- data.frame(id = 1:10,
y = c(1,2,4,5,9,10,10,13,14,20),
x = c(6,11,14,15,24,26,28,32,34,45),
strata = c(1,1,2,2,3,3,4,4,5,5))
samps <- combn(pop$y,5)
y_bar <- colMeans(samps)
hist(y_bar, main="Sampling Distribution of Y")
abline(v=mean(y_bar),col="red")
mean(y_bar)
## [1] 8.8
The standard error is 1.88.
Of the 252 confidence intervals, 244 cover the true mean of y. Thus, the coverage is about 96.83%.
diff <- pop$y - mean(y_bar)
fpc <- 1- (5/10)
s2 <- var(pop$y) #We estimate the variance with s^2
var <- fpc*s2/5 #The variance is the fpc*s2/n
se <- sqrt(var) #We take the square root of the variance to get the standard error
se
## [1] 1.878534
sd(y_bar)
## [1] 1.882272
low <- y_bar - 1.96*se
up <- y_bar + 1.96*se
conf <- data.frame(low,up)
coverage<-conf$low < mean(y_bar) & conf$up > mean(y_bar)
length(coverage)
## [1] 252
sum(coverage, na.rm=TRUE)
## [1] 244
244/252
## [1] 0.968254
The mean of this sampling distribution is 8.8 and the sampling distribution is shown below. The histogram is shown below. This sampling distribution is much less narrower than the SRS.
pop <- data.frame(id = 1:10, y = c(1,2,4,5,9,10,10,13,14,20),
x = c(6,11,14,15,24,26,28,32,34,45),
strata = c(1,1,2,2,3,3,4,4,5,5))
str1 <-unlist(list(pop[pop$strata==1,]$y))
str2 <-unlist(list(pop[pop$strata==2,]$y))
str3 <- unlist(list(pop[pop$strata==3,]$y))
str4 <- unlist(list(pop[pop$strata==4,]$y))
str5 <- unlist(list(pop[pop$strata==5,]$y))
pop_str <- expand.grid(str1,str2,str3,str4,str5)
#Source https://stackoverflow.com/questions/18705153/generate-list-of-all-possible-combinations-of-elements-of-vector
pop_str #All 32 combinations of possible 10 choose 5 = 252
## Var1 Var2 Var3 Var4 Var5
## 1 1 4 9 10 14
## 2 2 4 9 10 14
## 3 1 5 9 10 14
## 4 2 5 9 10 14
## 5 1 4 10 10 14
## 6 2 4 10 10 14
## 7 1 5 10 10 14
## 8 2 5 10 10 14
## 9 1 4 9 13 14
## 10 2 4 9 13 14
## 11 1 5 9 13 14
## 12 2 5 9 13 14
## 13 1 4 10 13 14
## 14 2 4 10 13 14
## 15 1 5 10 13 14
## 16 2 5 10 13 14
## 17 1 4 9 10 20
## 18 2 4 9 10 20
## 19 1 5 9 10 20
## 20 2 5 9 10 20
## 21 1 4 10 10 20
## 22 2 4 10 10 20
## 23 1 5 10 10 20
## 24 2 5 10 10 20
## 25 1 4 9 13 20
## 26 2 4 9 13 20
## 27 1 5 9 13 20
## 28 2 5 9 13 20
## 29 1 4 10 13 20
## 30 2 4 10 13 20
## 31 1 5 10 13 20
## 32 2 5 10 13 20
y_bar_str <- rowMeans(pop_str)
mean_y <- sum(rowMeans(pop_str))/32
mean_y
## [1] 8.8
hist(y_bar_str)
abline(v=mean(mean_y),col="red")
Below is the sampling distibtion of y. It is much more skewed than the other two. The mean is 8.7722.
#Find x_bar, y_bar, and Bhat to find y_bar_r
x_bar_u <- sum(pop$x)/10
x_comb <- as.data.frame(combn(pop$x, 5, FUN = NULL, simplify = TRUE))
xbar <- colSums(x_comb)/5
y_comb <- as.data.frame(combn(pop$y, 5, FUN = NULL, simplify = TRUE))
ybar <- colSums(y_comb)/5
avg_xy <- data.frame(xbar,ybar)
avg_xy$Bhat <- ybar/xbar
y_bar_r <- avg_xy$Bhat*x_bar_u
hist(y_bar_r)
abline(v=mean(y_bar_r),col="red")
We know that the true mean of the population is 8.8. We will compare SRS, stratified random sampling, and ratio estimation to this.
When we graph our bias’s, we notice that they all are normally distributed. The SRS bias is much wider than the other two, followed by ratio, and then stratified random sampling. From class, we know that the stratified random sampling method has much less bias than simple random sampling.
From our calculations on relative variances, we notice that the ratio estimator is more efficient than stratified random sampling. The simple random sampling method is the least efficient. From the relative efficiency, one can argue that the stratified estimator is less biased than the ratio estimator. However, looking at the histogram above we see that the stratified random sampling is less variable. So, it can be argued both estimators are better than the estimator for simple random sampling but it is unclear which is best.
#Bias
b_srs <- y_bar - 8.8 #RED
b_str <- y_bar_str - 8.8 #GREEN
b_rat <- y_bar_r - 8.8 #PURPLE/BLUE
hist(b_srs, col = "red")
hist(b_str, add=T, col=rgb(0, 1, 0, 0.5))
hist(b_rat, add=T, col=rgb(0,0,1,0.5))
#Relative Efficiency
var(y_bar_r)/var(y_bar_str) #ratio estimation vs stratified random sampling
## [1] 0.5536386
var(y_bar_str)/var(y_bar) #stratified random sampling vs simple random sampling
## [1] 0.1398507
var(y_bar_r)/var(y_bar) #simple random sampling vs ratio estimation
## [1] 0.07742676
The average weekday price of 9 holes of golf is 20.15. The standard error here is 1.629866. The coefficient of variablility is estimated to be 0.08087325.
library(SDaA)
data('golfsrs')
y_bar <- mean(golfsrs$wkday9)
y_bar
## [1] 20.15333
s2 <- var(golfsrs$wkday9)
n <- nrow(golfsrs)
N <- 14938 #Took this number from Chapter 2 Problem 16 from our HW
se <- sqrt((1-n/N)*s2/n)
se
## [1] 1.629866
CV <- se/y_bar
CV
## [1] 0.08087325
We are 95% confident that the proportion of golf courses that are public in the United States lies within the interval (0.3708411, 0.5624922).
head(golfsrs)
## rn state course holes type yearblt wkday18 wkday9
## 1 5491 RI Warwick Country Club 18 priv 1923 25 25
## 2 10276 VT Haystack Golf Club 18 semi 1972 40 24
## 3 6025 MN Pierz 9 pub 1939 NA 10
## 4 9739 GA Southerness G C 18 semi 1991 37 37
## 5 3463 CA Whispering Lakes 18 pub 1970 17 10
## 6 5883 MN Hidden Creek 18 pub 1996 16 12
## wkend18 wkend9 backtee rating par cart18 cart9 caddy pro
## 1 35 25 6453 71.8 69 15.0 7.5 y y
## 2 45 24 6549 71.1 72 30.0 18.0 n y
## 3 NA 10 3058 69.2 35 16.5 11.0 n n
## 4 45 37 6766 72.2 72 0.0 0.0 n y
## 5 20 10 6706 71.4 72 22.0 15.0 n y
## 6 18 12 7002 73.5 72 10.0 7.0 n y
golfsrs$public <- as.numeric(golfsrs$type == 'pub')
sum(golfsrs$public) #56 of the 120
## [1] 56
p <- 56/120
p
## [1] 0.4666667
pub_s2 <- var(golfsrs$public)
s2_sum <- (pub_s2/sum(golfsrs$public))
pub_se <- sqrt((1-(sum(golfsrs$public)/n))*(s2_sum))
low <- p - 1.96*pub_se
up <- p + 1.96*pub_se
low
## [1] 0.3708411
up
## [1] 0.5624922
A useful variable to stratify on would be the type of golf course as the prices may fluctuate based on the type. Another useful variable may be to stratify by the region of the country. As we are given the state, it might be useful to segment the country into regions. The rating of the golf course may be another useful tool as a higher rated golf course may cost more.
The estimate of the total number of courses that have caddies is 1537. The standard error of this estimate is 421.09.
N <- 15372 #New N given in problem
n <- 120
golfsrs$caddies <- as.numeric(golfsrs$caddy == "y")
cad <- sum(golfsrs$caddies)
y_bar <- cad/120 #12/120 had caddies so thats why it is 0.1
t_hat <- N*y_bar
s2_cad <- var(golfsrs$caddies)
fpc <- 1-(n/N)
var_t_hat <- (N^2)*(fpc)*(s2_cad/n)
t_hat_se <- sqrt(var_t_hat)
t_hat
## [1] 1537.2
t_hat_se
## [1] 421.0914