Question 1

A census is a survey in which the entire population is measured. Though the census is able to pull from the entire population, a sample may be more useful in some cases. A census may not be as flexible with the current state of events. The census typically comes out every year, but if we wanted to understand a more recent situation closer to the event (for example, peoples reactions to an election, thoughts on gun violence after a shooting, etc.) it is easier and quicker to take a sample.

Additionally, a sample is more monetarily efficient. The census is collected by the Census Bureau. If the exact question we wanted to study was not asked, it might be difficult to use the entire population.

Question 2

The two types of nonsampling errors are selection bias and measurement error.

Selection bias occurs when some part of the population is not sampled. Some examples of this are

-undercoverage which is failing to include all of the target population in the sampling frame

-overcoverage which is including population units in the sampling frame that are not in the target population

-nonresponse which is failing to obtain responses from all of the chosen sample

Measurement Error occurs if the response has a tendency to differ from the true value in one direction. An example of this is somoene lying about the times they workout. Their answer may be a bit higher than they actually do.

Question 3

The way this question is asked may lead the reader to answering in a specific way. By including external information, such as recent tragic shootings, stirs emotion. Additionally, the wording “indiscriminantly kill dozens of people very rapidly” forces an individual who disagrees to really think about their ethics and how they feel about the topic which may skew the results.

The question should be state a bit more simply and without any external bias. Also, providing a definition of what an assault rifle might be helpful. A better way to state the question would be:

“Do you think automatic rifles (selective-fire rifle that uses an intermediate cartridge and a detachable magazine) should be banned?”

Question 4

Statement: A simple random sample is a sample where every observation has the same sampling probability.

True, thats the definition of a simple random sample. Every possible subset of size n from the population has an equal chance of being in the sample.

Source: Lohr

In a simple random sample, each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process, and each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals.

Source: Wikipedia

Question 5

So, first I googled all the colleges in the US that have official Quidditch Teams at their school and found this list: https://www.usquidditch.org/teams. I am counting only individuals who are part of this quidditch team as individuals playing quidditch.

There are 143 colleges and adult communities, and of that list 107 are listed as colleges.

Additionally, 107 of 143 are strictly college. We can find the ratio of students that are playing quidditch at each school vs the total population.

Once we have this, we can find out the type of schools we looked at and use post stratification to weight the schools (public, private) in order to estimate the total number of students that play Quidditch in the U.S.

Question 6

One would need a sample of at least 475 in order to estimate with 95% confidence and 0.045 MoE how many adults between the ages of 30 to 50 have heard of Lil Baby.

ceiling(n_0 <- (1.96^2*(1/2)*(1-1/2)) / (0.045)^2)
## [1] 475

Question 7

Part A.

The mean is 8.8. This histogram is depicted below and shows a fairly normal distribution.

pop <- data.frame(id = 1:10,
y = c(1,2,4,5,9,10,10,13,14,20),
x = c(6,11,14,15,24,26,28,32,34,45),
strata = c(1,1,2,2,3,3,4,4,5,5))

samps <- combn(pop$y,5)
y_bar <- colMeans(samps)

hist(y_bar, main="Sampling Distribution of Y")
abline(v=mean(y_bar),col="red")

mean(y_bar)
## [1] 8.8

Part B

The standard error is 1.88.

Of the 252 confidence intervals, 244 cover the true mean of y. Thus, the coverage is about 96.83%.

diff <- pop$y - mean(y_bar) 
fpc <- 1- (5/10)

s2 <- var(pop$y) #We estimate the variance with s^2
var <- fpc*s2/5 #The variance is the fpc*s2/n
se <- sqrt(var) #We take the square root of the variance to get the standard error
se
## [1] 1.878534
sd(y_bar)
## [1] 1.882272
low <- y_bar - 1.96*se
up <- y_bar + 1.96*se

conf <- data.frame(low,up)
coverage<-conf$low < mean(y_bar) & conf$up > mean(y_bar)

length(coverage)
## [1] 252
sum(coverage, na.rm=TRUE)
## [1] 244
244/252
## [1] 0.968254

Part C.

The mean of this sampling distribution is 8.8 and the sampling distribution is shown below. The histogram is shown below. This sampling distribution is much less narrower than the SRS.

pop <- data.frame(id = 1:10, y = c(1,2,4,5,9,10,10,13,14,20),
x = c(6,11,14,15,24,26,28,32,34,45),
strata = c(1,1,2,2,3,3,4,4,5,5))

str1 <-unlist(list(pop[pop$strata==1,]$y))
str2 <-unlist(list(pop[pop$strata==2,]$y))
str3 <- unlist(list(pop[pop$strata==3,]$y))
str4 <- unlist(list(pop[pop$strata==4,]$y))
str5 <- unlist(list(pop[pop$strata==5,]$y))

pop_str <- expand.grid(str1,str2,str3,str4,str5) 
#Source https://stackoverflow.com/questions/18705153/generate-list-of-all-possible-combinations-of-elements-of-vector

pop_str #All 32 combinations of possible 10 choose 5 = 252
##    Var1 Var2 Var3 Var4 Var5
## 1     1    4    9   10   14
## 2     2    4    9   10   14
## 3     1    5    9   10   14
## 4     2    5    9   10   14
## 5     1    4   10   10   14
## 6     2    4   10   10   14
## 7     1    5   10   10   14
## 8     2    5   10   10   14
## 9     1    4    9   13   14
## 10    2    4    9   13   14
## 11    1    5    9   13   14
## 12    2    5    9   13   14
## 13    1    4   10   13   14
## 14    2    4   10   13   14
## 15    1    5   10   13   14
## 16    2    5   10   13   14
## 17    1    4    9   10   20
## 18    2    4    9   10   20
## 19    1    5    9   10   20
## 20    2    5    9   10   20
## 21    1    4   10   10   20
## 22    2    4   10   10   20
## 23    1    5   10   10   20
## 24    2    5   10   10   20
## 25    1    4    9   13   20
## 26    2    4    9   13   20
## 27    1    5    9   13   20
## 28    2    5    9   13   20
## 29    1    4   10   13   20
## 30    2    4   10   13   20
## 31    1    5   10   13   20
## 32    2    5   10   13   20
y_bar_str <- rowMeans(pop_str)
mean_y <- sum(rowMeans(pop_str))/32
mean_y
## [1] 8.8
hist(y_bar_str)
abline(v=mean(mean_y),col="red")

Part D.

Below is the sampling distibtion of y. It is much more skewed than the other two. The mean is 8.7722.

#Find x_bar, y_bar, and Bhat to find y_bar_r

x_bar_u <- sum(pop$x)/10
x_comb <- as.data.frame(combn(pop$x, 5, FUN = NULL, simplify = TRUE))
xbar <- colSums(x_comb)/5
y_comb <- as.data.frame(combn(pop$y, 5, FUN = NULL, simplify = TRUE))
ybar <- colSums(y_comb)/5

avg_xy <- data.frame(xbar,ybar)
avg_xy$Bhat <- ybar/xbar

y_bar_r <- avg_xy$Bhat*x_bar_u

hist(y_bar_r)
abline(v=mean(y_bar_r),col="red")

Part E.

We know that the true mean of the population is 8.8. We will compare SRS, stratified random sampling, and ratio estimation to this.

When we graph our bias’s, we notice that they all are normally distributed. The SRS bias is much wider than the other two, followed by ratio, and then stratified random sampling. From class, we know that the stratified random sampling method has much less bias than simple random sampling.

From our calculations on relative variances, we notice that the ratio estimator is more efficient than stratified random sampling. The simple random sampling method is the least efficient. From the relative efficiency, one can argue that the stratified estimator is less biased than the ratio estimator. However, looking at the histogram above we see that the stratified random sampling is less variable. So, it can be argued both estimators are better than the estimator for simple random sampling but it is unclear which is best.

#Bias 
b_srs <- y_bar - 8.8 #RED
b_str <- y_bar_str - 8.8 #GREEN
b_rat <- y_bar_r - 8.8 #PURPLE/BLUE

hist(b_srs, col = "red")
hist(b_str, add=T, col=rgb(0, 1, 0, 0.5))
hist(b_rat, add=T, col=rgb(0,0,1,0.5))

#Relative Efficiency 
var(y_bar_r)/var(y_bar_str) #ratio estimation vs stratified random sampling
## [1] 0.5536386
var(y_bar_str)/var(y_bar) #stratified random sampling vs simple random sampling
## [1] 0.1398507
var(y_bar_r)/var(y_bar) #simple random sampling vs ratio estimation
## [1] 0.07742676

Question 8

Part A.

The average weekday price of 9 holes of golf is 20.15. The standard error here is 1.629866. The coefficient of variablility is estimated to be 0.08087325.

library(SDaA)
data('golfsrs')

y_bar <- mean(golfsrs$wkday9)
y_bar
## [1] 20.15333
s2 <- var(golfsrs$wkday9)
n <- nrow(golfsrs)
N <- 14938 #Took this number from Chapter 2 Problem 16 from our HW
se <- sqrt((1-n/N)*s2/n)
se
## [1] 1.629866
CV <- se/y_bar
CV
## [1] 0.08087325

Part B.

We are 95% confident that the proportion of golf courses that are public in the United States lies within the interval (0.3708411, 0.5624922).

head(golfsrs)
##      rn state               course holes type yearblt wkday18 wkday9
## 1  5491    RI Warwick Country Club    18 priv    1923      25     25
## 2 10276    VT   Haystack Golf Club    18 semi    1972      40     24
## 3  6025    MN                Pierz     9  pub    1939      NA     10
## 4  9739    GA      Southerness G C    18 semi    1991      37     37
## 5  3463    CA     Whispering Lakes    18  pub    1970      17     10
## 6  5883    MN         Hidden Creek    18  pub    1996      16     12
##   wkend18 wkend9 backtee rating par cart18 cart9 caddy pro
## 1      35     25    6453   71.8  69   15.0   7.5     y   y
## 2      45     24    6549   71.1  72   30.0  18.0     n   y
## 3      NA     10    3058   69.2  35   16.5  11.0     n   n
## 4      45     37    6766   72.2  72    0.0   0.0     n   y
## 5      20     10    6706   71.4  72   22.0  15.0     n   y
## 6      18     12    7002   73.5  72   10.0   7.0     n   y
golfsrs$public <- as.numeric(golfsrs$type == 'pub')
sum(golfsrs$public) #56 of the 120 
## [1] 56
p <- 56/120
p
## [1] 0.4666667
pub_s2 <- var(golfsrs$public)
s2_sum <- (pub_s2/sum(golfsrs$public))
pub_se <- sqrt((1-(sum(golfsrs$public)/n))*(s2_sum))

low <- p - 1.96*pub_se
up <- p + 1.96*pub_se

low
## [1] 0.3708411
up
## [1] 0.5624922

Part C.

A useful variable to stratify on would be the type of golf course as the prices may fluctuate based on the type. Another useful variable may be to stratify by the region of the country. As we are given the state, it might be useful to segment the country into regions. The rating of the golf course may be another useful tool as a higher rated golf course may cost more.

Part D.

The estimate of the total number of courses that have caddies is 1537. The standard error of this estimate is 421.09.

N <- 15372 #New N given in problem
n <- 120

golfsrs$caddies <- as.numeric(golfsrs$caddy == "y")
cad <- sum(golfsrs$caddies)
y_bar <- cad/120 #12/120 had caddies so thats why it is 0.1

t_hat <- N*y_bar 
s2_cad <- var(golfsrs$caddies) 
fpc <- 1-(n/N)
var_t_hat <- (N^2)*(fpc)*(s2_cad/n)
t_hat_se <- sqrt(var_t_hat)

t_hat
## [1] 1537.2
t_hat_se
## [1] 421.0914