1. Refer to Question 11 from Assignment 1:
  1. Compute the probability that another random sample of the same size will produce a sample mean that is at least the same value as the value of X¯¯¯¯ you observed in Question 11 of Assignment 1.

R./ since we know it is a normal distribution and we know the mean of the sample of 12 is 5.6875, the population mean is 5.0 and the sample sd we can use pnorm to calculate the probability of getting another sample of 12 with the same mean. We need to subtract the probability we found from 1 and that is the probability of sampling another 12 deliveries that will have the same mean of 5.6875 –> 0.0562

p5.68 = pnorm(5.6875, 5.0, (1.5/sqrt(12)))
patleast5.68 = 1 - p5.68
patleast5.68
[1] 0.0561756
  1. Refer to the value of the sample standard deviation S (which you computed in Exercise 11 of Assignment 1): Compute the probability that another random sample (again, of the same size) will yield a sample standard deviation that is between 0.5 hour and 1 hour. R./ P(0.5 < SD < 1) = P (sd = 1) - P(SD=0.5) Using Chi-square P(SD=0.5) = P(110.5 / (1.5/sqrt(12))) = P(12.701) = pchisq(12.701, 11) = 0.6867 P(SD=1) = P(111 / (1/sqrt(12))) = P(38.105) = pchisq(38.105) = 0.9999 P(0.5 < SD < 1) = 0.9999 - 0.6867 = 0.3132
Xsq0.5 = (11*0.5 / (1.5/sqrt(12)))
P0.5 = pchisq(Xsq0.5, 11)
P0.5
[1] 0.686732
Xsq1 = (11*1 / (1/sqrt(12)))
P1 = pchisq(Xsq1, 11)
P1
[1] 0.9999249
P = P1 - P0.5
P
[1] 0.3131929
  1. A 2012 poll carried out by Ipsos Reid in found that “42% of Canadians, who live outside of Quebec, believe Quebec will separate from Canada at some point in the future.” A pollster wishes to see if this sentiment is still present for Canadians who live outside of the province of Quebec.
  1. The pollster has determined that they are going to randomly sample n=1426 Canadian residents who are not residents of Quebec in an attempt to estimate p - the proportion of all Canadians who live outside of Quebec (“Rest of Canada”) - who believe that Quebec will separate from Canada within the next 10 years. Describe the distribution of pˆ, the proportion of n=1426 randomly chosen Canadians who live outside of Quebec who believe Quebec will separate from Canada within the next 10 years. (Ensure that your description provides a (i) distribution shape (ii) a balancing point and (iii) a measure of spread.)

R./ The distribution is bell shaped The balancing point correspond to the population mean (0.42) The standard deviation is 0.01307

x = seq(300, 800, 0.1)
phat = (x/1426)
sdphat = sqrt(0.42*(1-0.42) / 1426)
sdphat
[1] 0.0130701
plot(phat, dnorm(phat, 0.42, sdphat), yaxt = 'n', xlab="Values of the Sample Proportion", ylab = "Density", main="Distribution of Sample Proportion from n = 1426", type="h", col='blue') + abline(v=0.42, col="blue")
integer(0)

  1. A recent poll of n=1426 Canadians who are not residents of Quebec was taken. Of these, 541 thought that Quebec will separate from Canada in the next 10 years (356 indicated “might happen”; 128 responded “likely to happen”; 57 indicated “definitely will happen”). The sample proportion is computed to be pˆ=5411426=0.3794. Under the condition of the 2012 poll result, how likely is it for another random sample of n=1426 Canadians (who reside outside of Quebec) to produce a sample proportion that is at most as 0.3794?

R./ I calculated the sd for the sample proportion, I used pnorm() to find the probability of getting another sample with a sample proportion of at most 0.3794 is 0.0009471

sdphat = sqrt(0.42*(1-0.42) / 1426) #calculate phat for this sample
sdphat
[1] 0.0130701
pless037 = pnorm(0.3794, mean=0.42, sd=sdphat)
pless037
[1] 0.000947136
  1. Consider the steps and associated R Code required to generate a distribution of the sample proportion, pˆ, when sampling n=1426 Canadians who live outside of Quebec, then determining the proportion who believe Quebec will separate from Canada within the next 10 years. Carry out a simulation where you simulate 1000 random samples of n=1426. Create, then run your code to determine the proportion of your pˆs that are less than or equal to 0.3794. Provide this proportion.

R./ After running the simulation, I found that the proportion of phats is 0.001. This proportion changes every time the simulation is run, I have gotten values from 0 to 0,004

nsamples = 1000  #no. of simulations
sample.phat = numeric(nsamples) #data vector to hold 1000 sample standard deviations of n = 5
size = 1426
s1426data = rbinom(1,size,0.42) #As described https://www.programmingr.com/examples/neat-tricks/sample-r-function/r-rbinom/
counter = 0

for(i in 1:nsamples){
  s1426data = rbinom(1,size,0.42) #simulates a binary sample with p = 0.42
  sample.phat[i] = s1426data/size #compute the phat of the sample of 1426
  if (sample.phat[i] < 0.3794){
      counter = counter + 1
  }
}

sample.phat[1:5]
[1] 0.4460028 0.4354839 0.3997195 0.4137447 0.4039271
sample.phat[995:999]
[1] 0.4235624 0.4004208 0.4417952 0.4130435 0.3920056
pphatless037 = counter/nsamples
pphatless037
[1] 0.001
  1. Billy purchases one 6-49 lottery ticket every week and keeps track of the number of “matches” he has on each of his tickets. To be clear, a “match” will occur when a number on his ticket matches a number that appears in the winning combination. A random variable X that keeps track of the number of matching numbers Billy experiences per week has the probability distribution function with a mean and standard deviation of P(X=x) = choose (6, x) * choose(43, 6−x) / choose(49, 6) x=0,1,2,3,4,5,6. E(X) = μX = 36 / 49 = 0.7347 SD(X) = σX = 0.75998 ≈ 0.76 Billy claims that in a year (52 weeks), on average, he manages to have at least one matching number on his 6-49 ticket. What do you think about Billy’s claim? Provide a brief commentary about Billy’s claim using your current knowledge of statistics and probability theory.

R./ The probability of getting at least one matching number is equal to 1 minus the probability of no getting any number or P(x=0)

P(x >= 1) = 1 - P(x = 0) we obtain P(0) by replacing x with 0 in the probability function P(X=0) = choose (6, 0) * choose(43, 6−0) / choose(49, 6) P(X=0) = 0.4360

replacing this value P(X >= 1) = 1 - 0.4360 P(X >= 1) = 0.5640

R./ The probability of getting at least one matching number in 52 draws ( one draw / week) is 0.564 which is greater that the probability of getting no matching numbers which is 0.436. It is possible that Billy does get at least one matching number every week for a year.

P0 = choose (6, 0) * choose(43, 6) / choose(49, 6) 
P0
[1] 0.435965
Pmore1 = 1 - P0
Pmore1
[1] 0.564035
  1. A common measure of toxicity for any pollutant is the concentration of the pollutant that will kill half of the test species in a given amount of time (usually about 96 hours for the fish species). This measurement is called the LC50, which refers to the lethal concentration killing 50% of the test species).

The Environmental Protection Agency has collected data on LC50 measurements for certain chemicals likely to be found in freshwater and lakes. For a certain species of fish, the LC50 measurements (in parts per million) for DDT in 12 experiments to determine the LC50 “dose” are 16,5,21,19,10,5,8,2,7,2,4,9

  1. Use R studio to create the bootstrap distribution of the sample mean X¯¯¯¯Boot,LC50. Use 2000 “bootstraps” in your work, and display the distribution.

R./ As follows

# library(mosaic)
library(ggplot2)
library(lattice)
lc50 = c(16,5,21,19,10,5,8,2,7,2,4,9)  #create a vector with the values of LC50
lc50.df = data.frame(lc50)  # convert the above vector into a dataframe
lc50.df
ntimes = 100
bootdata = numeric(length(lc50.df$lc50)*ntimes)
for(i in 1:ntimes)
{  if (i == 1) bootdata = lc50.df$lc50
   else bootdata = c(lc50.df$lc50, bootdata)
}

N = 2000
n.size = length(lc50.df$lc50)
boot.mean = numeric(N)
boot.data = numeric(n.size)
#
for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.mean[i] = mean(boot.data)
}
bootstrap.mean = data.frame(boot.mean) #data frame holding 2000 different bootstrap sample means
head(bootstrap.mean, 4)
NA
tail(bootstrap.mean, 4)
ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean")
favstats(~boot.mean, data=bootstrap.mean)
  1. From your result in (a), compute the 95% bootstrap (percentile) confidence interval for \(\mu_{LC50}\), the mean LC50 measurement for DDT.

R./ using qdata() on the list of means and then displaying the distribution with the values for p25 ans p97.5 which are 5.75 and 12.5

p25.lc50 = qdata(~boot.mean, c(0.025, 0.975), data=bootstrap.mean)[1]
p25.lc50[1]
2.5% 
5.75 
p975.lc50 = qdata(~boot.mean, c(0.025, 0.975), data=bootstrap.mean)[2]
p975.lc50
97.5% 
 12.5 
ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean with 95% confidence Interval ") + geom_vline(xintercept = p25.lc50, color="red") + geom_vline(xintercept = p975.lc50, color="red")

  1. Repeat your estimation of \(\mu_{LC50}\), using the “other” confidence interval covered in Data 602. In the context of these data, interpret the meaning of the confidence interval. State any conditions/assumptions that are required in the computation of this confidence interval.

R./Using Student’s T we found P2.5 = 4.9181 and P97.5 = 13.0819

Ttest.lc50 = t.test(~lc50, data=lc50.df)$conf #using default value of 0.95 confidence interval 
Ttest.lc50[1]
[1] 4.91814
Ttest.lc50[2]
[1] 13.08186
  1. Compare your results in parts (b) and (c). If you were to report one of these confidence intervals, which would you report? Explain your answer.

R./ Bootstrap for CI95 = (5.75, 12.58) (red in the graph) Student’s T for CI95 = (4.9181, 13.0817) (blue in the graph) To be on the safer side, I want to report a tighter interval so I prefer the interval from Bootstrap

ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean with 95% confidence Interval ") + geom_vline(xintercept = p25.lc50, color="red") + geom_vline(xintercept = p975.lc50, color="red") + geom_vline(xintercept = Ttest.lc50[1], color="blue") + geom_vline(xintercept = Ttest.lc50[2], color="blue")

  1. Does one’s educational level influence their opinion about vaccinations? A recent Angus Reid2 survey was taken. Each person sampled was asked to respond to the statement “The science around vaccinations isn’t clear.”

Respondents either “strongly agree”, “moderately agree”, “moderately disagree”, or “strongly disagree”. The sample was partitioned by level of education.

There were n=670 respondents who’s highest level of education was high school or less, of which 348 “disagreed” (moderately disagree or stongly disagree). There were also n=376 who’s highest level of education was at least an undergraduate university education. Of these, 274 disagreed.

  1. Consider the population consisting of all persons, who’s highest level of education was high school or less and the bootstrap statistic pˆBoot,HS . Using 1000 iterations/replications, create a bootstrap distribution of pˆHS . Display your distribution.

R./ From the information given we can calculate pHS = 348 / 670 = 0.52. I use this proportion to create a random list of zeros(agreed) and ones (disagree). I then use this list to create a table that allows me to see the amount of zeros and ones in the list. Next I loop 1000 times to create a sample from the list I created with the size of 670 and replacement = true. The mean of the sample is then calculated and stored in an array called phats.hs; then I create a data frame from the phats.hs list and displayed its histogram

library(binom)
hs = rbinom(n=670,size=1,prob=0.52)
table(hs)
hs
  0   1 
343 327 
phats.hs = numeric(1000) #create a vector to store each of the 1000 bootstrap values of the sample proportion
for(i in 1:1000){
      temp.data = sample(hs, length(hs), replace=TRUE) #could have used resample(userofmari) here
      phats.hs[i] = mean(temp.data)
}
boot_phat_hs.df = data.frame(phats.hs)
head(boot_phat_hs.df, 4)
tail(boot_phat_hs.df, 4)
tail(boot_phat_hs.df, 4)
ggplot(boot_phat_hs.df, aes(x = phats.hs)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion")

  1. Now consider a different population that consist of all persons who’s highest level of education was at least an undergraduate degree. Repeat part (a), creating a bootstrap distribution for pˆBoot,Uni. (Again, display your distribution).

R./ From the information given we can calculate pUni= 274 / 376 = 0.73. I use this proportion to create a random list of zeros(agreed) and ones (disagree). I then use this list to create a table that allows me to see the amount of zeros and ones in the list. Next I loop 1000 times to create a sample from the list I created with the size of 376 and replacement = true. The mean of the sample is then calculated and stored in an array called phats.uni; then I create a data frame from the phats.uni list and displayed its histogram

uni = rbinom(n=376,size=1,prob=0.73)
table(uni)
uni
  0   1 
108 268 
phats.uni = numeric(1000) 
for(i in 1:1000){
      temp.data = sample(uni, length(uni), replace=TRUE) 
      phats.uni[i] = mean(temp.data)
}
boot_phat_uni.df = data.frame(phats.uni)
head(boot_phat_uni.df, 4)
tail(boot_phat_uni.df, 4)
tail(boot_phat_uni.df, 4)
ggplot(boot_phat_uni.df, aes(x = phats.uni)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Proportion") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") 

  1. You wish to estimate pUni−pHS, the difference between the proportion of all university-educated Canadians who disagree that the science of vaccinations isn’t clear and the proportion of all Canadians who’s highest level of completed education is high school who believe the same. You wish to have 95% confidence in your result. Think about the code you created to generate the bootstrap distributions on parts (a) and (b). Modify the code to you created in parts (a) and (b) to create a distribution of the bootstrap statistic pˆUni−pˆHS.

R./ As follows

phats.uni_hs = numeric(1000) 
for(i in 1:1000){
    temp.data.hs = sample(hs, length(hs), replace=TRUE) #could have used resample(userofmari) here
    temp.data.uni = sample(uni, length(uni), replace=TRUE)  
    phats.uni_hs[i] = mean(temp.data.uni) - mean(temp.data.hs)
}
boot_phat_uni_hs.df = data.frame(phats.uni_hs)
head(boot_phat_uni_hs.df, 4)
tail(boot_phat_uni_hs.df, 4)
tail(boot_phat_uni_hs.df, 4)
  1. Consider your finding in part (c). Compute the 95% bootstrap percentile confidence interval for pUni−pHs. From your result, does the proportion of persons with at most a high school education who disagree the science around vaccinations isn’t clear greater than the similar proportion of persons with at least an undergraduate university degree? Write a paragraph that supports your answer.

R./ The proportion of people with a level of education higher than high school who disagree the science around vaccination isn’t clear is greater than the proportion of people with a level of education equal or less than high school. This is evident by the fact that the distribution of the simulation of pUni−pHs yields values that are greater than 0. The 95% confidence interval shows that most of the values for this difference falls between 16.68% and 28.32% in favor of people with higher education than high school.

p25.uni_hs = qdata(~phats.uni_hs, c(0.025, 0.975), data=boot_phat_uni_hs.df)[1]
p25.uni_hs
     2.5% 
0.1667625 
p975.uni_hs = qdata(~phats.uni_hs, c(0.025, 0.975), data=boot_phat_uni_hs.df)[2]
p975.uni_hs
    97.5% 
0.2831919 
ggplot(boot_phat_uni_hs.df, aes(x = phats.uni_hs)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = p25.uni_hs, color="red") + geom_vline(xintercept = p975.uni_hs, color="red")

  1. Nanos research3 recently completed a survey of n=1000 Canadians aged 18 years of age or older, asking each “what is your most important national issue of concern?” 163 responded “Inflation”, 149 responded “Environment”, 131 responded “Jobs/Economy”. Those were the Top Three.
  1. Compute a 95% confidence interval for pInflation, the proportion of all Canadians aged 18 years or older for whom “Inflation” is the most important national concern.

R./I can calculate the 95% confidence interval using the plus-2/plus-4 which gives us 0.1427(p2.5) - 0.1885(p97.5)

prop.test(163 + 2, 1000 + 4, correct=FALSE)$conf
[1] 0.142707 0.188537
attr(,"conf.level")
[1] 0.95
  1. Similar to your work in Question 4(b), create the distribution of the bootstrap statistic pˆBoot,Inflation and a 95% bootstrap percentile confidence interval for pInflation.

R./ Confidence interval 95% = 0.144 - 0.188

infl = rbinom(n=1000,size=1,prob=0.163)
infl.df = data.frame(infl)
head(infl.df, 4)
tail(infl.df)
N = 1000
n.size = length(infl.df$infl)
boot.proportion_infl = numeric(N)
boot.data_infl = numeric(n.size)


for(i in 1:N){
  
  boot.data_infl = sample(infl.df$infl, n.size, replace=TRUE) #sample with replacement 
  boot.proportion_infl[i] = mean(boot.data_infl)
}
bootstrap.proportion_infl = data.frame(boot.proportion_infl) #data frame holding 2000 different bootstrap sample means
head(bootstrap.proportion_infl, 4)
NA
tail(bootstrap.proportion_infl, 4)
p2.5Infl = qdata(~boot.proportion_infl, c(0.025, 0.975), data = bootstrap.proportion_infl)[1]
p2.5Infl
 2.5% 
0.143 
p97.5Infl = qdata(~boot.proportion_infl, c(0.025, 0.975), data = bootstrap.proportion_infl)[2]
p97.5Infl
97.5% 
0.191 
ggplot(bootstrap.proportion_infl, aes(x = boot.proportion_infl)) + geom_histogram(col="green", fill="blue", binwidth=0.005) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = p2.5Infl, color="red") + geom_vline(xintercept = p97.5Infl, color="red")

  1. A similar survey of Canadians in August 2023 - a little over a month ago - suggested that the proportion of all Canadians who identified “Inflation” as the most important national concern was pInflation,Aug_23=0.13 From your results in (a) and (b), can you infer that the proportion of all Canadians who believe “Inflation” is the most important national issue has increased since August of this year? Why or why not? Ensure you invoke a statistical justification.

R./ I can infer that the proportion of Canadians who believe inflation is the most important issue has increased based on the fact that the Confidence Interval for the most recent survey goes from 14.4 to 18.8; there is no overlap with the previous value of 0.13 found in the previous survey, showing us that the proportion has increased

  1. Refer to the data encountered in Question 4 of this assignment. Prior to completing parts (a) and (b), carefully read through both parts (a) and (b).
  1. Use the Bootstrap method to compute a 98% confidence interval for μ˜LC50, the median LC50 level. Interpret the meaning of this interval in the context of these data.

R./ I will proceed to simulate the median with N=2000,and then calculate the 98% confidence Interval using qdata for the 1% and 99%. Finally I will graph these results. Confidence Interval 99% = (4, 16) We can say that there is a 98% probability that the median value for DDT LC50 is between 4 and 16 ppm

N = 2000
n.size = length(lc50.df$lc50)
boot.median = numeric(N)
boot.data = numeric(n.size)


for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.median[i] = median(boot.data)
}

bootstrap.median = data.frame(boot.median) #data frame holding 2000 different bootstrap sample medians
head(bootstrap.median, 4)
tail(bootstrap.median, 4)
lc50.median_99 = qdata(boot.median, c(0.01, 0.99), data=bootstrap.median)
lc50.median_99
 1% 99% 
  4  16 
ggplot(bootstrap.median, aes(x = boot.median)) + geom_histogram(col="red", fill="blue", binwidth=0.8) + xlab("Values of the Sample Median") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Median") + geom_vline(xintercept = lc50.median_99[1], color="green") + geom_vline(xintercept = lc50.median_99[2], color="green")

  1. Compute the 95% bootstrap percentile confidence interval for the population standard deviation, σ. In addition, interpret the meaning of your interval in the context of these data.

R./ For this question, I will find the confidence interval for the variance first and then take the sq root of these results are thw data for the 95% confidence interval for the sd: (3.1712, 7.8607)

We can say that there is a 95% probability that the standard deviation value for DDT LC50 is between 3.1281 and 7.9183 ppm

N = 2000
n.size = length(lc50.df$lc50)
boot.var = numeric(N)
boot.data = numeric(n.size)


for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.var[i] = var(boot.data)
}

bootstrap.var = data.frame(boot.var) #data frame holding 2000 different bootstrap sample variance
head(bootstrap.var, 4)
tail(bootstrap.var, 4)
lc50.var_95 = qdata(boot.var, c(0.025, 0.975), data=bootstrap.var)
lc50.var_95
    2.5%    97.5% 
10.05682 61.79072 
ggplot(bootstrap.var, aes(x = boot.var)) + geom_histogram(col="red", fill="blue", binwidth=1) + xlab("Values of the Sample Variance") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Variance") + geom_vline(xintercept = lc50.var_95[1], color="green") + geom_vline(xintercept = lc50.var_95[2], color="green")

lc50.sd_95 = sqrt(lc50.var_95)
lc50.sd_95
    2.5%    97.5% 
3.171249 7.860707 
  1. A national survey 4 of n=399 “Gen Z”-ers - someone who is born in the years 1996 - 2010 (inclusive) was taken. Each was then asked the following question: “If a federal election were held tomorrow, which one of the following parties would you vote for in your constituency?” The results?

128 responded “Conservative” (Conservative Party of Canada) 96 responded “Liberal” (Liberal Party of Canada) 104 responded “NDP” (New Democratic Party of Canada) Respondents were provided with a few more “closed options”, including the Bloc Quebecois, People’s Party, and Green Party.

  1. Compute the 95% confidence interval for pCon, the proportion of all Gen Z-ers in Canada that will vote for their respective Conservative Member of Parliament candidate/constituency, in an election were “held tomorrow”.

R./ Since we know n and nconservative, we cam use prop.test() to determine the confidence interval using its default 95% and the plus-2/plus-4 to make the interval tighter. Pcons 95% CI = (0.2788,0.3697 ) According to these results we can say with a 95% confidence that between 27.88 and 36.97 % of Gen Z-ers will vote conservative.

ci95con.prop = prop.test(128+2, 399+4, correct=FALSE)$conf
ci95con.prop
[1] 0.2788012 0.3697105
attr(,"conf.level")
[1] 0.95
  1. Consider the bootstrap statistic p˜Con=XCon+2399+4 . Write the R code that will generate a bootstrap distribution for p˜Con . Use 1000 as the number of replications/iterations.
nsamples = 1000  #no. of simulations
sample.phat_con = numeric(nsamples) #data vector to hold 1000 samples representing the gen z-ers that choose conservative
size = 399

for(i in 1:nsamples){
  boot.data = rbinom(1,size,0.32) #simulates a binary sample representing the genz-ers that choose conservative knowing Pcon = 128/399 = 0.33
  sample.phat_con[i] = (boot.data + 2)/(size +4) #compute the phat using plus2/plus4 and stores it in the sample.phat_con list
}

bootstrap.con = data.frame(sample.phat_con)
head(bootstrap.con, 4)
tail(bootstrap.con, 4)
  1. From your result in part (b), compute a 95% bootstrap confidence interval for pCon.
ci95con.boot = qdata(~sample.phat_con, c(0.025, 0.975), data=bootstrap.con)
ci95con.boot
     2.5%     97.5% 
0.2779156 0.3622829 
ggplot(bootstrap.con, aes(x = sample.phat_con)) + geom_histogram(col="green", fill="blue", binwidth=0.05) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = ci95con.prop[1], color="red") + geom_vline(xintercept = ci95con.prop[2], color="red") + geom_vline(xintercept = ci95con.boot[1], color="purple") + geom_vline(xintercept = ci95con.boot[2], color="purple")

  1. Consider your results in parts (a) and (c). Compare the two results. If you had to pick one as the “best” estimate for the unknown value of pCon, which one would you select? Provide a justification for your choice.

R./ from a) CI95.prop = (0.2788, 0.3647) –> 0.3647 - 0.2788 = 0.0909 (red) b) CI95.boot = (0.2779, 0.3628) –> 0.3697 - 0.2754 = 0.0849 (purple)

     These intervals are very close, if I have to choose one, I choose the one with the smaller interval which in this case is the interval found by running boostrap
  1. In this question you will use the albertacountiessampled2019.csv. The data appearing in this file are the results of a random sample of n=100 counties in Alberta. For each county, the median household income was computed. This feature appears in the OriginalValue column.

Read this data into a R data frame. To read this into R, you will have to select the Raw tab appearing on the top right-hand side.

The bootstrap t statistic is defined as TBootstrap=X¯¯¯¯Boot−X¯¯¯¯(Sn√) where X¯¯¯¯ and S are the mean and standard deviation of the sample.

  1. From these data, generate the distribution of TBootstrap. Display this distribution. Use 1000 for the number of bootstraps.
alberta.counties = read.csv("albertacountiessampled2019.csv")
income.median = data.frame(alberta.counties$OriginalValue)
head(income.median, 4)
tail(income.median, 4)
stats = favstats(income.median$alberta.counties.OriginalValue)
stats
sample.mean = stats[6]
sample.mean
sample.sd = stats[7]
sample.sd
N = 1000
sample.size = length(income.median$alberta.counties.OriginalValue)
boot.data = numeric(sample.size)
boot.mean = numeric(N)
boot.t = numeric(N)

for (i in 1:N){
  boot.data = sample(alberta.counties$OriginalValue, sample.size, replace = TRUE)
  boot.mean[i] = mean(boot.data)
  t = (boot.mean[i] - sample.mean) / (sample.sd / sqrt(sample.size))
  t
  boot.t[i] = t
}

boot.meandf = data.frame(boot.mean)
head(boot.meandf)
tail(boot.meandf)
boot.tval = c(N)
for(i in 1:N){
  boot.tval[i] = boot.t[[i]][1]
}

boot.tdf = data.frame(boot.tval)
head(boot.tdf)
NA
tail(boot.tdf)
ggplot(boot.tdf, aes(x = boot.tval)) + geom_histogram(col="red", fill="blue", binwidth=0.4) + xlab("Values of Bootstrat T") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap T")

  1. Refer to part (a): A Bootstrap t.Confidence Interval for μ is defined as: [X¯¯¯¯−t∗α2(Sn−−√),X¯¯¯¯+t∗1−α2(Sn−−√)] where t∗α2 and t∗1−α2 is are α2 and 1−α2 percentiles from the TBootstrap distribution.

Compute the 95% bootstrap t confidence interval for μMedian_Family_Income for all the counties in Alberta.

Note: −tα2 will be a negative value due to the nature of the Student’s t distribution. Hence the “-” sign in front.

percentileT = quantile(boot.tdf$boot.tval, c(0.025, 0.975))
percentileT
     2.5%     97.5% 
-2.058206  1.842106 
meanCI2.5 = sample.mean - (percentileT[1] * (sample.sd/sqrt(sample.size)))
meanCI2.5$mean
[1] 92128.91
meanCI97.5 = sample.mean - (percentileT[2] * (sample.sd/sqrt(sample.size)))
meanCI97.5$mean
[1] 83239.22
  1. Compute the (i) 95% bootstrap percentile confidence interval estimate of μMedian_Family_Income and (ii) the 95% Student’s t confidence interval for μMedian_Family_Income.
medianIncome.ci95 = qdata(~alberta.counties.OriginalValue, c(0.025, 0.975), data=income.median)
medianIncome.ci95
    2.5%    97.5% 
 34747.5 131119.5 
BootTci95 = qdata(boot.tval, c(0.025, 0.975), data=boot.tdf)
BootTci95
     2.5%     97.5% 
-2.058206  1.842106 

D. Consider your results in parts (b) and (c): If you were to report one of these confidence intervals to a client, what would be your consultation? Provide a short paragraph explaining what the value of μMedian_Family_Income Could possibly be, and why you chose the method of confidence interval estimation to report from.

I would use the 95% Confidence Interval found in part b as it reports a tighter interval for than the one reported in C

  1. 83029.28 - 91911.28
  2. 34747.50 - 131119.50

---
title: "DATA602_ASSIGNMENT2"
output: html_notebook
---
1. Refer to Question 11 from Assignment 1:

a. Compute the probability that another random sample of the same size will produce a sample mean that is at least the same value as the value of X¯¯¯¯ you observed in Question 11 of Assignment 1.

R./ since we know it is a normal distribution and we know the mean of the sample of 12 is 5.6875, the population mean is 5.0 and the sample sd we can use pnorm to calculate the probability of getting another sample of 12 with the same mean. We need to subtract the probability we found from 1 and that is the probability of sampling another 12 deliveries that will have the same mean of 5.6875 --> 0.0562

```{r}
p5.68 = pnorm(5.6875, 5.0, (1.5/sqrt(12)))
patleast5.68 = 1 - p5.68
patleast5.68
```
b. Refer to the value of the sample standard deviation S (which you computed in Exercise 11 of Assignment 1): Compute the probability that another random sample (again, of the same size) will yield a sample standard deviation that is between 0.5 hour and 1 hour.
R./ P(0.5 < SD < 1) = P (sd = 1) - P(SD=0.5)
Using Chi-square
P(SD=0.5) = P(11*0.5 / (1.5/sqrt(12))) = P(12.701) = pchisq(12.701, 11) = 0.6867
P(SD=1) = P(11*1 / (1/sqrt(12))) = P(38.105) = pchisq(38.105) = 0.9999
P(0.5 < SD < 1) = 0.9999 - 0.6867 = 0.3132

```{r}
Xsq0.5 = (11*0.5 / (1.5/sqrt(12)))
P0.5 = pchisq(Xsq0.5, 11)
P0.5
Xsq1 = (11*1 / (1/sqrt(12)))
P1 = pchisq(Xsq1, 11)
P1
P = P1 - P0.5
P
```
2. A 2012 poll carried out by Ipsos Reid in found that “42% of Canadians, who live outside of Quebec, believe Quebec will separate from Canada at some point in the future.” A pollster wishes to see if this sentiment is still present for Canadians who live outside of the province of Quebec.

a. The pollster has determined that they are going to randomly sample n=1426 Canadian residents who are not residents of Quebec in an attempt to estimate p - the proportion of all Canadians who live outside of Quebec (“Rest of Canada”) - who believe that Quebec will separate from Canada within the next 10 years. Describe the distribution of pˆ, the proportion of n=1426 randomly chosen Canadians who live outside of Quebec who believe Quebec will separate from Canada within the next 10 years. (Ensure that your description provides a (i) distribution shape (ii) a balancing point and (iii) a measure of spread.)
 
 R./ The distribution is bell shaped
 The balancing point correspond to the population mean (0.42)
 The standard deviation is 0.01307
```{r}
x = seq(300, 800, 0.1)
phat = (x/1426)
sdphat = sqrt(0.42*(1-0.42) / 1426)
sdphat
```


```{r}
plot(phat, dnorm(phat, 0.42, sdphat), yaxt = 'n', xlab="Values of the Sample Proportion", ylab = "Density", main="Distribution of Sample Proportion from n = 1426", type="h", col='blue') + abline(v=0.42, col="blue")
```
 b. A recent poll of n=1426 Canadians who are not residents of Quebec was taken. Of these, 541 thought that Quebec will separate from Canada in the next 10 years (356 indicated “might happen”; 128 responded “likely to happen”; 57 indicated “definitely will happen”). The sample proportion is computed to be pˆ=5411426=0.3794. Under the condition of the 2012 poll result, how likely is it for another random sample of n=1426 Canadians (who reside outside of Quebec) to produce a sample proportion that is at most as 0.3794?
 
R./ I calculated the sd for the sample proportion, I used pnorm() to find the probability of getting another sample with a sample proportion of at most 0.3794 is 0.0009471
```{r}
sdphat = sqrt(0.42*(1-0.42) / 1426) #calculate phat for this sample
sdphat
pless037 = pnorm(0.3794, mean=0.42, sd=sdphat)
pless037
```
c. Consider the steps and associated R Code required to generate a distribution of the sample proportion, pˆ, when sampling n=1426 Canadians who live outside of Quebec, then determining the proportion who believe Quebec will separate from Canada within the next 10 years. Carry out a simulation where you simulate 1000 random samples of n=1426. Create, then run your code to determine the proportion of your pˆs that are less than or equal to 0.3794. Provide this proportion.

R./ After running the simulation, I found that the proportion of phats is 0.001. 
This proportion changes every time the simulation is run, I have gotten values from 0 to 0,004

```{r}
nsamples = 1000  #no. of simulations
sample.phat = numeric(nsamples) #data vector to hold 1000 sample standard deviations of n = 5
size = 1426
s1426data = rbinom(1,size,0.42) #As described https://www.programmingr.com/examples/neat-tricks/sample-r-function/r-rbinom/
counter = 0

for(i in 1:nsamples){
  s1426data = rbinom(1,size,0.42) #simulates a binary sample with p = 0.42
  sample.phat[i] = s1426data/size #compute the phat of the sample of 1426
  if (sample.phat[i] < 0.3794){
      counter = counter + 1
  }
}

sample.phat[1:5]
sample.phat[995:999]

pphatless037 = counter/nsamples
pphatless037
```
3. Billy purchases one 6-49 lottery ticket every week and keeps track of the number of “matches” he has on each of his tickets. To be clear, a “match” will occur when a number on his ticket matches a number that appears in the winning combination. A random variable X  that keeps track of the number of matching numbers Billy experiences per week has the probability distribution function with a mean and standard deviation of
P(X=x) = choose (6, x) * choose(43, 6−x) / choose(49, 6)      x=0,1,2,3,4,5,6.
E(X) = μX = 36 / 49 = 0.7347
SD(X) = σX = 0.75998 ≈ 0.76
Billy claims that in a year (52 weeks), on average, he manages to have at least one matching number on his 6-49 ticket. What do you think 
about Billy’s claim? Provide a brief commentary about Billy’s claim using your current knowledge of statistics and probability theory. 

R./ The probability of getting at least one matching number is equal to 1 minus the probability of no getting any number or P(x=0) 

P(x >= 1) = 1 - P(x = 0) 
we obtain P(0) by replacing x with 0 in the probability function
P(X=0) = choose (6, 0) * choose(43, 6−0) / choose(49, 6) 
P(X=0) = 0.4360

replacing this value 
P(X >= 1) = 1 - 0.4360
P(X >= 1) = 0.5640

R./ The probability of getting at least one matching number in 52 draws ( one draw / week) is 0.564 which is greater that the probability of getting no matching numbers which is 0.436. 
It is possible that Billy does get at least one matching number every week for a year.

```{r}
P0 = choose (6, 0) * choose(43, 6) / choose(49, 6) 
P0

Pmore1 = 1 - P0
Pmore1
```
4. A common measure of toxicity for any pollutant is the concentration of the pollutant that will kill half of the test species in a given amount of time (usually about 96 hours for the fish species). This measurement is called the LC50, which refers to the lethal concentration killing 50% of the test species).

The Environmental Protection Agency has collected data on LC50 measurements for certain chemicals likely to be found in freshwater and lakes. For a certain species of fish, the LC50 measurements (in parts per million) for DDT in 12 experiments to determine the LC50 “dose” are
16,5,21,19,10,5,8,2,7,2,4,9

a. Use R studio to create the bootstrap distribution of the sample mean X¯¯¯¯Boot,LC50. Use 2000 “bootstraps” in your work, and display the distribution.

R./ As follows

```{r}
# library(mosaic)
library(ggplot2)
library(lattice)
lc50 = c(16,5,21,19,10,5,8,2,7,2,4,9)  #create a vector with the values of LC50
lc50.df = data.frame(lc50)  # convert the above vector into a dataframe
lc50.df
```


```{r}
ntimes = 100
bootdata = numeric(length(lc50.df$lc50)*ntimes)
for(i in 1:ntimes)
{  if (i == 1) bootdata = lc50.df$lc50
   else bootdata = c(lc50.df$lc50, bootdata)
}

N = 2000
n.size = length(lc50.df$lc50)
boot.mean = numeric(N)
boot.data = numeric(n.size)
#
for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.mean[i] = mean(boot.data)
}
bootstrap.mean = data.frame(boot.mean) #data frame holding 2000 different bootstrap sample means
head(bootstrap.mean, 4)

```
```{r}
tail(bootstrap.mean, 4)
```


```{r}
ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean")
```
```{r}
favstats(~boot.mean, data=bootstrap.mean)
```
b. From your result in (a), compute the 95% bootstrap (percentile) confidence interval for \(\mu_{LC50}\), the mean LC50 measurement for DDT.

R./ using qdata() on the list of means and then displaying the distribution with the values for p25 ans p97.5 which are 5.75 and 12.5 

```{r}
p25.lc50 = qdata(~boot.mean, c(0.025, 0.975), data=bootstrap.mean)[1]
p25.lc50[1]

p975.lc50 = qdata(~boot.mean, c(0.025, 0.975), data=bootstrap.mean)[2]
p975.lc50
```
```{r}
ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean with 95% confidence Interval ") + geom_vline(xintercept = p25.lc50, color="red") + geom_vline(xintercept = p975.lc50, color="red")
```
c. Repeat your estimation of \(\mu_{LC50}\), using the “other” confidence interval covered in Data 602. In the context of these data, interpret the meaning of the confidence interval. State any conditions/assumptions that are required in the computation of this confidence interval.

R./Using Student's T we found P2.5 = 4.9181 and P97.5 = 13.0819


```{r}
Ttest.lc50 = t.test(~lc50, data=lc50.df)$conf #using default value of 0.95 confidence interval 
Ttest.lc50[1]
Ttest.lc50[2]
```
d. Compare your results in parts (b) and (c). If you were to report one of these confidence intervals, which would you report? Explain your answer.

R./ Bootstrap for CI95 = (5.75, 12.58) (red in the graph)
Student's T for CI95 = (4.9181, 13.0817) (blue in the graph)
To be on the safer side, I want to report a tighter interval so I prefer the interval from Bootstrap

```{r}
ggplot(bootstrap.mean, aes(x = boot.mean)) + geom_histogram(col="red", fill="blue", binwidth = 0.5) + xlab("Values of Bootstrap Mean") + ylab("Count") + ggtitle("Distribution of Bootstrap Statistic: Sample Mean with 95% confidence Interval ") + geom_vline(xintercept = p25.lc50, color="red") + geom_vline(xintercept = p975.lc50, color="red") + geom_vline(xintercept = Ttest.lc50[1], color="blue") + geom_vline(xintercept = Ttest.lc50[2], color="blue")
```
5. Does one’s educational level influence their opinion about vaccinations? A recent Angus Reid2 survey was taken. Each person sampled was asked to respond to the statement “The science around vaccinations isn’t clear.”

Respondents either “strongly agree”, “moderately agree”, “moderately disagree”, or “strongly disagree”. The sample was partitioned by level of education.

There were n=670 respondents who’s highest level of education was high school or less, of which 348 “disagreed” (moderately disagree or stongly disagree). There were also n=376
 who’s highest level of education was at least an undergraduate university education. Of these, 274 disagreed.


a. Consider the population consisting of all persons, who’s highest level of education was high school or less and the bootstrap statistic pˆBoot,HS . Using 1000 iterations/replications, create a bootstrap distribution of pˆHS
. Display your distribution.

R./ From the information given we can calculate pHS = 348 / 670 = 0.52. I use this proportion to create a random list of zeros(agreed) and ones (disagree).
I then use this list to create a table that allows me to see the amount of zeros and ones in the list.
Next I loop 1000 times to create a sample from the list I created with the size of 670 and replacement = true. The mean of the sample is then calculated and stored in an array called phats.hs; then I create a data frame from the phats.hs list and displayed its histogram

```{r}
library(binom)
hs = rbinom(n=670,size=1,prob=0.52)
table(hs)
```
```{r}
phats.hs = numeric(1000) #create a vector to store each of the 1000 bootstrap values of the sample proportion
for(i in 1:1000){
      temp.data = sample(hs, length(hs), replace=TRUE) #could have used resample(userofmari) here
      phats.hs[i] = mean(temp.data)
}
boot_phat_hs.df = data.frame(phats.hs)
head(boot_phat_hs.df, 4)

```
```{r}
tail(boot_phat_hs.df, 4)
```


```{r}
ggplot(boot_phat_hs.df, aes(x = phats.hs)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion")
```
b. Now consider a different population that consist of all persons who’s highest level of education was at least an undergraduate degree. Repeat part (a), creating a bootstrap distribution for pˆBoot,Uni. (Again, display your distribution).

R./ From the information given we can calculate pUni= 274 / 376 = 0.73. I use this proportion to create a random list of zeros(agreed) and ones (disagree). I then use this list to create a table that allows me to see the amount of zeros and ones in the list.
Next I loop 1000 times to create a sample from the list I created with the size of 376 and replacement = true. The mean of the sample is then calculated and stored in an array called phats.uni; then I create a data frame from the phats.uni list and displayed its histogram

```{r}
uni = rbinom(n=376,size=1,prob=0.73)
table(uni)
```
```{r}
phats.uni = numeric(1000) 
for(i in 1:1000){
      temp.data = sample(uni, length(uni), replace=TRUE) 
      phats.uni[i] = mean(temp.data)
}
boot_phat_uni.df = data.frame(phats.uni)
head(boot_phat_uni.df, 4)
```
```{r}
tail(boot_phat_uni.df, 4)
```


```{r}
ggplot(boot_phat_uni.df, aes(x = phats.uni)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Proportion") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") 
```
c. You wish to estimate pUni−pHS, the difference between the proportion of all university-educated Canadians who disagree that the science of vaccinations isn’t clear and the proportion of all Canadians who’s highest level of completed education is high school who believe the same. You wish to have 95% confidence in your result. Think about the code you created to generate the bootstrap distributions on parts (a) and (b). Modify the code to you created in parts (a) and (b) to create a distribution of the bootstrap statistic pˆUni−pˆHS.

R./ As follows

```{r}
phats.uni_hs = numeric(1000) 
for(i in 1:1000){
    temp.data.hs = sample(hs, length(hs), replace=TRUE) #could have used resample(userofmari) here
    temp.data.uni = sample(uni, length(uni), replace=TRUE)  
    phats.uni_hs[i] = mean(temp.data.uni) - mean(temp.data.hs)
}
boot_phat_uni_hs.df = data.frame(phats.uni_hs)
head(boot_phat_uni_hs.df, 4)

```
```{r}
tail(boot_phat_uni_hs.df, 4)
```


d. Consider your finding in part (c). Compute the 95% bootstrap percentile confidence interval for pUni−pHs. From your result, does the proportion of persons with at most a high school education who disagree the science around vaccinations isn’t clear greater than the similar proportion of persons with at least an undergraduate university degree? Write a paragraph that supports your answer.

R./ The proportion of people with a level of education higher than high school who disagree the science around vaccination isn't clear is greater than the proportion of people with a level of education equal or less than high school. This is evident by the fact that the distribution of the simulation of pUni−pHs yields values that are greater than 0. The 95% confidence interval shows that most of the values for this difference falls between 16.68% and 28.32% in favor of people with higher education than high school.

```{r}
p25.uni_hs = qdata(~phats.uni_hs, c(0.025, 0.975), data=boot_phat_uni_hs.df)[1]
p25.uni_hs
p975.uni_hs = qdata(~phats.uni_hs, c(0.025, 0.975), data=boot_phat_uni_hs.df)[2]
p975.uni_hs

```
```{r}
ggplot(boot_phat_uni_hs.df, aes(x = phats.uni_hs)) + geom_histogram(col="green", fill="blue", binwidth=0.01) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = p25.uni_hs, color="red") + geom_vline(xintercept = p975.uni_hs, color="red")
```


6. Nanos research3 recently completed a survey of n=1000  Canadians aged 18 years of age or older, asking each “what is your most important national issue of concern?” 163 responded “Inflation”, 149 responded “Environment”, 131 responded “Jobs/Economy”. Those were the Top Three.

a. Compute a 95% confidence interval for pInflation, the proportion of all Canadians aged 18 years or older for whom “Inflation” is the most important national concern.

R./I can calculate the 95% confidence interval using the plus-2/plus-4 which gives us 0.1427(p2.5) - 0.1885(p97.5)
```{r}
prop.test(163 + 2, 1000 + 4, correct=FALSE)$conf
```
b. Similar to your work in Question 4(b), create the distribution of the bootstrap statistic pˆBoot,Inflation
 and a 95% bootstrap percentile confidence interval for pInflation.
 
 R./ Confidence interval 95% = 0.144 - 0.188
 
```{r}
infl = rbinom(n=1000,size=1,prob=0.163)
infl.df = data.frame(infl)
head(infl.df, 4)
```
```{r}
tail(infl.df)
```


```{r}
N = 1000
n.size = length(infl.df$infl)
boot.proportion_infl = numeric(N)
boot.data_infl = numeric(n.size)


for(i in 1:N){
  
  boot.data_infl = sample(infl.df$infl, n.size, replace=TRUE) #sample with replacement 
  boot.proportion_infl[i] = mean(boot.data_infl)
}
bootstrap.proportion_infl = data.frame(boot.proportion_infl) #data frame holding 2000 different bootstrap sample means
head(bootstrap.proportion_infl, 4)

```
```{r}
tail(bootstrap.proportion_infl, 4)
```


```{r}
p2.5Infl = qdata(~boot.proportion_infl, c(0.025, 0.975), data = bootstrap.proportion_infl)[1]
p2.5Infl
p97.5Infl = qdata(~boot.proportion_infl, c(0.025, 0.975), data = bootstrap.proportion_infl)[2]
p97.5Infl
```
```{r}
ggplot(bootstrap.proportion_infl, aes(x = boot.proportion_infl)) + geom_histogram(col="green", fill="blue", binwidth=0.005) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = p2.5Infl, color="red") + geom_vline(xintercept = p97.5Infl, color="red")
```
c. A similar survey of Canadians in August 2023 - a little over a month ago - suggested that the proportion of all Canadians who identified “Inflation” as the most important national concern was pInflation,Aug_23=0.13
From your results in (a) and (b), can you infer that the proportion of all Canadians who believe “Inflation” is the most important national issue has increased since August of this year? Why or why not? Ensure you invoke a statistical justification.

R./ I can infer that the proportion of Canadians who believe inflation is the most important issue has increased based on the fact that the Confidence Interval for the most recent survey goes from 14.4 to 18.8; there is no overlap with the previous value of 0.13 found in the previous survey, showing us that the proportion has increased


7. Refer to the data encountered in Question 4 of this assignment. Prior to completing parts (a) and (b), carefully read through both parts (a) and (b).

a. Use the Bootstrap method to compute a 98% confidence interval for μ˜LC50, the median LC50 level. Interpret the meaning of this interval in the context of these data.

R./ I will proceed to simulate the median with N=2000,and then calculate the 98% confidence Interval using qdata for the 1% and 99%. Finally I will graph these results. 
Confidence Interval 99% = (4, 16)
We can say that there is a 98% probability that the median value for DDT LC50 is between 4 and 16 ppm
```{r}
N = 2000
n.size = length(lc50.df$lc50)
boot.median = numeric(N)
boot.data = numeric(n.size)


for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.median[i] = median(boot.data)
}

bootstrap.median = data.frame(boot.median) #data frame holding 2000 different bootstrap sample medians
head(bootstrap.median, 4)
```


```{r}
tail(bootstrap.median, 4)
```


```{r}
lc50.median_99 = qdata(boot.median, c(0.01, 0.99), data=bootstrap.median)
lc50.median_99
```


```{r}
ggplot(bootstrap.median, aes(x = boot.median)) + geom_histogram(col="red", fill="blue", binwidth=0.8) + xlab("Values of the Sample Median") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Median") + geom_vline(xintercept = lc50.median_99[1], color="green") + geom_vline(xintercept = lc50.median_99[2], color="green")
```
b. Compute the 95% bootstrap percentile confidence interval for the population standard deviation, σ. In addition, interpret the meaning of your interval in the context of these data.

R./ For this question, I will find the confidence interval for the variance first and then take the sq root of these results are thw data for the 95% confidence interval for the sd: (3.1712, 7.8607)

We can say that there is a 95% probability that the standard deviation value for DDT LC50 is between 3.1281 and 7.9183 ppm
```{r}
N = 2000
n.size = length(lc50.df$lc50)
boot.var = numeric(N)
boot.data = numeric(n.size)


for(i in 1:N){
  
  boot.data = sample(lc50.df$lc50, n.size, replace=TRUE) #sample with replacement 
  boot.var[i] = var(boot.data)
}

bootstrap.var = data.frame(boot.var) #data frame holding 2000 different bootstrap sample variance
head(bootstrap.var, 4)
```

```{r}
tail(bootstrap.var, 4)
```
```{r}
lc50.var_95 = qdata(boot.var, c(0.025, 0.975), data=bootstrap.var)
lc50.var_95
```


```{r}
ggplot(bootstrap.var, aes(x = boot.var)) + geom_histogram(col="red", fill="blue", binwidth=1) + xlab("Values of the Sample Variance") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Variance") + geom_vline(xintercept = lc50.var_95[1], color="green") + geom_vline(xintercept = lc50.var_95[2], color="green")
```

```{r}
lc50.sd_95 = sqrt(lc50.var_95)
lc50.sd_95
```
8. A national survey 4 of n=399  “Gen Z”-ers - someone who is born in the years 1996 - 2010 (inclusive) was taken. Each was then asked the following question:
“If a federal election were held tomorrow, which one of the following parties would you vote for in your constituency?”
The results?

128 responded “Conservative” (Conservative Party of Canada)
96 responded “Liberal” (Liberal Party of Canada)
104 responded “NDP” (New Democratic Party of Canada)
Respondents were provided with a few more “closed options”, including the Bloc Quebecois, People’s Party, and Green Party.

a. Compute the 95% confidence interval for pCon, the proportion of all Gen Z-ers in Canada that will vote for their respective Conservative Member of Parliament candidate/constituency, in an election were “held tomorrow”.

R./ Since we know n and nconservative, we cam use prop.test() to determine the confidence interval using its default 95% and the plus-2/plus-4 to make the interval tighter.
Pcons 95% CI = (0.2788,0.3697 )
According to these results we can say with a 95% confidence that between 27.88 and 36.97 % of Gen Z-ers will vote conservative.

```{r}
ci95con.prop = prop.test(128+2, 399+4, correct=FALSE)$conf
ci95con.prop
```
b. Consider the bootstrap statistic p˜Con=XCon+2399+4
. Write the R code that will generate a bootstrap distribution for p˜Con
. Use 1000 as the number of replications/iterations.

```{r}
nsamples = 1000  #no. of simulations
sample.phat_con = numeric(nsamples) #data vector to hold 1000 samples representing the gen z-ers that choose conservative
size = 399

for(i in 1:nsamples){
  boot.data = rbinom(1,size,0.32) #simulates a binary sample representing the genz-ers that choose conservative knowing Pcon = 128/399 = 0.33
  sample.phat_con[i] = (boot.data + 2)/(size +4) #compute the phat using plus2/plus4 and stores it in the sample.phat_con list
}

bootstrap.con = data.frame(sample.phat_con)
head(bootstrap.con, 4)
```
```{r}
tail(bootstrap.con, 4)
```


c. From your result in part (b), compute a 95% bootstrap confidence interval for pCon.

```{r}
ci95con.boot = qdata(~sample.phat_con, c(0.025, 0.975), data=bootstrap.con)
ci95con.boot
```
```{r}
ggplot(bootstrap.con, aes(x = sample.phat_con)) + geom_histogram(col="green", fill="blue", binwidth=0.05) + xlab("Values of the Sample Proportion") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap Sample Proportion") + geom_vline(xintercept = ci95con.prop[1], color="red") + geom_vline(xintercept = ci95con.prop[2], color="red") + geom_vline(xintercept = ci95con.boot[1], color="purple") + geom_vline(xintercept = ci95con.boot[2], color="purple")
```

d. Consider your results in parts (a) and (c). Compare the two results. If you had to pick one as the “best” estimate for the unknown value of pCon, which one would you select? Provide a justification for your choice.

R./ from a) CI95.prop = (0.2788, 0.3647) --> 0.3647 - 0.2788 = 0.0909 (red)
         b) CI95.boot = (0.2779, 0.3628) --> 0.3697 - 0.2754 = 0.0849 (purple)
         
         These intervals are very close, if I have to choose one, I choose the one with the smaller interval which in this case is the interval found by running boostrap

9. In this question you will use the albertacountiessampled2019.csv. The data appearing in this file are the results of a random sample of n=100 counties in Alberta. For each county, the median household income was computed. This feature appears in the OriginalValue column.

Read this data into a R data frame. To read this into R, you will have to select the Raw tab appearing on the top right-hand side.

The bootstrap t statistic is defined as
TBootstrap=X¯¯¯¯Boot−X¯¯¯¯(Sn√)
where X¯¯¯¯
 and S
 are the mean and standard deviation of the sample.

a. From these data, generate the distribution of TBootstrap. Display this distribution. Use 1000 for the number of bootstraps.

```{r}
alberta.counties = read.csv("albertacountiessampled2019.csv")
income.median = data.frame(alberta.counties$OriginalValue)
head(income.median, 4)
```
```{r}
tail(income.median, 4)
```
```{r}
stats = favstats(income.median$alberta.counties.OriginalValue)
stats
```
```{r}
sample.mean = stats[6]
sample.mean
```

```{r}
sample.sd = stats[7]
sample.sd
```
```{r}
N = 1000
sample.size = length(income.median$alberta.counties.OriginalValue)
boot.data = numeric(sample.size)
boot.mean = numeric(N)
boot.t = numeric(N)

for (i in 1:N){
  boot.data = sample(alberta.counties$OriginalValue, sample.size, replace = TRUE)
  boot.mean[i] = mean(boot.data)
  t = (boot.mean[i] - sample.mean) / (sample.sd / sqrt(sample.size))
  t
  boot.t[i] = t
}

boot.meandf = data.frame(boot.mean)
head(boot.meandf)
```

```{r}
tail(boot.meandf)
```

```{r}
boot.tval = c(N)
for(i in 1:N){
  boot.tval[i] = boot.t[[i]][1]
}

boot.tdf = data.frame(boot.tval)
head(boot.tdf)

```
```{r}
tail(boot.tdf)
```

```{r}
ggplot(boot.tdf, aes(x = boot.tval)) + geom_histogram(col="red", fill="blue", binwidth=0.4) + xlab("Values of Bootstrat T") + ylab("Count") + ggtitle("Histogram: Distribution of Bootstrap T")

```
b. Refer to part (a): A Bootstrap t.Confidence Interval for μ  is defined as:
[X¯¯¯¯−t∗α2(Sn−−√),X¯¯¯¯+t∗1−α2(Sn−−√)]
where t∗α2
 and t∗1−α2
 is are α2
 and 1−α2
 percentiles from the TBootstrap distribution.

Compute the 95% bootstrap t confidence interval for μMedian_Family_Income for all the counties in Alberta.

Note: −tα2  will be a negative value due to the nature of the Student’s t  distribution. Hence the “-” sign in front.

```{r}
percentileT = quantile(boot.tdf$boot.tval, c(0.025, 0.975))
percentileT
```
```{r}
meanCI2.5 = sample.mean - (percentileT[1] * (sample.sd/sqrt(sample.size)))
meanCI2.5$mean
```
 
```{r}
meanCI97.5 = sample.mean - (percentileT[2] * (sample.sd/sqrt(sample.size)))
meanCI97.5$mean
```
c. Compute the (i) 95% bootstrap percentile confidence interval estimate of μMedian_Family_Income  and (ii) the 95% Student’s t confidence interval for μMedian_Family_Income.
```{r}
medianIncome.ci95 = qdata(~alberta.counties.OriginalValue, c(0.025, 0.975), data=income.median)
medianIncome.ci95

```


```{r}
BootTci95 = qdata(boot.tval, c(0.025, 0.975), data=boot.tdf)
BootTci95

```
D. Consider your results in parts (b) and (c): If you were to report one of these confidence intervals to a client, what would be your consultation? Provide a short paragraph explaining what the value of μMedian_Family_Income Could possibly be, and why you chose the method of confidence interval estimation to report from.

I would use the 95% Confidence Interval found in part b as it reports a tighter interval for than the one reported in C

b) 83029.28 - 91911.28
c) 34747.50 - 131119.50

