Name(s):
Matt Bufalino, Mary Keller, Will Mason, Jackie Cisneros, Paul Sliwka

Questions
Inference for One and Two Proportions

  1. Consider the following null hypothesis significance test: \(H_0: p=.35\) \(H_a: p\neq.35\)
    A sample of 300 provided a sample proportion \(p\)=.275.
    1. Compute the value of the test statistic.
      # The answer is....The standard Error of the significance test is .0275 so, z= -2.723524 
      p <- .275
      p
    ## [1] 0.275
      SE <- sqrt((.35*(1-.35))/300)
      SE
    ## [1] 0.02753785
      z <- (p-.35)/SE
      z
    ## [1] -2.723524
    1. What is the \(p\)-value?

      # The answer is....The area more extreme than +/- 2.7272 is near zero, say < .006  
    2. At \(\alpha\)=.05, what is your conclusion?

      # The answer is....Reject the Null Hypothesis 
    3. What is/are your conclusions?
      Answer:.35 is excluded from the upper confidence interval (i.e. , .35 is higher than the range of plausible parameter values(CI (.224- .326)). and thus we accept the alternative hypothesis with .275 of the population , with 95% confidence interval.

  2. The Consumer Reports National Research Center conducted a telephone survey of 2,000 adults to learn about the major economic concerns for the future. The survey results showed that 1,760 of the respondents think the future health of Social Security is a major economic concern.
    1. What is the point estimate of the population proportion of adults who think the future health of Social Security is a major economic concern.
        # The answer is.... If success are denoted as 1 and failures denoted as 0, the mean of the 0s and 1s is the proportion of success. There are 1760 1s  over the n of 2000. Therefore the Point Estimate is .88 
      PE <- 1760/2000
      PE
    ## [1] 0.88
    1. At 90% confidence, what is the margin of error?
        # The answer is..... 0.01195316
    SE <- sqrt((.88*(1-.88))/2000)
    SE 
    ## [1] 0.007266361
    MOE <- 1.645*SE 
    MOE
    ## [1] 0.01195316
    1. Develop a 90% confidence interval (two-sided) for the population proportion of adults who think the future health of Social Security is a major economic concern.
        # The answer is.... CI90 = (0.8919532,0.8680468)
    PE + MOE
    ## [1] 0.8919532
    PE - MOE
    ## [1] 0.8680468
  3. Facebook was voted the most popular website, with 17% of a sample of 2,500 Internet users in the 12–17 age group using the site.
    1. At 95% confidence, what is the margin of error?
        # The answer is....0.01472481
        PE2 <- .17
        n2 <- 2500
        SE2  <- sqrt((PE2 *(1-PE2))/n2)
        SE2
    ## [1] 0.007512656
        MOE2 <- 1.96*SE2 
        MOE2
    ## [1] 0.01472481
    1. What is the interval estimate of the population proportion for which Facebook is the most popular website among Internet users, using a 95% two-sided confidence interval.
        # The answer is.... CI95 = (0.1552752,0.1847248)
      PE2 - MOE2
    ## [1] 0.1552752
      PE2 + MOE2
    ## [1] 0.1847248
    1. How would your conclusions have changed if only 1,000 youths participated in the survey the the same estimate obtained?
        # The answer is....CI95=[0.146718, 0.193282] which is .17 +/- 0.02328196 so the CI got wider with a smaller sample size.  
        PE3 <- .17
        n <-  1000
        SE3  <- sqrt((PE3*(1-PE3))/n)
        SE3
    ## [1] 0.01187855
        MOE3 <- 1.96*SE3 
        MOE3
    ## [1] 0.02328196
        PE3 - MOE3
    ## [1] 0.146718
        PE3 + MOE3 
    ## [1] 0.193282
     i. For a similar setting, would you recommend spending the resourced to obtain another sample of size 2,500?
    
     ```r
     # The answer is.... Although you would get a more accurate estimate because you have reduced sampling error, you probably do not need to the increased accuracy. Although you do have a more accurate estimate (less sampling error), you don't act upon that information in any way. So, arguable it is not needed to pay for the additional costs of the larger sample size.
     ```
    1. How would your conclusions have changed if only 100 youths participated in the survey the the same estimate obtained?
        # The answer is.... CI95=[0.09637597, 0.243624] which is .17 +/- 0.03756328 so the CI got wider with a smaller sample size.
        PE4 <- .17
        n4 <-  100
        SE4  <- sqrt((PE4*(1-PE4))/n4)
        SE4
    ## [1] 0.03756328
        MOE4 <- 1.96*SE4 
        MOE4
    ## [1] 0.07362403
        PE4 - MOE4
    ## [1] 0.09637597
        PE4 + MOE4 
    ## [1] 0.243624
    a. What are the implications for sample size on how your use of the data and the conclusions you can draw changes?
    
     ```r
     # The answer is....We demonstrate that 17% voted Facebook as most popular website from a sample size of  2,500 Internet users in the 12–17 age group using the site.CI95 = (0.1552752,0.1847248).   
     ```
  4. Eagle Outfitters is a chain of stores specializing in specializing in outdoor apparel and camping gear. They are considering a promotion that involves mailing discount coupons to all their credit card customers. This promotion will be considered a success if more than 10% of those receiving the coupons use them. Before going national with the promotion, coupons were sent to a sample of 100 credit card customers. The Eagle data file can be found here as CSV file.
    1. Develop the null and alternative hypotheses that most appropriately address the question of having more than 10% coupon usage.
      Answer: n= 100 Ho <= .09 Ha >.1
    2. The file Eagle contains the sample data. Develop a point estimate of the population proportion.
      Answer: .10
        PE5 <- 10/100
        PE5
    ## [1] 0.1
    1. Use \(\alpha\)=.05 to conduct your hypothesis test. Should Eagle go national with the promotion?
##**Answer**: The value statistic is the standard error for the significance test is 0.029. So, z= 0.0.349 with a critical value of 1.165, to right of distribution. The area more extreme than +/- .0.349, say <.363 . Thus we fail to reject the null hypothesis
n5 <- 100
PE5 <- .10
null <- .09
SEforhyptest <- sqrt((null*(1-null))/n5)
SEforhyptest
## [1] 0.02861818
SEforCI <- sqrt((PE5*(1-PE5))/n5)
SEforCI
## [1] 0.03
zvalue <- (PE5-null)/SEforhyptest 
zvalue
## [1] 0.3494283
d. What is the most appropriate confidence interval that goes along with the hypothesis test in part c?  

```r
    # The answer is.... CI95 =[.051, 1]
MOE5 <- 1.96*SEforCI
MOE5 
```

```
## [1] 0.0588
```

```r
PE5 + MOE5
```

```
## [1] 0.1588
```

```r
PE5 - MOE5 
```

```
## [1] 0.0412
```
    
  1. Describe a scenario in which a marketing research team would be interested in testing the difference between two proportions using \(\alpha\)= .05.
    Answer: What is the difference in between the proportion of credit card holders making purchase who received a 10% coupon as compared to a 20% coupon.

  2. In a two group context, suppose that group 1 had a proportion of .75 with a sample size of 437, whereas the group 2 had a proportion of .687 with a sample size of 219.

    1. What is the \(p\)-value for the test of equal proportions?
       # The answer is.... P-value >.05 at 0.0869, 
     n1 <- 437
     n2 <- 219
    
    p1_positive <- .75
    p2_positive <- .687  
    
    diff <- p1_positive - p2_positive
    diff
    ## [1] 0.063
    pooled_proportion <- (n1* p1_positive + n2* p2_positive)/(n1 + n2)
    pooled_proportion
    ## [1] 0.728968
    SE_diff_proportion <- sqrt(pooled_proportion*(1-pooled_proportion)*(1/n1 + 1/n2))
    SE_diff_proportion
    ## [1] 0.0368005
    z.test.diff <- diff/SE_diff_proportion
    z.test.diff
    ## [1] 1.711933
    p.test.diff <- 2*(1-pnorm(abs(z.test.diff)))
    p.test.diff       
    ## [1] 0.08690893
    1. What are the confidence intervals limits for the population difference between proportions?
        # The answer is....CI95=[-0.0091,0.1351] which is .063 +/-  0.0721
    alpha <- .05
    C <- 1-alpha
    MOE.diff.prop <- SE_diff_proportion*qnorm(1-alpha/2)
    diff
    ## [1] 0.063
    MOE.diff.prop
    ## [1] 0.07212765
    c(diff - MOE.diff.prop, diff, diff + MOE.diff.prop)
    ## [1] -0.009127646  0.063000000  0.135127646
    1. What is your conclusion regarding the null hypothesis?
      Answer: Observed difference is .063 and pooled estimate of proportion for the hypothesis test is 0.7289 with a standard error of difference in proportions of .038. The z value is 1.712 compared to the critical value: +/- 1.96). Thus, we fail to reject the Null Hypotheses with P-value <.05 (.087)
    2. Suppose, instead, that the sample sizes of group 1 and group 2 were 645 and 931, respectively.
      1. What is the \(p\)-value for the test of equal proportions?
            # The answer is.... P-value >.01 (0.0065)
 n3 <- 645
 n4 <- 931
    
p3_positive <- .75
p4_positive <- .687  
    
diff2 <- p3_positive - p4_positive
diff2
## [1] 0.063
pooled_proportion2 <- (n3* p3_positive + n4* p4_positive)/(n3 + n4)
pooled_proportion2
## [1] 0.7127836
SE_diff_proportion2 <- sqrt(pooled_proportion2*(1-pooled_proportion2)*(1/n3 + 1/n4))
SE_diff_proportion2
## [1] 0.02317965
z.test.diff2 <- diff2/SE_diff_proportion2
z.test.diff2
## [1] 2.717901
p.test.diff2 <- 2*(1-pnorm(abs(z.test.diff2)))
p.test.diff2 
## [1] 0.006569743
    i. What are the confidence intervals limits for the population difference between proportions?
            # The answer is.... CI95=[0.0176, 0.1084] which is .0630 +/- .0454
alpha2 <- .05
C2 <- 1-alpha2
MOE.diff.prop2 <- SE_diff_proportion2*qnorm(1-alpha2/2)
diff2
## [1] 0.063
MOE.diff.prop2
## [1] 0.04543128
c(diff2 - MOE.diff.prop2, diff2, diff2 + MOE.diff.prop2)
## [1] 0.01756872 0.06300000 0.10843128
    ii. What is your conclusion regarding the null hypothesis?  
        **Answer**:  With a larger sample size we can reject the null hypothesis. 
        
  1. What is different about the confidence intervals in the two scenarios and what causes the difference?
    Answer:
    1 Scenario (smaller): CI95=[-0.0091,0.1351] which is .063 +/- 0.0721 2 Scenario (larger): CI95=[0.0176, 0.1084] which is .0630 +/- .0454 The change in the sample size adjusted the CI width between two scenarios.
  1. The Professional Golf Association (PGA) measured the putting accuracy of professional golfers playing on the PGA Tour and the best amateur golfers playing in the World Amateur Championship (Golf Magazine). A sample of 1,075 6-foot putts by professional golfers found 688 made putts. A sample of 1,200 6-foot putts by amateur golfers found 696 made putts.
    1. Give the proportions of made 6-foot putts by both professional golfers and amateur golfers.
        # The answer is.... Professional :  0.64  and Amateur:  0.58
np <- 1075
na <- 1200
    
pp_positive <- 688/1075
pp_positive 
## [1] 0.64
pa_positive <- 696/1200 
pa_positive
## [1] 0.58
b. What is the point estimate of the difference between the proportions of the two populations?  
        # The answer is....  .06
PGAdiff <- pp_positive - pa_positive
PGAdiff
## [1] 0.06
c. What is the 95% two-sided confidence interval for the difference between the two population proportions?  
        # The answer is.... CI95= (0.0198, 0.1002 which is 0.06000000 +/- 04018 
PGApooled_proportion <- (np* pp_positive + na* pa_positive)/(np + na)
PGApooled_proportion
## [1] 0.6083516
PGASE_diff_proportion <- sqrt(PGApooled_proportion*(1-PGApooled_proportion)*(1/np + 1/na))
PGASE_diff_proportion
## [1] 0.02049847
PGAz.test.diff <- PGAdiff/PGASE_diff_proportion
PGAz.test.diff
## [1] 2.927048
PGAp.test.diff <- 2*(1-pnorm(abs(PGAz.test.diff)))
PGAp.test.diff 
## [1] 0.003421956
alpha <- .05
PGAC <- 1-alpha
PGAMOE.diff.prop <- PGASE_diff_proportion*qnorm(1-alpha/2)
PGAdiff
## [1] 0.06
PGAMOE.diff.prop
## [1] 0.04017625
c(PGAdiff - PGAMOE.diff.prop, PGAdiff, PGAdiff + PGAMOE.diff.prop)
## [1] 0.01982375 0.06000000 0.10017625
a. Interpret the 95% confidence interval and provide a summary statement about the difference between the two populations.    
    **Answer**: P-value < .01 (.003421956) so we to reject the null hypothesis with a z value of 2.927 compared to the critical +/- 1.960.
         
  1. Chicago O’Hare (ORD) and Atlanta Hartsfield-Jackson (ATL) are among the busiest airports in the United States. The congestion often leads to delayed flight arrivals as well as delayed flight departures. The Bureau of Transportation tracks the on-time and delayed performance at major airports (Travel & Leisure). A flight is considered delayed if it is more than 15 minutes behind schedule. The following sample data show the delayed departures at Chicago O’Hare and Atlanta Hartsfield-Jackson airports.
    library(tidyverse, quietly = TRUE)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
    Flights <- tribble(
                ~Airport, ~Flights, ~Delays,
                "ORD", 900, 252,
                "ATL", 1200, 312)
    Flights
a. State in words --- not in equation form --- the null hypotheses that can be used to infer whether the population proportions of delayed departures differ at these two airports.  
    **Answer**:   
b. What is the point estimate of the proportion of flights that have delayed departures at Chicago O’Hare? ORD : .28   
        # The answer is.... 
ORDn <- 900
ORDp <- 252/900
ORDp
## [1] 0.28
b. What is the point estimate of the proportion of flights that have delayed departures at Atlanta Hartsfield-Jackson?  
        # The answer is.... ATL : .26 
ATLn <- 1200
ATLp <- 312/1200
ATLp
## [1] 0.26
c. What is the $p$-value for the hypothesis test?  
        # The answer is....P-value is >.05 (0.3061511) thus we fail to reject the null hypothesis.  

Flightspooled_proportion <- (ORDn* ORDp + ATLn* ATLp)/(ORDn + ATLn)
Flightspooled_proportion
## [1] 0.2685714
FlightsSE_diff_proportion <- sqrt(Flightspooled_proportion*(1-Flightspooled_proportion)*(1/ORDn + 1/ATLn))
FlightsSE_diff_proportion
## [1] 0.01954401
Flightsdiff <- ORDp - ATLp
Flightsdiff
## [1] 0.02
Flightsz.test.diff <- Flightsdiff/FlightsSE_diff_proportion
Flightsz.test.diff
## [1] 1.023332
Flightsp.test.diff <- 2*(1-pnorm(abs(Flightsz.test.diff)))
Flightsp.test.diff 
## [1] 0.3061511
alpha <- .05
FlightsC <- 1-alpha
FlightsMOE.diff.prop <- FlightsSE_diff_proportion*qnorm(1-alpha/2)
Flightsdiff
## [1] 0.02
FlightsMOE.diff.prop
## [1] 0.03830555
c(Flightsdiff - FlightsMOE.diff.prop, Flightsdiff, Flightsdiff + FlightsMOE.diff.prop)
## [1] -0.01830555  0.02000000  0.05830555
d. Provide a summary statement that describes the outcomes to the question of interest and your conclusion.  
    **Answer**:  There is not a statisically significant difference in delayed departures at Chicago O’Hare and Atlanta Hartsfield-Jackson airports based on a Pvalue >.05 with a z-value of 1.023 with a critcial z value +/- 1.960.
    
  1. A Republican and a Democratic representative discussed the possibility of legislation that they two were putting forward together to a mixed audience of politically interested adults. Of those in the audience that participated in a follow-up questionnaire, 161 of 350 Republicans supported the legislation, whereas 79 of 250 Democrats supported it.
    1. With a Type I error rate of .05, is there a difference in the level of support for the legislation between Republicans and Democrats? Explain your conclusion.
      Answer:
    2. What is the \(p\)-value for the null hypothesis test?
        # The answer is...P-Value is < .001 (0.0003857468)
RepN <- 350
DemN <- 250

Rep_P_Positive <- 161/350
Rep_P_Positive
## [1] 0.46
Dem_P_Positive <- 79/250
Dem_P_Positive
## [1] 0.316
Politicspooled_proportion <- (RepN* Rep_P_Positive + DemN* Dem_P_Positive)/(RepN + DemN)
Politicspooled_proportion
## [1] 0.4
PoliticsSE_diff_proportion <- sqrt(Politicspooled_proportion*(1-Politicspooled_proportion)*(1/RepN + 1/DemN))
PoliticsSE_diff_proportion
## [1] 0.0405674
Politicsdiff <- Rep_P_Positive - Dem_P_Positive
Politicsdiff
## [1] 0.144
Politicsz.test.diff <- Politicsdiff/PoliticsSE_diff_proportion
Politicsz.test.diff
## [1] 3.549648
Politicsp.test.diff <- 2*(1-pnorm(abs(Politicsz.test.diff)))
Politicsp.test.diff
## [1] 0.0003857468
b. What is the 95% confidence interval for the difference in the proportions?  
        # The answer is.... CI95= (.0645, 0.2235) which is 0.14400000 +/- 0.0795
alpha <- .05
PoliticsC <- 1-alpha
PoliticsMOE.diff.prop <- PoliticsSE_diff_proportion*qnorm(1-alpha/2)
Politicsdiff
## [1] 0.144
PoliticsMOE.diff.prop
## [1] 0.07951065
c(Politicsdiff - PoliticsMOE.diff.prop, Politicsdiff, Politicsdiff + PoliticsMOE.diff.prop)
## [1] 0.06448935 0.14400000 0.22351065
  1. Optional: Listen to the How Do I Know This Medicine Works episode from the podcast. This episode features Marie Davidian, who is past president of the American Statistical Association and very well-known in the field. The podcast is available here: (http://www.cas.miamioh.edu/statsandstories/archives8.html)
    1. How does the evaluation of the question “does a medication work” relate to the question of “does this web-site tweak improve sales” or “does this advertisement work?”
      Answer:
    2. How does “personalized medicine” relate to the idea of a “personalized recommendation” system?
      Answer:
    3. What does “double blind” mean in the context of trials?
      Answer:
    4. What is the “gold standard” in evaluating treatment effectiveness? Answer: