5.6

#A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

SM <- (65+77)/2
SM
## [1] 71
SD <- sqrt(((65-71)^2+(77-71)^2)/24)
SD
## [1] 1.732051

5.12

#5.12 Auto exhaust and lead exposure. Researchers interested in lead exposure due to carexhaust sampled the blood of 52 police officers subjected to constant inhalation of automobile exhaust fumes while working traffic enforcement in a primarily urban environment. The blood samples of these officers had an average lead concentration of 124.32 μg/l and a SD of 37.74 μg/l; a previous study of individuals from a nearby suburb, with no history of exposure, found an averageblood level concentration of 35 μg/l.36 

#(a) Write down the hypotheses that would be appropriate for testing if the police officers appear to have been exposed to a higher concentration of lead. 
#Ho <= 35 ug/l: Ha > 35 ug/l

#(b) Explicitly state and check all conditions necessary for inference on these data.
#Random: Doesn't say if population is random, Normal: Appears to be normal, Independant: As long as the population is 520 or more than this condition is satisfied.

#(c) Test the hypothesis that the downtown police officers have a higher lead exposure than the group in the previous study. Interpret your results in context.
n <- 52
pmean <- 35
smean <- 124.32
SD <- 37.74
t.test(rnorm(52, 124.32, 37.74), mu = 35)
## 
##  One Sample t-test
## 
## data:  rnorm(52, 124.32, 37.74)
## t = 13.136, df = 51, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 35
## 95 percent confidence interval:
##  101.7396 125.8191
## sample estimates:
## mean of x 
##  113.7794
#(d) Based on your preceding result, without performing a calculation, would a 99% confidence interval for the average blood concentration level of police officers contain 35 μg/l?
#No because the SD is to small

5.18

#5.18 Paired or not, Part II? In each of the following scenarios, determine if the data are paired.
#(a) We would like to know if Intel’s stock and Southwest Airlines’ stock have similar rates of return. To find out, we take a random sample of 50 days, and record Intel’s and Southwest’s stock on those same days.
#It would be if Southwest relies heavily on Intel and their performance
#(b) We randomly sample 50 items from Target stores and note the price for each. Then we visit Walmart and collect the price for each of those same 50 items.
#If they are in the same area then they could be considered paired since there is price discrimination.
#(c) A school board would like to determine whether there is a difference in average SAT scores for students at one high school versus another high school in the district. To check, they take a simple random sample of 100 students from each high school.
#These data points are unrelated so they're unpaired.

5.24

#5.24 Sample size and pairing. Determine if the following statement is true or false, and if false, explain your reasoning: If comparing means of two groups with equal sample sizes, always use a paired test.
#False, you don't always need to use it. There must be a natural correlation between the two to use a paired test.

5.30

#In Exercise 5.28, we discussed diamond prices (standardized by weight) for diamonds with weights 0.99 carats and 1 carat. See the table for summary statistics, and then construct a 95% confidence interval for the average di↵erence between the standardized prices of 0.99 and 1 carat diamonds. You may assume the conditions for inference are met.
Z = 1.96
m99 <- 44.51
SD99 <- 13.32
n99 <- 23
m1 <- 56.81
SD1 <- 16.13
n1 <- 23
CI99 <- c(m99-(Z*SD99/sqrt(n99)), m99+(Z*SD99/sqrt(n99)))
CI1 <- c(m1-(Z*SD1/sqrt(n1)), m1+(Z*SD1/sqrt(n1)))
CI99
## [1] 39.06627 49.95373
CI1
## [1] 50.21786 63.40214

5.36

# 5.36 Gaming and distracted eating, Part II. The researchers from Exercise 5.35 also investigated the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction) was 6.1, with a standard deviation of 1.8. Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different?
n <- 22
sMean <- 4.9
sSD <- 1.8
pMean <- 6.1
t_score <- t_score <- ((sMean-pMean)/(1.8/sqrt(n)))
t_score
## [1] -3.126944

T score is large enough to reject null hypothesis.

5.42

#We would like to test if students who are in the social sciences, natural sciences, arts and humanities, and other fields spend the same amount of time studying for this course. What type of test should we use? Explain your reasoning.
# Because there are multiple variables you can use either ANOVA or Chi square test to test the relationalships between multiple variables.

5.48

#The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents. Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.
#a)Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.
#Ho: b1, b2, b3, b4, b5 are equall
#Ha: at least one variables mean is not equal to the other
#b) Check conditions and describe any assumptions you must make to proceed with the test
#Independent: There seems to be independence across the groups

#Normal: They seem to be bormal except for bachelors and maybe HS

#Variability: Seems equal variability across the groups

#c) Below is part of the output associated with this test. Fill in the empty cells.
Dfdegree = 4 
DfRes = 1167 
DfT = 1171
Prf <- 0.0682
Fstat <- qf( 1 - Prf, 4 , DfRes)
Fstat
## [1] 2.188931
MSG <- 501.54
MSG
## [1] 501.54
MSR <- MSG / Fstat
MSR
## [1] 229.1255
SSG <- 4 * MSG
SSG
## [1] 2006.16
SSRes <- 267382
SSRes
## [1] 267382
SST <- SSG + SSRes
SST
## [1] 269388.2
#d)What is the conclusion of the test?
#Because P have is greater than .05 the null hypothesis can not be rejected.

6.8

#In January 2011, The Marist Poll published a report stating that 66% of adults nationally think licensed drivers should be required to retake their road test once they reach 65 years of age. It was also reported that interviews were conducted on 1,018 American adults, and that the margin of error was 3% using a 95% confidence level.
#(a) Verify the margin of error reported by The Marist Poll.
ME = 1.96*sqrt(.66*(1-.66))/1018
ME
## [1] 0.0009120523
#(b) Based on a 95% confidence interval, does the poll provide convincing evidence that more than 70% of the population think that licensed drivers should be required to retake their road test once they turn 65?
p1 = .66
p2 = .70
n = 1018
z = (p1-p2)/sqrt(p2*(1-p2)/n)
z
## [1] -2.784994

6.16

#Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
#a)A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test to determine if these data provide strong evidence supporting this statement.
#Ho: >= 50% of adults who decide not to go to college do it because they can't afford it

#Ha: < 50% of adults who decide not to go to college do it because they can't afford it
p1 = .48
p2 = .50
n = 331
z1 = (p1-p2)/sqrt(p2*(1-p2)/n) 
z1
## [1] -0.7277362

-.728 is more than -1.64 so we do not reject null hypothesis 6.24

#The Stanford University Heart Transplant Study was con- ducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was o cially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group.
#A hypothesis test would reject the conclusion that the survival rate is the same in each group, and so we might like to calculate a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
#We Cant use normal approximation because the sample size is too small. We couldn't conduct a confidence interval for discrete numbers as small and as few as given.

6.32

# A news article reports that “Americans have differing views on two potentially inconvenient and invasive practices that airports could implement to uncover potential terrorist attacks.” This news piece was based on a survey conducted among a random sample of 1,137 adults nationwide, interviewed by telephone November 7-10, 2010, where one of the questions on the survey was “Some airports are now using ‘full-body’ digital x-ray machines to electronically screen passengers in airport security lines. Do you think these new x-ray machines should or should not be used at airports?” Below is a summary of responses based on party affiliation.
#a) Conduct an appropriate hypothesis test evaluating whether there is a difference in the proportion of Republicans and Democrats who think the full-body scans should be applied in airports. Assume that all relevant conditions are met.
vector=c(264,38,16) 
matrix= matrix(c(264,38,16,299,55,15), nrow=3)
colnames(matrix) <- c("Republican", "Democrat")
rownames(matrix) <-c("Should", "Should Not", "Don't Know/No Answer")
matrix
##                      Republican Democrat
## Should                      264      299
## Should Not                   38       55
## Don't Know/No Answer         16       15
chisq.test(matrix)
## 
##  Pearson's Chi-squared test
## 
## data:  matrix
## X-squared = 1.5381, df = 2, p-value = 0.4635

p value is greater than .05 therefore we do not reject the null hypothesis and assume there is no evidence for difference between parties.

6.40

```True or false, Part II. Determine if the statements below are true or false. For each false statement, suggest an alternative wording to make it a true statement. (a) As the degrees of freedom increases, the mean of the chi-square distribution increases. TRUE (b) If you found X2 = 10 with df = 5 you would fail to reject H0 at the 5% significance level.

pchisq(10, 5, lower.tail = FALSE)
## [1] 0.07523525

TRUE (c) When finding the p-value of a chi-square test, we always shade the tail areas in both tails. False, the chi square test will always be positive and one tailed. (d) As the degrees of freedom increases, the variability of the chi-square distribution decreases. TRUE

6.48 Researchers conducted a study investigating the relationship between caffeinated coffee consumption and risk of depression in women. They collected data on 50,739 women free of depression symptoms at the start of the study in the year 1996, and these women were followed through 2006. The researchers used questionnaires to collect data on caffeinated coffee consumption, asked each individual about physician-diagnosed depression, and also asked about the use of antidepressants. The table below shows the distribution of incidences of depression by amount of caffeinated coffee consumption. (a) What type of test is appropriate for evaluating if there is an association between co↵ee intake and depression? Chi square independence test is most appropriate

  1. Write the hypotheses for the test you identified in part (a). Ho: Depression is independent of coffee consumption. Ha: Depression is not independent of coffee consumption.
  2. Calculate the overall proportion of women who do and do not su↵er from depression. Dont suffer 48,132/50,739

Do suffer 2,607/50,739

  1. Identify the expected count for the highlighted cell, and calculate the contribution of this cell to the test statistic, i.e. (Observed − Expected)2/Expected.
Exp=(2607*6617)/50739
Exp
## [1] 339.9854
Obs <- 373
ChiSq = (Obs-Exp)^2/Exp
ChiSq
## [1] 3.205914
  1. The test statistic is #2 = 20.93. What is the p-value?
ChiSq=20.93
r=2
c=5
df=(r-1)*(c-1)
round(1-pchisq(ChiSq,4),5)
## [1] 0.00033
  1. What is the conclusion of the hypothesis test? Because the p value is less than .05 we can reject he null hypothesis. So coffee consumption and depression are not necessarilt independant.
  2. One of the authors of this study was quoted on the NYTimes as saying it was “too early to recommend that women load up on extra co↵ee” based on just this study.64 Do you agree with this statement? Explain your reasoning. Only one variable was tested so there could be confounding variables.

6.56 An experiment conducted by the MythBusters, a science en- tertainment TV program on the Discovery Channel, tested if a person can be subconsciously influenced into yawning if another person near them yawns. 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a group where there wasn’t a person yawning near them (control). The following table shows the results of this experiment. A simulation was conducted to understand the distribution of the test statistic under the assumption of independence: having someone yawn near another person has no influence on if the other person will yawn. In order to conduct the simulation, a researcher wrote yawn on 14 index cards and not yawn on 36 index cards to indicate whether or not a person yawned. Then he shuffled the cards and dealt them into two groups of size 34 and 16 for treatment and control, respectively. He counted how many participants in each simulated group yawned in an apparent response to a nearby yawning person, and calculated the difference between the simulated proportions of yawning as. This simulation was repeated 10,000 times using software to obtain 10,000 differences that are due to chance alone. The histogram shows the distribution of the simulated diffrences. a) What are the hypotheses for testing if yawning is contagious, i.e. whether it is more likely for someone to yawn if they see someone else yawning? Ho: Yawning is not contagious

Ha: Yawning is contagious

b)Calculate the observed difference between the yawning rates under the two scenarios. Control = 4/16 = .25

Treatment = 10/34 = .29

.29-.24 = .04 or 4%

  1. Estimate the p-value using the figure above and determine the conclusion of the hypothesis test. So I estimate that the far two left and right boxes (which would fall out of the 95% confidence interval) are just below .05 I’m guessing .02+.02+.004+.004= .048. Therefore we reject the null hypothesis and say yawning is contagious.