# 5.6 Working backwards, Part II. A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.
n <- 25
x1 <- 65
x2 <- 77

SMean <- (x2 + x1) / 2
#Sample Mean
SMean
## [1] 71
MoE <- (x2-x1)/2
#Margin of error
MoE
## [1] 6
df <- 25 - 1
p <- 0.9
p_2tails <- p + (1 - p)/2

t_val <- qt(p_2tails, df)

# Since ME = t * SE
SE <- MoE / t_val
SE
## [1] 3.506963
# Since SE = sd/sqrt(n)
sd <- SE * sqrt(n)
sd
## [1] 17.53481
#5.12 Auto exhaust and lead exposure. Researchers interested in lead exposure due to car exhaust sampled the blood of 52 police officers subjected to constant inhalation of automobile exhaust fumes while working traffc enforcement in a primarily urban environment. The blood samples of these officers had an average lead concentration of 124.32 µg/l and a SD of 37.74 µg/l; a previous study of individuals from a nearby suburb, with no history of exposure, found an average blood level concentration of 35 µg/l.36 
n <- 52
pMean <- 35
sMean <- 124.32
sSD <- 37.74
#(a) Write down the hypotheses that would be appropriate for testing if the police offcers appear to have been exposed to a higher concentration of lead. 
#Ho: Police officers have not been exposed to a higher concentration of lead sMean - pMean = 0
#Ha: Police officers appear to have been exposed to a higher concentration of lead sMean - pMean != 0
#(b) Explicitly state and check all conditions necessary for inference on these data. 

#-> Independence: sample size is 52 however we dont know the population size. If the population is more than 520 then this condition is satisfied.
#-> Normality: such that most of the observations lie within(around) the 2 SD of the mean.
#(c) Test the hypothesis that the downtown police officers have a higher lead exposure than the group in the previous study. Interpret your results in context. 
t.test(rnorm(n = 52, sMean, sSD), mu = pMean)
## 
##  One Sample t-test
## 
## data:  rnorm(n = 52, sMean, sSD)
## t = 17.766, df = 51, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 35
## 95 percent confidence interval:
##  117.8927 139.0130
## sample estimates:
## mean of x 
##  128.4529
#According to this t.test there is significant evidence that policie officers have been exposed to more lead
#(d) Based on your preceding result, without performing a calculation, would a 99% confidence interval for the average blood concentration level of police o"cers contain 35 µg/l?
#No it would not as the standard deviation would only increase the CI lower range down from 112 to about 75.

#5.18 Paired or not, Part II? In each of the following scenarios, determine if the data are paired. 
#(a) We would like to know if Intel's stock and Southwest Airlines' stock have similar rates of return. To find out, we take a random sample of 50 days, and record Intel's and Southwest's stock on those same days. 
#The data may be paired if Southwest heavily relies on Intel's products to deliver its service.
#(b) We randomly sample 50 items from Target stores and note the price for each. Then we visit Walmart and collect the price for each of those same 50 items. 
#The two data sets could be considered to be paired if they are geographically in similar locations as they may compete for prices.

#(c) A school board would like to determine whether there is a difference in average SAT scores for students at one high school versus another high school in the district. To check, they take a simple random sample of 100 students from each high school.
#These data sets we can reasonable assume are unrelated and therefore not paired.

#5.24 Sample size and pairing. Determine if the following statement is true or false, and if false, explain your reasoning: If comparing means of two groups with equal sample sizes, always use a paired test.
#The word "always" makes this statement potentially false. It is suggested that you use a paired t-test when the data is paired, for example that each obvservation has a correlation or connection to exactly one observation in the other data set.

# 5.30 Diamonds, Part II. In Exercise 5.28, we discussed diamond prices (standardized by weight) for diamonds with weights 0.99 carats and 1 carat. See the table for summary statistics, and then construct a 95% confidence interval for the average difference between the standardized prices of 0.99 and 1 carat diamonds. You may assume the conditions for inference are met.
Z = 1.96
m_99 <- 44.51
SD_99 <- 13.32
n_99 <- 23
m_1 <- 56.81
SD_1 <- 16.13
n_1 <- 23
CI_99 <- c(m_99-(Z*SD_99/sqrt(n_99)), m_99+(Z*SD_99/sqrt(n_99)))
CI_1 <- c(m_1-(Z*SD_1/sqrt(n_1)), m_1+(Z*SD_1/sqrt(n_1)))
CI_99
## [1] 39.06627 49.95373
CI_1
## [1] 50.21786 63.40214
# 5.36 Gaming and distracted eating, Part II. The researchers from Exercise 5.35 also investigated the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction) was 6.1, with a standard deviation of 1.8. Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are di???erent?

n<- 22
sMean <- 4.9
sSD <- 1.8
pMean <- 6.1

z = (sMean - pMean)/(sSD/sqrt(n))
alpha = .05
z.half.alpha = qnorm(1-(alpha/2))
c(-z.half.alpha, z.half.alpha)
## [1] -1.959964  1.959964
z
## [1] -3.126944
pval = 2 * pnorm(z)
pval
## [1] 0.001766337
#because the p value is less than .05 we do not reject the null

# 5.42 Which test? We would like to test if students who are in the social sciences, natural sciences, arts and humanities, and other fields spend the same amount of time studying for this course. What type of test should we use? Explain your reasoning.

#A good test to use would be an ANOVA as it could help examine the difference of means between the groups.

# 5.48 Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.
mu <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)
k <- 5
MSG <- 501.54
SSE <- 267382
n <- sum(n) - k
n
## [1] 1167
p <- 0.0682
#Find Df
dfG <- k-1
dfE <- n-k
dfT <- dfG + dfE
df <- c(dfG, dfE, dfT)
df
## [1]    4 1162 1166
# Find Mean Sq
MSE <- SSE / dfE
MS <- c(MSG, MSE, NA)
MSE
## [1] 230.105
MS
## [1] 501.540 230.105      NA
# Find F-value
Fv <- MSG / MSE
Fv
## [1] 2.179614
myTable.dt <- data.frame(df, SSE, MSG, c(Fv, NA, NA), c(p, NA, NA))
colnames(myTable.dt) <- c("Df", "Sum Sq", "Mean Sq", "F Value", "Pr(>F)")
rownames(myTable.dt) <- c("degree", "Residuals", "Total")
myTable.dt [1:5]
##             Df Sum Sq Mean Sq  F Value Pr(>F)
## degree       4 267382  501.54 2.179614 0.0682
## Residuals 1162 267382  501.54       NA     NA
## Total     1166 267382  501.54       NA     NA
#The p value is greater than .05, the null hypothesis is rejected and there is not a significant difference between the groups