Load Packages
## prettydoc knitr
## TRUE TRUE
Chapter 5 - Inference for Numerical Data
Practice: 5.5, 5.13, 5.19, 5.31, 5.45
Graded: 5.6, 5.14, 5.20, 5.32, 5.48
5.6 Working backwards, Part II.
A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.
# givens
ci <- 0.9
ci_values <- c(65, 77)
n <- 25
df <- n - 1
# calcs
me <- (ci_values[2] - ci_values[1])/2
xbar <- ci_values[1] + me
cv <- abs(qt((1 - ci)/2, df))
se <- me/cv
s <- se * sqrt(n)
# answers
xbar## [1] 71
me## [1] 6
s## [1] 17.53481
5.14 SAT scores.
SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.
- Raina wants to use a 90% confidence interval. How large a sample should she collect?
# givens
sigma <- 250
ci <- 0.9
me <- 25
# calcs
cv <- abs(qnorm((1 - ci)/2))
se <- me/cv
n <- (sigma/se)^2
ceiling(n)## [1] 271
- Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.
Given a maximum ME of 25, the sample size is going to have to be larger. With a CI of .99, the critical value (CV) is going to increase. Since ME is SE * CV , the SE is going to need to be smaller to still get an ME of 25. Since SE is the SD divided by the square root of n, in order to get a smaller SE, the sample size will need to be larger.
- Calculate the minimum required sample size for Luke.
# givens
sigma <- 250
ci <- 0.99
me <- 25
# calcs
cv <- abs(qnorm((1 - ci)/2))
se <- me/cv
n <- (sigma/se)^2
ceiling(n)## [1] 664
5.20 High School and Beyond, Part I.
The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.
Is there a clear difference in the average reading and writing scores? No, not necessarily. In the box plots, the medians are not too far apart, and the IQRs overlap each other. In the histogram, the distribution of the score differences is centered around 0.
Are the reading and writing scores of each student independent of each other? No, these are paired data since each student took the both tests. For each observation in one group, there is exactly one specially-corresponding observation in the other group.
Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?
\(H_{0}: \mu_{diff}=0\) \(H_{A}: \mu_{diff}\neq0\)
- Check the conditions required to complete this test.
- The observations are a simple random sample of less than 10% of the population.
- The sample size is at least 30.
- Per the histogram, the distribution is not strongly skewed.
- The average observed difference in scores is ¯xreadwrite = 0.545, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?
xbar <- 0.545
s <- 8.887
n <- 200
df <- n - 1
se <- s/sqrt(n)
null_hyp <- 0
t_score <- (xbar - null_hyp)/se
pt(t_score, df, lower.tail = F) * 2## [1] 0.3868365
print("No, because p = .39, we fail to reject the null hypothesis. The sample's average difference in scores do not provide convincing evidence that there is a diffenence in scores")## [1] "No, because p = .39, we fail to reject the null hypothesis. The sample's average difference in scores do not provide convincing evidence that there is a diffenence in scores"
What type of error might we have made? Explain what the error means in the context of the application. We may have made a Type 2 error. Because we didn’t find evidence of a difference, we would have made a Type 2 error if the alternative hypothesis, that there is a difference, is true.
Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning. Yes, \(\bar{x}\) had a large p-value if 0 was population mean. Conversely, 0 is going to be within the confidence interval constructed around \(\bar{x}\).
5.32 Fuel eciency of manual and automatic cars, Part I.
Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012.
Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.
Automatic Manual
Mean 16.12 19.85 SD 3.58 4.51 n 26 26
# givens
n1 <- 26
n2 <- n1
xbar1 <- 16.12
xbar2 <- 19.85
xbar_diff <- xbar2 - xbar1
s1 <- 3.58
s2 <- 4.51
alpha <- 0.05
null_hyp <- 0
# calculations
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
df_diff <- min(n1, n2) - 1
t_score <- (xbar_diff - null_hyp)/se_diff
pvalue <- pt(t_score, df_diff, lower.tail = F) * 2
# conclusions
paste("p-value = ", pvalue)## [1] "p-value = 0.00288361481312234"
ifelse(pvalue < alpha, "Reject the null hypothesis", "Fail to reject the null hypothesis")## [1] "Reject the null hypothesis"
5.45 Coffee, depression, and physical activity.
Caffeine is the world’s most widely used stimulant, with approximately 80% consumed in the form of coffee. Participants in a study investigating the relationship between coffee consumption and exercise were asked to report the number of hours they spent per week on moderate (e.g., brisk walking) and vigorous (e.g., strenuous sports and jogging) exercise. Based on these data the researchers estimated the total hours of metabolic equivalent tasks (MET) per week, a value always greater than 0. The table below gives summary statistics of MET for women in this study based on the amount of coffee consumed.
- Write the hypotheses for evaluating if the average physical activity level varies among the different levels of coffee consumption.
Check conditions and describe any assumptions you must make to proceed with the test.
Below is part of the output associated with this test. Fill in the empty cells.
| na | Df | Sum Sq | Mean Sq | F-value | Pr(>F) |
|---|---|---|---|---|---|
| coffee | 4 | 10508 | 2627 | 5.213345 | 0.0003 |
| Residuals | 50,734 | 25,564,819 | 503.8991 | ||
| Total | 50,738 | 25,575,327 |
k <- 5 #groups
n <- 50739 #total sample
DF_G <- k - 1
DF_E <- n - k
DF_T <- DF_G + DF_E
DF_G## [1] 4
DF_E## [1] 50734
DF_T## [1] 50738
SSE <- 25564819
SST <- 25575327
SSG <- SST - SSE
SSG## [1] 10508
MSG <- SSG/DF_G
MSG## [1] 2627
MSE <- SSE/DF_E
MSE## [1] 503.8991
fvalue <- MSG/MSE
fvalue## [1] 5.213345
pf(fvalue, DF_G, DF_E, lower.tail = F)## [1] 0.0003391831
- What is the conclusion of the test?
5.48 Work hours and education.
The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents. Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.
- Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.
\(H_{0}:\) The population mean of each group is equal to the others
\(H_{A}:\) At least one pair of means is different
- Check conditions and describe any assumptions you must make to proceed with the test. We are given very little information about the conditions. We are going to have to assume the following:
- The observations were collected by a simple random sample. It is obviously less than 10% of the population.
- All of the group sample sizes are rather large so we can relax the normality assumption.
- We can’t be completely certain that the variance in the groups are about equal. The SDs are similiar except for Jr Coll.
- Below is part of the output associated with this test. Fill in the empty cells.
| na | Df | Sum Sq | Mean Sq | F-value | Pr(>F) |
|---|---|---|---|---|---|
| degree | 4 | 2006.16 | 501.54 | 2.188992 | 0.0682 |
| Residuals | 1167 | 267382 | 229.1191 | ||
| Total | 1171 | 269388.2 |
# givens
k <- 5 #groups
n <- 1172 #total sample
SSE <- 267382
MSG <- 501.54
# calcs
DF_G <- k - 1
DF_E <- n - k
DF_T <- DF_G + DF_E
DF_G## [1] 4
DF_E## [1] 1167
DF_T## [1] 1171
SSG <- DF_G * MSG
SSG## [1] 2006.16
SST <- SSG + SSE
SST## [1] 269388.2
MSE <- SSE/DF_E
MSE## [1] 229.1191
fvalue <- MSG/MSE
fvalue## [1] 2.188992
pf(fvalue, DF_G, DF_E, lower.tail = F)## [1] 0.06819325
- What is the conclusion of the test?
Since the p-value is greater than .05, we are failing to reject the null hypothesis. The data are not convincing evidence that at least one of pair of means is different.
Chapter 6 - Inference for Categorical Data
Practice: 6.5, 6.11, 6.27, 6.43, 6.47
Graded: 6.6, 6.12, 6.20, 6.28, 6.44, 6.48
6.6 2010 Healthcare Law.
On June 28, 2012 the U.S. Supreme Court upheld the much debated 2010 healthcare law, declaring it constitutional. A Gallup poll released the day after this decision indicates that 46% of 1,012 Americans agree with this decision. At a 95% confidence level, this sample has a 3% margin of error. Based on this information, determine if the following statements are true or false, and explain your reasoning.
- We are 95% confident that between 43% and 49% of Americans in this sample support the decision of the U.S. Supreme Court on the 2010 healthcare law.
False. A CI is constructed to estimate the population proportion.
- We are 95% confident that between 43% and 49% of Americans support the decision of the U.S. Supreme Court on the 2010 healthcare law.
True. This is a correct interpretation of a CI with a 3% ME.
- If we considered many random samples of 1,012 Americans, and we calculated the sample proportions of those who support the decision of the U.S. Supreme Court, 95% of those sample proportions will be between 43% and 49%.
False. 95% of the CIs will contain the population proportion.
- The margin of error at a 90% confidence level would be higher than 3%
False. For a 90% CI, the critical value descreases from 1.96 to 1.64. If the SE remains the same, this will decrease the ME.
6.12 Legalization of marijuana, Part I.
The 2010 General Social Survey asked 1,259 US residents: “Do you think the use of marijuana should be made legal, or not?” 48% of the respondents said it should be made legal.
- Is 48% a sample statistic or a population parameter? Explain.
Since it is the result of a survey, 48% is a sample statistic.
- Construct a 95% confidence interval for the proportion of US residents who think marijuana should be made legal, and interpret it in the context of the data.
# givens
n <- 1259
p_hat <- 0.48
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
# calculations
se <- sqrt(p_hat * (1 - p_hat)/n)
me <- se * cv
# CI values
p_hat + me## [1] 0.5075967
p_hat - me## [1] 0.4524033
We are 95% confident that between 45% and 51% of US residents think that marijuana should be legal.
- A critic points out that this 95% confidence interval is only accurate if the statistic follows a normal distribution, or if the normal model is a good approximation. Is this true for these data? Explain.
Yes. It is true if observations are indedendent, which is a reasonable assumption given the organization. Also, the sample size is sufficiently large since there are more than 10 successes and failures.
- A news piece on this survey’s findings states, “Majority of Americans think marijuana should be legalized.” Based on your confidence interval, is this news piece’s statement justified?
No, most of the confidence interval is below 50%.
6.20 Legalize Marijuana, Part II.
As discussed in Exercise 6.12, the 2010 General Social Survey reported a sample where about 48% of US residents thought marijuana should be made legal. If we wanted to limit the margin of error of a 95% confidence interval to 2%, about how many Americans would we need to survey?
max_me <- 0.02
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
max_p_hat <- 0.48
se <- max_me/cv
n <- max_p_hat * (1 - max_p_hat)/se^2
ceiling(n)## [1] 2398
6.28 Sleep deprivation, CA vs. OR, Part I.
According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents. Calculate a 95% confidence interval for the difference between the proportions of Californians and Oregonians who are sleep deprived and interpret it in context of the data.
# givens
n1 <- 11545
n2 <- 4691
p_hat1 <- 0.08
p_hat2 <- 0.088
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
# calculations
p_hat_diff <- p_hat1 - p_hat2
se <- sqrt(p_hat1 * (1 - p_hat1)/n1 + p_hat2 * (1 - p_hat2)/n2)
me <- se * cv
# CI values
p_hat_diff## [1] -0.008
CI <- c(p_hat_diff - me, p_hat_diff + me)
CI## [1] -0.017497954 0.001497954
print("We are 95% confident that the proportion of Californians who reported being sleep deprived is between 1.7% less than and 0.1% more than Oregon residents who reported the same.")## [1] "We are 95% confident that the proportion of Californians who reported being sleep deprived is between 1.7% less than and 0.1% more than Oregon residents who reported the same."
6.44 Barking deer.
Microhabitat factors associated with forage and bed sites of barking deer in Hainan Island, China were examined from 2001 to 2002.
In this region woods make up 4.8% of the land, cultivated grass plot makes up 14.7%, and deciduous forests makes up 39.6%. Of the 426 sites where the deer forage, 4 were categorized as woods, 16 as cultivated grassplot, and 61 as deciduous forests. The table below summarizes these data.
Woods Cultivated grassplot Deciduous forests Other Total 4 16 61 345 426
Write the hypotheses for testing if barking deer prefer to forage in certain habitats over others. \(H_{0}:\) The barking deer have no preference in certain habitats. The observed counts refect natural sampling fluctuation. \(H_{A}:\) The barking deer prefer certain habitats over others.
What type of test can we use to answer this research question?
a chi-squared goodness of fitness test
- Check if the assumptions and conditions required for this test are satisfied.
We must assumed that the researchers attempted to collect cases that were independent of each other. Each scenario has at least 5 expected cases.
point_estimates <- c(4, 16, 61, 345)
p <- c(0.048, 0.147, 0.396)
p <- c(p, 1 - sum(p))
EC <- p * sum(point_estimates)
EC## [1] 20.448 62.622 168.696 174.234
EC >= 5## [1] TRUE TRUE TRUE TRUE
- Do these data provide convincing evidence that barking deer prefer to forage in certain habitats over others? Conduct an appropriate hypothesis test to answer this research question.
# manual chi-squared calc
habitat <- c("Woods", "Cultivated grassplot", "Deciduous forests", "Other")
my_df <- data.frame(habitat, point_estimates, p, expected_cases = EC, zvalues = (point_estimates -
EC)/sqrt(EC))
kable(my_df)| habitat | point_estimates | p | expected_cases | zvalues |
|---|---|---|---|---|
| Woods | 4 | 0.048 | 20.448 | -3.637372 |
| Cultivated grassplot | 16 | 0.147 | 62.622 | -5.891521 |
| Deciduous forests | 61 | 0.396 | 168.696 | -8.291769 |
| Other | 345 | 0.409 | 174.234 | 12.937041 |
xsq <- sum(my_df$zvalues^2)
paste("chi-squared test statistic:", round(xsq, 2))## [1] "chi-squared test statistic: 284.06"
# using r function
a <- chisq.test(point_estimates, p = p)
a##
## Chi-squared test for given probabilities
##
## data: point_estimates
## X-squared = 284.06, df = 3, p-value < 2.2e-16
a$p.value## [1] 2.799724e-61
print("Since the p-value is near 0, we reject the null hypothesis that the barking deer have no habitat preferences.")## [1] "Since the p-value is near 0, we reject the null hypothesis that the barking deer have no habitat preferences."
6.48 Coffee and Depression.
Researchers conducted a study investigating the relationship between caffeinated coffee consumption and risk of depression in women. They collected data on 50,739 women free of depression symptoms at the start of the study in the year 1996, and these women were followed through 2006. The researchers used questionnaires to collect data on caffeinated coffee consumption, asked each individual about physician-diagnosed depression, and also asked about the use of antidepressants. The table below shows the distribution of incidences of depression by amount of caffeinated coffee consumption.
- What type of test is appropriate for evaluating if there is an association between coffee intake and depression?
Since this is a bivariate contigency table, we will use the chi-square test of independence.
- Write the hypotheses for the test you identified in part (a).
\(H_{0}:\) The amount of coffee consumed has no difference in the incidences of depression. \(H_{A}:\) The amount of coffee consumed has differences in the incidences of depression.
- Calculate the overall proportion of women who do and do not suffer from depression.
# create matrix
values <- c(670, 373, 905, 564, 95, 11545, 6244, 16329, 11726, 2288)
coldims <- c("<1 cup/week", "2-6 cups/week", "1 cup/day", "2-3 cups/day", ">= 4 cups/day")
rowdims <- c("Depression", "nDepression")
mdims <- list(rowdims, coldims)
my_matrix <- matrix(values, 2, 5, byrow = TRUE, dimnames = mdims)
sum(my_matrix[1, ])/sum(my_matrix)## [1] 0.05138059
sum(my_matrix[2, ])/sum(my_matrix)## [1] 0.9486194
- Identify the expected count for the highlighted cell, and calculate the contribution of this cell to the test statistic, i.e. (Observed - Expected)^2/Expected.
obs_value <- my_matrix[1, 2]
EV <- sum(my_matrix[, 2]) * sum(my_matrix[1, ])/sum(my_matrix)
(obs_value - EV)^2/EV## [1] 3.205914
- The test statistic is x^2 = 20.93. What is the p-value?
my_chisq_test <- chisq.test(my_matrix)
my_chisq_test$p.value## [1] 0.0003267104
# my_matrix my_chisq_test$expected- What is the conclusion of the hypothesis test?
Because we have a tiny p-value, we will reject the null hypothesis that there is no difference
- One of the authors of this study was quoted on the NYTimes as saying it was “too early to recommend that women load up on extra coffee” based on just this study. Do you agree with this statement? Explain your reasoning.
I agree with this statement for a few reasons
Because the researchers didn’t randomly assign the women into the treatment groups, the results of the study are generializiable to the population, but can only infer correlation and not causality. There may be a confounding variable not addressed in the study.
There may be other negative side affects to increasing coffee consumption, and there may be more effective ways to prevent depression, like exercise.
Practice
5.5 Working backwards, Part I.
A 95% confidence interval for a population mean, µ, is given as (18.985, 21.015). This confidence interval is based on a simple random sample of 36 observations. Calculate the sample mean and standard deviation. Assume that all conditions necessary for inference are satisfied. Use the t-distribution in any calculations.
# givens
ci <- 0.95
ci_values <- c(18.985, 21.015)
n <- 36
# calcs
me <- (ci_values[2] - ci_values[1])/2
xbar <- ci_values[1] + me
cv <- abs(qt((1 - ci)/2, n - 1))
se <- me/cv
s <- se * sqrt(n)
xbar## [1] 20
s## [1] 2.999841
5.13 Car insurance savings.
A market researcher wants to evaluate car insurance savings at a competing company. Based on past studies he is assuming that the standard deviation of savings is $100. He wants to collect data such that he can get a margin of error of no more than $10 at a 95% confidence level. How large of a sample should he collect?
# givens
sigma <- 100
ci <- 0.95
me <- 10
# calcs
cv <- abs(qnorm((1 - ci)/2))
se <- me/cv
n <- (sigma/se)^2
ceiling(n)## [1] 385
5.19 Global warming, Part I.
Is there strong evidence of global warming? Let’s consider a small scale example, comparing how temperatures have changed in the US from 1968 to 2008. The daily high temperature reading on January 1 was collected in 1968 and 2008 for 51 randomly selected locations in the continental US. Then the difference between the two readings (temperature in 2008 - temperature in 1968) was calculated for each of the 51 different locations. The average of these 51 values was 1.1 degrees with a standard deviation of 4.9 degrees. We are interested in determining whether these data provide strong evidence of temperature warming in the continental US. (a) Is there a relationship between the observations collected in 1968 and 2008? Or are the observations in the two groups independent? Explain. They have a special correspondence since the same locations were used in both samples. Consequently, we should use the t-distribution. (b) Write hypotheses for this research in symbols and in words.
\(H_{0}: \mu_{diff}=0\) \(H_{A}: \mu_{diff}>0\)
Check the conditions required to complete this test.
Calculate the test statistic and find the p-value.
s <- 4.9
xbar <- 1.1
n <- 51
df <- n - 1
se <- s/sqrt(n)
t_score <- (xbar - 0)/se
pt(t_score, df, lower.tail = F)## [1] 0.05759731
- What do you conclude? Interpret your conclusion in context.
- What type of error might we have made? Explain in context what the error means.
- Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the temperature measurements from 1968 and 2008 to include 0? Explain your reasoning.
5.31 Chicken diet and weight, Part I.
Chicken farming is a multi-billion dollar industry, and any methods that increase the growth rate of young chicks can reduce consumer costs while increasing company profits, possibly by millions of dollars.
An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Below are some summary statistics from this data set along with box plots showing the distribution of weights by feed type.
Mean SD n
horsebean 160.20 38.63 10 linseed 218.75 52.24 12
Describe the distributions of weights of chickens that were fed linseed and horsebean. The horsebean distribution has a much lower median, lower IQR, and narrower IQR than the linseed distribution, which has a much higher median, higher IQR and wider IQR. The IQRs do not overlap.
Do these data provide strong evidence that the average weights of chickens that were fed linseed and horsebean are different? Use a 5% significance level.
# givens
n1 <- 10
n2 <- 12
xbar1 <- 160.2
xbar2 <- 218.75
xbar_diff <- xbar2 - xbar1
s1 <- 38.63
s2 <- 52.24
sign_level <- 0.05
null_hyp <- 0
# calculations
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
df_diff <- min(n1, n2) - 1
t_score <- (xbar_diff - null_hyp)/se_diff
pvalue <- pt(t_score, df_diff, lower.tail = F) * 2
# conclusions
paste("p-value = ", pvalue)## [1] "p-value = 0.0145523176477712"
ifelse(pvalue < sign_level, "Reject the null hypothesis", "Fail to reject the null hypothesis")## [1] "Reject the null hypothesis"
- What type of error might we have committed? Explain. Type 1
- Would your conclusion change if we used \(\alpha\) = 0.01? yes
6.5 Prop 19 in California.
In a 2010 Survey USA poll, 70% of the 119 respondents between the ages of 18 and 34 said they would vote in the 2010 general election for Prop 19, which would change California law to legalize marijuana and allow it to be regulated and taxed. At a 95% confidence level, this sample has an 8% margin of error. Based on this information, determine if the following statements are true or false, and explain your reasoning.
- We are 95% confident that between 62% and 78% of the California voters in this sample support Prop 19.
False. A CI is constructed to estimate the population proportion.
- We are 95% confident that between 62% and 78% of all California voters between the ages of 18 and 34 support Prop 19.
True
- If we considered many random samples of 119 California voters between the ages of 18 and 34, and we calculated 95% confidence intervals for each, 95% of them will include the true population proportion of 18-34 year old Californians who support Prop 19.
True, that is a correct interpretation of the CI.
- In order to decrease the margin of error to 4%, we would need to quadruple (multiply by 4) the sample size. False. Assuming the same .7 proportion, we would need to more than quadruple it to 505.
max_me <- 0.04
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
max_p_hat <- 0.7
se <- max_me/cv
n <- max_p_hat * (1 - max_p_hat)/se^2
ceiling(n)## [1] 505
- Based on this confidence interval, there is sufficient evidence to conclude that a majority of California voters between the ages of 18 and 34 support Prop 19. True ####6.11 Study abroad.
A survey on 1,509 high school seniors who took the SAT and who completed an optional web survey between April 25 and April 30, 2007 shows that 55% of high school seniors are fairly certain that they will participate in a study abroad program in college.
Is this sample a representative sample from the population of all high school seniors in the US? Explain your reasoning. No
Let’s suppose the conditions for inference are met. Even if your answer to part (a) indicated that this approach would not be reliable, this analysis may still be interesting to carry out (though not report). Construct a 90% confidence interval for the proportion of high school seniors (of those who took the SAT) who are fairly certain they will participate in a study abroad program in college, and interpret this interval in context.
n <- 1509
p_hat <- 0.55
cL <- 0.9
cv <- abs(qnorm((1 - cL)/2))
se <- sqrt(p_hat * (1 - p_hat)/n)
me <- se * cv
p_hat + me## [1] 0.5710654
p_hat - me## [1] 0.5289346
What does “90% confidence” mean? We are 90% confident that the population proportion is within the confidence interval.
Based on this interval, would it be appropriate to claim that the majority of high school seniors are fairly certain that they will participate in a study abroad program in college?
If we assume the conditions for inference are met, yes.
6.19 College smokers.
We are interested in estimating the proportion of students at a university who smoke. Out of a random sample of 200 students from this university, 40 students smoke.
- Calculate a 95% confidence interval for the proportion of students at this university who smoke, and interpret this interval in context. (Reminder: Check conditions.)
# givens
n <- 200
p_hat <- 0.2
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
# calculations
se <- sqrt(p_hat * (1 - p_hat)/n)
me <- se * cv
# CI values
p_hat + me## [1] 0.2554362
p_hat - me## [1] 0.1445638
- If we wanted the margin of error to be no larger than 2% at a 95% confidence level for the proportion of students who smoke, how big of a sample would we need?
max_me <- 0.02
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
max_p_hat <- 0.2
se <- max_me/cv
n <- max_p_hat * (1 - max_p_hat)/se^2
ceiling(n)## [1] 1537
6.27 Public Option, Part III.
Exercise 6.13 presents the results of a poll evaluating support for the health care public option plan in 2009. 70% of 819 Democrats and 42% of 783 Independents support the public option.
- Calculate a 95% confidence interval for the difference between (pD pI ) and interpret it in this context. We have already checked conditions for you.
# givens
n1 <- 819
n2 <- 783
p_hat1 <- 0.7
p_hat2 <- 0.42
cL <- 0.95
cv <- abs(qnorm((1 - cL)/2))
# calculations
p_hat_diff <- p_hat1 - p_hat2
se <- sqrt(p_hat1 * (1 - p_hat1)/n1 + p_hat2 * (1 - p_hat2)/n2)
me <- se * cv
# CI values
p_hat_diff## [1] 0.28
c(p_hat_diff - me, p_hat_diff + me)## [1] 0.2333084 0.3266916
- True or false: If we had picked a random Democrat and a random Independent at the time of this poll, it is more likely that the Democrat would support the public option than the Independent. ####6.43 Rock-paper-scissors.
Rock-paper-scissors is a hand game played by two or more people where players choose to sign either rock, paper, or scissors with their hands. For your statistics class project, you want to evaluate whether players choose between these three options randomly, or if certain options are favored above others. You ask two friends to play rock-paper-scissors and count the times each option is played. The following table summarizes the data:
Rock Paper Scissors 43 21 35
Use these data to evaluate whether players choose between these three options randomly, or if certain options are favored above others. Make sure to clearly outline each step of your analysis, and interpret your results in context of the data and the research question.
\(H_{0}: \mu_{diff}=0\) \(H_{A}: \mu_{diff}\neq0\)
pe <- c(43, 21, 35)
pe_len <- length(pe)
p <- c(rep(sum(pe)/pe_len, pe_len))
my_df <- data.frame(pe, p, z = (pe - p)/sqrt(p))
sum(my_df$z^2)## [1] 7.515152
chisq.test(pe)##
## Chi-squared test for given probabilities
##
## data: pe
## X-squared = 7.5152, df = 2, p-value = 0.02334
6.47 Offshore drilling, Part III.
The table below summarizes a data set we first encountered in Exercise 6.29 that examines the responses of a random sample of college graduates and nongraduates on the topic of oil drilling. Complete a chi-square test for these data to check whether there is a statistically significant difference in responses from college graduates and non-graduates.
values <- c(154, 132, 180, 126, 104, 131)
rowdims <- c("Support", "Oppose", "Do not know")
coldims <- c("Grad", "nGrad")
mdims <- list(rowdims, coldims)
my_matrix <- matrix(values, 3, 2, byrow = TRUE, dimnames = mdims)
test <- chisq.test(my_matrix)
test##
## Pearson's Chi-squared test
##
## data: my_matrix
## X-squared = 11.461, df = 2, p-value = 0.003246
test$expected## Grad nGrad
## Support 151.4728 134.5272
## Oppose 162.0653 143.9347
## Do not know 124.4619 110.5381