Solution to Sample Stat 239 Final Exam

Question 1: One-Sample z-Confidence Interval for a Population Proportion (8 points)

You are investigating the proportion of customers satisfied with a product. From a sample of 100 customers, 70 express satisfaction. Calculate a 90% confidence interval for the true proportion of satisfied customers.

prop.test(n = 100, x = 70, conf.level = 0.90)

## 
##  1-sample proportions test with continuity correction
## 
## data:  70 out of 100, null probability 0.5
## X-squared = 15.21, df = 1, p-value = 9.619e-05
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
##  0.6149607 0.7738142
## sample estimates:
##   p 
## 0.7

A 90% Confidence Interval is: 0.615 to 0.774, meaning that you are 95% confident that at least 61.5% and at most 77.4% of customers satisfied with the product.

Question 2: Two-Sample z-Confidence Interval for the Difference Between Two Population Proportions (8 points)

In a study comparing the effectiveness of two advertising strategies, you collect individual responses from 80 people exposed to Strategy A and 90 people exposed to Strategy B. Among those exposed to Strategy A, 30 purchase the product, while among those exposed to Strategy B, 45 purchase the product. Test whether there is a significant difference in the proportion of people making a purchase between the two strategies.

n1 = 80
x1 = 30
n2 = 90
x2 = 45
prop.test(n = c(n1, n2), x = c(x1, x2), conf.level = 0.95)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 2.2011, df = 1, p-value = 0.1379
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.28487646  0.03487646
## sample estimates:
## prop 1 prop 2 
##  0.375  0.500

95% Confidence Interval for the Difference in the proportion of people making a purchase between strategy A and strategy B is: -0.285 to 0.035

Since 0 is is in the interval, there is no evidence that there is a difference between the two strategies. The p-value 0.1379 (larger than commomly used significance levels such as 0.05) also suggests that there is NO Significant difference between the two strategies.

Question 3: One-Sample z-test for a Population Proportion (8 points)

Suppose a pharmaceutical company has developed a new drug that they claim is effective in treating a particular condition. They claim that more than 70% of patients who take the drug will experience an improvement. To test this claim, they take a sample of 200 patients and find that 150 of them show improvement. Show the test details.

no_successes = 150   # Number of successes (e.g., patients with a specific condition)
sample_size = 200   # Sample size

# Null hypotheses
p0 = 0.70        # Null hypothesis H0: p = 0.70

# Perform the one-sample proportion test with alternative hypothesis Ha: p > 0.70
prop.test(n = sample_size, x = no_successes, p = p0, alternative = "greater")

## 
##  1-sample proportions test with continuity correction
## 
## data:  no_successes out of sample_size, null probability p0
## X-squared = 2.1488, df = 1, p-value = 0.07134
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
##  0.6938964 1.0000000
## sample estimates:
##    p 
## 0.75

# Reject H0

The p-value (0.07134) suggests there is NO significant evidence provided by the data that more than 70% of patients who take the drug would experience an improvement.

Question 4: Two-Sample z-test for the Difference Between Two Population Proportions (8 points)

A company has two sales teams, Team A and Team B, and they want to know if there is a significant difference in the success rates of closing deals between the two teams. Team A had 100 sales attempts with 60 successes, while Team B had 120 sales attempts with 72 successes. Show the test details.

# Sample data: Numbers of successes and sample sizes
no_successes_1 = 60   # Number of successes in sample 1
sample_size_1 = 100          # Sample size in sample 1

no_successes_2 = 72   # Number of successes in sample 2
sample_size_2 = 120          # Sample size in sample 2

# Perform the 2-sample proportion test with alternative hypothesis Ha: p1 not equal to p2
prop.test(n = c(sample_size_1, sample_size_2), x = c(no_successes_1, no_successes_2), alternative = "two.sided")

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(no_successes_1, no_successes_2) out of c(sample_size_1, sample_size_2)
## X-squared = 0, df = 1, p-value = 1
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1300093  0.1300093
## sample estimates:
## prop 1 prop 2 
##    0.6    0.6

# Don't reject H0

The p-value (1) indicates that there is no evidence that there is a significant difference in the success rates of closing deals between the two teams.

Question 5: One-Sample t-Confidence Interval for a Population Mean (8 points)

You want to estimate the average time students spend commuting to campus. From a random sample of 9 students, you collect the following data on their daily commuting times (in minutes):

[20, 25, 22, 18, 24, 21, 23, 19, 20].

Calculate a 95% confidence interval for the true mean commuting time.

time = c(20, 25, 22, 18, 24, 21, 23, 19, 20)
t.test(time, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  time
## t = 27.29, df = 8, p-value = 3.504e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.53065 23.13602
## sample estimates:
## mean of x 
##  21.33333

95% Confidence Interval is 19.53 to 23.14 minutes.

Question 6: Two-Sample t-Confidence Interval for the Difference Between Two Population Means (8 points)

In a study comparing the effectiveness of two diets on weight loss, you collect individual data on weight loss for each participant.

Diet X group: [2, 3, 1, 2, 1, 3, 2, 4, 1]

Diet Y group: [3, 2, 4, 1, 3, 2, 4, 2, 3]

Test whether there is a significant difference in the mean weight loss between the two diets.

x = c(2, 3, 1, 2, 1, 3, 2, 4, 1)
y = c(3, 2, 4, 1, 3, 2, 4, 2, 3)

t.test(x, y, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -1.1471, df = 15.956, p-value = 0.2683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.5825037  0.4713926
## sample estimates:
## mean of x mean of y 
##  2.111111  2.666667

There is no significant difference (p-value = 0.2683) in the mean weight loss between the two diets.

Also, 0 is in the 95% confidence interval, which shows no evidence of a difference in the mean weight loss between the two diets.

Question 7: One-Sample t-Test on a Population Mean (8 points)

A company claims that the average response time for their customer service is less than 5 minutes. You collect individual data from a sample of 10 customer service responses, and the response times are:

[4, 5, 6, 4, 5, 6, 4, 5, 6, 4].

Test the company’s claim at a 5% significance level.

x = c(4, 5, 6, 4, 5, 6, 4, 5, 6, 4)
hypothesized_mean = 5
# Perform a one-sample t-test with alternative hypothesis Ha: mu < 5
t.test(x, mu = hypothesized_mean, alternative = "less")

## 
##  One Sample t-test
## 
## data:  x
## t = -0.36116, df = 9, p-value = 0.3632
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
##      -Inf 5.407566
## sample estimates:
## mean of x 
##       4.9

# Don't reject H0

The p-value (0.3632) suggests there is no evidence that the average response time for their customer service is less than 5 minutes.

Question 8: Two-Sample t-Test on the difference between Two Population Means (8 points)

Consider a hypothetical biological study where researchers are investigating the effect of a new drug on the average lifespan of two different species of laboratory mice. The study involves two independent groups: one group treated with the new drug (Group A) and another group receiving a placebo (Group B).

Group A (Drug Treatment):

Lifespans of mice: 800, 820, 810, 825, 830

Group B (Placebo):

Lifespans of mice: 790, 805, 800, 795, 810

Test whether there is a significant increase in the average lifespan of mice in the treatment group compared to the placebo group.

group1 = c(800, 820, 810, 825, 830)
group2 = c(790, 805, 800, 795, 810)

# Perform a two-sample t-test with alternative hypothesis Ha: mu1 > mu2
t.test(group1, group2, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = 2.6389, df = 6.908, p-value = 0.01694
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4.770531      Inf
## sample estimates:
## mean of x mean of y 
##       817       800

The p-value (0.0169) suggests that there is a significant increase in the average lifespan of mice in the treatment group compared to the placebo group.

Question 9: Paired Two-Sample t Confidence Interval and t Test (8 points)

Let’s say we are investigating the effectiveness of a training program designed to improve exam scores. We have a group of 10 students, and we measure their scores before and after the training. The data are

before_training: 75, 68, 82, 90, 78, 65, 88, 72, 95, 80

after_training: 82, 75, 88, 92, 85, 70, 92, 78, 98, 86

Test whether there is a significant difference in scores before and after training.
Give a 95% confidence interval for difference in scores before and after training (after - before).

before_training = c(75, 68, 82, 90, 78, 65, 88, 72, 95, 80)
after_training = c(82, 75, 88, 92, 85, 70, 92, 78, 98, 86)
d = after_training - before_training
t.test(d, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  d
## t = 9.4851, df = 9, p-value = 5.546e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.035978 6.564022
## sample estimates:
## mean of x 
##       5.3

The p-value (almost 0) suggests that there is a significant difference in scores before and after training.

95% confidence interval for difference in scores before and after training is 4.03 to 6.56.

Question 10: Chi-Square Test for Independence (8 points)

You are analyzing survey data to examine whether there is an association between occupation and preferred mode of transportation (Car, Public Transit, Bicycle). You collect data from 150 individuals, and the observed frequencies are as follows:

Office Workers: Car - 50, Public Transit - 30, Bicycle - 10
Students: Car - 20, Public Transit - 40, Bicycle - 20
Conduct a chi-square test for independence at a 5% significance level.

# Make a data matrix of 2 rows (occupations) and 3 columns (transportation mode)
chisq.test(x=matrix(c(50, 20, 30, 40, 10, 20), 2, 3))

## 
##  Pearson's Chi-squared test
## 
## data:  matrix(c(50, 20, 30, 40, 10, 20), 2, 3)
## X-squared = 17.09, df = 2, p-value = 0.0001945

The small p-value (0.00019) suggests a strong evidence that there is an association between occupation and method of transport

Question 11: Chi-Square Goodness of Fit (8 points)

Regarding favorite ice cream flavors, you expect 50% of all individuals like vanilla, 30% like chocolate, and 20% like strawberry (the instructor added this missing info). You collect data on favorite ice cream flavors in a group of 60 individuals. The observed frequencies are as follows:

Vanilla: 25
Chocolate: 20
Strawberry: 15

Conduct a chi-square goodness-of-fit test to determine if the observed distribution of ice cream flavors matches the expected distribution.

chisq.test(x = c(25, 20, 15), p = c(0.50, 0.30, 0.20))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(25, 20, 15)
## X-squared = 1.8056, df = 2, p-value = 0.4054

The large p-value suggests that there is no evidence that the observed distribution of ice cream flavors matches the expected distribution.

Question 12: Simple Linear Regression (8 points)

You are conducting a study to investigate the relationship between the number of hours individuals spend jogging per week and their cardiovascular fitness levels, measured in terms of the maximum oxygen consumption (VO2 max). Collect data from 8 participants and record both the weekly jogging hours and their corresponding VO2 max levels.

Participant 1: Jogging Hours - 3, VO2 max - 40
Participant 2: Jogging Hours - 4, VO2 max - 45
Participant 3: Jogging Hours - 2, VO2 max - 38
Participant 4: Jogging Hours - 5, VO2 max - 50
Participant 5: Jogging Hours - 3, VO2 max - 42
Participant 6: Jogging Hours - 6, VO2 max - 55
Participant 7: Jogging Hours - 4, VO2 max - 48
Participant 8: Jogging Hours - 5, VO2 max - 52

Perform a simple linear regression analysis to predict cardiovascular fitness levels (VO2 max) based on the number of hours spent jogging per week. What is the regression equation? Is the slope significantly different from 0 at the 0.05 significance level?

jogging = c(3, 4, 2, 5, 3, 6, 4, 5)
vo2 = c(40, 45, 38, 50, 42, 55, 48, 52)
mydata = data.frame(V02_Max = vo2, Jogging = jogging)

model = lm(V02_Max ~ Jogging, data = mydata)
summary(model)

## 
## Call:
## lm(formula = V02_Max ~ Jogging, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.750 -0.875  0.000  0.875  1.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.2500     1.5975   17.68 2.10e-06 ***
## Jogging       4.5000     0.3819   11.78 2.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.323 on 6 degrees of freedom
## Multiple R-squared:  0.9586, Adjusted R-squared:  0.9517 
## F-statistic: 138.9 on 1 and 6 DF,  p-value: 2.256e-05

The regression equation is V02_Max = 28.25 - 4.5Jogging. The p-value (basically 0) suggests that the slope is significantly different from 0 at the 0.05 significance level.

Question 13: ANOVA (4 points)

You want to test whether there is a significant difference in test scores among three different teaching methods. Collect data from 4 groups, with 10 students in each group:

Method A: 75, 78, 80, 82, 85, 88, 90, 92, 95, 98 Method B: 72, 75, 78, 80, 82, 85, 88, 90, 92, 95 Method C: 70, 72, 75, 78, 80, 82, 85, 88, 90, 92

Perform an ANOVA to test for a significant difference in test scores among the four teaching methods.

a = c(75, 78, 80, 82, 85, 88, 90, 92, 95, 98)
b = c(72, 75, 78, 80, 82, 85, 88, 90, 92, 95)
c = c(70, 72, 75, 78, 80, 82, 85, 88, 90, 92)
teach = data.frame(Method = rep(c('A', 'B', 'C'), each = 10), 
                   Scores = c(a, b, c))

anova_result = aov(Scores~Method, data = teach)
summary(anova_result)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Method       2  130.1   65.03   1.132  0.337
## Residuals   27 1551.8   57.47

The p-value (0.337) suggests that there is no difference in the test scores among the three different teaching methods.

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Scores ~ Method, data = teach)
## 
## $Method
##     diff       lwr      upr     p adj
## B-A -2.6 -11.00622 5.806219 0.7261563
## C-A -5.1 -13.50622 3.306219 0.3048325
## C-B -2.5 -10.90622 5.906219 0.7436839

With the p-values of all pair-wise comparison not small, the post-hoc test confirms the conclusion just made.