Data 606 Final Exam

December 17th, 2018

3. The time taken to complete a statistics final by all students is normally distributed with a mean of 120 minutes and a standard deviation of 10 minutes.

a. Find the probability that a randomly selected student will take more than 150 minutes to complete the test.

pnorm(150, 120, 10, lower.tail = FALSE)

## [1] 0.001349898

b. Find the probability that the mean time taken to complete the test by a random sample of 16 students would be between 122 and 126 minutes.

#Standard Deviation for sample size of 16 is:
10/sqrt(16)

## [1] 2.5

#Answer
pnorm(126, 120, 2.5, lower.tail = TRUE) - pnorm(122, 120, 2.5, lower.tail = TRUE)

## [1] 0.2036579

4. Rh-negative blood appears in 15% of the United States population.

a. Find the probability that out of 7 randomly selected U.S. residents at least 3 of them have Rh-negative blood.

pbinom(2, 7, 0.15, lower.tail = FALSE)

## [1] 0.07376516

b. Use the normal approximation to find the probability that in a group 100 randomly selected people fewer than 17.5% will have a Rh-negative blood.

Mean is n*p

Standard deviation = sqrt(n * p * (1-p))

mu<-100*0.15
st.dev.<-sqrt(100*0.15*0.85)

pnorm(17.5, mu, st.dev., lower.tail = TRUE)

## [1] 0.7580801

5.

a. The U.S. Travel Industry estimated that Americans planned to spend an average of 4.8 nights away on vacations in 1995 (U.S. News & World Report, June 12, 1995). Suppose that this mean was based on a random sample of 100 Americans and the population standard deviation was 1.5 nights. Construct a 90% confidence interval for the mean length of vacations Americans planned in 1995.

z<-qnorm(0.05, lower.tail=FALSE)

4.8 - z * 1.5 / sqrt(100)

## [1] 4.553272

4.8 + z * 1.5 / sqrt(100)

## [1] 5.046728

\(\color{red}{\text{Answer:}}\) (4.55, 5.05)

b. A poll of 1226 adults revealed that 49% believe that the devil may sometimes possess earthlings. Find a 95% confidence interval for the population proportion of the adults who hold this opinion. (Source:“Demons Begone,” Asheville Citizen-Times, April 5, 1991).

z<-qnorm(0.025, lower.tail=FALSE)
z

## [1] 1.959964

popprop<-0.49
n<-1226

SE<-sqrt(popprop*(1-popprop)/n)
popprop - z*SE

## [1] 0.4620175

popprop + z*SE

## [1] 0.5179825

\(\color{red}{\text{Answer:}}\) (46.2%, 51.8%)

6. Grocery stores, drugstores, and large supermarkets all use scanners to calculate a customer’s bill. Scanners should be as accurate as possible. A state agency regularly monitors stores by randomly selecting items and comparing with the shelf price with the checkout scanner price. During one check by the agency, 16 items were found to be incorrectly scanned. The amounts of overcharge(in cents) were

200, -99, 100, -50, 40, -60, 20, 30, 50, 300, -120, 100, 50, 30, -70, 40

A negative sign indicates an undercharge - the scanner price was below the shelf price.

a. Make a stemplot of the data interpret.

a<-c(200, -99, 100, -50, 40, -60, 20, 30, 50, 300, -120, 100, 50, 30, -70, 40)
stem(a)

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   -1 | 20
##   -0 | 765
##    0 | 2334455
##    1 | 00
##    2 | 0
##    3 | 0

\(\color{red}{\text{Answer:}}\) The distribution of the data appears to be close enough to normal. The minimum value is -120, and the maximum value is 300. It appears that there are no significant outliers.

b. Compute the mean and the range.

#Mean:
mean(a)

## [1] 35.0625

#Range:
max(a)-min(a)

## [1] 420

c. Give the five-number summary of the data.

quantile(a)

##     0%    25%    50%    75%   100% 
## -120.0  -52.5   35.0   62.5  300.0

d. Construct a boxplot and interpret.

boxplot(a)

\(\color{red}{\text{Answer:}}\) The data appears nearly normal, there is not a lot of variability in our data, “300” seems to be an outlier.

e. Use the 1.5xIQR criterion to spot suspected outliers.

quantile(a)

##     0%    25%    50%    75%   100% 
## -120.0  -52.5   35.0   62.5  300.0

#IQR
iqr<-IQR(a)

#Q1 - 1.5*IQR
-52.5-1.5*iqr

## [1] -225

#Q3 + 1.5*IQR
62.5+1.5*iqr

## [1] 235

\(\color{red}{\text{Answer:}}\) 300 is an outlier.

f. For this data sample standard deviation is 1.083. Test the hypothesis that the mean overcharge is more than 0 at 0.05 significance level.

The sample is random and our sample is less than 10% of the total scans. The skew is not very strong.

H0: The mean overcharge = 0

HA: The mean overcharge > 0

z<-(35.0625-0)/(1.083/sqrt(16))

pnorm(z, lower.tail = FALSE)

## [1] 0

Since the p-value is 0 which is less than 0.05 significance level, we reject H0. The data provide convincing evidence that the average overcharge is greater that 0

Sorted data:

-120, -99, -70, -60, -50, 20, 30, 30, 40, 40, 50, 50, 100, 100, 200, 300

7. Do cars traveling in the right lane of I-94 travel slower than those in the left lane? The following sample information was obtained. Use the 0.01 significance level to provide an answer to this question.

 #                           Right Lane            Left Lane

# Sample size                      5                     6

# Sample mean                      65                   69

# Sample standard deviation       4.12                3.22

H0: The right lane mean equals the left lane mean

HA: The right lane mean is slower than the left lane mean

We assume independence within groups: that all sample observations are sampled randomly and that the observations we are looking at are less that 10% of the entire population of all cars. We also assume independence between groups, which is reasonalbe since the sample is random. We assume normal distribution without strong skew.

#PointEstimate = MuLL - MuRL
PointEstimate<-69-65

#Standard Error
SE<-sqrt(((3.22^2)/6)+((4.12^2)/5))

T<-(PointEstimate-0)/SE
T

## [1] 1.767258

#p-value
dt(T, 4)

## [1] 0.0886121

\(\color{red}{\text{Answer:}}\) Because our p-value is larger than 0.01 we do not reject out null hypothesis and conclude that there is no difference between right late and left lane means.

8. A noted medical researcher has suggested that a heart attack is less likely to occur among adults who actively participate in athletics. A random sample of 300 adults is obtained. Of that total, 100 are found to be athletically active. Within this group, 10 suffered heart attacks; among the 200 athletically in active adults, 25 had suffered heart attacks.

a. Test the hypothesis that the proportion of adults who are active and suffered heart attacks is different than the proportion of adults who are not active and suffered heart attacks. Use the 0.05 significance level.

Verifying Conditions: We assume that both samples follow a normal model and that the observations are independent of each other, each group is a simple random sample from less than 10% of the population, the observations are independent, both within the samples and between the samples. The success-failure condition also holds for each sample - there are 10 success and failures in each sample.

Hypothesis Testing:

H0: There is no difference between the proportion of adults who are active and suffered heart attacks and the proportion of adults who are not active and suffered heart attacks.

HA: There is a difference between the proportion of adults who are active and suffered heart attacks and the proportion of adults who are not active and suffered heart attacks.

#Pooled Proportion
n1<-100
n2<-200
pp<-(10+25)/(n1+n2)

#Proportions of active adults who suffered heart attacks
p1<-10/n1
  
#Proportions of not active adults who suffered heart attacks

p2<-25/n2

#Point Estimate of difference
pe<-p1-p2

#Standard Error Calculation

SE<-sqrt(((pp*(1-pp))/n1)+((pp*(1-pp))/n2))
SE

## [1] 0.03931709

#Using the point estimate and standard error to calculate a p-value for the hypothesis test

Z<-(pe-0)/SE

pvalue<-2*pnorm(Z, lower.tail = TRUE)

\(\color{red}{\text{Answer:}}\) Because p-value is larger than 0.05 we cannot reject our null hypothesis and we conclude that there is no difference between the proportion of adults who are active and suffered heart attacks and the proportion of adults who are not active and suffered heart attacks.

b. Construct a 99% confidence interval for the difference between the proportions of all active and inactive adults who suffered heart attacks.

#Recalculating the standard error using the sample proportions:

SE<-sqrt(((p1*(1-p1))/n1)+((p2*(1-p2))/n2))
SE

## [1] 0.03803781

#z for 99% confidence interval
z<-qnorm(0.005, lower.tail=FALSE)
z

## [1] 2.575829

#Confidence interval = pointestimate ± z*SE
round(pe-z*SE,2)

## [1] -0.12

round(pe+z*SE,2)

## [1] 0.07

\(\color{red}{\text{Answer:}}\) (-0.12,0.07)

9. Based on interviews of couples seeking divorces, a social worker compiles the following data related to the period of acquaintanceship before marriage and the duration of marriage.

#                                      Duration of Marriage
# Acquaint. bef. mar.       At least 4 years      More than 4 years     Total

# Under 0.5 years                  11                   8                 19

# 0.5-1.5 years                    28                  24                 52

# Over 1.5 years                   21                  19                 40

#Total                             60                  51                111

Perform a test to determine whether the data substantiate an association between the stability of a marriage and the period of acquaintanceship prior to marriage. Use a=0.05.

H0: There is no difference in stability of marriage between the three groups

HA: There is some difference in stability of marriage between the three groups

A chi-square test for a two-way table may be used to test this hypothesis. There are two conditions that must be checked before performing a chi-square test:

Independence. Each case that contributes a count to the table must be independent of all the other cases in the table.
Sample size / distribution. Each particular scenario must have at least 5 expected cases.

Our data meets both of these conditions.

#As a first step, we compute the expected values for each of the six table cells.

EV11<-19*60/119
EV11

## [1] 9.579832

EV12<-19*51/111
EV12

## [1] 8.72973

EV21<-52*60/111
EV21

## [1] 28.10811

EV22<-52*51/111
EV22

## [1] 23.89189

EV31<-40*60/111
EV31

## [1] 21.62162

EV32<-40*51/111
EV32

## [1] 18.37838

#Based on expected values above we confirm that our data meets sample size/distribution condition.

#Compute the chi-square test statistic.
chisquare<-(((11-EV11)^2)/EV11)+(((8-EV12)^2)/EV12)+(((28-EV21)^2)/EV21)+(((24-EV22)^2)/EV22)+(((21-EV31)^2)/EV31)+(((19-EV32)^2)/EV32)
chisquare

## [1] 0.3113348

#Degrees of freedom: 
df<- (3-1)*(2-1)
df

## [1] 2

p_Val <- pchisq(chisquare, df, lower.tail = FALSE)
p_Val

## [1] 0.8558438

\(\color{red}{\text{Answer:}}\) We do not reject the null hypothesis because the p-value is greater than 0.05, and we conclude that there is no difference in stability of marriage between the three groups