DATA 606 - Final Exam

3.The time taken to complete a statistics final by all students is normally distributed with a mean of 120 minutes and a standard deviation of 10 minutes.

a.Find the probability that a randomly selected student will take more than 150 minutes to complete the test.

Ans: The normal distribution can be represented by N(\(\mu\) = 120, \(\sigma\) = 10).

Z score = (150 - 120)/10 = 30/10 = 3

Based on Normal Probabality table the percentile score for Z score of 3.0 is 0.9987.

Probability of student taking more than 150 minutes = 1 - 0.9987 = 0.0013.

b.Find the probability that the mean time taken to complete the test by a random sample of 16 students would be between 122 and 126 minutes.

Ans: Sample Size , n = 16 Standard Error, SE = \(\sigma\)/\(\sqrt{n}\) = 10/4 = 2.5

x1 = 122
x2 = 126
Mean = 120
SD<-10
SE<-SD/sqrt(16)
Z1 <- (x1 - Mean)/SE
Z2 <- (x2 - Mean)/SE
p1 <- 1 - pnorm(Z1)
p2<- 1 - pnorm(Z2)
p1-p2

## [1] 0.2036579

The probability that the mean time taken to complete the test by a random sample of 16 students would be between 122 and 126 minutes is 20.4%

4. Rh-negative blood appears in 15% of the United States population.

a.Find the probability that out of 7 randomly selected U.S. residents at least 3 of them have Rh-negative blood.

Ans: Here, a random trial resulting in a person having Rh-negative blood type can be defined as success. For finding at least 3 people randomly to have Rh-negative, we need to apply Binomial distribution for following combinations, add the probablities and take the complement -

n = 7, k = 0 & p = 0.15

P0  = $\frac { n! }{ k!(n-k)! } { p }^{ k }{ (1-p) }^{ n-k }$
  = (0.15)^0*(0.85)^7
  = 0.3206

n = 7, k = 1 & p = 0.15

P1  = $\frac { n! }{ k!(n-k)! } { p }^{ k }{ (1-p) }^{ n-k }$
  = 7 * (0.15)^1*(0.85)^6
  = 0.396

n = 7, k = 2 & p = 0.15

P2  = $\frac { n! }{ k!(n-k)! } { p }^{ k }{ (1-p) }^{ n-k }$
  = 21 * (0.15)^2*(0.85)^5
  = 21 * 0.0225 * 0.4437053125
  = 0.2096

Probability for at least 3 people to have Rh-negative = 1- P0 - P1 - P2 = 1 - 0.3206 - 0.396 - 0.2096 = 0.0738

b.Use the normal approximation to find the probability that in a group 100 randomly selected people fewer than 17.5% will have a Rh-negative blood.

Ans: Applying Normal approximation of Binomial distribution, I calculated \(\mu\) and \(\sigma\) -

\(\mu\) = n * p = 100 * 0.15 = 15

\(\sigma\) = \(\sqrt { n*p*(1-p) }\) = \(\sqrt {100*0.15*0.85}\) = 3.57

So we will use Normal Distribution using N(\(\mu\) = 15, \(\sigma\) = 3.57) and calculate Z score -

Z = (17.5 - 15)/3.57 = 0.7

From the Normal probability plot table the percentile for the positive Z score is 0.5279. Hence there is a 52.79% probability for fewer than 17.5% will have a Rh-negative blood group in a randomly selected population of 100 people.

a.The U.S. Travel Industry estimated that Americans planned to spend an average of 4.8 nights away on vacations in 1995 (U.S. News & World Report, June 12, 1995). Suppose that this mean was based on a random sample of 100 Americans and the population standard deviation was 1.5 nights. Construct a 90% confidence interval for the mean length of vacations Americans planned in 1995.

Ans: Here, population Sample Size, n = 100 population Mean, \(\mu\) = 4.8 population Standard Deviation, \(\sigma\) = 1.5

Standard Error, \({ \sigma }_{ \bar { x } }\) = \(\sigma/\sqrt{n}\) = 1.5/10 = 0.15

For constructing a 90% confidence interval, I will use a \({ z }^{ * }\) coefficient of 1.65.

Margin of Error = \({ z }^{ * }\) x \({ \sigma }_{ \bar { x } }\) = 1.65 * 0.15 = 0.2475

Hence 90% confidence interval of Mean Vacation nights = \(\mu\) \(\pm\) 0.2475 = (4.8 + 0.2475,4.8 - 0.2475 ) = (5.0475, 4.5525)

b.A poll of 1226 adults revealed that 49% believe that the devil may sometimes possess earthlings. Find a 95% confidence interval for the population proportion of the adults who hold this opinion. (Source:“Demons Begone,” Asheville Citizen-Times, April 5, 1991).

Ans:

Population Size, n = 1226 Population Mean, \(\mu\) = 0.49 \({ z }^{ * }\) = 1.96

Applying principles of constructing 95% confidence interval for proportion -

n = 1226
p = 0.49
Z = 1.96
lower <- p - Z * sqrt((p*(1-p))/n)
upper <- p + Z * sqrt((p*(1-p))/n)
c(lower, upper)

## [1] 0.462017 0.517983

cat("95% confidence interval is (",lower, upper,")")

## 95% confidence interval is ( 0.462017 0.517983 )

6. Grocery stores, drugstores, and large supermarkets all use scanners to calculate a customer’s bill. Scanners should be as accurate as possible. A state agency regularly monitors stores by randomly selecting items and comparing with the shelf price with the checkout scanner price. During one check by the agency, 16 items were found to be incorrectly scanned. The amounts of overcharge(in cents) were

200, -99, 100, -50, 40, -60, 20, 30, 50, 300, -120, 100, 50, 30, -70, 40

A negative sign indicates an undercharge-the scanner price was below the shelf price.

a.Make a stemplot of the data interpret.

Ans:

data <- c(200, -99, 100, -50, 40, -60, 20, 30, 50, 300, -120, 100, 50, 30, -70, 40)
stem(sort(data))

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   -1 | 20
##   -0 | 765
##    0 | 2334455
##    1 | 00
##    2 | 0
##    3 | 0

b.Compute the mean and the range.

Ans:

mean(data)

## [1] 35.0625

range(data)

## [1] -120  300

c.Give the five-number summary of the data.

Ans:

summary(data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -120.00  -52.50   35.00   35.06   62.50  300.00

d.Construct a boxplot and interpret.

Ans:

boxplot(data)

Boxplot shows that median is equal to 35

Upper and lower quartile approx are 60 and -50 respectively. It means that 75% of data are less than 60 and 25% of the data is less than -50.

The entire box represents the inter-quartile range (upper quartile - lower quartile)

In box plot the whiskers are defined as 1.5 times the inter-quartile range. Anything this outside the whiskers is considered as an outlier. Data has one outlier = 300

e.Use the 1.5xIQR criterion to spot suspected outliers.

Ans:

# Q3 and Q1 were taken from the result of summary() function above
Q3<-62.50 
Q1<--52.50
IQR<-(Q3-Q1)
below<- Q1-1.5*IQR
above<- Q3+1.5*IQR
cat("Lower limit based on 1.5*IQR: ",below)

## Lower limit based on 1.5*IQR:  -225

cat("Upper limit based on 1.5*IQR: ",above)

## Upper limit based on 1.5*IQR:  235

All numbers that are below -225 or above 235 we should consider as an outliers. In our case we have 1 outlier: 300

f.For this data sample standard deviation is 1.083. Test the hypothesis that the mean overcharge is more than 0 at 0.05 significance level.

Ans:

\({ H }_{ 0 }\): the mean overcharge is less than 0

\({ H }_{ A }\): the mean overcharge is more than 0

t_test<-t.test(data, alternative = "greater")
t_test

## 
##  One Sample t-test
## 
## data:  data
## t = 1.295, df = 15, p-value = 0.1074
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  -12.40101       Inf
## sample estimates:
## mean of x 
##   35.0625

p-value is > 0.05, we do not have enough evidence to reject the \({ H }_{ 0 }\) hypotesis in favor of \({ H }_{ A }\).

Sorted data:

-120, -99, -70, -60, -50, 20, 30, 30, 40, 40, 50, 50, 100, 100, 200, 300

7. Do cars traveling in the right lane of I-94 travel slower than those in the left lane? The following sample information was obtained. Use the 0.01 significance level to provide an answer to this question.

Ans:

\({ H }_{ 0 }\): There is no difference in average speed of Cars traveling in the right and left lanes of I-94 (\({ \mu }_{ Right }\) = \({ \mu }_{ Left }\))

\({ H }_{ A }\): Cars traveling in the right lane of I-94 travel slower than those in the left lane (\({ \mu }_{ Right } < { \mu }_{ Left }\))

n_Right <- 5
mean_Right <- 65
SD_Right <- 4.12

n_Left <- 6
mean_Left <- 69
SD_Left <- 3.22

Mean_L_R <- mean_Left - mean_Right

SE_L_R <- sqrt(((SD_Right^2)/n_Right) + ((SD_Left^2)/n_Left))

## For calculating degrees of freedom, I chose  n_Right - 1, since n_Right < n_Left
df <- n_Right - 1

T_score <- (Mean_L_R - 0)/SE_L_R
T_score

## [1] 1.767258

p_Value <- pt(-abs(T_score),df)

p_Value

## [1] 0.07596231

Since pValue is greater than 0.01, we do not have sufficient evidence to to reject \({ H }_{ 0 }\) in favor of \({ H }_{ A }\).

8. A noted medical researcher has suggested that a heart attack is less likely to occur among adults who actively participate in athletics. A random sample of 300 adults is obtained. Of that total, 100 are found to be athletically active. Within this group, 10 suffered heart attacks; among the 200 athletically in active adults, 25 had suffered heart attacks.

a.Test the hypothesis that the proportion of adults who are active and suffered heart attacks is different than the proportion of adults who are not active and suffered heart attacks. Use the 0.05 significance level.

Ans:

\({ H }_{ 0 }\): proportion of adults who are active and suffered heart attacks = the proportion of adults who are not active and suffered heart attacks

\({ H }_{ A }\): proportion of adults who are active and suffered heart attacks ??? the proportion of adults who are not active and suffered heart attacks

I have done a two sided hypotheses test -

result <- prop.test(x = c(10, 25), n = c(100, 200),alternative = c("two.sided"))
result

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(10, 25) out of c(100, 200)
## X-squared = 0.19811, df = 1, p-value = 0.6562
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.10705274  0.05705274
## sample estimates:
## prop 1 prop 2 
##  0.100  0.125

Since p-value > 0.05 we do not have enough evidence to reject the \({ H }_{ 0 }\) and we can not accept the \({ H }_{ A }\) .

b.Construct a 99% confidence interval for the difference between the proportions of all active and inactive adults who suffered heart attacks.

Ans:

result2<- prop.test(x = c(10, 25), n = c(100, 200),alternative = c("two.sided"),conf.level = 0.99)
result2

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(10, 25) out of c(100, 200)
## X-squared = 0.19811, df = 1, p-value = 0.6562
## alternative hypothesis: two.sided
## 99 percent confidence interval:
##  -0.13047891  0.08047891
## sample estimates:
## prop 1 prop 2 
##  0.100  0.125

The 99% confidence interval for the difference in proportion of of all active and inactive adults who suffered heart attacks ranges from (-0.13047891, 0.08047891)

9. Based on interviews of couples seeking divorces, a social worker compiles the following data related to the period of acquaintanceship before marriage and the duration of marriage.

Perform a test to determine whether the data substantiate an association between the stability of a marriage and the period of acquaintanceship prior to marriage. Use \(\alpha\) = 0.05.

Ans:

\({ H }_{ 0 }\): the is NO association between the stability of a marriage and the period of acquaintanceship prior to marriage

\({ H }_{ A }\): the is an association between the stability of a marriage and the period of acquaintanceship prior to marriage

chi = (11 - 10.3)^2/10.3 + (8-8.7)^2/8.7+(28-28.1)^2/28.1+(24-23.9)^2/23.9+(21-21.6)^2/21.6+(19-18.4)^2/18.4
chi

## [1] 0.1409008

num_col  = 2
num_row = 3
df = (num_col-1)*(num_row-1)
1-pchisq(chi, df)

## [1] 0.931974

The chi-square statistic is 0.1409008. The p-value is 0.931974. The result is not significant at p < 0.05. The is not enough evidence to reject the \({ H }_{ 0 }\) in favor of \({ H }_{ A }\).

DATA 606 - Final Exam

Soumya Ghosh

December 22, 2018