HW1_hyunjin

Question 1

A statistics instructor used to provide their students with a formula sheet, and found that their median grade in for exams was 74. Then, they switched to having the students write their own note-sheets. Here is the data from a small class after they switched:

X = c(64,72,73,79,80,81,84,89,90,92,95,98)

a) What aspect of this data may make the traditional (parametric) test for a sample mean un-advisable? Pick one, and explain your answer.

n = 12, which is less than 30. it has small sample size and hard to say it has a specific distribution.

length(X)

## [1] 12

b) Specify the null and alternative hypothesis the instructor would like to test.

\[\begin{align*} H_0 = \theta \leq 74 \\ H_A = \theta > 74 \end{align*}\]

c) Calculate the approximate test-statistic, using the normal approximation to a binomial distribution

S = B+,

B+ is number of observations that is greater than 74.

B+ = 9 \[\begin{align*} Z_s = \frac{9-12*0.5}{\sqrt{12*(0.25)}} = 1.732 \end{align*}\]

Z = (9-12*0.5)/sqrt(12*(0.25))
Z

## [1] 1.732051

d) Calculate and interpret the p-value for the test.

P(Z>1.732) = 0.0416 p-value is 0.0416. If the true median were 74 or less, we would observe our data or more extreme with probability 0.0416

e) State your conclusion about the hypotheses in (b) in terms of the problem, if α = 0.01.

at significance level of 0.01, since p-value is greater than 0.01, we fail to reject the null hypothesis. we can conclude that the median grade for the exam is 74 or less.

Question 2

Continue with the data from Problem 1.

a) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 90% confidence interval for the true median.

\[\begin{align*} \alpha = 0.1 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.64*\sqrt{0.25*12}+0.5*12 = 3.1594 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.64*\sqrt{0.25*12}+0.5*12+1 = 9.8406 \end{align*}\]

so the CI is \((X_{(3)},X_{(10)})\) or (73, 92)

b) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 95% confidence interval.

\[\begin{align*} \alpha = 0.05 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.96*\sqrt{0.25*12}+0.5*12 = 2.61 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.96*\sqrt{0.25*12}+0.5*12+1 = 10.39 \end{align*}\]

so the CI is \((X_{(3)},X_{(10)})\) or (73, 92)

c) Using R, calculate the exact 90% confidence interval for the true median.

median = median(X)
median

## [1] 82.5

CI = SIGN.test(X, md = median, alternative = "two.sided", conf.level = 0.90)
CI$Confidence.Intervals

##                   Conf.Level  L.E.pt  U.E.pt
## Lower Achieved CI     0.8540 79.0000 90.0000
## Interpolated CI       0.9000 76.4309 90.8564
## Upper Achieved CI     0.9614 73.0000 92.0000

Using R, Interpolated CI at confidence level 0.9 is (76.4309, 90.8564)

d) Using R, calculate the exact 95% confidence interval for the true median.

median = median(X)
median

## [1] 82.5

CI = SIGN.test(X, md = median, alternative = "two.sided", conf.level = 0.95)
CI$Confidence.Intervals

##                   Conf.Level  L.E.pt  U.E.pt
## Lower Achieved CI     0.8540 79.0000 90.0000
## Interpolated CI       0.9500 73.6382 91.7873
## Upper Achieved CI     0.9614 73.0000 92.0000

Using R, Interpolated CI at confidence level 0.95 is (73.6382, 91.7873)

Question 3

a) What is the minimum and maximum of our dataset?

min(X)

## [1] 64

max(X)

## [1] 98

min of the dataset is 64 and the max of the dataset is 98.

b) Would you say that some of the confidence intervals calculated are impractical? If so, which ones, and why?

I think they are impractical because they do not have narrow interval.

Question 4

The weights for a particular breed of chicken follow in pounds (in increasing order by row): 3.75 3.78 3.84 3.84 3.88 3.92 3.93 3.93 3.94 3.94 3.96 3.96 3.96 3.98 3.99 4.02 4.02 4.03 4.06 4.06 4.09 4.10 4.12 4.17 Note, this data can be found under the Dataset folder on Piazza, as chicken.csv. The breeder claims that the median weight for the chickens is 4 lbs.

a) State the null and alternative hypothesis.

\[\begin{align*} H_0 = \theta = 4 \\ H_A = \theta \neq 4 \end{align*}\]

b) Calculate the approximate test-statistic using the normal approximation to binomial.

S = B+ = 9 (observations that are greater than 4lbs)

B+ = 9

n = 24

Zs = \(\frac{9-24*0.5}{\sqrt{24*0.25}}\) = -1.22247

c) Calculate the p-value associated with the test statistic in (b).

2(P(Z<-1.2247)) = 0.221 p value is 0.221 If the true median were 4, we would observe our data or more extreme with probability 0.221

d) Use R to calculate the exact p-value for the hypothesis test. How large of a difference was there?

Using R, I got Zs = -1.2247 and p value for two sided is 0.2207 there is no big difference.

setwd("~/Downloads")
chicken <- read.csv("chicken.csv")
X2 = chicken$weight
B = sum(X2 > 4)
B

## [1] 9

n = length(X2)
n

## [1] 24

Zs = (B - n*0.50)/sqrt(0.25*n)
Zs

## [1] -1.224745

p.value.two.sided = 2*pnorm(abs(Zs),lower.tail= FALSE)
p.value.two.sided

## [1] 0.2206714

e) State your conclusion in terms of the problem, if α = 0.05.

At significance level of 0.05, since our p-value, 0.221 is greater than 0.05, we fail to reject the null hypothesis. we can conclude that the median weight for the chickens is 4lbs

Question 5

Continue with Problem 4.

a) Using R, calculate the exact 90% confidence interval for the true median

CI2 = SIGN.test(X2, md = median(X2), alternative = "two.sided",conf.level = 0.90)
CI2$Confidence.Intervals

##                   Conf.Level L.E.pt U.E.pt
## Lower Achieved CI     0.8484 3.9400   4.02
## Interpolated CI       0.9000 3.9341   4.02
## Upper Achieved CI     0.9361 3.9300   4.02

using R, interpolated CI at confidcence level 0.9 is (3.9341, 4.02).

b) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 90% confidence interval for the true median.

\[\begin{align*} \alpha = 0.05 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.64*\sqrt{0.25*24}+0.5*24 = 7.9828 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.64*\sqrt{0.25*24}+0.5*24+1 = 17.0172 \end{align*}\]

so the CI is \((X_{(8)},X_{(17)})\) or (3.93, 4.02)

c) Which interval is narrower, and why?

They are almost the same. Because it has a sufficiently large samples to test.

d) Interpret the interval from (b).

We are 90% confident that the true median weight for thr chicken is between 3.93 and 4.02lbs

Question 6

The salaries of 30 faculty members from a particular department follow (in tens of thousands of dollars, rounded to a whole number): 77 85 89 92 93 93 93 93 94 95 95 96 98 99 99 99 100 102 102 102 103 104 108 111 112 112 113 113 114 117 Note, this data can be found under the Dataset folder on Piazza, as salary.csv

a) Approximate the 95% confidence interval for the percentile which corresponds to the value 99.

Using R, \(\hat{F}(99) = 0.53333...\)

salary <- read.csv("salary.csv")
X3 = salary$cash
X3 = sort(X3)
CDF = cumsum(table(X3))/length(X3)
CDF

##         77         85         89         92         93         94         95 
## 0.03333333 0.06666667 0.10000000 0.13333333 0.26666667 0.30000000 0.36666667 
##         96         98         99        100        102        103        104 
## 0.40000000 0.43333333 0.53333333 0.56666667 0.66666667 0.70000000 0.73333333 
##        108        111        112        113        114        117 
## 0.76666667 0.80000000 0.86666667 0.93333333 0.96666667 1.00000000

\(\alpha = 0.05\)

n = 30

Using the formula below, \[\begin{align*} \hat{F}(x) \pm Z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{F}(x)*(1-\hat{F}(x))}{n}} \end{align*}\]

lower bound is \[\begin{align*} 0.5333 - 1.96*\sqrt{\frac{0.5333*(1-0.5333)}{30}} = 0.3548 \end{align*}\]

upper bound is \[\begin{align*} 0.5333 + 1.96*\sqrt{\frac{0.5333*(1-0.5333)}{30}} = 0.7118 \end{align*}\]

95% CI is (0.3548, 0.7118). Thus, we estimate that x=99 could be anything from the 35.48th and 71.18th percentile.

b) Approximate the 95% confidence interval for the 10th percentile.

\(P^* = 0.1\)

\(\alpha = 0.05\)

lower bound is \[\begin{align*} n*P^*-Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.1-1.96*\sqrt{0.1*(1-0.1)*30} = -0.2206 \end{align*}\] since it is less than 1, we can round it up to 1st position.

upper bound is \[\begin{align*} n*P^*+1+Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.1+1+1.96*\sqrt{0.1*(1-0.1)*30} = 7.2206 \end{align*}\] since it is close to 7, we can round it down to 7th position. therfore, CI is (77, 93)

c) Approximate the 90% confidence interval for the percentile which corresponds to the value 94.

Using R, \(\hat{F}(94) = 0.3\)

\(\alpha = 0.1\)

n = 30

Using the formula below, \[\begin{align*} \hat{F}(x) \pm Z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{F}(x)*(1-\hat{F}(x))}{n}} \end{align*}\]

lower bound is \[\begin{align*} 0.3 - 1.65*\sqrt{\frac{0.3*(1-0.3)}{30}} = 0.1620 \end{align*}\]

upper bound is \[\begin{align*} 0.3 + 1.65*\sqrt{\frac{0.3*(1-0.3)}{30}} = 0.4380 \end{align*}\]

90% CI is (0.1620, 0.4380). Thus, we estimate that x=94 could be anything from the 16.2th and 43.8th percentile.

d) Approximate the 90% confidence interval for the 90th percentile.

\(P^* = 0.9\)

\(\alpha = 0.1\)

lower bound is \[\begin{align*} n*P^*-Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.9-1.65*\sqrt{0.9*(1-0.9)*30} = 24.2888 \end{align*}\] since it is close to 24, we can round it down to 24th position.

upper bound is \[\begin{align*} n*P^*+1+Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.9+1+1.65*\sqrt{0.9*(1-0.9)*30} = 30.7112 \end{align*}\] since it is close to 31, we can round it down to 31st position. since n = 30, we can say that the location is 30. therfore, CI is (111, 117)

Question 7

Answer the following questions with TRUE or FALSE. It is good practice to explain your answers.

a) Non-parametric tests have no assumptions.

False

non-parametric statistics uses techniques that do not require typical assumptions of traditional techiniques. However, non-parametric have less strict assumptions as well. (there is a pdf of all assumptions for non-parametric test on Piazza as well)

b) When the sample size is small, the main assumptions of parametric tests may be violated.

True

we cannot assume that it is approximately normal with small sample size. Therefore this is true. we can use non-parametric test instead.

c) The median is heavily influenced by outliers.

False

median is not sigfnificantly influenced by outliers while the mean is influenced by outliers

d) The mean is heavily influenced by outliers.

True

yes. outliers are common problems that can cause violation in common parametric test for the mean.

HW1_hyunjin_chang

Question 1

a) What aspect of this data may make the traditional (parametric) test for a sample mean un-advisable? Pick one, and explain your answer.

b) Specify the null and alternative hypothesis the instructor would like to test.

c) Calculate the approximate test-statistic, using the normal approximation to a binomial distribution

d) Calculate and interpret the p-value for the test.

e) State your conclusion about the hypotheses in (b) in terms of the problem, if α = 0.01.

Question 2

a) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 90% confidence interval for the true median.

b) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 95% confidence interval.

c) Using R, calculate the exact 90% confidence interval for the true median.

d) Using R, calculate the exact 95% confidence interval for the true median.

Question 3

a) What is the minimum and maximum of our dataset?

b) Would you say that some of the confidence intervals calculated are impractical? If so, which ones, and why?

Question 4

a) State the null and alternative hypothesis.

b) Calculate the approximate test-statistic using the normal approximation to binomial.

c) Calculate the p-value associated with the test statistic in (b).

d) Use R to calculate the exact p-value for the hypothesis test. How large of a difference was there?

e) State your conclusion in terms of the problem, if α = 0.05.

Question 5

a) Using R, calculate the exact 90% confidence interval for the true median

b) Using the normal approximation to the binomial distribution, calculate the lower and upper bound for a 90% confidence interval for the true median.

c) Which interval is narrower, and why?

d) Interpret the interval from (b).

Question 6

a) Approximate the 95% confidence interval for the percentile which corresponds to the value 99.

b) Approximate the 95% confidence interval for the 10th percentile.

c) Approximate the 90% confidence interval for the percentile which corresponds to the value 94.

d) Approximate the 90% confidence interval for the 90th percentile.

Question 7

a) Non-parametric tests have no assumptions.

b) When the sample size is small, the main assumptions of parametric tests may be violated.

c) The median is heavily influenced by outliers.

d) The mean is heavily influenced by outliers.