A statistics instructor used to provide their students with a formula sheet, and found that their median grade in for exams was 74. Then, they switched to having the students write their own note-sheets. Here is the data from a small class after they switched:
X = c(64,72,73,79,80,81,84,89,90,92,95,98)
n = 12, which is less than 30. it has small sample size and hard to say it has a specific distribution.
length(X)
## [1] 12
\[\begin{align*} H_0 = \theta \leq 74 \\ H_A = \theta > 74 \end{align*}\]
S = B+,
B+ is number of observations that is greater than 74.
B+ = 9 \[\begin{align*} Z_s = \frac{9-12*0.5}{\sqrt{12*(0.25)}} = 1.732 \end{align*}\]
Z = (9-12*0.5)/sqrt(12*(0.25))
Z
## [1] 1.732051
P(Z>1.732) = 0.0416 p-value is 0.0416. If the true median were 74 or less, we would observe our data or more extreme with probability 0.0416
at significance level of 0.01, since p-value is greater than 0.01, we fail to reject the null hypothesis. we can conclude that the median grade for the exam is 74 or less.
Continue with the data from Problem 1.
\[\begin{align*} \alpha = 0.1 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.64*\sqrt{0.25*12}+0.5*12 = 3.1594 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.64*\sqrt{0.25*12}+0.5*12+1 = 9.8406 \end{align*}\]
so the CI is \((X_{(3)},X_{(10)})\) or (73, 92)
\[\begin{align*} \alpha = 0.05 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.96*\sqrt{0.25*12}+0.5*12 = 2.61 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.96*\sqrt{0.25*12}+0.5*12+1 = 10.39 \end{align*}\]
so the CI is \((X_{(3)},X_{(10)})\) or (73, 92)
median = median(X)
median
## [1] 82.5
CI = SIGN.test(X, md = median, alternative = "two.sided", conf.level = 0.90)
CI$Confidence.Intervals
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8540 79.0000 90.0000
## Interpolated CI 0.9000 76.4309 90.8564
## Upper Achieved CI 0.9614 73.0000 92.0000
Using R, Interpolated CI at confidence level 0.9 is (76.4309, 90.8564)
median = median(X)
median
## [1] 82.5
CI = SIGN.test(X, md = median, alternative = "two.sided", conf.level = 0.95)
CI$Confidence.Intervals
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8540 79.0000 90.0000
## Interpolated CI 0.9500 73.6382 91.7873
## Upper Achieved CI 0.9614 73.0000 92.0000
Using R, Interpolated CI at confidence level 0.95 is (73.6382, 91.7873)
min(X)
## [1] 64
max(X)
## [1] 98
min of the dataset is 64 and the max of the dataset is 98.
I think they are impractical because they do not have narrow interval.
The weights for a particular breed of chicken follow in pounds (in increasing order by row): 3.75 3.78 3.84 3.84 3.88 3.92 3.93 3.93 3.94 3.94 3.96 3.96 3.96 3.98 3.99 4.02 4.02 4.03 4.06 4.06 4.09 4.10 4.12 4.17 Note, this data can be found under the Dataset folder on Piazza, as chicken.csv. The breeder claims that the median weight for the chickens is 4 lbs.
\[\begin{align*} H_0 = \theta = 4 \\ H_A = \theta \neq 4 \end{align*}\]
S = B+ = 9 (observations that are greater than 4lbs)
B+ = 9
n = 24
Zs = \(\frac{9-24*0.5}{\sqrt{24*0.25}}\) = -1.22247
2(P(Z<-1.2247)) = 0.221 p value is 0.221 If the true median were 4, we would observe our data or more extreme with probability 0.221
Using R, I got Zs = -1.2247 and p value for two sided is 0.2207 there is no big difference.
setwd("~/Downloads")
chicken <- read.csv("chicken.csv")
X2 = chicken$weight
B = sum(X2 > 4)
B
## [1] 9
n = length(X2)
n
## [1] 24
Zs = (B - n*0.50)/sqrt(0.25*n)
Zs
## [1] -1.224745
p.value.two.sided = 2*pnorm(abs(Zs),lower.tail= FALSE)
p.value.two.sided
## [1] 0.2206714
At significance level of 0.05, since our p-value, 0.221 is greater than 0.05, we fail to reject the null hypothesis. we can conclude that the median weight for the chickens is 4lbs
Continue with Problem 4.
CI2 = SIGN.test(X2, md = median(X2), alternative = "two.sided",conf.level = 0.90)
CI2$Confidence.Intervals
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8484 3.9400 4.02
## Interpolated CI 0.9000 3.9341 4.02
## Upper Achieved CI 0.9361 3.9300 4.02
using R, interpolated CI at confidcence level 0.9 is (3.9341, 4.02).
\[\begin{align*} \alpha = 0.05 \end{align*}\] a = lower bound \[\begin{align*} a = -Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n \end{align*}\] \[\begin{align*} =-1.64*\sqrt{0.25*24}+0.5*24 = 7.9828 \end{align*}\] b = upper bound \[\begin{align*} b = Z_{1-\frac{\alpha}{2}}*\sqrt{0.25*n} + 0.5*n + 1 \end{align*}\] \[\begin{align*} =1.64*\sqrt{0.25*24}+0.5*24+1 = 17.0172 \end{align*}\]
so the CI is \((X_{(8)},X_{(17)})\) or (3.93, 4.02)
They are almost the same. Because it has a sufficiently large samples to test.
We are 90% confident that the true median weight for thr chicken is between 3.93 and 4.02lbs
The salaries of 30 faculty members from a particular department follow (in tens of thousands of dollars, rounded to a whole number): 77 85 89 92 93 93 93 93 94 95 95 96 98 99 99 99 100 102 102 102 103 104 108 111 112 112 113 113 114 117 Note, this data can be found under the Dataset folder on Piazza, as salary.csv
Using R, \(\hat{F}(99) = 0.53333...\)
salary <- read.csv("salary.csv")
X3 = salary$cash
X3 = sort(X3)
CDF = cumsum(table(X3))/length(X3)
CDF
## 77 85 89 92 93 94 95
## 0.03333333 0.06666667 0.10000000 0.13333333 0.26666667 0.30000000 0.36666667
## 96 98 99 100 102 103 104
## 0.40000000 0.43333333 0.53333333 0.56666667 0.66666667 0.70000000 0.73333333
## 108 111 112 113 114 117
## 0.76666667 0.80000000 0.86666667 0.93333333 0.96666667 1.00000000
\(\alpha = 0.05\)
n = 30
Using the formula below, \[\begin{align*} \hat{F}(x) \pm Z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{F}(x)*(1-\hat{F}(x))}{n}} \end{align*}\]
lower bound is \[\begin{align*} 0.5333 - 1.96*\sqrt{\frac{0.5333*(1-0.5333)}{30}} = 0.3548 \end{align*}\]
upper bound is \[\begin{align*} 0.5333 + 1.96*\sqrt{\frac{0.5333*(1-0.5333)}{30}} = 0.7118 \end{align*}\]
95% CI is (0.3548, 0.7118). Thus, we estimate that x=99 could be anything from the 35.48th and 71.18th percentile.
\(P^* = 0.1\)
\(\alpha = 0.05\)
lower bound is \[\begin{align*} n*P^*-Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.1-1.96*\sqrt{0.1*(1-0.1)*30} = -0.2206 \end{align*}\] since it is less than 1, we can round it up to 1st position.
upper bound is \[\begin{align*} n*P^*+1+Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.1+1+1.96*\sqrt{0.1*(1-0.1)*30} = 7.2206 \end{align*}\] since it is close to 7, we can round it down to 7th position. therfore, CI is (77, 93)
Using R, \(\hat{F}(94) = 0.3\)
\(\alpha = 0.1\)
n = 30
Using the formula below, \[\begin{align*} \hat{F}(x) \pm Z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{F}(x)*(1-\hat{F}(x))}{n}} \end{align*}\]
lower bound is \[\begin{align*} 0.3 - 1.65*\sqrt{\frac{0.3*(1-0.3)}{30}} = 0.1620 \end{align*}\]
upper bound is \[\begin{align*} 0.3 + 1.65*\sqrt{\frac{0.3*(1-0.3)}{30}} = 0.4380 \end{align*}\]
90% CI is (0.1620, 0.4380). Thus, we estimate that x=94 could be anything from the 16.2th and 43.8th percentile.
\(P^* = 0.9\)
\(\alpha = 0.1\)
lower bound is \[\begin{align*} n*P^*-Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.9-1.65*\sqrt{0.9*(1-0.9)*30} = 24.2888 \end{align*}\] since it is close to 24, we can round it down to 24th position.
upper bound is \[\begin{align*} n*P^*+1+Z_{1-\frac{\alpha}{2}}\sqrt{P^* * (1-P^*)*n} \\ = 30*0.9+1+1.65*\sqrt{0.9*(1-0.9)*30} = 30.7112 \end{align*}\] since it is close to 31, we can round it down to 31st position. since n = 30, we can say that the location is 30. therfore, CI is (111, 117)
Answer the following questions with TRUE or FALSE. It is good practice to explain your answers.
False
non-parametric statistics uses techniques that do not require typical assumptions of traditional techiniques. However, non-parametric have less strict assumptions as well. (there is a pdf of all assumptions for non-parametric test on Piazza as well)
True
we cannot assume that it is approximately normal with small sample size. Therefore this is true. we can use non-parametric test instead.
False
median is not sigfnificantly influenced by outliers while the mean is influenced by outliers
True
yes. outliers are common problems that can cause violation in common parametric test for the mean.