##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
mean height is 171.1 and median height is 170.3
standard deviation is 9.4 and \(IQR=Q3-Q1 = 177.8 - 163.8 = 14.0\).
To decide whether a person is unusually tall or short, we calculate their Z-score and obtain the associated p-value of being at that height or more extreme. If the probability is sufficiently small, we decide the height is unusual. We decide that a height of $ |z| > 1.5 $ to be unusual. Applying this criteria, we evaluate a person of 180 cm height. \[ Z_{180} = \frac{180-171.1}{9.4} = 0.9468085 \]
z180 = (180 - 171.1) /9.4
1 - pnorm( z180 )
## [1] 0.1718682
We conclude that a height of 180 cm is not unusually tall since 17.2% of adults are this height or greater.
Next, we consider a short person of height 155 cm:
(z155 = ( 155 - 171.1 ) / 9.4 )
## [1] -1.712766
pnorm( z155)
## [1] 0.0433778
We conclude that a height of 155 cm is unusual because only 4.3% of adults are this height or shorter and the Z score is -1.71 < -1.5.
another random sample would give different individuals due to sampling variation. Hence the mean and standard deviation are unlikely to remain unchanged under a different sample.
The variability of the estimate of the mean is called the standard error. The dataset bdims gives a standard error of 0.42 cm given its sample size as shown below.
sd( bdims$hgt) / sqrt( length( bdims$hgt))
## [1] 0.4177887
FALSE. The 95% confidence interval is about the average spending of all American adults not just the 436 survey participants. We have perfect knowledge of the average spending of the sampled Americans.
FALSE. The distribution is valid even if the true distribution is skewed provided that the sample size is large enough and independent is assured. Large sample size such as \(n=436\) is sufficient to offset the skewness.
FALSE. We are 95% sure that the population mean will fall in the confidence interval. However, if the distribution is skewed, 95% of random sample means may not fall inside a specific confidence interval.
TRUE. This is the definition of a confidence interval.
TRUE. A lower confidence level gives narrowere confidence bands.
FALSE. The margin of error decreases according to the square root of n law. Thus, to reduce the margin to a third, we need 9 times has many observations in the sample, not three times.
TRUE. The sample mean is midway between the end points of the confidence interval. Taking the right endpoint minus the sample mean gives the margin of error which is 4.4 as claimed.
89.11 - 84.71
## [1] 4.4
Yes. Conditions for inference are met. Intelligence appears to be normally distributed. A random sample of size \(n=36\) is sufficient to make inferences. The sampling is random and less than 10% of the total population of the large city’s school children so likely independent.
We will use a one-sided hypothesis test. The null hypothesis is:
\(H_{0}: \text{Avg(Age Gifted Count to 10) >= Avg(Age General Count to 10) }\)
and the alternative hypothesis is:
\(H_{1}: \text{Avg(Age Gifted Count to 10) < Avg(Age General Count to 10)}\)
( Z = ( 30.69 - 32 ) / (4.31 /sqrt(36) ) )
## [1] -1.823666
(pnorm(-abs(Z)) )
## [1] 0.0341013
since the one-sided p-value is 3.41% is less than the significance level of 10%, we conclude gift children first count to 10 earlier than the general population. We reject the null hypothesis.
The p-value is for a one-sided hypothesis test. The conditional probability that the average age that gifted children count to 10 is equal or greater than that of general children but we see data as low as 30.69 months is 3.41%.
To calculate a 90% confidence interval for the average age at which gifted children first count to 10 successfully, we use: a Z-score of 1.645.
( ci_lower_age = 30.69 - 1.645 * 4.31/sqrt(36) )
## [1] 29.50834
( ci_upper_age = 30.69 + 1.645 * 4.31/sqrt(36) )
## [1] 31.87166
\[Z = (118.2 - 100)/( 6.5/\sqrt{36})\]
( Z = (118.2 - 100)/( 6.5/sqrt(36) ) )
## [1] 16.8
Using a two sided hypothesis test, we can say that p-value associated with the Z-score of 16.8 is:
2*pnorm(-abs(Z))
## [1] 2.44044e-63
This p-value is much smaller than the significance level of 0.10. Thus, we rejectly the null hypothesis that mothers’ IQ is same as the population average.
(ci_lower = 118.2 - 1.645 * 6.5/sqrt(36) )
## [1] 116.4179
(ci_upper = 118.2 + 1.645 * 6.5/sqrt(36) )
## [1] 119.9821
The sampling distribution is the probability distribution of a statistic calculated from repeated random samples of size \(n\) of the population probability distribution.
As \(n\) increases to infinity, the sampling distribution of the sample mean converges to a normal distribution in shape. The mean of the sampling distribution is the population mean. The spread of the sampling distribution decreases to zero as the sample size goes to infinity with the square root of sample size.
\[ Z(10500) = \frac{ 10500 - 9000}{ 1000} = 1.5 \]
1 - pnorm(1.5)
## [1] 0.0668072
(s = 1000/sqrt(15) )
## [1] 258.1989
(Z = ( 10500 - 9000)/ 258.2 )
## [1] 5.80945
(1-pnorm(Z) )
## [1] 3.13392e-09
df= data.frame( pop = rnorm(300, 9000, 1000), samp = rnorm(300, 9000, 1000/sqrt(15)))
ggplot(df) + geom_histogram(aes(x=df$pop), fill="red", alpha = 0.2, bins=25) + geom_histogram(aes(x=df$samp), fill="blue", alpha=0.3, bins=25)
Assuming the calculation mistake did not involve calculation of the statistic itself but only the p-value, if \(n\) increases from 50 to 500, the p-value should decrease. For example, if we are assessing the difference of sample mean from a target population mean, then sampling distribution would have the same population mean but lower standard error with \(n=500\). Thus, the denominator of our Z-score would decrease, leading to a larger Z-score and thus lower p-value.