##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
The sample mean is the midpoint of the confidence interval therefore since we know the lower and upper endpoints, \(\mu = 71\).
( sample_mean = 0.5 * ( 77 + 65) )
## [1] 71
The margin of error ME is half of the width of the confidence interval. Therefore, \(ME = 77 - \mu = 77 - 71 = 6\).
The sample standard deviation s satisfies the equation:
\[ME = t_{\alpha/2}\frac{s}{\sqrt{n}} = t_{0.05}\frac{s}{\sqrt{25}}\]
The reason is that \(n=25\) and we are given a confidence level \(\alpha\) of 100 - 90 percent. Since we are doing a two sided confidence interval, the parameter is \(t_{\alpha/2} = 1.710882\).
qt(0.05, 24)
## [1] -1.710882
This implies \(ME = 6 = 1.710882 \frac{s}{5}\) which givesa sample standard deviation s of \[ s= \frac{ME}{t_{\alpha/2}} \sqrt{n}= 17.53\].
\[ME=z SE = z\frac{s}{\sqrt{n}}\] Since \(M=25\) and a 90 percent confidence requires \(z=1.6448\) (based on qnorm for a two-sided interval) and a standard deviation \(s=250\), we get: \[25 = 1.6448 \frac{250}{\sqrt{n}}\]
(n = ( 1.6448 * 250 /25)^2 )
## [1] 270.5367
We see that \(n>= 270.53\) which means \(n=271\) is the nearest integer satisfying the requirement.
Luke’s sample size needs to be bigger than Raina’s because he has a higher confidence requirement for the interval. This means the z-score is larger. The sample size is proportional to the square of the z-score.
For Luke’s minimum required sample size, we obtain:
\[ n = (\frac{z s}{ME})^2 = (2.575829 \frac{250}{25})^2 \]
(n = (qnorm(0.005) * 250 / 25)^2)
## [1] 663.4897
Rounding up to nearest integer, we see n must be at least 664 students.
summary(hsb2$read)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.00 44.00 50.00 52.23 60.00 76.00
summary(hsb2$write)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 45.75 54.00 52.77 60.00 67.00
summary(hsb2$read - hsb2$write)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -21.000 -7.000 0.000 -0.545 6.000 24.000
qqplot( hsb2$read, hsb2$write)
abline(0,1)
In the above plot, we see many points fall near the X-y line. This indicates the scores are very positively correlated.
\[H_{0}: \bar{x}_{read} - \bar{x}_{write} = 0\] \[H_{A}: \bar{x}_{read} - \bar{x}_{write} \neq 0\]
Also, the observations come from a nearly normal distribution because we can see the histogram of the differences below.
hist( hsb2$read - hsb2$write, breaks=15)
t.test(hsb2$read - hsb2$write, mu =0, conf.level = .95 )
##
## One Sample t-test
##
## data: hsb2$read - hsb2$write
## t = -0.86731, df = 199, p-value = 0.3868
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -1.7841424 0.6941424
## sample estimates:
## mean of x
## -0.545
With a p-value of 38%, we cannot reject the null hypothesis at the 95% confidence level that reading and writing score have the same average.
We have potentially made a Type II error by rejecting the alternative hypothesis incorrectly which means we conclude reading and writing score have the same average when they are really different.
Yes. I would expect the confidence interval of the average difference betweeen reading and writing scores to contain 0 in its interior.
We need to calculate the standard error of the difference in means of two medium sized samples \(n=26\).
n = 26
mean_auto = 16.12
sd_auto = 3.58
mean_manual = 19.85
sd_manual = 4.51
( standard_error = sqrt( ( sd_auto^2 / n) + (sd_manual^2 / n) ) )
## [1] 1.12927
Finally, we calculate the T-score of the difference in means and the p-value. Since this is much less than 5% at 0.14%, we strongly reject the null hypothesis and conclude the means are different.
( t_score = ( mean_auto - mean_manual) / standard_error )
## [1] -3.30302
( p_value = pt( t_score, df = n - 1 ) )
## [1] 0.001441807
The null hypothesis is that all subgroup means are equal by educational attainment. The alternative hypothesis is that at least one subgroup mean differs from the population mean.
The conditions that need to be checked are:
data within each group is normal. This appears to be somewhat true. However, we observe the high school subgroup has a positively skewed tail distribution. Some high school cases work over 80 hours per week. On the other hand, many bachelor’s cases work less than 30 hours. This is a left skew.
The variability across groups is about equal. There is similar variation reflected in the height of the box of the boxplots. One exception is that bachelor’s boxplot is somewhat narrower.
The assumption we have to make is that variability of the bachelor’s is equal to the population.
means = c( 38.67, 39.6 , 41.39, 42.55,40.85 )
deviations = c( 15.81, 14.97, 18.1, 13.62, 15.51 )
n = c( 121, 546, 97, 253, 155 )
df = data.frame( mu = means, stdev = deviations, numcases = n)
(n_cases = sum(n) )
## [1] 1172
k = 5
(df = k -1 )
## [1] 4
(dfResidual = n_cases - k )
## [1] 1167
next we calculate the F-statistic F using the built-in function.
prob = .0682
( F_stat = qf( 1 - prob, df, dfResidual) )
## [1] 2.188931
Since we know that:
\[F=\frac{MSG}{MSE}\]
This implies we can solve for MSE:
( MSE = 501.54 / F_stat ) # From the table: Mean Sq.
## [1] 229.1255
MSG = 501.54
In turn, we can now solve for SSG and SST as well:
SSG = df * MSG
SSE = 267382 # from the table
SST = SSE + SSG
Lastly, we note the total degrees of freedom is the sum of \(n-k\) and \(k-1\) which is \(n-1 = 1171\)
anova_df = data.frame( Anova = c("degree", "Residuals", "Total"),
Df = c( 4, 1167, 1171 ) ,
SumSq = c( SSG, 267382, SST) ,
MeanSq = c( MSG, MSE, NA) ,
FStat = c(F_stat, NA, NA) ,
Prob = c( 0.0682, NA, NA)
)
knitr::kable(anova_df)
| Anova | Df | SumSq | MeanSq | FStat | Prob |
|---|---|---|---|---|---|
| degree | 4 | 2006.16 | 501.5400 | 2.188931 | 0.0682 |
| Residuals | 1167 | 267382.00 | 229.1255 | NA | NA |
| Total | 1171 | 269388.16 | NA | NA | NA |