Data 606 Assignment 5

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

5.6 Working backwards

The sample mean is the midpoint of the confidence interval therefore since we know the lower and upper endpoints, \(\mu = 71\).

( sample_mean = 0.5 * ( 77 + 65) )

## [1] 71

The margin of error ME is half of the width of the confidence interval. Therefore, \(ME = 77 - \mu = 77 - 71 = 6\).

The sample standard deviation s satisfies the equation:

\[ME = t_{\alpha/2}\frac{s}{\sqrt{n}} = t_{0.05}\frac{s}{\sqrt{25}}\]

The reason is that \(n=25\) and we are given a confidence level \(\alpha\) of 100 - 90 percent. Since we are doing a two sided confidence interval, the parameter is \(t_{\alpha/2} = 1.710882\).

qt(0.05, 24)

## [1] -1.710882

This implies \(ME = 6 = 1.710882 \frac{s}{5}\) which givesa sample standard deviation s of \[ s= \frac{ME}{t_{\alpha/2}} \sqrt{n}= 17.53\].

5.14 SAT Scores

Raina can assume a normal distribution for estimation purposes. The key equation is:

\[ME=z SE = z\frac{s}{\sqrt{n}}\] Since \(M=25\) and a 90 percent confidence requires \(z=1.6448\) (based on qnorm for a two-sided interval) and a standard deviation \(s=250\), we get: \[25 = 1.6448 \frac{250}{\sqrt{n}}\]

(n =  ( 1.6448 * 250 /25)^2 )

## [1] 270.5367

We see that \(n>= 270.53\) which means \(n=271\) is the nearest integer satisfying the requirement.

Luke’s sample size needs to be bigger than Raina’s because he has a higher confidence requirement for the interval. This means the z-score is larger. The sample size is proportional to the square of the z-score.
For Luke’s minimum required sample size, we obtain:

\[ n = (\frac{z s}{ME})^2 = (2.575829 \frac{250}{25})^2 \]

(n = (qnorm(0.005) * 250 / 25)^2)

## [1] 663.4897

Rounding up to nearest integer, we see n must be at least 664 students.

5.20 High School and Beyond, Part I

There is no clear difference between the reading and writing scores. When we look at the summary statistics of reading minus writing, the distribution is fairly symmetric with a zero median difference and negative mean of -0.545 (i.e. reading < writing score). This is calculated below:

summary(hsb2$read)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   44.00   50.00   52.23   60.00   76.00

summary(hsb2$write)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   45.75   54.00   52.77   60.00   67.00

summary(hsb2$read - hsb2$write)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -21.000  -7.000   0.000  -0.545   6.000  24.000

Reading and writing scores of each student are not independent. Intuitively, if a student is good at reading, they should be good at writing.

qqplot( hsb2$read, hsb2$write)

abline(0,1)

In the above plot, we see many points fall near the X-y line. This indicates the scores are very positively correlated.

The null and alternative hypotheses are that:

\[H_{0}: \bar{x}_{read} - \bar{x}_{write} = 0\] \[H_{A}: \bar{x}_{read} - \bar{x}_{write} \neq 0\]

Observations of the differences are somewhat independent because each student’s difference should unrelated to the other and the sample of 200 is much less than the entire US high school senior population.

Also, the observations come from a nearly normal distribution because we can see the histogram of the differences below.

hist( hsb2$read - hsb2$write, breaks=15)

Let’s a difference in means test at the 95% confidence level.

t.test(hsb2$read - hsb2$write, mu =0, conf.level = .95 )

## 
##  One Sample t-test
## 
## data:  hsb2$read - hsb2$write
## t = -0.86731, df = 199, p-value = 0.3868
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.7841424  0.6941424
## sample estimates:
## mean of x 
##    -0.545

With a p-value of 38%, we cannot reject the null hypothesis at the 95% confidence level that reading and writing score have the same average.

We have potentially made a Type II error by rejecting the alternative hypothesis incorrectly which means we conclude reading and writing score have the same average when they are really different.
Yes. I would expect the confidence interval of the average difference betweeen reading and writing scores to contain 0 in its interior.

5.32 Fuel efficiency

We need to calculate the standard error of the difference in means of two medium sized samples \(n=26\).

n = 26

mean_auto = 16.12
sd_auto = 3.58

mean_manual = 19.85
sd_manual = 4.51

( standard_error = sqrt( ( sd_auto^2 / n) + (sd_manual^2 / n) ) )

## [1] 1.12927

Finally, we calculate the T-score of the difference in means and the p-value. Since this is much less than 5% at 0.14%, we strongly reject the null hypothesis and conclude the means are different.

( t_score = ( mean_auto - mean_manual) / standard_error )

## [1] -3.30302

( p_value = pt( t_score, df = n - 1 ) )

## [1] 0.001441807

5.48 Work Hours and Education

The null hypothesis is that all subgroup means are equal by educational attainment. The alternative hypothesis is that at least one subgroup mean differs from the population mean.
The conditions that need to be checked are:

observations are independent within each group. This seems plausible and will be assumed.
data within each group is normal. This appears to be somewhat true. However, we observe the high school subgroup has a positively skewed tail distribution. Some high school cases work over 80 hours per week. On the other hand, many bachelor’s cases work less than 30 hours. This is a left skew.
The variability across groups is about equal. There is similar variation reflected in the height of the box of the boxplots. One exception is that bachelor’s boxplot is somewhat narrower.

The assumption we have to make is that variability of the bachelor’s is equal to the population.

To calculate the ANOVA table, we first put the data into a data frame:

means      = c( 38.67, 39.6 , 41.39, 42.55,40.85 )
deviations = c( 15.81, 14.97, 18.1, 13.62, 15.51 )
n          = c( 121,   546,  97, 253,  155 )

df = data.frame( mu = means,  stdev = deviations, numcases = n)

(n_cases = sum(n) )

## [1] 1172

k = 5

(df = k -1 )

## [1] 4

(dfResidual = n_cases - k )

## [1] 1167

next we calculate the F-statistic F using the built-in function.

prob = .0682

( F_stat = qf( 1 - prob, df, dfResidual) )

## [1] 2.188931

Since we know that:

\[F=\frac{MSG}{MSE}\]

This implies we can solve for MSE:

( MSE = 501.54 / F_stat )  # From the table: Mean Sq.

## [1] 229.1255

MSG = 501.54

In turn, we can now solve for SSG and SST as well:

SSG = df * MSG

SSE = 267382   # from the table

SST = SSE + SSG

Lastly, we note the total degrees of freedom is the sum of \(n-k\) and \(k-1\) which is \(n-1 = 1171\)

anova_df = data.frame( Anova = c("degree", "Residuals", "Total"),
         Df = c( 4, 1167, 1171 ) ,
         SumSq = c( SSG, 267382,  SST) ,
         MeanSq = c( MSG, MSE, NA) ,  
         FStat = c(F_stat, NA, NA) ,
         Prob = c( 0.0682, NA, NA)
         )

knitr::kable(anova_df)

Anova	Df	SumSq	MeanSq	FStat	Prob
degree	4	2006.16	501.5400	2.188931	0.0682
Residuals	1167	267382.00	229.1255	NA	NA
Total	1171	269388.16	NA	NA	NA

We conclude that the p-value is too large to be significant. Therefore, we cannot reject that null hypothesis that all the means are equal.