Working backwards, Part II. A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.
Answer 5.6
Since we know that the sample mean is (x2+x1)/2 where the confidence interval is (x1,x2)
n <- 25
x1 <- 65
x2 <- 77
SMean <- (x2 + x1) / 2
SMean
## [1] 71
The sample mean is 71.
Since we know that the margin of error is (x2−x1)/2 where the confidence interval is (x1,x2)
n <- 25
x1 <- 65
x2 <- 77
ME <- (x2 - x1) / 2
ME
## [1] 6
The margin of error is ME = 6.
To calculate the sample standard deviation we use ME=t∗⋅SE by using the qt() function and df = 25 - 1.
df <- 25 - 1
p <- 0.9
p_2tails <- p + (1 - p)/2
t_val <- qt(p_2tails, df)
# Since ME = t * SE
SE <- ME / t_val
# Since SE = sd/sqrt(n)
sd <- SE * sqrt(n)
sd
## [1] 17.53481
The standard deviation is sd = 17.5348146.
SAT scores. SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.
Answer 5.14
Answer a)
For this, I will use as follows: ME=z⋅SE and since SE=sdn√
we have as follows: ME=z⋅sdn√ at the end we obtain: MEz=sdn√
n=(z⋅sdME)2
z <- 1.65 # due to 90% Confidence interval
ME <- 25
sd <- 250
n <- ((z * sd) / ME ) ^ 2
n
## [1] 272.25
The sample size should be 273 students.
Answer b)
Luke’s sample should be larger since it will require a higher z number multiplied by the standard deviation and then squared.
Answer c)
z <- 2.575 # due to 99% Confidence interval
ME <- 25
sd <- 250
n <- ((z * sd) / ME ) ^ 2
n
## [1] 663.0625
The sample size should be 664 students.
High School and Beyond, Part I. The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.
knitr::include_graphics("/Users/priyashaji/Documents/cuny msds/Spring'19/data 606/homeworks/homework_5/Screen Shot 2019-03-24 at 4.13.23 PM.png")
knitr::include_graphics("/Users/priyashaji/Documents/cuny msds/Spring'19/data 606/homeworks/homework_5/Screen Shot 2019-03-24 at 4.13.32 PM.png")
Answer 5.20
Answer a)
Clear difference is not visible in the average of the reading and writing scores. The difference distribution is fairly normal around the zero difference, though it seems to be a slight skew to the right.
Answer b)
The scores are independent of each student but not of each score, that is reading and writing scores are not independent of each other for each student.
Answer c)
Since the question is referring for the difference in the average score of students, and not referring to the average difference in scores. The hypotheses could be as follows:
H_0: The difference of average in between reading and writing equal zero. That is: μr−μw=0
H_A: The difference of average in between reading and writing does NOT equal zero. That is: μr−μw≠0
Answer d)
Independence of observations: The difference histogram suggested the data are paired. If paired, then they wouldn’t be independent.
Observations come from nearly normal distribution: The box plot provided in the text suggests the data are reasonably normally distributed and no outliers exist.
Answer e)
The hypotheses for the average difference test are:
The paired data is presumably from less than 10% of the population of senior high schoolers, and from a simple random sample. We noted that the differences are nearly normally distributed, so the conditions are met in order to apply the t-distribution.
sd_Diff <- 8.887
mu_Dif <- -0.545
n <- 200
SE_Diff <- sd_Diff / sqrt(n)
# Compute T statistic
t_value <- (mu_Dif - 0) / SE_Diff
df <- n - 1
p <- pt(t_value, df = df)
p
## [1] 0.1934182
Since the p-value is not less that 0.05, this implies that there is not convincing evidence of a difference in student’s reading and writing exam scores maintaining our NULL hypothesis.
The above conclusion need to be analyzed with further detail since the data need to be independent and currently is not.
Answer f)
Type I error: Incorrectly reject the null hypothesis.
Type II error: Incorrectly reject the alternative hypothesis.
In the case, we may have made a type II error by rejecting the alternative hypothesis HA. that is, we might have wrongly concluded that there is not a difference in the average student reading and writing exam scores.
Answer g)
Yes,there should be a confidence interval for the average difference between reading and writing scores to include 0.
When the confidence interval include 0 for this kind of hypothesis test, it indicates that the difference is not in one side or another.
Fuel efficiency of manual and automatic cars, Part I. Each year the US Environ- mental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel e
knitr::include_graphics("/Users/priyashaji/Documents/cuny msds/Spring'19/data 606/homeworks/homework_5/Screen Shot 2019-03-24 at 4.39.55 PM.png")
Answer 5.32
The hypotheses for this test are as follows:
From the text we have as follows:
n <- 26
# Automatic
mu_a <- 16.12
sd_a <- 3.58
# Manual
mu_m <- 19.85
sd_m <- 4.51
# difference in sample means
mu_Diff <- mu_a - mu_m
# standard error of this point estimate
SE_Diff <- ( (sd_a ^ 2 / n) + ( sd_m ^ 2 / n) ) ^ 0.5
t_val <- (mu_Diff - 0) / SE_Diff
df <- n - 1
p <- pt(t_val, df = df)
p
## [1] 0.001441807
Since the p-value is less than 0.05, we reject the null hypothesis H0 and conclude that there is strong evidence of a difference in fuel efficiency between manual and automatic transmissions.
Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.
knitr::include_graphics("/Users/priyashaji/Documents/cuny msds/Spring'19/data 606/homeworks/homework_5/Screen Shot 2019-03-24 at 4.44.53 PM.png")
Answer 5.48
Answer a)
The hypotheses for this ANOVA test follow:
Answer b)
The observations are independent within and across groups: I will assume independence within and across the groups based on the nature of the provided data.
The data within each group are nearly normal: The box plots do not support nearly normal data within each group. Each group has outliers some groups seem to follow a normal distribution.
The variability across the groups is about equal: There seems to be a similarity of variability in between some of the groups just by observing the standard deviations.
knitr::include_graphics("/Users/priyashaji/Documents/cuny msds/Spring'19/data 606/homeworks/homework_5/Screen Shot 2019-03-24 at 4.46.54 PM.png")
Answer c)
mu <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)
data_table <- data.frame (mu, sd, n)
n <- sum(data_table$n)
k <- length(data_table$mu)
# Finding degrees of freedom
df <- k - 1
dfResidual <- n - k
# Using the qf function on the Pr(>F) to get the F-statistic:
Prf <- 0.0682
F_statistic <- qf( 1 - Prf, df , dfResidual)
# F-statistic = MSG/MSE
MSG <- 501.54
MSE <- MSG / F_statistic
# MSG = 1 / df * SSG
SSG <- df * MSG
SSE <- 267382
# SST = SSG + SSE, and df_Total = df + dfResidual
SST <- SSG + SSE
dft <- df + dfResidual
Df Sum Sq Mean Sq F value Pr(>F)
degree 4 2006.16 501.54 2.188984 0.0682
Residuals 1167 267,382 229.12
Total 1171 269388.16
Answer d)
Since the p-value = 0.0682 is greater than 0.05, We conclude that there is not a significant difference between the groups and the null hypothesis does not get rejected.