Stats 606 HW 5

Row

5.6 Working backwards

A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

#we need to find the mean

midpoint = (65 + 77)/2

paste("The sample mean is the sum of the two CI divided by 2: ", midpoint)

[1] "The sample mean is the sum of the two CI divided by 2:  71"

# using df of 25-1 = 24, we use a z score of 1.71 from the book or the qt function
z <- 1.71

round(abs(qt(0.10/2,24)),2)

[1] 1.71

paste("We are using a z score of 1.71 with 24 degrees of freedom")

[1] "We are using a z score of 1.71 with 24 degrees of freedom"

# SE = s / sqrt(n)

# Using one of the values

# 71 + z*SE = 77
# z*SE = 6


SE <- 6/z
paste0("The SE is ", round(SE,2))

[1] "The SE is 3.51"

paste0("The marginal error is the SE * z_score: ", round(SE*z))

[1] "The marginal error is the SE * z_score: 6"

# Using the SE formula: s / sqrt(n)
n <- 25
# SE = sd / sqrt(n)
# sd = SE * sqrt(n)
sd <- SE * sqrt(25)
paste("the sample standard deviation is: ", round(sd,4))

[1] "the sample standard deviation is:  17.5439"

5.14 SAT scores

SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

Raina wants to use a 90% confidence interval. How large a sample should she collect?

# Given Values
sd <-250
me <- 25
# 2 tail
CI <- .10/2 
# want to find x bar
# x +-25
# we will use the z score for a 90% two tail confidence interval
z <- qnorm(CI, lower.tail = FALSE)

SE = 25 / z
SE

[1] 15.19892

# SE = 250 / sqrt(n)
# n = (250/ sqrt(SE)^2)
n<- (250/SE)^2
paste("The sample size should be roughly", round(n))

[1] "The sample size should be roughly 271"

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

Since we are looking for a higher level of accuracy, we need to get a larger sample. Below is the calcuation to show what the sample size should be.

Calculate the minimum required sample size for Luke.

CI <- .01/2
z <- qnorm(CI, lower.tail = FALSE)
z

[1] 2.575829

SE <- 25/z
SE

[1] 9.705612

n<- (250/SE)^2
paste("The sample size should be roughly", round(n))

[1] "The sample size should be roughly 663"

5.20 High School and Beyond, Part I

The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the diffrences in scores are shown below.

Is there a clear difference in the average reading and writing scores?

There is a slightly higher write score vs a reading score. However, based on the differences histogram, it appears to be fairly normally distribution. The IQR for reading is much wider than the writing score.

Are the reading and writing scores of each student independent of each other?

In this particular case, I do not believe the scores to be independent of each other since reading and writing ability tend to be fairly correlated. the individual cases may be independent depending on the population size of the school. Assuming the school size is at least above 2000 students, then the observations would be independent.

Create hypotheses appropriate for the following research question: is there an evident diffrence in the average scores of students in the reading and writing exam?

H0 The difference between student’s reading and writing scores is 0 u_diff = 0 H1 There is a difference between student’s reading and writing scores. u_diff != 0

Check the conditions required to complete this test.

The conditions are: 1. The population size must be less than 10% 2. The differences are taken from a fairly normal distribution. Here, we will assume this is fairly normal.

The average observed difference in scores is x ̄read-write = -0.545, and the standard deviation of the dfferences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

No. The probability of having a difference observed is relatively high 95%. Therefore, we do not reject the null hypothesis.

x_bar_diff <- -0.545
sd_diff <- 8.887
t <- abs((x_bar_diff - 0)/sd_diff)
paste("The t value is: ", round(t,4))

[1] "The t value is:  0.0613"

paste("The probability of t with 199 degrees of freedom is: ", 2*pt(t,199, lower.tail = FALSE))

[1] "The probability of t with 199 degrees of freedom is:  0.951161504690953"

What type of error might we have made? Explain what the error means in the context of the application.

In this scenario, we could potentially obtain a type 2 error. Not rejecting the null hypothesis when it is indeed false.

Based on the results of this hypothesis test, would you expect a confidence interval for the average diffrence between the reading and writing scores to include 0? Explain your reasoning.

Yes. We expect it to include the value of zero because we are not rejecting the null hypothesis. This means that within our confidence interval, we expect to see zero.

5.32 Fuel effciency of manual and automatic cars

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel effciency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a diffrence between the average fuel ecffiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

We will use the following hypothesis

H0: The difference in fuel effciency = 0 H1: The difference in fuel efficiency is != 0

mileage <- matrix(c(16.12, 19.85, 3.58, 4.41, 26, 26), nrow = 3, ncol = 2, byrow = TRUE)

colnames(mileage) <- c("Automatic", "Manual")
rownames(mileage) <- c("Mean", "SD", "n")
mileage

     Automatic Manual
Mean     16.12  19.85
SD        3.58   4.41
n        26.00  26.00

mean_diff<- mileage[1,1] - mileage[1,2]
paste("The mean difference is: ", mean_diff)

[1] "The mean difference is:  -3.73"

SE <- sqrt((mileage[2,1]^2/mileage[3,1])+(mileage[2,2]^2/mileage[3,2]))
paste("The SE is: ", round(SE),4)

[1] "The SE is:  1 4"

t <- mean_diff/SE
paste("The t value is: ", round(abs(t),4))

[1] "The t value is:  3.3484"

p_times2 <- round(2*pt(abs(t), df=mileage[3,1]-1, lower.tail = FALSE),4)
paste("The p-value is: ",p_times2)

[1] "The p-value is:  0.0026"

Since the p value is less than 0.05, we can reject the null hypothesis.

5.48 Work hours and education

The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

H0: The Avg hours worked is the same for all educational attainment levels H1: The Avg hours worked is different for at least one pair

Check conditions and describe any assumptions you must make to proceed with the test.

I am assuming the following conditions: 1. Independence 2. the data is approximately normal for all groups 3. The variance is constant between groups, or fairly equal.

Below is part of the output associated with this test. Fill in the empty cells.

# total groups -1 (dfg)
degree <- (5 -1)
paste("The degress are: ", degree)

[1] "The degress are:  4"

# residuals n - k dfe
residuals <- 1172 - 5 
paste("The residuals are: ", residuals)

[1] "The residuals are:  1167"

# total dfT
dft <- residuals + degree
paste("The total degrees of freedom are: ",dft)

[1] "The total degrees of freedom are:  1171"

# SSG (Sum Sq)
x <- 40.45
SSG <- k<-(121*(38.67-x)^2)+(546*(39.6-x)^2)+(97*(41.39-x)^2)+(253*(42.55-x)^2)+(155*(40.85-x)^2)
paste("The Sum Sq is: ", round(SSG,2))

[1] "The Sum Sq is:  2004.1"

# Total Error
#SSG + 267382
paste("The total error is Sum sQ: ", round(SSG + 267382,2))

[1] "The total error is Sum sQ:  269386.1"

# mean sq group
MSG <- SSG / degree
paste("The MSG is: ", round(SSG/degree,2))

[1] "The MSG is:  501.03"

# MSE SSE/dfe
SSE <- 267382
MSG<-SSG/degree
MSE<-SSE/residuals
paste("MSE is: ", round(MSE,2))

[1] "MSE is:  229.12"

f.value <- MSG/MSE
paste("The F Value is: ",round(f.value,2))

[1] "The F Value is:  2.19"

What is the conclusion of the test?

Using a significance of .05. We see that the P value is greater than our value. Therefore, we do not reject the null hypothesis since the varation between groups can be due to chance in variablity.

---
title: "Stats 606 HW 5"
author: "David Apolinar"
date: "3/24/2019"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    source_code: embed
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Row {.tabset .tabset-fade}
-------------------------------------

### 5.6 Working backwards

A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.


```{r 5.6}
#we need to find the mean

midpoint = (65 + 77)/2

paste("The sample mean is the sum of the two CI divided by 2: ", midpoint)

# using df of 25-1 = 24, we use a z score of 1.71 from the book or the qt function
z <- 1.71

round(abs(qt(0.10/2,24)),2)
paste("We are using a z score of 1.71 with 24 degrees of freedom")

# SE = s / sqrt(n)

# Using one of the values

# 71 + z*SE = 77
# z*SE = 6


SE <- 6/z
paste0("The SE is ", round(SE,2))

paste0("The marginal error is the SE * z_score: ", round(SE*z))
# Using the SE formula: s / sqrt(n)
n <- 25
# SE = sd / sqrt(n)
# sd = SE * sqrt(n)
sd <- SE * sqrt(25)
paste("the sample standard deviation is: ", round(sd,4))

```

### 5.14 SAT scores

SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

(a) Raina wants to use a 90% confidence interval. How large a sample should she collect?
```{r}
# Given Values
sd <-250
me <- 25
# 2 tail
CI <- .10/2 
# want to find x bar
# x +-25
# we will use the z score for a 90% two tail confidence interval
z <- qnorm(CI, lower.tail = FALSE)

SE = 25 / z
SE
# SE = 250 / sqrt(n)
# n = (250/ sqrt(SE)^2)
n<- (250/SE)^2
paste("The sample size should be roughly", round(n))
```

(b) Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

Since we are looking for a higher level of accuracy, we need to get a larger sample. Below is the calcuation to show what the sample size should be.

(c) Calculate the minimum required sample size for Luke.

```{r c}
CI <- .01/2
z <- qnorm(CI, lower.tail = FALSE)
z
SE <- 25/z
SE
n<- (250/SE)^2
paste("The sample size should be roughly", round(n))
```

### 5.20 High School and Beyond, Part I

The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the diffrences in scores are shown below.

(a) Is there a clear difference in the average reading and writing scores?

There is a slightly higher write score vs a reading score. However, based on the differences histogram, it appears to be fairly normally distribution. The IQR for reading is much wider than the writing score.

(b) Are the reading and writing scores of each student independent of each other?

In this particular case, I do not believe the scores to be independent of each other since reading and writing ability tend to be fairly correlated. the individual cases may be independent depending on the population size of the school. Assuming the school size is at least above 2000 students, then the observations would be independent.

(c) Create hypotheses appropriate for the following research question: is there an evident diffrence in the average scores of students in the reading and writing exam?

H0 The difference between student's reading and writing scores is 0 u_diff = 0
H1 There is a difference between student's reading and writing scores. u_diff != 0

(d) Check the conditions required to complete this test.

The conditions are:
  1. The population size must be less than 10%
  2. The differences are taken from a fairly normal distribution. Here, we will assume this is fairly normal.

(e) The average observed difference in scores is x ̄read-write = -0.545, and the standard deviation of the dfferences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

No. The probability of having a difference observed is relatively high 95%. Therefore, we do not reject the null hypothesis. 

```{r}
x_bar_diff <- -0.545
sd_diff <- 8.887
t <- abs((x_bar_diff - 0)/sd_diff)
paste("The t value is: ", round(t,4))
paste("The probability of t with 199 degrees of freedom is: ", 2*pt(t,199, lower.tail = FALSE))
```

(f) What type of error might we have made? Explain what the error means in the context of the application.

In this scenario, we could potentially obtain a type 2 error. Not rejecting the null hypothesis when it is indeed false.

(g) Based on the results of this hypothesis test, would you expect a confidence interval for the average diffrence between the reading and writing scores to include 0? Explain your reasoning.

Yes. We expect it to include the value of zero because we are not rejecting the null hypothesis. This means that within our confidence interval, we expect to see zero.

### 5.32 Fuel effciency of manual and automatic cars

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel effciency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a diffrence between the average fuel ecffiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

We will use the following hypothesis

H0: The difference in fuel effciency = 0
H1: The difference in fuel efficiency is != 0

```{r}
mileage <- matrix(c(16.12, 19.85, 3.58, 4.41, 26, 26), nrow = 3, ncol = 2, byrow = TRUE)

colnames(mileage) <- c("Automatic", "Manual")
rownames(mileage) <- c("Mean", "SD", "n")
mileage
mean_diff<- mileage[1,1] - mileage[1,2]
paste("The mean difference is: ", mean_diff)

SE <- sqrt((mileage[2,1]^2/mileage[3,1])+(mileage[2,2]^2/mileage[3,2]))
paste("The SE is: ", round(SE),4)
t <- mean_diff/SE
paste("The t value is: ", round(abs(t),4))
p_times2 <- round(2*pt(abs(t), df=mileage[3,1]-1, lower.tail = FALSE),4)
paste("The p-value is: ",p_times2)
```

Since the p value is less than 0.05, we can reject the null hypothesis.


### 5.48 Work hours and education

The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

(a) Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

H0: The Avg hours worked is the same for all educational attainment levels
H1: The Avg hours worked is different for at least one pair

(b) Check conditions and describe any assumptions you must make to proceed with the test.

I am assuming the following conditions:
  1. Independence
  2. the data is approximately normal for all groups
  3. The variance is constant between groups, or fairly equal.

(c) Below is part of the output associated with this test. Fill in the empty cells.

```{r}
# total groups -1 (dfg)
degree <- (5 -1)
paste("The degress are: ", degree)

# residuals n - k dfe
residuals <- 1172 - 5 
paste("The residuals are: ", residuals)

# total dfT
dft <- residuals + degree
paste("The total degrees of freedom are: ",dft)
# SSG (Sum Sq)
x <- 40.45
SSG <- k<-(121*(38.67-x)^2)+(546*(39.6-x)^2)+(97*(41.39-x)^2)+(253*(42.55-x)^2)+(155*(40.85-x)^2)
paste("The Sum Sq is: ", round(SSG,2))

# Total Error
#SSG + 267382
paste("The total error is Sum sQ: ", round(SSG + 267382,2))

# mean sq group
MSG <- SSG / degree
paste("The MSG is: ", round(SSG/degree,2))

# MSE SSE/dfe
SSE <- 267382
MSG<-SSG/degree
MSE<-SSE/residuals
paste("MSE is: ", round(MSE,2))

f.value <- MSG/MSE
paste("The F Value is: ",round(f.value,2))
```

(d) What is the conclusion of the test?

Using a significance of .05. We see that the P value is greater than our value. Therefore, we do not reject the null hypothesis since the varation between groups can be due to chance in variablity.