library(mosaic)
NCbirths <- read.csv("births.csv")

Question 1

What are the names of the variables that indicate whether or not the mother smoked and the weight of the baby?

Habit and weight

Question 2

In addition to the variable that indicates whether or not the mother smoked, choose one numerical (quantitative) variable and one categorical variable that you think might also affect the weight of a baby. Explain why.

A quantitative variable that might affect the baby is Gained because the weight a mother gains during pregnancy typically coincides with the weight of the child. A categorical variable that might affect weight is Premie because premature babies typically weigh less than those born normally.

Question 3

Create a data frame that has only the four variables you identified in Questions 1 and 2 and the observations for which the mother was not a smoker.

result <- subset(NCbirths, subset = Habit == "NonSmoker", 
                 select = c("weight", "Habit", "Gained", "Premie"))

Question 4

Demonstrate that you have created the data frame in Question 3 by using the head() function to print out the first few rows of your data frame.

head(result)

Question 5

Report the mean of

(a) the weights of all babies.

(b) the weights of those born to smoking mothers.

(c) the weights of those born to non-smoking mothers.

mean(NCbirths$weight)
#> [1] 116.0591
mean(NCbirths$weight ~ NCbirths$Habit)
#>           NonSmoker    Smoker 
#>  118.6667  116.8416  108.4225

Question 6

Report the standard deviation of

(a) the weights of all babies.

(b) the weights of those born to smoking mothers.

(c) the weights of those born to non-smoking mothers.

sd(NCbirths$weight)
#> [1] 20.40667
sd(NCbirths$weight ~ NCbirths$Habit)
#>           NonSmoker    Smoker 
#>  21.09660  20.29014  20.03352
Births <- NCbirths[NCbirths$Habit != "", ]

Question 7

Using Births, create two graphics that help answer this question: Does the birth weight of babies born to smoking mothers differ from those born to non-smoking mothers?

histogram(~ weight | Births$Habit, data = Births, layout = c(1,2))

freqpolygon(~ weight | Births$Habit, data = Births, layout = c(1,2))

Question 8

Set the seed to 45 and simulate tossing 20 coins, where the probability of heads is 0.4.

set.seed(45)
do(4) * rflip(n = 20, prob = 0.4)

Question 9

Set the seed to 45 and simulate 1000 repetitions of tossing 20 coins, where the probability of heads is 0.4. Save the output from this simulation, and print the first six lines from the saved data frame.

set.seed(45)
flip.data45 <- do(1000) * rflip(n = 20, prob = 0.4)
head(flip.data45)

Question 10

Set the seed to 225 and choose 25 numbers out of the numbers from 1 to 1998, without replacement.

set.seed(225)
sample(1:1998, replace=FALSE, size = 25)
#>  [1] 1464 1693 1217  944 1592 1161 1094  556 1362  615  749 1383 1549 1421 1495
#> [16] 1653 1866 1057  130  972  836  325  192  523  875

Question 11

Set the seed to 45 and simulate 20 coin flips using the sample() function, where the probability of heads (represented by 1) is 0.4. Print the proportion of heads obtained.

set.seed(45)
coin11 <- sample(c(1,0), replace = TRUE, size = 20, prob = c(.4, .6))
print(coin11)
#>  [1] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
mean(coin11)
#> [1] 0.15

Question 12

State the null and alternative hypotheses in symbols.

Ho pi = 0.75 Ha pi > 0.75

Question 13

Simulate the null distribution in R using the methods developed in this lab (i.e., the for loop and sample() function). Set the seed to 405. Do not use the do() * rflip() notation.

pi <- 0.75
p_hat <- 8/10
n <- 10
N <- 1000

sim_prop <- numeric(N)

set.seed(405)

for(i in 1:N){
  flips <- sample(c(1,0), size = n, replace = TRUE, prob = c(pi, 1-pi))
  sim_prop[i] <- mean(flips)
}

Question 14

Create a plot of the null distribution you simulated in Question 13.

histogram(sim_prop, breaks = 10)

Question 15

Compute the (approximate) p-value based on your simulation.

sum(sim_prop >= p_hat) / N
#> [1] 0.522

Question 16

Compute the value of the standardized statistic (z-statistic) based on the standard deviation of the null distribution you simulated in Question 13.

(p_hat - mean(sim_prop)) / sd(sim_prop)
#> [1] 0.3839956

Question 17

State your conclusion about the strength of evidence.

With a p-value of .522 and a z-statistic of 0.38 there is little evidence in support of the alternative and against the null hypothesis.

Question 18

Suppose the player instead had made all 10 shots. Using the same simulation results from Question 13, state your conclusion about the strength of evidence with a significance level of 0.05. Can you think of any limitation(s) to the researcher’s study? Explain.

pi <- 0.75
p_hat <- 10/10
n <- 10
N <- 1000

sim_prop <- numeric(N)

set.seed(405)

for(i in 1:N){
  flips <- sample(c(1,0), size = n, replace = TRUE, prob = c(pi, 1-pi))
  sim_prop[i] <- mean(flips)
}

sum(sim_prop >= p_hat) / N
#> [1] 0.055

Based on the p-value of .055 there is moderate evidence against the null hypothesis. The possible constraints may be the small sample size of only 10 shots which would allow for high variability in the data.