1. Using R, write a program to calculate all the prime numbers less than 100. A prime number is a positive integer, greater than 1, divisible without remainder only by 1 and itself. Your function should return a vector of all the primes < 100. (15 pt)

get.primes <- function(m) {
  # Calculate all the prime numbers less than `m`
  # Return: a vector of the primes
  primes <- vector("numeric")
  for (i in 2:m) {
    is.prime <- TRUE
    for (j in primes) {
      if (i %% j == 0) {
        is.prime <- FALSE
        next
      }
    }
    if (is.prime) {
      primes[length(primes) + 1] <- i
    }
  }
  primes
}

get.primes(100)
##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
## [24] 89 97

Or, using Sieve of Eratosthenes (a much more efficient algorithm):

s.o.e <- function(m) {
  nums <- 2:m
  for (i in 2:sqrt(m)) {
    if (i != 0) {
      for (j in seq(i^2, m, i)) {
        nums[j - 1] = 0
      }
    }
  }
  nums[nums != 0]
}

s.o.e(100)
##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
## [24] 89 97

2. Using R, generate a sample of n = 100, random numbers from Weibull distribution with shape = 1.03 and scale = 1. Make two side-by-side plots:

  1. the probability density for (a) the sample (b) Weibull density (analytical) (c) normal density (analytical), with the same mean and standard deviation as Weibull analytical density.
  2. CDF for the same

Annotate graphs indicating clearly which curve corresponds to which distribution. Make a table to summarize and compare parameters (i.e. mean, median, mode, skewness and standard deviation) of three distributions (sample, Weibull, and normal). Be sure to nicely label your axes and title the graph. [Hints: Weibull distribution, Skewness, modeest, density(), ecdf(), legend(), par(mfrow=c(1,2)), e1071::skewness()] (15pt)

# set a seed so our graph can stay the same every time we run the code
set.seed(1984)

dat.shape = 1.03
dat.scale = 1

dat <- rweibull(n = 100, shape = dat.shape, scale = dat.scale)

sample.pdf <- density(dat)
sample.cdf <- ecdf(dat)

# Construct functions for analytical values of a weibull distribution
weibull.pdf <- function(x, shape = dat.shape, scale = dat.scale) {
  # (shape/scale) * (x/scale)^(shape-1) * exp(-(x/scale)^shape)
  dweibull(x, shape, scale)
}
weibull.cdf <- function(x, shape = dat.shape, scale = dat.scale) {
  # 1 - exp(- (x/scale)^shape)
  pweibull(x, shape, scale)
}
weibull.mean <- function(shape = dat.shape, scale = dat.scale) {
  scale * gamma(1 + 1/shape)
}
weibull.sd <- function(shape = dat.shape, scale = dat.scale) {
  dat.scale * sqrt(gamma(1+2/shape) - (gamma(1+1/dat.shape))^2)
}
  
# The normal distribution uses the same analytical values of
# the mean and sd from the weibull distribution in the given shape and scale
norm.mu <- weibull.mean()
norm.sigma <- weibull.sd()

norm.pdf <- function(x, mu = norm.mu, sd = norm.mu) {
  # 1 / (sqrt(2*pi)*sd) * exp(-((x-mu)^2/(2*sd^2)))
  dnorm(x, mu, sd)
}
norm.cdf <- function(x, mu = norm.mu, sd = norm.sigma) {
  # this one is too complicated, I'm not going to write the raw function here.
  pnorm(x, mu, sd)
}

# Setting parameters for the plot
par(mfrow=c(1, 2), mar = c(4, 4, 3, 2), mgp = c(2.6, 1, 0), mex = .8)

cols <- list(sample = "red", weibull = "blue", norm = "green")  # colors
plot(sample.pdf, main = "Probability Density",
     col = cols$sample,
     ylim = c(0, 1),
     xlim = c(-3, 7),
     xlab = "sd", ylab = "density")
plot(weibull.pdf, col = cols$weibull,
     ylim = c(0, 1),
     xlim = c(-3, 7),
     add = TRUE)
plot(norm.pdf, col = cols$norm,
     ylim = c(0, 1),
     xlim = c(-3, 7),
     add = TRUE)
legend("topright", legend = names(cols), col = unlist(cols),
       cex = 0.8,
       lwd = 1)

plot(sample.cdf, main = "Cumulative Distribution",
     do.points = FALSE,
     verticals = TRUE,
     col = cols$sample,
     ylim = c(0, 1),
     xlim = c(-5, 5),
     xlab = "X", ylab = "P(x <= X)")
plot(weibull.cdf, col = cols$weibull,
     ylim = c(0, 1),
     xlim = c(-5, 5),
     add = TRUE)
plot(norm.cdf, col = cols$norm,
     ylim = c(0, 1),
     xlim = c(-5, 5),
     add = TRUE)
legend("topleft", legend = names(cols), col = unlist(cols),
       inset = c(0.04, 0.08),
       cex = 0.8,
       lwd = 1)

Summarization table:

library(knitr)
# The function `skewness` and `weibullMode` are provided by the "modeest" library
suppressPackageStartupMessages(library(modeest))

getmode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
weibull.median <- function(shape = dat.shape, scale = dat.scale) {
  scale * (log(2))^(1/shape)
}
weibull.skewness <- function(shape = dat.shape, scale = dat.scale) {
  mu <- weibull.median(shape, scale)  # the analytical mean
  sigma <- weibull.sd(shape, scale)  # the analytical sd
   (gamma(1+3/shape) * scale^3 - 3*mu*sigma^2 - sigma^3) / sigma^3
}

df <- data.frame(
  mean = c(mean(dat), norm.mu, norm.mu),
  sd = c(sd(dat), weibull.sd(), norm.sigma),
  # for normal distribution, the `median` and `mode` are the same as its `mean`
  median = c(median(dat), weibull.median(), norm.mu),
  mode = c(getmode(dat), weibullMode(dat.shape, dat.scale), norm.mu),
  skewness = c(skewness(dat)[[1]], weibull.skewness(), 0)
)
row.names(df) <- c("sample", "weibull", "normal")
df <- round(df, 5)
kable(df)
mean sd median mode skewness
sample 0.96779 1.02636 0.64692 0.42809 1.93538
weibull 0.98803 0.95937 0.70059 0.03229 2.90452
normal 0.98803 0.95937 0.98803 0.98803 0.00000

3. You flip a coin five times.

  1. What’s the chance of getting three or more heads in a row? (5 pt)
  2. What’s the chance of getting three or more heads in a row conditional on knowing the first flip was a heads? (5 pt)

Answer

  1. Let \(H\) denote getting the head in a flip, \(T\) be getting the tail. The sample space contains \(2^5 = 32\) possibilities. Possible outcomes for event “getting three or more heads in a row” are \(\{HHHTT, HHHTH, THHHT, TTHHH, HTHHH, HHHHT, THHHH, HHHHH\}\).There are in total 8 possible outcomes, therefore \[P(\textrm{3 or more heads in a row}) = \frac{8}{32} = 0.25\] We can also calculate the probability directly: let \(X\) denote either head or tail, the probability of “getting three or more heads” is the same as the sum of following mututally exclusive outcomes: \(HHHTX, HHHHX, THHHX, XTHHH\). \[P = 0.5^4 + 0.5^4 + 0.5^4 + 0.5^4 = 0.25\]

  2. Since we already know the outcome of the first flip, the sample space becomes \(2^4 = 16\). Possible outcomes are \(\{HHHTT, HHHTH, HHHHT, HHHHH, HTHHH,\}\), therefore \[P = \frac{5}{16} = 0.3125\] Alternatively, the mutually exclusive outcomes are: \(hHHTX, hHHHX, hTHHH\) \[P = 0.5^3 + 0.5^3 + 0.5^4 = 0.3125\]


4. NASA has declared that the Earth is likely to be hit by an asteroid this year based on an astronomical observation it has made. These things are hard to judge for certain, but it is known that the test NASA used is pretty good - it has an accuracy (sensitivity) of 99% and a false positive rate of only 1%. It is further known that the general probability of an asteroid hitting earth in any given year is 1 in 100,000. What is the probability we will actually be hit by an asteroid this year given NASA’s test? (10 pt)

Answer

Let \(P(A) =\) an asteroid is hitting the earth, \(P(B) =\) NASA predits a hit is coming. We know that

\[P(B|A) = 0.99, P(B|\bar{A}) = 0.01, P(A) = 1/100000, P(\bar{A}) = 99999/100000\]

The probability that NASA’s predition is correct and we will actually be hit is

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)} = = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\bar{A})P(\bar{A})} = \frac{0.99 \times 0.00001}{0.99 \times 0.00001 + 0.01 \times 0.99999} = 0.0989\%\]

The possibility that we will actually be hit is very very small.


5. The average number of snow days in Boston in a winter month is avg = 2 Assuming these events follow a poisson distribution, calculate (using R) the probability of getting n = 5 or more snow days in a month. (5 pt)

Answer

In such poisson distribution, expected number of occurence \(\lambda = 2\). Since poisson distribution is a discrete distribution of integers, the probability of getting n = 5 or more snow days \(P[X \geq 5]\) is equivalent to \(P[X > 4]\). Using the CDF of the Poisson ppois(q, lambda, lower.tail..), we can just set lower.tail = FALSE and calcuate \(P[X > n - 1]\).

avg <- 2
n <- 5
ppois(q = n - 1, lambda = avg, lower.tail = FALSE)
## [1] 0.05265302

6. You want to know how many hours of sleep the average college student gets.

You start out with a preliminary survey of 11 people, and get the following data (in hours): 7, 9, 7, 8, 10, 6, 10, 6, 6, 8, 10. You hypothesize that despite what the doctors say, the average college student does not get 7 hours of sleep a night. What does your survey say? State your null hypothesis, research hypothesis (two tailed), and calculate your threshold value, test statistic, and p value. Do you reject the null or not? (10 pt)

Answer

Let \(x\) be hours of sleep the average college student gets.

\[\textrm{Null hypothesis}\ H_{0}: x = 7\] \[\textrm{Research hypothesis}\ H_{a}: x \ne 7\]

With given data, \(n = 11, \bar{x} = 7.9091, sd = 1.6404, se = 0.4946\). We expect the data to be in a T distribution. Choose a significance level of 0.05, i.e. \(\alpha = 0.05\), 5% chance of having a Type I error, the critical threshold value for the two tailed test is calculated as qt(.975, df = n - 1) = 2.228. Our T-test must be less than -2.228 or larger than 2.228 in order to reject the null.

However, the test statistis

\[\textrm{tstatistic} = \frac{\bar{x} - \mu}{se} = \frac{7.9091 - 7}{0.4946} = 1.838\]

It is no larger than 2.228, nor less than -2.228. We cannot reject the null. Maybe the average college student does get 7 hours of sleep.

Alternatively, we can calculate the p-value: 2 * (1 - pt(tstatistic, df)) = 2 * (1 - pt(1.838, 10)) = 0.0959, which is larger than \(alpha = 0.05\), the p-value threshold we have chosen.

dat <- c(7, 9, 7, 8, 10, 6, 10, 6, 6, 8, 10)
n <- length(dat)
mu <- mean(dat)
s <- sd(dat)
se <- s / sqrt(n)
p.value <- 2 * (1 - pt(1.838, df = 10))

7. Despite the disappointing results in 6, you are confident in your hypothesis. Assuming your sample standard deviation and mean do not change and you want to survey as few people as possible, how many additional people would you have to survey to reject the null at the \(\alpha = 0.01\) level? Plot p-value as a function of sample size. (5 pt)

Answer

The p-value of the sample in Question 6 is \(0.0959\), to reject the null at the \(\alpha = 0.01\) level means that we need to reduce the p-value to be less than than \(0.01\).

For any given sample, the test statistic is

\[T = \frac{\bar{x} - \mu}{se} = \frac{\bar{x} - \mu}{s / \sqrt{n}} = \frac{\bar{x} - \mu}{s} \sqrt{n}\]

Since \(s, \bar{x}, \mu\) are all the same for each sample,

\begin{align*} \frac{T_1}{T_2} &= \frac{\sqrt{n_1}}{\sqrt{n_2}} \\ T_2 &= T_1\frac{\sqrt{n_2}}{\sqrt{n_1}} \\ n_2 &= \frac{n_1}{(T_1 / T_2)^2} \end{align*}

Knowing that the bigger the sample size (larger the degree of freedom) is, the closer the distribution is to a normal distribution. The desired test statistic should lie somewhere between -qnorm(0.005) and -qt(0.005, df = 10), i.e. \((2.576, 3.169)\).

\begin{align*} n_2 &\in (\frac{11}{(\frac{1.838}{3.169})^2}, \frac{11}{(\frac{1.838}{2.576})^2}) \\ &\in (21.6, 32.7) \end{align*}

The smallest possible \(n\) is \(22\), let’s see if that gives us a desired p-value.

\[T_2 = T_1\frac{\sqrt{n_2}}{\sqrt{n_1}} = 1.838 \times \frac{\sqrt{22}}{\sqrt{11}} = 2.599\]

\[\textrm{p-value} = \textrm{2 * (1 - pt(2.599, df = 22 - 1))} = 0.0168\]

\(0.0168\) is still slightly higher than the desired p-value of \(0.01\). So we try next integers as the \(n\), untill we get a lower enough p-value. The smallest such integer is \(26\), giving a p-value of \(0.00858\).

We’d need to survey \(26 - 11 = 15\) additional people.

The relationship between the sample size and the p-value is plotted as below:

func <-  function(n) {
  x1 <- 7.9091
  x0 <- 7
  s <- 1.6404
  tstatistic <- (x1 - x0) / (s / sqrt(n))
  2 * (1 - pt(tstatistic, df = n - 1))
}

par(mfrow=c(1, 1), mex = .7)
plot(func, 11, 80,
  main = "p-value for different sample sizes",
  ylab = "p-value", xlab = "Sample Size")


8. You survey the same 11 people during finals period, and get the following hours: data.resample=5, 9, 9, 7, 5, 6, 7, 5, 6, 8, 9 Do college students get significantly less sleep than usual during finals? (10 pt)

dat1 <- c(7, 9, 7, 8, 10, 6, 10, 6, 6, 8, 10)  # normal times
dat2 <- c(5, 9, 9, 7, 5, 6, 7, 5, 6, 8, 9)  # druing finals period

m1 <- mean(dat1)
s1 <- sd(dat1)
se1 <- s1 / sqrt(length(dat1))

m2 <- mean(dat2)
s2 <- sd(dat2)
se2 <- s2 / sqrt(length(dat2))

\[H_0: \bar{x}_1 = \bar{x}_2\] \[H_a: \bar{x}_1 \neq \bar{x}_2\]

For sample data from normal times, \[n_1 = 11, \bar{x}_1 = 7.9091, s_1 = 1.6404, se_1 = 0.4946\]

For sample data data from finals period, \[n_2 = 11, \bar{x}_2 = 6.9091, s_2 = 1.6404, se_2 = 0.4946\]

The two groups have the same sample size, the same standard deviation and standard error. Therefore

\[se_{diff} = \sqrt{se_1^2 + se_2^2} = 0.4946, df = 2n - 2 = 20\] \[\textrm{Test statistic} = \frac{\bar{x}_1 - \bar{x}_2}{se_diff} = 2.0218\]

Assuming a two-tailed test with \(\alpha = 0.05\), the critical threshold is calculated as:

alpha <- 0.05
qt(1 - alpha / 2, df = 20)
## [1] 2.085963

Since \(2.0218 \notin [2.085963, \infty]\), we cannot reject the null. There is NO statistically significant difference in hours of sleep during normal times and finals period.


9. You are a very bad gardener, and hypothesize that feeding houseplants vodka might help them relax and grow better. You perform an experiment to test your hypothesis, giving 15 houseplants water spiked with vodka, and 15 houseplants water alone. These are your results:

condition live die
treatment 4 11
control 8 7

This looks pretty bad for the treatment, but being as good at statistics as you are bad at gardening, you test it using the chi-square test. What are your results? (10 pt)

Answer

First, calculate the expected value for each cell based on the total number of each variable (live or die, treatment or control):

condition live die total
treatment 4 (6) 11 (9) 15
control 8 (6) 7 (9) 15
total 12 18 30

The chi-square test statistic

\[\chi^2 = \sum \frac{f_0 - f_e}{f_e} = \frac{(4 - 6)^2}{6} + \frac{(11 - 9)^2}{9} + \frac{(8 - 6)^2}{6} + \frac{(7 - 9)^2}{9} = 2.222\]

The degree of freedom

\[df = (r - 1)(c - 1) = (2 - 1)(2 - ) = 1\]

Therefore the 95% confident threshold value is

alpha = 0.05
qchisq(1 - alpha, df = 1)
## [1] 3.841459

Since the threshold value (\(3.841459\)) is larger than our test statistic (\(2.222\), we cannot reject the null. We cannot say for sure that the treatment group caused more death of the plants.


10. Perhaps you got things backwards, and plants need more stimulation to thrive. So you adjust your experiment into three treatment groups: water, vodka, and coffee. These are your results:

condition mean days alive sd n
water 50 10 20
vodka 45 7 10
coffee 55 4 10

The overall mean is 50 days (as we said, you’re a bad gardener). Use an F test to determine if there is any significant difference among these three groups. (10 pt)

Answer

\(H_0\): there is no significant difference among these three groups.
\(H_a\): at least one of these groups is different.
Total sampled population: \(20 + 10 + 10 = 40\)
Overall mean: \(\bar{y} = (50*20 + 45*10 + 55*10)/40 = 50\)
Number of groups: \(G = 3\)

\begin{align*} \textrm{Between variance} &= \frac{n_1(\bar{y}_1 - \bar{y})^2 + ... + n_G(\bar{y}_G - \bar{y})^2}{G - 1} \\ &= \frac{50(50 - 50)^2 + 45(45 -50)^2 + 55(55 - 50)^2}{3 - 1} \\ &= 1250 \\ \\ \textrm{Within variance} &= \frac{(n_1 - 1)s_1^2 + ... + (n_G - 1)s_G^2}{N - G} \\j &= \frac{(20 - 1) \times 10^2 + (10 - 1) \times 7^2 + (10 - 1) \times 4^2}{40 - 3} &= 67.162162 \end{align*}

\[F = \frac{\textrm{Between variance}}{\textrm{Within variance}} = \frac{1250}{67.162162} = 18.612\]

For \(\alpha = 0.05\), the threshold value is calculated as:

alpha <- 0.05
G <- 3  # number of groups
N <- 40  # total sampled population
qf(1 - alpha, df1 = G, df2 = N - G)
## [1] 2.858796

Our F value (\(18.612\)) is much larger than 2.858796, thus we CAN reject the null. The mean days alive of my houseplants was significantly impacted by at least one of the liquids I’ve been feeding them.