FBB Statistics (Term 7)

Решение задач к зачету по курсу «Методы прикладной статистики в биологии»

Задача 1

In an investigation on toxins produced by molds that infect corn crops, a biochemist prepares extracts of the mold culture and then measures the amount of the toxic substance per gram of solution. From six preparations of the mold culture the following observations on toxic substances (mg) are obtained: 1.2, 0.8, 0.6, 1.1, 1.2, 0.8. Find a 95% confidence interval for the mean amount of toxic substances.

x <- c(1.2, 0.8, 0.6, 1.1, 1.2, 0.8)
n <- length(x)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2)/(n-1))  # s <- sd(x)
SE <- s/sqrt(n)
t <- qt(1-0.025, df=n-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval

## [1] 0.6865937 1.2134063

Задача 2

Trace metals in drinking water wells affect the flavor of the water and unusually high concentrations can pose a health hazard. Furthermore, the water in well may vary in the concentration of the trace metals depending upon from where it is drawn. In the paper, “Trace Metals of South Indian River Region” (Environmental Studies, 1982, 62-6), trace metal concentrations (mg/L) on zinc were found from water drawn from the bottom and the top of each of 6 wells. The data follows:

location <- seq(1, 6)
bottom <- c(.430, .266, .567, .531, .707, .716)
top <- c(.415, .238, .390, .410, .605, .609)
problem2.table <- matrix(c(location, bottom, top), nrow=6)
problem2.table

##      [,1]  [,2]  [,3]
## [1,]    1 0.430 0.415
## [2,]    2 0.266 0.238
## [3,]    3 0.567 0.390
## [4,]    4 0.531 0.410
## [5,]    5 0.707 0.605
## [6,]    6 0.716 0.609

Find a 95% confidence interval for the mean difference in the zinc concentrations in this area between water drawn from the top and bottom of wells.

problem2.table <- cbind(problem2.table, problem2.table[,2] - problem2.table[,3])
x <- problem2.table[,4]
n <- length(x)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2)/(n-1))  # s <- sd(x)
SE <- s/sqrt(n)
t <- qt(1-0.025, df=n-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval

## [1] 0.02797823 0.15535510

Задача 3

A turkey producer knows from previous experience that profits are maximized by selling turkeys when their average weight is 12 kilograms. Before determining whether to put all their full grown turkeys on the market this month, the producer wishes to estimate their mean weight. Prior knowledge indicates that turkey weights have a standard deviation of around 1.5 kilograms. What is the number of turkeys that must be sampled in order to estimate their true mean weight to within 0.5 kilograms with 95% confidence?

within.range <- 0.5  # mu = sample.mean + within.range, where within.range = z * sigma/sqrt(n)
z <- qnorm(1 - 0.025)
sigma <- 1.5
n <- (z * sigma / within.range)^2
n <- ceiling(n)
n

## [1] 35

Задача 4

A statistical procedure to estimate the mean shell thickness of eggs from chickens contaminated with PCBs (polychlorinated biphenyls) obtains a point estimate of 0.70 mm and an estimated standard error of 0.05 mm. This means:

The standard deviation of actual shell thickness in the sample was 0.05 mm. Нет, 0.05 - стандратное отклонение эстиматора (т.е. для 0.07). Оценка стандартного отклонения по выборке в sqrt(n) раз больше: SE = sigma/sqrt(n).
We are 95% confident that the sample mean shell thickness is accurate to with 0.05 mm. Нет. Почему 95%?
An estimate of the standard deviation of the sample mean shell thickness over repeated samples is 0.05 mm. Ага.
The standard deviation of the population mean over all eggs is about 0.05 mm. Нет. Population SD мы не можем знать.
An approximate 95% confidence interval for the sample mean shell thickness is 0.70mm ± 0.10mm. Ага:

SE <- 0.05
within.range <- qnorm(1 - 0.025) * SE
format(round(within.range, 2), nsmall=2)

## [1] "0.10"

Задача 5

A statistician selects a random sample of 200 seeds from a large shipment of a certain variety of tomato seeds and tests the sample for percentage germination. If 155 of the 200 seeds germinate, what is a 95% confidence interval for the population proportion of seeds that germinate?

p.estimator <- 155/200
n <- 200
p <- 0.5  # Assume the worst case possible
SE <- sqrt(p * (1- p) / n)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator - z * SE, p.estimator + z * SE)
confidence.interval

## [1] 0.7057048 0.8442952

Задача 6

Some scientists believe that a new drug would benefit about half of all people with a certain blood disorder. To estimate the proportion of patients who would benefit from taking the drug, the scientists will administer it to a random sample of patients who have the blood disorder. What sample size is needed so that the 95% confidence interval will have a width of 0.06?

confidence.interval.width <- 0.06
p <- 0.5
within.range <- confidence.interval.width / 2  # within.range = z * SE, where SE = sqrt(p * (1 - p) / n)
z <- qnorm(1 - 0.025)
n <- p * (1 - p) * z^2 / (within.range)^2
n <- ceiling(n)
n

## [1] 1068

Задача 7

You would like to estimate the percentage of “regular users of vitamins” in a large population and you would like your estimate to be accurate to within 4 percentage points, 19 times out of 20. Approximately how large should your sample size be?

within.range <- 0.04  # within.range = z * SE, where SE = sqrt(p * (1 - p) / n)
p <- 0.5  # Assume the worst case possible
z <- qnorm(1 - (1-19/20)/2)  # 0.025 of probability by tail
n <- p * (1 - p) * z^2 / (within.range)^2
n <- ceiling(n)
n

## [1] 601

Задача 8

In order for the confidence interval in the previous question to be valid:

we must assume that we have a random sample from a normal population. Нет. Для n > 40 распределение выборочных средних будет близко к нормальному для любого распределения генеральной совокупности.
we must assume that we have a random sample from some population (but it need not be a normal population because of the Central Limit Theorem). Ага.
we must assume that the population is normal (but we do not require a random sample because of the Central Limit Theorem). Нет. См. выше.
we do not need to assume that the population is normal nor that the sample is random (because of the Central Limit Theorem). Нет. Sample должен быть random.
we must assume that we have a random sample. Ага.

Задача 9

An experiment was conducted to compare the efficacies of two drugs in the prevention of tapeworms in the stomachs of a new breed of sheep. Samples of size 5 and 8 from each breed were given the drug and the two sample means were 28.6 and 40.0 worms/sheep. From previous studies, it is known that the variances in the two groups are 198 and 232, respectively, and that the number of worms in the stomachs has an approximate normal distribution. Find a 95% confidence interval for the difference in the mean number of worms per sheep.

Так две породы овец или два вида лекарств?

n <- 5
m <- 8
sample.mean <- 40.0 - 28.6
t <- qt(1 - 0.025, df=min(n, m)-1)
SE <- sqrt(198 / n + 232 / m)
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval

## [1] -11.59594  34.39594

Задача 10

A study was conducted to estimate the effectiveness of doing assignments in an introductory statistics course. Students in one section taught by instructor A received no assignments. Students in another section taught by instructor B, received assignments. The final grade of each student was recorded. A 95% confidence interval for the difference in the mean grades (Section A - Section B) was computed to be −3.5 ± 1.8. This means:

There is evidence that doing assignments improves the average grade because the difference in the population means is less than zero. Нет.
There is little evidence that doing assignments improves the average grade because the 95% confidence interval does not cover 0. Нет.
There is evidence that doing assignments improves the average grade because the 95% confidence interval does not cover 0. Ага.
There is evidence that doing assignments does not improve the average grade because the 95% confidence interval does not cover 0. Нет.
There is little evidence that doing assignments does not improve the average grade because the 95% confidence interval does cover 0. Ага.

Задача 11

In a study of iron deficiency among infants, random samples of infants following different feeding programs were compared. One group contained breast-fed infants, while the children in another group were fed by a standard baby formula without any iron supplements. Here are summary results of blood hemoglobin levels at 12 months of age.

breast.fed <- c(8, 13.3, 1.7)  # sample size, sample mean and sample SD
formula.fed <- c(10, 12.4, 1.8)
problem11.table <- matrix(c(breast.fed, formula.fed), nrow=2, byrow=T)
problem11.table

##      [,1] [,2] [,3]
## [1,]    8 13.3  1.7
## [2,]   10 12.4  1.8

Establish a 98% confidence interval for the mean difference in hemoglobin level between the two populations of infants.

n <- problem11.table[1,1]
m <- problem11.table[2,1]
sample.mean <- problem11.table[1,2] - problem11.table[2,2]
SE <- sqrt((problem11.table[1,3])^2 / n + (problem11.table[2,3])^2 / m)
t <- qt(1-0.01, df=min(n, m)-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval

## [1] -1.581699  3.381699

# Assumptions: equal variances.
SE <- sqrt(((problem11.table[1,3])^2 * (n-1) + (problem11.table[2,3])^2 * (m-1)) / (n+m-2)) * sqrt(1/n + 1/m)
t <- qt(1-0.01, df=n + m - 2)
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval

## [1] -1.253065  3.053065

Задача 12

Two surgical procedures are widely used to treat a certain type of cancer. To compare the success rates of the two procedures, a random sample from each type of procedure is obtained, and the number of patients with no reoccurrence of the disease after 1 year was recorded. Here is the data.

procedureA <- c(100, 78)  # n and 'No occurence'
procedureB <- c(120, 102)
problem12.table <- matrix(c(procedureA, procedureB), nrow=2, byrow=T)
problem12.table

##      [,1] [,2]
## [1,]  100   78
## [2,]  120  102

Find a 95% confidence interval for the difference in success rates.

n <- procedureA[1]
m <- procedureB[1]
p.estimator.A <- procedureA[2] / procedureA[1]
p.estimator.B <- procedureB[2] / procedureB[1]
p.estimator.diff <- p.estimator.B - p.estimator.A
# Assumptions: actual p is about p.estimator.
p.A <- p.estimator.A
p.B <- p.estimator.B
SE <- sqrt(p.A * (1 - p.A) / n + p.B * (1 - p.B) / m)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator.diff - z * SE, p.estimator.diff + z * SE)
confidence.interval

## [1] -0.0333126  0.1733126

sprintf(round(confidence.interval*100, 0), fmt="%d%%")

## [1] "-3%" "17%"

# Assumptions: the worst case possible.
p.A <- 0.5
p.B <- 0.5
SE <- sqrt(p.A * (1 - p.A) / n + p.B * (1 - p.B) / m)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator.diff - z * SE, p.estimator.diff + z * SE)
confidence.interval

## [1] -0.06269019  0.20269019

sprintf(round(confidence.interval*100, 0), fmt="%d%%")

## [1] "-6%" "20%"

Задача 13

A sample of 8 patients had their lung capacity measured before and after a certain treatment with the following results:

patient <- seq(1,8)
before <- c(750, 860, 950, 830, 750, 680, 720, 810)
after <- c(850, 880, 930, 860, 800, 740, 760, 800)
problem13.table <- matrix(c(patient, before, after), ncol=3)
problem13.table

##      [,1] [,2] [,3]
## [1,]    1  750  850
## [2,]    2  860  880
## [3,]    3  950  930
## [4,]    4  830  860
## [5,]    5  750  800
## [6,]    6  680  740
## [7,]    7  720  760
## [8,]    8  810  800

Is there any evidence that the treatment provides an increase in lung capacity?

Здесь правильно применять one-tailed test?

x <- after - before
n <- length(x)
# H0: no difference (x = 0)
# H1: there is an increase (x > 0)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2) / (n-1))  # s <- sd(x)
p.value <- 1 - pt((sample.mean-0) / (s/sqrt(n)), df=n-1)
p.value

## [1] 0.02116503

t.test(after, before, paired=T, df=n-1, alternative='greater')$p.value

## [1] 0.02116503

There is evidence that the treatment provides an increase in lung capacity.
На 5% уровне значимости мы отвергаем гипотезу H0.

Задача 14

The power takeoff driveline on tractors used in agriculture is a potentially serious hazard to operators of farm equipment. The driveline is covered by a shield in new tractors, but for a variety of reasons, the shield is often missing on older tractors. Two type of shields are the bolt-on and the flipup. It was believed that the bolt-on shield was perceived as a nuisance by the operators and deliberately removed, but the flip-up shield is easily lifted for inspection and maintenance and may be left in place. In a study initiated by the National Safety Council of the U.S., a sample of older tractors with both types of shields was taken to see what proportion were removed. Of 183 tractors designed to have bolt-on shields, 35 had been removed. Of the 136 tractors with flip-up shields, 15 were removed. We wish to test the hypothesis H: pb = pf vs A: pb <> pf where pb and pf are the proportion of tractors with the bolt-on and flip-up shields removed, respectively. Find the corresponding p-value.

n <- 183
n.removed <- 35
m <- 136
m.removed <- 15
pb.estimator <- n.removed / n
pf.estimator <- m.removed / m
# Assumptions: actual p is about p.estimator.
pb <- pb.estimator
pf <- pf.estimator
SE <- sqrt(pb * (1 - pb) / n + pf * (1 - pf) / m)
p.value <- 2 * pnorm((pf.estimator-pb.estimator - 0) / SE)
p.value

## [1] 0.04081365

# Assumptions: the worst case possible.
pb <- 0.5
pf <- 0.5
SE <- sqrt(pb * (1 - pb) / n + pf * (1 - pf) / m)
p.value <- 2 * pnorm((pf.estimator-pb.estimator - 0) / SE)
p.value

## [1] 0.1526427

Задача 15

A large supermarket chain will increase its stock of bakery products if more than 20% of its customers are purchasers of bakery products. A random sample of 100 customers found 28% purchased bakery items. A 5% significance test is conducted to determine if the chain should increase its bakery stock. Should the supermarket chain increase its stock of bakery products?

n <- 100
p.estimator <- 0.28
threshold <- 0.2
# Assumptions: the worst case possible.
p <- 0.5
SE <- sqrt(p * (1 - p) / n)
p.value <- 1 - pnorm((p.estimator - threshold) / SE)
p.value

## [1] 0.05479929

There is little evidence that the supermarket chain should increase its stock of bakery products.

Задача 16

One criticism of reforestation efforts after timber harvesting is that too few of the seedlings survive. An experiment was conducted to assess if mulching the slash (limbs, roots, small branches, etc.) and leaving the mulch on the ground improves the survival rate compared to just leaving the slash on the ground. It is believed that mulching will cause the material to break down sooner and release the nutrients to the seedlings. A total of 500 seedlings were randomly assigned to the two treatments and the two year survival rate was measured. Of the 250 seedling receiving the “mulching” treatment, 75 survived; of the 250 seedlings receiving the “control” treatment, 55 survived. Do these data support the criticism point stated in the problem?

m <- 250; c <- 250
m.survived <- 75; c.survived <- 55
pm.estimator <- m.survived / m
pc.estimator <- c.survived / c
# Assumptions: actual p is about p.estimator.
pm <- pm.estimator
pc <- pc.estimator
SE <- sqrt(pm * (1 - pm) / m + pc * (1 - pc) / c)
p.value <- 1 - pnorm((pm.estimator-pc.estimator - 0 ) / SE)
p.value

## [1] 0.02029713

# Assumptions: the worst case possible.
pm <- 0.5; pc <- 0.5
SE <- sqrt(pm * (1 - pm) / m + pc * (1 - pc) / c)
p.value <- 1 - pnorm((pm.estimator-pc.estimator - 0 ) / SE)
p.value

## [1] 0.03681914

Нет?

Задача 17

DDT is an insecticide that accumulates up the food chain. Predator birds can be contaminated with quite high levels of the chemical by eating many lightly contaminated preys. One effect of DDT upon birds is to inhibit the production of the enzyme carbonic anhydrase which controls calcium metabolism. It is believed that this causes egg shells to be thinner and weaker than normal and makes the eggs more prone to breakage. (This is one of reasons why the condor in California is near extinction.) An experiment was conducted where 16 sparrow hawks were fed a mixture of 3 ppm dieldrin and 15 ppm DDT (a combination often found in contaminated prey). The first egg laid by each bird was measured and the mean shell thickness was found to be 0.19 mm with a standard deviation of 0.01 mm. A normal egg shell is assumed to have a mean thickness of 0.2 mm. Carry out an appropriate statistical test.

n <- 16
sample.mean <- 0.19
s <- 0.01
mu <- 0.2
SE <- s / sqrt(n)
p.value <- pt((sample.mean - mu) / SE, df=n-1)
p.value

## [1] 0.0005796584

Задача 18

Resting pulse rate is an important measure of the fitness of a person’s cardiovascular system with a lower rate indicative of greater fitness. The mean pulse rate for all adult males is approximately 72 beats per minute. A random sample of 25 male students currently enrolled in the Faculty of Agriculture and now taking 5.211 was selected and the mean pulse resting pulse rate was found to be 80 beats per minute with a standard deviation of 20 beats per minute. The experimenter wishes to test if the students are less fit, on average, than the general population. Help her to perform an appropriate statistical test.

mu <- 72
n <- 25
sample.mean <- 80
s <- 20
SE <- s / sqrt(n)
p.value <- 1 - pt((sample.mean - mu) / SE, df=n-1)
p.value

## [1] 0.02846992

Задача 19

A 95% confidence interval for μ is calculated to be (1.7, 3.5). It is now decided to test the hypothesis H0 : μ = 0 vs HA:μ <> 0 at the 5% significance level, using the same data as was used to construct the confidence interval.

We cannot test the hypothesis without the original data. Нет.
We cannot test the hypothesis at the 5% level because the 5% test is connected to the 97.5% confidence interval. Нет.
We can only make the connection between hypothesis tests and confidence interval if the sample sizes are large. Нет. Для малых n sigma может быть известным, а ещё мы можем испоьзовать t-распределение. Однако n мы не знаем, поэтому не сможем определить степени свободы t-распределения.
We would reject H0 at 5% level. Ага.

confidence.interval <- c(1.7, 3.5)
sample.mean <- mean(confidence.interval)  # mu = sample.mean +- z * sigma/sqrt(n)
within.range <- confidence.interval[2] - sample.mean
SE <- within.range / qnorm(1 - 0.025)
p.value <- 1 - pnorm((sample.mean - 0)/ SE)
p.value

## [1] 7.47579e-09

We would accept H0 at 5% level. Нет.

Задача 20

We wish to test if a new feed increases the mean weight gain compared to an old feed. At the conclusion of the experiment it was found that the new feed gave a 10 kg bigger gain than the old feed. A two-sample t-test with the proper one-sided alternative was done and the resulting p-value was .082. This means:

There is an 8.2% chance the null hypothesis is true. Нет.
There was only a 8.2% chance of observing an increase greater than 10 kg (assuming the null hypothesis was true). Ага.
There was only an 8.2% chance of observing an increase greater than 10 kg (assuming the null hypothesis was false). Нет.
There is an 8.2% chance the alternate hypothesis is true. Нет.
There is only an 8.2% chance of getting a 10 kg increase. Нет.

Задача 21

In order to compare two kinds of feed, thirteen pigs are split into two groups, and each group received one feed. The following are the gains in weight (kilograms) after a fixed period of time:

feedA <- c(8.0, 7.4, 5.8, 6.2, 8.8, 9.5, NA)
feedB <- c(12.0, 18.2, 8.0, 9.6, 8.2, 9.9, 10.3)
problem21.table <- matrix(c(feedA, feedB), nrow=2, byrow=T)
problem21.table

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    8  7.4  5.8  6.2  8.8  9.5   NA
## [2,]   12 18.2  8.0  9.6  8.2  9.9 10.3

Find the p-value for the corresponding statistical test. What is your conclusion?

feedA <- feedA[!is.na(feedA)]
lenA <- length(feedA)
lenB <- length(feedB)
sample.mean.A <- mean(feedA); sample.mean.B <- mean(feedB)
SE <- sqrt((sd(feedA))^2 / lenA + (sd(feedB))^2 / lenB)
p.value <- 1 - pt((sample.mean.B-sample.mean.A - 0) / SE, df=min(lenA, lenB)-1)
p.value

## [1] 0.03667544

На уровне значимости 5% есть основания полагать, что второй корм даёт больший прирост в весе, чем первый.

Задача 22

Each person in a random sample of 50 was asked to state his/her sex and preferred color. The resulting frequencies are shown below.

male <- c(5, 14, 6)
female <- c(15, 6, 4)
problem22.table <- data.frame(rbind(male, female))
colnames(problem22.table) <- c("red", "blue", "green")
problem22.table

##        red blue green
## male     5   14     6
## female  15    6     4

Carry out a chi-square test is used to test the null hypothesis that sex and preferred color are independent. What other test could you suggest for this problem?

sumup <- margin.table
O <- as.table(as.matrix(problem22.table))
E <- sumup(O, 1) %*% t(sumup(O, 2)) / sumup(O)
E  # Expected

##        red blue green
## male    10   10     5
## female  10   10     5

chi <- sum((E - O)^2 / E)
chi.df <- prod(dim(problem22.table) - 1)  # df = (n-1)*(m-1)
p.value <- 1 - pchisq(chi, df=chi.df)
p.value

## [1] 0.01356856

Fisher's exact test.

Задача 23

An educator plans to evaluate two different teaching methods by comparing the average tests cores of two classes of students, each taught by one of the two techniques under study. Suppose the two classes have 35 and 38 students, respectively, with standard deviations in tests scores of 24.5 and 21.2, respectively. Is an observed difference in average test scores of 2.4 significant?

n <- 35; m <- 38
sd.n <- 24.5; sd.m <- 21.2
sample.mean <- 2.4
SE <- sqrt(sd.n^2 / n + sd.m^2 / m)
p.value <- 1 - pt((sample.mean - 0) / SE, df=min(n, m))
p.value

## [1] 0.3292284

Нет (на уровне значиомости 5%).

Задача 24

A highway superintendent states that five bridges into a city are used in the ratio 2:3:3:4:6 during the morning rush hour. A highway study of a random sample of 6000 cars indicates that 720, 970, 1013, 1380, and 1917 cars use the five bridges, respectively. At what significance level can the superintendent’s claim be rejected?

counts.o <- c(720, 970, 1013, 1380, 1917)
ratio.e <- c(2, 3, 3, 4, 6)
counts.e <- sum(counts.o) / sum(ratio.e) * ratio.e
counts.e  # Expected

## [1]  666.6667 1000.0000 1000.0000 1333.3333 2000.0000

chi <- sum((counts.e - counts.o)^2 / counts.e)
chi.df <- length(counts.o) - 1  # df = n - 1
p.value <- 1 - pchisq(chi, df=chi.df)
p.value

## [1] 0.03400959

The superintendent's claim can be rejected at 1% significance level.

Задача 25

An author of a new book claims that anyone following his suggested diet program will lose an average of 2.9 pounds per week. A researcher believes that the true figure will be lower and plans a test involving random sample of 36 overweight people. She will reject the author’s claim if the mean weight loss in the volunteer group is less than 2.5 pounds per week. Assume that the standard deviation among individuals is 1.2 pounds per week. If the true mean value is 2.4 pounds per week, what is the probability that the researcher will mistakenly fail to reject the author’s false claim of 2.8 pounds?

n <- 36
s <- 1.2
sample.mean <- 2.5
mu <- 2.4
SE <- s / sqrt(n)
reject.prob <- 1 - pt((sample.mean - mu) / SE, df=n-1)
reject.prob

## [1] 0.3101022

Задача 26

A telephone company’s records indicate that private customers pay on average 17.10 per month for long-distance telephone calls. A random sample of 10 customers’ bills during a given month produced a sample mean of 22.10 expended for long-distance calls and a sample variance of 45. A 5% significance test is to be performed to determine if the mean level of billing for long distance calls per month is in excess of 17.10. The calculated value of the test statistic and the critical value respectively are:

(2.36, 1.8331) Ага.
(1.17, 2.2622)
(2.36, 2.2622)
(1.17, 1.8331)
(0.025, 1.8125)

mu <- 17.10
n <- 10
sample.mean <- 22.10
s <- sqrt(45)
SE <- s / sqrt(n)
test.statistics <- (sample.mean - mu) / SE
critical.value <- qt(1-0.05, df=n-1)
c(test.statistics, critical.value)

## [1] 2.357023 1.833113

Задача 27

Some varieties of nematodes feed on the roots of lawn grasses and crops such as strawberries and tomatoes. The pest, which is particularly troublesome in warm climates, can be treated by the application of nematocides. However, because of the size of the worms, it is very difficult to count them directly. Hence, the yield of a crop is used as a surrogate for the number of worms. Four brands of nematocides are to be compared. Twelve plots of land of comparable fertility that were suffering from nematodes were planted with a crop. Each nematocide was applied to three plots; the assignment of the nematocide to the plot was made at random. At harvest time, the yields of each plot were recorded and part of the ANOVA table appears below:

n <- 4; m <- 3;
df.E <- 8; df.T <- 11
SSA <- 3.456; SSE <- 1.200; SST <- 4.656

Complete the ANOVA table and carry out the corresponding statistical test. What is your conclusion?

df.X <- df.T - df.E  # df.X <- n - 1
df.X

## [1] 3

MSA <- SSA / (n - 1)
MSA

## [1] 1.152

MSE <- SSE / (m * n - n)
MSE

## [1] 0.15

F <- MSA / MSE
F

## [1] 7.68

p.value <- 1 - pf(F, df.X, df.E)
p.value

## [1] 0.009671289

Задача 28

Suppose that we have grown one bacterium in broth culture at 3 different pH levels at 4 different temperatures. We have 12 flasks in all, but no replicates.

problem28.table <- data.frame(rbind(c(25, 10, 19, 40), c(30, 15, 25, 45), c(35, 20, 30, 55), c(40, 15, 22, 40)))
colnames(problem28.table) <- c('temp', 'pH 5.5',  'pH 6.5', 'pH 7.5')
problem28.table

##   temp pH 5.5 pH 6.5 pH 7.5
## 1   25     10     19     40
## 2   30     15     25     45
## 3   35     20     30     55
## 4   40     15     22     40

Construct the ANOVA table and carry out the corresponding statistical tests. What is your conclusion?

observed <- as.matrix(problem28.table[-1])
rownames(observed) <- c("t 25", "t 30", "t 35", "t 40")
n <- nrow(observed); m <- ncol(observed)
observed <- cbind(observed, apply(observed, 1, mean))
observed <- rbind(observed, apply(observed, 2, mean))
rownames(observed)[5] <- "mean"; colnames(observed)[4] <- "mean"
mean.n <- observed[,m+1]; mean.m <- observed[n+1,]
mean.global <- mean.n[n+1]
observed

##      pH 5.5 pH 6.5 pH 7.5     mean
## t 25     10     19     40 23.00000
## t 30     15     25     45 28.33333
## t 35     20     30     55 35.00000
## t 40     15     22     40 25.66667
## mean     15     24     45 28.00000

expected <- t(apply(observed[-n-1,], 1, function(x) mean.m + x[4] - mean.global))
expected <- rbind(expected, mean.m)
rownames(expected)[n+1] <- "mean"
expected

##        pH 5.5   pH 6.5   pH 7.5     mean
## t 25 10.00000 19.00000 40.00000 23.00000
## t 30 15.33333 24.33333 45.33333 28.33333
## t 35 22.00000 31.00000 52.00000 35.00000
## t 40 12.66667 21.66667 42.66667 25.66667
## mean 15.00000 24.00000 45.00000 28.00000

SST <- sum((observed[-n-1,-m-1] - mean.global)^2)
SSA <- m * sum((mean.n[-n-1] - mean.global)^2)
SSB <- n * sum((mean.m[-m-1] - mean.global)^2)
SSE <- SST - SSA - SSB
round(c(SST, SSA, SSB, SSE), 1)

## [1] 2162.0  238.7 1896.0   27.3

MSE <- SSE / ((n * m - 1) - (n - 1) - (m - 1))
MSA <- SSA / (n - 1)
MSB <- SSB / (m - 1)
round(c(MSA, MSB, MSE), 1)

## [1]  79.6 948.0   4.6

F.A <- MSA / MSE
F.B <- MSB / MSE
round(c(F.A, F.B), 1)

## [1]  17.5 208.1

c(1-pf(F.A, n-1, n*m-n-m+1), 1-pf(F.B, m-1, n*m-n-m+1))

## [1] 2.280032e-03 2.870213e-06

table28 <- data.frame(as.character(rep(problem28.table[,1], each=3)), rep(c("pH 5.5", "pH 6.5", "pH 7.5"), 4), as.vector(unlist(t(problem28.table[,-1]))), stringsAsFactors = F)
colnames(table28)<-c("temp", "pH", "counts")
summary(aov(counts ~ temp+pH, data=table28))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## temp         3  238.7    79.6   17.46  0.00228 ** 
## pH           2 1896.0   948.0  208.10 2.87e-06 ***
## Residuals    6   27.3     4.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Задача 29

The following information was obtained from the manager of a city water department for predicting the consumption of water (in gallons) from the size of household. [ Household Size (x) versus Water Used (y), 10 observations, values are not shown] Here are the summary statistics:

sum.x <- 57
sum.y <- 10038
sum.x2 <- 433
sum.y2 <- 11641474
sum.xy <- 67669

Find the equation of the least squares regression for water consumption on household size.

n <- 10
slope <- (sum.xy - sum.x * sum.y / n) / (sum.x2 - sum.x^2 / n)
intercept <- (sum.y - slope * sum.x) / n
sprintf("y = %.2f * x + %.2f", slope, intercept)

## [1] "y = 96.69 * x + 452.66"

Relate the slope of the regression line to the value of correlation coefficient. Interpret, using the correct units, both figures in the context of the problem.

slope.inverted <- (sum.xy - sum.x * sum.y / n) / (sum.y2 - (sum.y)^2 / n)
r <- sqrt(slope * slope.inverted)  # b * b* = r^2
r

## [1] 0.8035266

sprintf("%.2f%% дисперсии Y объясняется линейной регрессией.", r^2 * 100)

## [1] "64.57% дисперсии Y объясняется линейной регрессией."

Find a 90% confidence interval for the regression slope

a <- intercept; b <- slope;
sigma.e <- sqrt((sum.y2 - 2 * (a * sum.y + b * sum.xy) + (n * a^2 + 2*a*b*sum.x + b^2 * sum.x2)) / (n - 2))
SE <- sigma.e / (sum.x2 - (sum.x)^2 / n)
t <- qt(0.05, df=n-2)
confidence.interval <- c(slope - t * SE, slope + t * SE)
confidence.interval

## [1] 101.22148  92.16243

Find a 90% prediction interval for the water consumption given that household size is 10.

sxx <- sum(sum.x2 - sum.x^2 / n) / (n - 1)
syy <- sum(sum.y2 - sum.y^2 / n) / (n - 1)
sxy <- sum(sum.xy - sum.x * sum.y / n) / (n - 1)
r <- sxy / sqrt(sxx * syy)
SSE <- (n - 1) * syy * (1 - r^2)
sigma.e <- sqrt(SSE / (n - 2))  # This line and above: just a clear way to calculate the common values
x0 <- 10
t <- qt(1-0.05, df=n-2)
within.range <- t * sigma.e * sqrt(1 / n + (x0 - sum.x / n)^2 / (sum.x2 - sum.x^2 / n) + 1)
c(a + b * x0 - within.range, a + b * x0 + within.range)

## [1]  867.5501 1971.6007

What assumptions do you need for the above reasoning to be valid?

Задача 30

A research group was interested in predicting the number of bus riders per capita in census districts. They felt that the rider-ship per capita, Y , could be predicted using the average income, X, for the census district. A sample of 29 census districts were taken and the observations on the samples were used to obtain n = 29, mean(Y) = 62.1429, mean(X) = 3452.178; Sum(X − mean(X))(Y − mean(Y)) = 189312.0; Sum(X − mean(X))2= 19910691.0; Sum(Y − mean(Y))2 = 13369.381; MSE = 428.5 Based on this data, find a 98% confidence interval for the slope of the regression line.

n <- 29
mean.x <- 3452.178; mean.y <- 62.1429
sxx <- 19910691.0 / (n - 1)
syy <- 13369.381 / (n - 1)
sxy <- 189312.0 / (n - 1)
MSE <- 428.5
b <- sxy / sxx
SE <- sqrt(MSE / (sxx * (n - 1)))
t <- qt(1-0.01, df=n-2)
c(b - t * SE, b + t * SE)

## [1] -0.00196282  0.02097894

Задача 31

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

height <- c(1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83)
weight <- c(62.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46)

Find the equation of the least squares regression for weight versus height

n <- length(height)
mean.height <- sum(height) / n
mean.weight <- sum(weight) / n
sxx <- sum((height - mean.height)^2) / (n - 1)
syy <- sum((weight - mean.weight)^2) / (n - 1)
sxy <- sum((weight - mean.weight) * (height - mean.height)) / (n - 1)
b <- sxy / sxx
a <- mean.weight - b * mean.height
sprintf("y = %.2f + %.2f * x", a, b)

## [1] "y = -22.07 + 51.38 * x"

lm(weight ~ height)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      -22.07        51.38

Calculate and interpret the coefficient of determination. What does residual plot tell you about validity of the linear model?

r <- sxy / sqrt(sxx * syy)
r^2

## [1] 0.8184522

Do you find any outliers or influential scores in the dataset? If so, explain.

plot(height, weight)
abline(lm(weight ~ height))

По графику видно, что есть маленькая и очень тяжёлая женщина.

Use log transformations to find out whether the linear model provides a good enough fit. If so, explain. If not, provide a better model for these data (you may use EXCEL or other tools for calculations)

# log(y) = a + b * log(x) + e
plot(log(height), log(weight))
abline(lm(log(weight) ~ log(height)))

# y = a + b * log(x) + e
plot(log(height), weight) 
abline(lm(weight ~ log(height)))

# log(y) = a + b * x + e
plot(height, log(weight))
abline(lm(log(weight) ~ height))

Задача 32

Two methods are used to measure the business acumen expressed by heroines of romance novels. The mean of method 1 is 88.6 with a variance of 109.63 in a sample of size 41. The mean of method 2 is 85.1 with a variance of 65.99 in a sample of size 21. Before you test if the means of these two methods are the same, you should test if they possess homogeneous variances. Find the corresponding p-value.

mean.1 <- 88.6;     mean.2 <- 85.1
var.1 <- 109.63;    var.2 <- 65.99
n.1 <- 41;          n.2 <- 21
# Test for homogeneous variances
p.value.homogeneous <- 1 - pf((var.1 / var.2), df1=n.1-1, df2=n.2-1)
p.value.homogeneous

## [1] 0.1121884

# Different veriances
SE <- sqrt(var.1 / n.1 + var.2 / n.2)
p.value <- 1 - pt((mean.1 - mean.2 - 0) / SE, df=min(n.1, n.2)-1)
p.value

## [1] 0.08110655

# If the variances were homogeneous
SE <- sqrt(1 / n.1 + 1 / n.2) * sqrt(((n.1 - 1) * var.1 + (n.2 - 1) * var.2) / (n.1 + n.2 - 2))
p.value <- 1 - pt((mean.1 - mean.2 - 0) / SE, df=n.1+n.2-2)
p.value

## [1] 0.09303994