Решение задач к зачету по курсу «Методы прикладной статистики в биологии»
In an investigation on toxins produced by molds that infect corn crops, a biochemist prepares extracts of the mold culture and then measures the amount of the toxic substance per gram of solution. From six preparations of the mold culture the following observations on toxic substances (mg) are obtained: 1.2, 0.8, 0.6, 1.1, 1.2, 0.8. Find a 95% confidence interval for the mean amount of toxic substances.
x <- c(1.2, 0.8, 0.6, 1.1, 1.2, 0.8)
n <- length(x)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2)/(n-1)) # s <- sd(x)
SE <- s/sqrt(n)
t <- qt(1-0.025, df=n-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval
## [1] 0.6865937 1.2134063
Trace metals in drinking water wells affect the flavor of the water and unusually high concentrations can pose a health hazard. Furthermore, the water in well may vary in the concentration of the trace metals depending upon from where it is drawn. In the paper, “Trace Metals of South Indian River Region” (Environmental Studies, 1982, 62-6), trace metal concentrations (mg/L) on zinc were found from water drawn from the bottom and the top of each of 6 wells. The data follows:
location <- seq(1, 6)
bottom <- c(.430, .266, .567, .531, .707, .716)
top <- c(.415, .238, .390, .410, .605, .609)
problem2.table <- matrix(c(location, bottom, top), nrow=6)
problem2.table
## [,1] [,2] [,3]
## [1,] 1 0.430 0.415
## [2,] 2 0.266 0.238
## [3,] 3 0.567 0.390
## [4,] 4 0.531 0.410
## [5,] 5 0.707 0.605
## [6,] 6 0.716 0.609
Find a 95% confidence interval for the mean difference in the zinc concentrations in this area between water drawn from the top and bottom of wells.
problem2.table <- cbind(problem2.table, problem2.table[,2] - problem2.table[,3])
x <- problem2.table[,4]
n <- length(x)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2)/(n-1)) # s <- sd(x)
SE <- s/sqrt(n)
t <- qt(1-0.025, df=n-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval
## [1] 0.02797823 0.15535510
A turkey producer knows from previous experience that profits are maximized by selling turkeys when their average weight is 12 kilograms. Before determining whether to put all their full grown turkeys on the market this month, the producer wishes to estimate their mean weight. Prior knowledge indicates that turkey weights have a standard deviation of around 1.5 kilograms. What is the number of turkeys that must be sampled in order to estimate their true mean weight to within 0.5 kilograms with 95% confidence?
within.range <- 0.5 # mu = sample.mean + within.range, where within.range = z * sigma/sqrt(n)
z <- qnorm(1 - 0.025)
sigma <- 1.5
n <- (z * sigma / within.range)^2
n <- ceiling(n)
n
## [1] 35
A statistical procedure to estimate the mean shell thickness of eggs from chickens contaminated with PCBs (polychlorinated biphenyls) obtains a point estimate of 0.70 mm and an estimated standard error of 0.05 mm. This means:
Нет, 0.05 - стандратное отклонение эстиматора (т.е. для 0.07). Оценка стандартного отклонения по выборке в sqrt(n) раз больше: SE = sigma/sqrt(n).Нет. Почему 95%?Ага.Нет. Population SD мы не можем знать.Ага:SE <- 0.05
within.range <- qnorm(1 - 0.025) * SE
format(round(within.range, 2), nsmall=2)
## [1] "0.10"
A statistician selects a random sample of 200 seeds from a large shipment of a certain variety of tomato seeds and tests the sample for percentage germination. If 155 of the 200 seeds germinate, what is a 95% confidence interval for the population proportion of seeds that germinate?
p.estimator <- 155/200
n <- 200
p <- 0.5 # Assume the worst case possible
SE <- sqrt(p * (1- p) / n)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator - z * SE, p.estimator + z * SE)
confidence.interval
## [1] 0.7057048 0.8442952
Some scientists believe that a new drug would benefit about half of all people with a certain blood disorder. To estimate the proportion of patients who would benefit from taking the drug, the scientists will administer it to a random sample of patients who have the blood disorder. What sample size is needed so that the 95% confidence interval will have a width of 0.06?
confidence.interval.width <- 0.06
p <- 0.5
within.range <- confidence.interval.width / 2 # within.range = z * SE, where SE = sqrt(p * (1 - p) / n)
z <- qnorm(1 - 0.025)
n <- p * (1 - p) * z^2 / (within.range)^2
n <- ceiling(n)
n
## [1] 1068
You would like to estimate the percentage of “regular users of vitamins” in a large population and you would like your estimate to be accurate to within 4 percentage points, 19 times out of 20. Approximately how large should your sample size be?
within.range <- 0.04 # within.range = z * SE, where SE = sqrt(p * (1 - p) / n)
p <- 0.5 # Assume the worst case possible
z <- qnorm(1 - (1-19/20)/2) # 0.025 of probability by tail
n <- p * (1 - p) * z^2 / (within.range)^2
n <- ceiling(n)
n
## [1] 601
In order for the confidence interval in the previous question to be valid:
Нет. Для n > 40 распределение выборочных средних будет близко к нормальному для любого распределения генеральной совокупности.Ага.Нет. См. выше.Нет. Sample должен быть random.Ага.An experiment was conducted to compare the efficacies of two drugs in the prevention of tapeworms in the stomachs of a new breed of sheep. Samples of size 5 and 8 from each breed were given the drug and the two sample means were 28.6 and 40.0 worms/sheep. From previous studies, it is known that the variances in the two groups are 198 and 232, respectively, and that the number of worms in the stomachs has an approximate normal distribution. Find a 95% confidence interval for the difference in the mean number of worms per sheep.
Так две породы овец или два вида лекарств?
n <- 5
m <- 8
sample.mean <- 40.0 - 28.6
t <- qt(1 - 0.025, df=min(n, m)-1)
SE <- sqrt(198 / n + 232 / m)
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval
## [1] -11.59594 34.39594
A study was conducted to estimate the effectiveness of doing assignments in an introductory statistics course. Students in one section taught by instructor A received no assignments. Students in another section taught by instructor B, received assignments. The final grade of each student was recorded. A 95% confidence interval for the difference in the mean grades (Section A - Section B) was computed to be −3.5 ± 1.8. This means:
Нет.Нет.Ага.Нет.Ага.In a study of iron deficiency among infants, random samples of infants following different feeding programs were compared. One group contained breast-fed infants, while the children in another group were fed by a standard baby formula without any iron supplements. Here are summary results of blood hemoglobin levels at 12 months of age.
breast.fed <- c(8, 13.3, 1.7) # sample size, sample mean and sample SD
formula.fed <- c(10, 12.4, 1.8)
problem11.table <- matrix(c(breast.fed, formula.fed), nrow=2, byrow=T)
problem11.table
## [,1] [,2] [,3]
## [1,] 8 13.3 1.7
## [2,] 10 12.4 1.8
Establish a 98% confidence interval for the mean difference in hemoglobin level between the two populations of infants.
n <- problem11.table[1,1]
m <- problem11.table[2,1]
sample.mean <- problem11.table[1,2] - problem11.table[2,2]
SE <- sqrt((problem11.table[1,3])^2 / n + (problem11.table[2,3])^2 / m)
t <- qt(1-0.01, df=min(n, m)-1)
# Assumptions: normal distribution.
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval
## [1] -1.581699 3.381699
# Assumptions: equal variances.
SE <- sqrt(((problem11.table[1,3])^2 * (n-1) + (problem11.table[2,3])^2 * (m-1)) / (n+m-2)) * sqrt(1/n + 1/m)
t <- qt(1-0.01, df=n + m - 2)
confidence.interval <- c(sample.mean - t * SE, sample.mean + t * SE)
confidence.interval
## [1] -1.253065 3.053065
Two surgical procedures are widely used to treat a certain type of cancer. To compare the success rates of the two procedures, a random sample from each type of procedure is obtained, and the number of patients with no reoccurrence of the disease after 1 year was recorded. Here is the data.
procedureA <- c(100, 78) # n and 'No occurence'
procedureB <- c(120, 102)
problem12.table <- matrix(c(procedureA, procedureB), nrow=2, byrow=T)
problem12.table
## [,1] [,2]
## [1,] 100 78
## [2,] 120 102
Find a 95% confidence interval for the difference in success rates.
n <- procedureA[1]
m <- procedureB[1]
p.estimator.A <- procedureA[2] / procedureA[1]
p.estimator.B <- procedureB[2] / procedureB[1]
p.estimator.diff <- p.estimator.B - p.estimator.A
# Assumptions: actual p is about p.estimator.
p.A <- p.estimator.A
p.B <- p.estimator.B
SE <- sqrt(p.A * (1 - p.A) / n + p.B * (1 - p.B) / m)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator.diff - z * SE, p.estimator.diff + z * SE)
confidence.interval
## [1] -0.0333126 0.1733126
sprintf(round(confidence.interval*100, 0), fmt="%d%%")
## [1] "-3%" "17%"
# Assumptions: the worst case possible.
p.A <- 0.5
p.B <- 0.5
SE <- sqrt(p.A * (1 - p.A) / n + p.B * (1 - p.B) / m)
z <- qnorm(1 - 0.025)
confidence.interval <- c(p.estimator.diff - z * SE, p.estimator.diff + z * SE)
confidence.interval
## [1] -0.06269019 0.20269019
sprintf(round(confidence.interval*100, 0), fmt="%d%%")
## [1] "-6%" "20%"
A sample of 8 patients had their lung capacity measured before and after a certain treatment with the following results:
patient <- seq(1,8)
before <- c(750, 860, 950, 830, 750, 680, 720, 810)
after <- c(850, 880, 930, 860, 800, 740, 760, 800)
problem13.table <- matrix(c(patient, before, after), ncol=3)
problem13.table
## [,1] [,2] [,3]
## [1,] 1 750 850
## [2,] 2 860 880
## [3,] 3 950 930
## [4,] 4 830 860
## [5,] 5 750 800
## [6,] 6 680 740
## [7,] 7 720 760
## [8,] 8 810 800
Is there any evidence that the treatment provides an increase in lung capacity?
Здесь правильно применять one-tailed test?
x <- after - before
n <- length(x)
# H0: no difference (x = 0)
# H1: there is an increase (x > 0)
sample.mean <- mean(x)
s <- sqrt(sum((x-sample.mean)^2) / (n-1)) # s <- sd(x)
p.value <- 1 - pt((sample.mean-0) / (s/sqrt(n)), df=n-1)
p.value
## [1] 0.02116503
t.test(after, before, paired=T, df=n-1, alternative='greater')$p.value
## [1] 0.02116503
There is evidence that the treatment provides an increase in lung capacity.
На 5% уровне значимости мы отвергаем гипотезу H0.
The power takeoff driveline on tractors used in agriculture is a potentially serious hazard to operators of farm equipment. The driveline is covered by a shield in new tractors, but for a variety of reasons, the shield is often missing on older tractors. Two type of shields are the bolt-on and the flipup. It was believed that the bolt-on shield was perceived as a nuisance by the operators and deliberately removed, but the flip-up shield is easily lifted for inspection and maintenance and may be left in place. In a study initiated by the National Safety Council of the U.S., a sample of older tractors with both types of shields was taken to see what proportion were removed. Of 183 tractors designed to have bolt-on shields, 35 had been removed. Of the 136 tractors with flip-up shields, 15 were removed. We wish to test the hypothesis H: pb = pf vs A: pb <> pf where pb and pf are the proportion of tractors with the bolt-on and flip-up shields removed, respectively. Find the corresponding p-value.
n <- 183
n.removed <- 35
m <- 136
m.removed <- 15
pb.estimator <- n.removed / n
pf.estimator <- m.removed / m
# Assumptions: actual p is about p.estimator.
pb <- pb.estimator
pf <- pf.estimator
SE <- sqrt(pb * (1 - pb) / n + pf * (1 - pf) / m)
p.value <- 2 * pnorm((pf.estimator-pb.estimator - 0) / SE)
p.value
## [1] 0.04081365
# Assumptions: the worst case possible.
pb <- 0.5
pf <- 0.5
SE <- sqrt(pb * (1 - pb) / n + pf * (1 - pf) / m)
p.value <- 2 * pnorm((pf.estimator-pb.estimator - 0) / SE)
p.value
## [1] 0.1526427
A large supermarket chain will increase its stock of bakery products if more than 20% of its customers are purchasers of bakery products. A random sample of 100 customers found 28% purchased bakery items. A 5% significance test is conducted to determine if the chain should increase its bakery stock. Should the supermarket chain increase its stock of bakery products?
n <- 100
p.estimator <- 0.28
threshold <- 0.2
# Assumptions: the worst case possible.
p <- 0.5
SE <- sqrt(p * (1 - p) / n)
p.value <- 1 - pnorm((p.estimator - threshold) / SE)
p.value
## [1] 0.05479929
There is little evidence that the supermarket chain should increase its stock of bakery products.
One criticism of reforestation efforts after timber harvesting is that too few of the seedlings survive. An experiment was conducted to assess if mulching the slash (limbs, roots, small branches, etc.) and leaving the mulch on the ground improves the survival rate compared to just leaving the slash on the ground. It is believed that mulching will cause the material to break down sooner and release the nutrients to the seedlings. A total of 500 seedlings were randomly assigned to the two treatments and the two year survival rate was measured. Of the 250 seedling receiving the “mulching” treatment, 75 survived; of the 250 seedlings receiving the “control” treatment, 55 survived. Do these data support the criticism point stated in the problem?
m <- 250; c <- 250
m.survived <- 75; c.survived <- 55
pm.estimator <- m.survived / m
pc.estimator <- c.survived / c
# Assumptions: actual p is about p.estimator.
pm <- pm.estimator
pc <- pc.estimator
SE <- sqrt(pm * (1 - pm) / m + pc * (1 - pc) / c)
p.value <- 1 - pnorm((pm.estimator-pc.estimator - 0 ) / SE)
p.value
## [1] 0.02029713
# Assumptions: the worst case possible.
pm <- 0.5; pc <- 0.5
SE <- sqrt(pm * (1 - pm) / m + pc * (1 - pc) / c)
p.value <- 1 - pnorm((pm.estimator-pc.estimator - 0 ) / SE)
p.value
## [1] 0.03681914
Нет?
DDT is an insecticide that accumulates up the food chain. Predator birds can be contaminated with quite high levels of the chemical by eating many lightly contaminated preys. One effect of DDT upon birds is to inhibit the production of the enzyme carbonic anhydrase which controls calcium metabolism. It is believed that this causes egg shells to be thinner and weaker than normal and makes the eggs more prone to breakage. (This is one of reasons why the condor in California is near extinction.) An experiment was conducted where 16 sparrow hawks were fed a mixture of 3 ppm dieldrin and 15 ppm DDT (a combination often found in contaminated prey). The first egg laid by each bird was measured and the mean shell thickness was found to be 0.19 mm with a standard deviation of 0.01 mm. A normal egg shell is assumed to have a mean thickness of 0.2 mm. Carry out an appropriate statistical test.
n <- 16
sample.mean <- 0.19
s <- 0.01
mu <- 0.2
SE <- s / sqrt(n)
p.value <- pt((sample.mean - mu) / SE, df=n-1)
p.value
## [1] 0.0005796584
Resting pulse rate is an important measure of the fitness of a person’s cardiovascular system with a lower rate indicative of greater fitness. The mean pulse rate for all adult males is approximately 72 beats per minute. A random sample of 25 male students currently enrolled in the Faculty of Agriculture and now taking 5.211 was selected and the mean pulse resting pulse rate was found to be 80 beats per minute with a standard deviation of 20 beats per minute. The experimenter wishes to test if the students are less fit, on average, than the general population. Help her to perform an appropriate statistical test.
mu <- 72
n <- 25
sample.mean <- 80
s <- 20
SE <- s / sqrt(n)
p.value <- 1 - pt((sample.mean - mu) / SE, df=n-1)
p.value
## [1] 0.02846992
A 95% confidence interval for μ is calculated to be (1.7, 3.5). It is now decided to test the hypothesis H0 : μ = 0 vs HA:μ <> 0 at the 5% significance level, using the same data as was used to construct the confidence interval.
Нет.Нет.Нет. Для малых n sigma может быть известным, а ещё мы можем испоьзовать t-распределение. Однако n мы не знаем, поэтому не сможем определить степени свободы t-распределения.Ага.confidence.interval <- c(1.7, 3.5)
sample.mean <- mean(confidence.interval) # mu = sample.mean +- z * sigma/sqrt(n)
within.range <- confidence.interval[2] - sample.mean
SE <- within.range / qnorm(1 - 0.025)
p.value <- 1 - pnorm((sample.mean - 0)/ SE)
p.value
## [1] 7.47579e-09
Нет.We wish to test if a new feed increases the mean weight gain compared to an old feed. At the conclusion of the experiment it was found that the new feed gave a 10 kg bigger gain than the old feed. A two-sample t-test with the proper one-sided alternative was done and the resulting p-value was .082. This means:
Нет.Ага.Нет.Нет.Нет.In order to compare two kinds of feed, thirteen pigs are split into two groups, and each group received one feed. The following are the gains in weight (kilograms) after a fixed period of time:
feedA <- c(8.0, 7.4, 5.8, 6.2, 8.8, 9.5, NA)
feedB <- c(12.0, 18.2, 8.0, 9.6, 8.2, 9.9, 10.3)
problem21.table <- matrix(c(feedA, feedB), nrow=2, byrow=T)
problem21.table
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 8 7.4 5.8 6.2 8.8 9.5 NA
## [2,] 12 18.2 8.0 9.6 8.2 9.9 10.3
Find the p-value for the corresponding statistical test. What is your conclusion?
feedA <- feedA[!is.na(feedA)]
lenA <- length(feedA)
lenB <- length(feedB)
sample.mean.A <- mean(feedA); sample.mean.B <- mean(feedB)
SE <- sqrt((sd(feedA))^2 / lenA + (sd(feedB))^2 / lenB)
p.value <- 1 - pt((sample.mean.B-sample.mean.A - 0) / SE, df=min(lenA, lenB)-1)
p.value
## [1] 0.03667544
На уровне значимости 5% есть основания полагать, что второй корм даёт больший прирост в весе, чем первый.
Each person in a random sample of 50 was asked to state his/her sex and preferred color. The resulting frequencies are shown below.
male <- c(5, 14, 6)
female <- c(15, 6, 4)
problem22.table <- data.frame(rbind(male, female))
colnames(problem22.table) <- c("red", "blue", "green")
problem22.table
## red blue green
## male 5 14 6
## female 15 6 4
Carry out a chi-square test is used to test the null hypothesis that sex and preferred color are independent. What other test could you suggest for this problem?
sumup <- margin.table
O <- as.table(as.matrix(problem22.table))
E <- sumup(O, 1) %*% t(sumup(O, 2)) / sumup(O)
E # Expected
## red blue green
## male 10 10 5
## female 10 10 5
chi <- sum((E - O)^2 / E)
chi.df <- prod(dim(problem22.table) - 1) # df = (n-1)*(m-1)
p.value <- 1 - pchisq(chi, df=chi.df)
p.value
## [1] 0.01356856
Fisher's exact test.
An educator plans to evaluate two different teaching methods by comparing the average tests cores of two classes of students, each taught by one of the two techniques under study. Suppose the two classes have 35 and 38 students, respectively, with standard deviations in tests scores of 24.5 and 21.2, respectively. Is an observed difference in average test scores of 2.4 significant?
n <- 35; m <- 38
sd.n <- 24.5; sd.m <- 21.2
sample.mean <- 2.4
SE <- sqrt(sd.n^2 / n + sd.m^2 / m)
p.value <- 1 - pt((sample.mean - 0) / SE, df=min(n, m))
p.value
## [1] 0.3292284
Нет (на уровне значиомости 5%).
A highway superintendent states that five bridges into a city are used in the ratio 2:3:3:4:6 during the morning rush hour. A highway study of a random sample of 6000 cars indicates that 720, 970, 1013, 1380, and 1917 cars use the five bridges, respectively. At what significance level can the superintendent’s claim be rejected?
counts.o <- c(720, 970, 1013, 1380, 1917)
ratio.e <- c(2, 3, 3, 4, 6)
counts.e <- sum(counts.o) / sum(ratio.e) * ratio.e
counts.e # Expected
## [1] 666.6667 1000.0000 1000.0000 1333.3333 2000.0000
chi <- sum((counts.e - counts.o)^2 / counts.e)
chi.df <- length(counts.o) - 1 # df = n - 1
p.value <- 1 - pchisq(chi, df=chi.df)
p.value
## [1] 0.03400959
The superintendent's claim can be rejected at 1% significance level.
An author of a new book claims that anyone following his suggested diet program will lose an average of 2.9 pounds per week. A researcher believes that the true figure will be lower and plans a test involving random sample of 36 overweight people. She will reject the author’s claim if the mean weight loss in the volunteer group is less than 2.5 pounds per week. Assume that the standard deviation among individuals is 1.2 pounds per week. If the true mean value is 2.4 pounds per week, what is the probability that the researcher will mistakenly fail to reject the author’s false claim of 2.8 pounds?
n <- 36
s <- 1.2
sample.mean <- 2.5
mu <- 2.4
SE <- s / sqrt(n)
reject.prob <- 1 - pt((sample.mean - mu) / SE, df=n-1)
reject.prob
## [1] 0.3101022
A telephone company’s records indicate that private customers pay on average 17.10 per month for long-distance telephone calls. A random sample of 10 customers’ bills during a given month produced a sample mean of 22.10 expended for long-distance calls and a sample variance of 45. A 5% significance test is to be performed to determine if the mean level of billing for long distance calls per month is in excess of 17.10. The calculated value of the test statistic and the critical value respectively are:
Ага.mu <- 17.10
n <- 10
sample.mean <- 22.10
s <- sqrt(45)
SE <- s / sqrt(n)
test.statistics <- (sample.mean - mu) / SE
critical.value <- qt(1-0.05, df=n-1)
c(test.statistics, critical.value)
## [1] 2.357023 1.833113
Some varieties of nematodes feed on the roots of lawn grasses and crops such as strawberries and tomatoes. The pest, which is particularly troublesome in warm climates, can be treated by the application of nematocides. However, because of the size of the worms, it is very difficult to count them directly. Hence, the yield of a crop is used as a surrogate for the number of worms. Four brands of nematocides are to be compared. Twelve plots of land of comparable fertility that were suffering from nematodes were planted with a crop. Each nematocide was applied to three plots; the assignment of the nematocide to the plot was made at random. At harvest time, the yields of each plot were recorded and part of the ANOVA table appears below:
n <- 4; m <- 3;
df.E <- 8; df.T <- 11
SSA <- 3.456; SSE <- 1.200; SST <- 4.656
Complete the ANOVA table and carry out the corresponding statistical test. What is your conclusion?
df.X <- df.T - df.E # df.X <- n - 1
df.X
## [1] 3
MSA <- SSA / (n - 1)
MSA
## [1] 1.152
MSE <- SSE / (m * n - n)
MSE
## [1] 0.15
F <- MSA / MSE
F
## [1] 7.68
p.value <- 1 - pf(F, df.X, df.E)
p.value
## [1] 0.009671289
Suppose that we have grown one bacterium in broth culture at 3 different pH levels at 4 different temperatures. We have 12 flasks in all, but no replicates.
problem28.table <- data.frame(rbind(c(25, 10, 19, 40), c(30, 15, 25, 45), c(35, 20, 30, 55), c(40, 15, 22, 40)))
colnames(problem28.table) <- c('temp', 'pH 5.5', 'pH 6.5', 'pH 7.5')
problem28.table
## temp pH 5.5 pH 6.5 pH 7.5
## 1 25 10 19 40
## 2 30 15 25 45
## 3 35 20 30 55
## 4 40 15 22 40
Construct the ANOVA table and carry out the corresponding statistical tests. What is your conclusion?
observed <- as.matrix(problem28.table[-1])
rownames(observed) <- c("t 25", "t 30", "t 35", "t 40")
n <- nrow(observed); m <- ncol(observed)
observed <- cbind(observed, apply(observed, 1, mean))
observed <- rbind(observed, apply(observed, 2, mean))
rownames(observed)[5] <- "mean"; colnames(observed)[4] <- "mean"
mean.n <- observed[,m+1]; mean.m <- observed[n+1,]
mean.global <- mean.n[n+1]
observed
## pH 5.5 pH 6.5 pH 7.5 mean
## t 25 10 19 40 23.00000
## t 30 15 25 45 28.33333
## t 35 20 30 55 35.00000
## t 40 15 22 40 25.66667
## mean 15 24 45 28.00000
expected <- t(apply(observed[-n-1,], 1, function(x) mean.m + x[4] - mean.global))
expected <- rbind(expected, mean.m)
rownames(expected)[n+1] <- "mean"
expected
## pH 5.5 pH 6.5 pH 7.5 mean
## t 25 10.00000 19.00000 40.00000 23.00000
## t 30 15.33333 24.33333 45.33333 28.33333
## t 35 22.00000 31.00000 52.00000 35.00000
## t 40 12.66667 21.66667 42.66667 25.66667
## mean 15.00000 24.00000 45.00000 28.00000
SST <- sum((observed[-n-1,-m-1] - mean.global)^2)
SSA <- m * sum((mean.n[-n-1] - mean.global)^2)
SSB <- n * sum((mean.m[-m-1] - mean.global)^2)
SSE <- SST - SSA - SSB
round(c(SST, SSA, SSB, SSE), 1)
## [1] 2162.0 238.7 1896.0 27.3
MSE <- SSE / ((n * m - 1) - (n - 1) - (m - 1))
MSA <- SSA / (n - 1)
MSB <- SSB / (m - 1)
round(c(MSA, MSB, MSE), 1)
## [1] 79.6 948.0 4.6
F.A <- MSA / MSE
F.B <- MSB / MSE
round(c(F.A, F.B), 1)
## [1] 17.5 208.1
c(1-pf(F.A, n-1, n*m-n-m+1), 1-pf(F.B, m-1, n*m-n-m+1))
## [1] 2.280032e-03 2.870213e-06
table28 <- data.frame(as.character(rep(problem28.table[,1], each=3)), rep(c("pH 5.5", "pH 6.5", "pH 7.5"), 4), as.vector(unlist(t(problem28.table[,-1]))), stringsAsFactors = F)
colnames(table28)<-c("temp", "pH", "counts")
summary(aov(counts ~ temp+pH, data=table28))
## Df Sum Sq Mean Sq F value Pr(>F)
## temp 3 238.7 79.6 17.46 0.00228 **
## pH 2 1896.0 948.0 208.10 2.87e-06 ***
## Residuals 6 27.3 4.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following information was obtained from the manager of a city water department for predicting the consumption of water (in gallons) from the size of household. [ Household Size (x) versus Water Used (y), 10 observations, values are not shown] Here are the summary statistics:
sum.x <- 57
sum.y <- 10038
sum.x2 <- 433
sum.y2 <- 11641474
sum.xy <- 67669
n <- 10
slope <- (sum.xy - sum.x * sum.y / n) / (sum.x2 - sum.x^2 / n)
intercept <- (sum.y - slope * sum.x) / n
sprintf("y = %.2f * x + %.2f", slope, intercept)
## [1] "y = 96.69 * x + 452.66"
slope.inverted <- (sum.xy - sum.x * sum.y / n) / (sum.y2 - (sum.y)^2 / n)
r <- sqrt(slope * slope.inverted) # b * b* = r^2
r
## [1] 0.8035266
sprintf("%.2f%% дисперсии Y объясняется линейной регрессией.", r^2 * 100)
## [1] "64.57% дисперсии Y объясняется линейной регрессией."
a <- intercept; b <- slope;
sigma.e <- sqrt((sum.y2 - 2 * (a * sum.y + b * sum.xy) + (n * a^2 + 2*a*b*sum.x + b^2 * sum.x2)) / (n - 2))
SE <- sigma.e / (sum.x2 - (sum.x)^2 / n)
t <- qt(0.05, df=n-2)
confidence.interval <- c(slope - t * SE, slope + t * SE)
confidence.interval
## [1] 101.22148 92.16243
sxx <- sum(sum.x2 - sum.x^2 / n) / (n - 1)
syy <- sum(sum.y2 - sum.y^2 / n) / (n - 1)
sxy <- sum(sum.xy - sum.x * sum.y / n) / (n - 1)
r <- sxy / sqrt(sxx * syy)
SSE <- (n - 1) * syy * (1 - r^2)
sigma.e <- sqrt(SSE / (n - 2)) # This line and above: just a clear way to calculate the common values
x0 <- 10
t <- qt(1-0.05, df=n-2)
within.range <- t * sigma.e * sqrt(1 / n + (x0 - sum.x / n)^2 / (sum.x2 - sum.x^2 / n) + 1)
c(a + b * x0 - within.range, a + b * x0 + within.range)
## [1] 867.5501 1971.6007
?
A research group was interested in predicting the number of bus riders per capita in census districts. They felt that the rider-ship per capita, Y , could be predicted using the average income, X, for the census district. A sample of 29 census districts were taken and the observations on the samples were used to obtain n = 29, mean(Y) = 62.1429, mean(X) = 3452.178; Sum(X − mean(X))(Y − mean(Y)) = 189312.0; Sum(X − mean(X))2= 19910691.0; Sum(Y − mean(Y))2 = 13369.381; MSE = 428.5 Based on this data, find a 98% confidence interval for the slope of the regression line.
n <- 29
mean.x <- 3452.178; mean.y <- 62.1429
sxx <- 19910691.0 / (n - 1)
syy <- 13369.381 / (n - 1)
sxy <- 189312.0 / (n - 1)
MSE <- 428.5
b <- sxy / sxx
SE <- sqrt(MSE / (sxx * (n - 1)))
t <- qt(1-0.01, df=n-2)
c(b - t * SE, b + t * SE)
## [1] -0.00196282 0.02097894
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
height <- c(1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83)
weight <- c(62.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46)
n <- length(height)
mean.height <- sum(height) / n
mean.weight <- sum(weight) / n
sxx <- sum((height - mean.height)^2) / (n - 1)
syy <- sum((weight - mean.weight)^2) / (n - 1)
sxy <- sum((weight - mean.weight) * (height - mean.height)) / (n - 1)
b <- sxy / sxx
a <- mean.weight - b * mean.height
sprintf("y = %.2f + %.2f * x", a, b)
## [1] "y = -22.07 + 51.38 * x"
lm(weight ~ height)
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -22.07 51.38
r <- sxy / sqrt(sxx * syy)
r^2
## [1] 0.8184522
plot(height, weight)
abline(lm(weight ~ height))
По графику видно, что есть маленькая и очень тяжёлая женщина.
# log(y) = a + b * log(x) + e
plot(log(height), log(weight))
abline(lm(log(weight) ~ log(height)))
# y = a + b * log(x) + e
plot(log(height), weight)
abline(lm(weight ~ log(height)))
# log(y) = a + b * x + e
plot(height, log(weight))
abline(lm(log(weight) ~ height))
Two methods are used to measure the business acumen expressed by heroines of romance novels. The mean of method 1 is 88.6 with a variance of 109.63 in a sample of size 41. The mean of method 2 is 85.1 with a variance of 65.99 in a sample of size 21. Before you test if the means of these two methods are the same, you should test if they possess homogeneous variances. Find the corresponding p-value.
mean.1 <- 88.6; mean.2 <- 85.1
var.1 <- 109.63; var.2 <- 65.99
n.1 <- 41; n.2 <- 21
# Test for homogeneous variances
p.value.homogeneous <- 1 - pf((var.1 / var.2), df1=n.1-1, df2=n.2-1)
p.value.homogeneous
## [1] 0.1121884
# Different veriances
SE <- sqrt(var.1 / n.1 + var.2 / n.2)
p.value <- 1 - pt((mean.1 - mean.2 - 0) / SE, df=min(n.1, n.2)-1)
p.value
## [1] 0.08110655
# If the variances were homogeneous
SE <- sqrt(1 / n.1 + 1 / n.2) * sqrt(((n.1 - 1) * var.1 + (n.2 - 1) * var.2) / (n.1 + n.2 - 2))
p.value <- 1 - pt((mean.1 - mean.2 - 0) / SE, df=n.1+n.2-2)
p.value
## [1] 0.09303994