Mariam Ishaque
1.Analysis of Variance (one-way) In a one-way analysis of variance we are trying to find evidence that a single independent variable influences the results for a dependent variable. Suppose, for example, that we are evaluating a multivitamin tablet for iron to see if there is too much variability between individual tables. Taking three tablets and analyzing each four times, we obtain the following results in mg Fe/g tablet. Use the data, listed in Table A to perform an ANOVA calculation using R. In this example the different tablets are the independent variable and the dependent variable is the concentration of iron.
a.Put the data in a data frame containing two columns, one an index for the independent variable and the other for the dependent variable’s values.
b.Perform the analysis of variance, listing degrees of freedom, residuals, F value and p-value.
A<-c(5.67,5.67,5.55,5.57)
B<-c(5.75,5.47,5.43,5.45)
C<-c(4.74,4.45,4.65,4.94)
tablets=data.frame(A,B,C)
tabs=stack(tablets)
tabs
## values ind
## 1 5.67 A
## 2 5.67 A
## 3 5.55 A
## 4 5.57 A
## 5 5.75 B
## 6 5.47 B
## 7 5.43 B
## 8 5.45 B
## 9 4.74 C
## 10 4.45 C
## 11 4.65 C
## 12 4.94 C
anova(lm(values~ind,data=tabs))
## Analysis of Variance Table
##
## Response: values
## Df Sum Sq Mean Sq F value Pr(>F)
## ind 2 2.058 1.029 45.2 2e-05 ***
## Residuals 9 0.205 0.023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#There is a significant difference between the three tablets.
2.Analysis of Variance (two-way) The data for the analysis of iron in a multivitamin has another independent variable, the acid used to dissolve the tablet, listed in the Table below. For example, the first two replicates for each tablet might have been obtained by dissolving the tablet in HCl and the third and fourth replicate by dissolving the tablet in HNO3. With two independent variables, we need to examine the effect of any difference between tablets and between acids, as well as any interaction between the acids and the tablets.
a.To accomplish a two-way analysis of variance construct a data frame containing three columns, one for the dependent variable’s values, one providing an index for the tablets and one providing an index for the acid.
b.Perform a two-way ANOVA to determine listing degrees of freedom, residuals, F values and p- values. State whether any of the factors are significant and whether or not there is a significant interaction between the factors.
A<-c(5.67,5.67,5.55,5.57)
B<-c(5.75,5.47,5.43,5.45)
C<-c(4.74,4.45,4.65,4.94)
tablets=data.frame(A,B,C)
tabs=stack(tablets)
acid<-c("HCl","HCl","HNO3","HNO3","HCl","HCl","HNO3","HNO3","HCl","HCl","HNO3","HNO3")
tabletsacid=data.frame(tabs,acid)
tabletsacid
## values ind acid
## 1 5.67 A HCl
## 2 5.67 A HCl
## 3 5.55 A HNO3
## 4 5.57 A HNO3
## 5 5.75 B HCl
## 6 5.47 B HCl
## 7 5.43 B HNO3
## 8 5.45 B HNO3
## 9 4.74 C HCl
## 10 4.45 C HCl
## 11 4.65 C HNO3
## 12 4.94 C HNO3
anova(lm(values~ind*acid,data=tabletsacid))
## Analysis of Variance Table
##
## Response: values
## Df Sum Sq Mean Sq F value Pr(>F)
## ind 2 2.058 1.029 49.91 0.00018 ***
## acid 1 0.002 0.002 0.10 0.75861
## ind:acid 2 0.079 0.039 1.91 0.22772
## Residuals 6 0.124 0.021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Only the tablet type is a significant factor. There is not a significant effect from the acid type. There is not significant interaction between the tablets and acid.
3.Sample Size Estimation
In a pilot study that evaluated the standard uptake value (SUV) of 19F fluorodeoxyglucose in breasts, classified as dense and fatty, the following values were determined: Assume that the sample sizes are equal α=0.05 and β=0.20:
Table C Right Breast SUV Left Breast SUV Mean SD Mean SD Dense 0.39 0.05 0.36 0.07 Fatty 0.32 0.1 0.31 0.08
a.Estimate the total number of subjects (N) required to determine whether the SUV in fatty right breasts is significantly different from the SUV of in the dense right breasts.
pooledSD<-sqrt((0.05^2 + 0.1^2)/2)
pooledSD
## [1] 0.07906
library(pwr)
pwr.t.test(d=(0.39-0.32)/pooledSD,power=0.8,sig.level=0.05,type="two.sample",alternative="two.sided")
##
## Two-sample t test power calculation
##
## n = 21.03
## d = 0.8854
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
#N would approximately need to be 22 per group, or 44 total.
b.What would N have to be to adequately determine whether the SUV in dense right breasts is significantly different from the SUV of in the dense left breasts?
pooledSD<-sqrt((0.05^2 + 0.07^2)/2)
pooledSD
## [1] 0.06083
library(pwr)
pwr.t.test(d=(0.39-0.36)/pooledSD,power=0.8,sig.level=0.05,type="two.sample",alternative="two.sided")
##
## Two-sample t test power calculation
##
## n = 65.51
## d = 0.4932
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
#N would approximately need to be 66 per group, or 132 total.
4.Chi-Square Goodness of Fit:
An unusual distribution of blood groups in patients undergoing one type of surgical procedure is suspected. The expected distribution for the population served by the hospital which performs this surgery is known to be 44% group O, 45% group A, 8% group B and 3% group AB. If a random sample of routine pre-operative blood grouping results is taken and compared with the expected distribution, how would you analyze these data? Perform the analysis and explain your results.
Blood O 67
Group A 83
B 29
AB 8
Obs<-c(67,83,29,8)
Exp<-c(.44,.45,.08,.03)
chisq.test(Obs,p=Exp,rescale.p=T)
##
## Chi-squared test for given probabilities
##
## data: Obs
## X-squared = 17.05, df = 3, p-value = 0.0006908
#You would reject the null hypothesis with a p value of 0.0006908. There is a statistically significant difference between the distribution of blood groups from patients undergoing this surgical procedure and those expected from a random sample of the general population.
5.Chi Square - Independence
A manufacturer was considering marketing crackers high in a certain kind of edible fiber as a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that they would feel less hungry and eat less. A laboratory studied whether people would in fact eat less in this way.
Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber, both, and a control cracker) and were then allowed to eat as much as they wished from a prepared menu. The amount of food they consumed and their weight were monitored, along with any side effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric upset from some of the fiber crackers. A contingency table of “Cracker” versus “Bloat” shows the relationship between the four different types of cracker and the four levels of severity of bloating as reported by the subjects.
Perform a Chi-Square test on the data in TABLE E to determine whether Bloating is independent of Cracker (the type of fiber eaten). Explain your results.
#Table E
cracker<-matrix(c(0,2,0,5,4,5,4,2,1,3,2,3,7,2,6,2),nrow=4)
rownames(cracker)<-c("bran","combo","control","gum")
colnames(cracker)<-c("high","low","medium","none")
cracker
## high low medium none
## bran 0 4 1 7
## combo 2 5 3 2
## control 0 4 2 6
## gum 5 2 3 2
chisq.test(cracker)
## Warning: Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cracker
## X-squared = 16.94, df = 9, p-value = 0.04962
#With p<0.05, bloating and cracker (type of fiber) appear to be dependent.
6.Linear Regression.
With the rmr dataset, plot metabolic rate versus body weight. Fit a linear regression model to the relation. According to the fitted model, what is the predicted metabolic rate for a body weight of 70 kg? Give a 95% confidence interval for the slope of the line.
library(ISwR)
attach(rmr)
plot(body.weight,metabolic.rate, main="Metabolic Rate vs Body Weight")
abline(lm(metabolic.rate~body.weight))
fit<-lm(metabolic.rate~body.weight,data=rmr)
summary(fit)
##
## Call:
## lm(formula = metabolic.rate ~ body.weight, data = rmr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -245.7 -114.0 -32.1 105.0 484.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 811.227 76.976 10.54 2.3e-13 ***
## body.weight 7.060 0.978 7.22 7.0e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158 on 42 degrees of freedom
## Multiple R-squared: 0.554, Adjusted R-squared: 0.543
## F-statistic: 52.1 on 1 and 42 DF, p-value: 7.03e-09
predict(fit,newdata=data.frame(body.weight=70))
## 1
## 1305
qt(.975,42)
## [1] 2.018
confint(fit,"body.weight",level=0.95)
## 2.5 % 97.5 %
## body.weight 5.087 9.032
7.Linear Regression. In the juul data set, fit a linear regression model for the square root of the IGF-I concentration versus age to the group of subjects over 25 years old.
library(ISwR)
attach(juul)
summary(juul)
## age menarche sex igf1 tanner
## Min. : 0.17 Min. :1.0 Min. :1.00 Min. : 25 Min. :1.00
## 1st Qu.: 9.05 1st Qu.:1.0 1st Qu.:1.00 1st Qu.:202 1st Qu.:1.00
## Median :12.56 Median :1.0 Median :2.00 Median :314 Median :2.00
## Mean :15.10 Mean :1.5 Mean :1.53 Mean :340 Mean :2.64
## 3rd Qu.:16.86 3rd Qu.:2.0 3rd Qu.:2.00 3rd Qu.:463 3rd Qu.:5.00
## Max. :83.00 Max. :2.0 Max. :2.00 Max. :915 Max. :5.00
## NA's :5 NA's :635 NA's :5 NA's :321 NA's :240
## testvol
## Min. : 1.0
## 1st Qu.: 1.0
## Median : 3.0
## Mean : 7.9
## 3rd Qu.:15.0
## Max. :30.0
## NA's :859
summary(lm(sqrt(igf1)~age,data=juul,subset=age>25))
##
## Call:
## lm(formula = sqrt(igf1) ~ age, data = juul, subset = age > 25)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.864 -1.166 0.102 0.945 4.114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.7103 0.4946 37.83 <2e-16 ***
## age -0.1053 0.0107 -9.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 120 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.446, Adjusted R-squared: 0.441
## F-statistic: 96.6 on 1 and 120 DF, p-value: <2e-16
8.In the malaria data set, analyze the log-transformed antibody level versus age. Make a plot of the relation. Do you notice anything peculiar?
library(ISwR)
attach(malaria)
## The following object is masked from juul:
##
## age
summary(malaria)
## subject age ab mal
## Min. : 1.0 Min. : 3.00 Min. : 2 Min. :0.00
## 1st Qu.: 25.8 1st Qu.: 5.75 1st Qu.: 29 1st Qu.:0.00
## Median : 50.5 Median : 9.00 Median : 111 Median :0.00
## Mean : 50.5 Mean : 8.86 Mean : 312 Mean :0.27
## 3rd Qu.: 75.2 3rd Qu.:12.00 3rd Qu.: 374 3rd Qu.:1.00
## Max. :100.0 Max. :15.00 Max. :2066 Max. :1.00
summary(lm(log(ab)~age,data=malaria))
##
## Call:
## lm(formula = log(ab) ~ age, data = malaria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.075 -1.062 0.118 1.101 2.733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8370 0.3802 10.09 <2e-16 ***
## age 0.1035 0.0395 2.62 0.01 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.48 on 98 degrees of freedom
## Multiple R-squared: 0.0654, Adjusted R-squared: 0.0558
## F-statistic: 6.85 on 1 and 98 DF, p-value: 0.0103
plot(log(ab)~age,data=malaria, main="Log(Antibody Level) vs Age - Malaria", ylab="Log(Antibody Level)", xlab="Age")
#The graph of log(Antibody level) vs Age appears to show a cyclic relationship between the two variables in the Malaria dataset. There seem to be spikes approximately around ages 4, 6, and 11. This could be due to the inherent nature of the disease.
9.Perform the following functions on the data in TABLE F.
a.Make a histogram of each variable and comment on its apparent normality.
esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
esteem
## Subject Height Self.Esteem
## 1 1 60 3.6
## 2 2 62 4.1
## 3 3 60 4.6
## 4 4 69 3.8
## 5 5 58 4.4
## 6 6 75 3.2
## 7 7 71 3.1
## 8 8 65 3.8
## 9 9 63 4.1
## 10 10 60 4.3
## 11 11 61 3.7
## 12 12 62 3.5
## 13 13 63 3.2
## 14 14 68 3.7
## 15 15 68 3.3
## 16 16 71 3.4
## 17 17 67 4.1
## 18 18 63 3.8
## 19 19 67 3.4
## 20 20 67 4.0
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'juul':
##
## menarche
attach(esteem)
truehist(Height, main="Height", xlab="Height (inches)")
curve(dnorm(x,mean=mean(Height),sd=sd(Height)),add=T)
truehist(Self.Esteem, main="Self Esteem", xlab= "Self Esteem (1-5 scale)", ymax=1.0, xlim=c(1,5))
curve(dnorm(x,mean=mean(Self.Esteem),sd=sd(Self.Esteem)),add=T)
#They both appear to be more or less normal. Self esteem does not appear to be as normally distributed as height.
b.Perform a Pearson Product Moment Correlation ® and determine its significance.
esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)
## The following objects are masked from esteem (position 3):
##
## Height, Self.Esteem, Subject
cor.test(Height,Self.Esteem)
##
## Pearson's product-moment correlation
##
## data: Height and Self.Esteem
## t = -3.405, df = 18, p-value = 0.003154
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8367 -0.2537
## sample estimates:
## cor
## -0.626
#There does seem to be a significant correlation between the two variables.
c.Print out a scatter plot with regression line
esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)
## The following objects are masked from esteem (position 3):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
##
## Height, Self.Esteem, Subject
plot(Height,Self.Esteem, main="Self Esteem vs Height", ylab="Self Esteem (1-5 scale)", xlab="Height (inches)", ylim=c(1,5))
abline(lm(Self.Esteem~Height), col=2)
plot(Height,Self.Esteem, main="Self Esteem vs Height", ylab="Self Esteem (1-5 scale)", xlab="Height (inches)")
abline(lm(Self.Esteem~Height), col=2)
10.Perform the following functions on the data in TABLE F.
a.Add the confidence bands and prediction bands to the scatter plot in Problem 9.
esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)
## The following objects are masked from esteem (position 3):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 5):
##
## Height, Self.Esteem, Subject
lm.esteem<-lm(Self.Esteem~Height)
pred.frame<-data.frame(Height=58:75)
pp<-predict(lm.esteem,int="p",newdata=pred.frame)
pc<-predict(lm.esteem,int="c",newdata=pred.frame)
plot(Height,Self.Esteem,ylim=c(1,5), main="Self Esteem vs Height", xlab="Height (inches)", ylab= "Self Esteem (1-5 scale)")
pred.height<-pred.frame$Height
matlines(pred.height,pc,lty=c(1,2,2),col=1)
matlines(pred.height,pp,lty=c(1,3,3),col=2)
plot(Height,Self.Esteem, ylim=c(2.5,5), main="Self Esteem vs Height", xlab="Height (inches)", ylab= "Self Esteem (1-5 scale)")
matlines(pred.height,pc,lty=c(1,2,2),col=1)
matlines(pred.height,pp,lty=c(1,3,3),col=2)
b.Perform a Spearman rank Order Correlation (ρ) and determine its significance.
esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)
## The following objects are masked from esteem (position 3):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 5):
##
## Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 6):
##
## Height, Self.Esteem, Subject
cor.test(Height,Self.Esteem, method="spearman")
## Warning: Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Height and Self.Esteem
## S = 2137, p-value = 0.004566
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.6067
#There appears to be a statistically significant dependency between the two variables. With a rho = -0.0607, there appears to be a decreasing monotonic trend between height and self esteem.
c.Choose which correlation is more appropriate, r or ρ and defend your position.
#I believe using the Spearman test, a nonparametric method, to assess correlation is more appropriate in this case. The self esteem data is ordinal, which is more appropriate for the Spearman correlation rather than Pearson's. Also, the data appears to be monotonic, which is an assumption of the Spearman test.