Mariam Ishaque

RADI 5007 - HW 3

1.Analysis of Variance (one-way) In a one-way analysis of variance we are trying to find evidence that a single independent variable influences the results for a dependent variable. Suppose, for example, that we are evaluating a multivitamin tablet for iron to see if there is too much variability between individual tables. Taking three tablets and analyzing each four times, we obtain the following results in mg Fe/g tablet. Use the data, listed in Table A to perform an ANOVA calculation using R. In this example the different tablets are the independent variable and the dependent variable is the concentration of iron.

a.Put the data in a data frame containing two columns, one an index for the independent variable and the other for the dependent variable’s values.

b.Perform the analysis of variance, listing degrees of freedom, residuals, F value and p-value.

A<-c(5.67,5.67,5.55,5.57)
B<-c(5.75,5.47,5.43,5.45)
C<-c(4.74,4.45,4.65,4.94)
tablets=data.frame(A,B,C)
tabs=stack(tablets)
tabs

##    values ind
## 1    5.67   A
## 2    5.67   A
## 3    5.55   A
## 4    5.57   A
## 5    5.75   B
## 6    5.47   B
## 7    5.43   B
## 8    5.45   B
## 9    4.74   C
## 10   4.45   C
## 11   4.65   C
## 12   4.94   C

anova(lm(values~ind,data=tabs))

## Analysis of Variance Table
## 
## Response: values
##           Df Sum Sq Mean Sq F value Pr(>F)    
## ind        2  2.058   1.029    45.2  2e-05 ***
## Residuals  9  0.205   0.023                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#There is a significant difference between the three tablets.

2.Analysis of Variance (two-way) The data for the analysis of iron in a multivitamin has another independent variable, the acid used to dissolve the tablet, listed in the Table below. For example, the first two replicates for each tablet might have been obtained by dissolving the tablet in HCl and the third and fourth replicate by dissolving the tablet in HNO3. With two independent variables, we need to examine the effect of any difference between tablets and between acids, as well as any interaction between the acids and the tablets.

a.To accomplish a two-way analysis of variance construct a data frame containing three columns, one for the dependent variable’s values, one providing an index for the tablets and one providing an index for the acid.

b.Perform a two-way ANOVA to determine listing degrees of freedom, residuals, F values and p- values. State whether any of the factors are significant and whether or not there is a significant interaction between the factors.

A<-c(5.67,5.67,5.55,5.57)
B<-c(5.75,5.47,5.43,5.45)
C<-c(4.74,4.45,4.65,4.94)
tablets=data.frame(A,B,C)
tabs=stack(tablets)
acid<-c("HCl","HCl","HNO3","HNO3","HCl","HCl","HNO3","HNO3","HCl","HCl","HNO3","HNO3")
tabletsacid=data.frame(tabs,acid)
tabletsacid

##    values ind acid
## 1    5.67   A  HCl
## 2    5.67   A  HCl
## 3    5.55   A HNO3
## 4    5.57   A HNO3
## 5    5.75   B  HCl
## 6    5.47   B  HCl
## 7    5.43   B HNO3
## 8    5.45   B HNO3
## 9    4.74   C  HCl
## 10   4.45   C  HCl
## 11   4.65   C HNO3
## 12   4.94   C HNO3

anova(lm(values~ind*acid,data=tabletsacid))

## Analysis of Variance Table
## 
## Response: values
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## ind        2  2.058   1.029   49.91 0.00018 ***
## acid       1  0.002   0.002    0.10 0.75861    
## ind:acid   2  0.079   0.039    1.91 0.22772    
## Residuals  6  0.124   0.021                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Only the tablet type is a significant factor. There is not a significant effect from the acid type. There is not significant interaction between the tablets and acid.

3.Sample Size Estimation

In a pilot study that evaluated the standard uptake value (SUV) of 19F fluorodeoxyglucose in breasts, classified as dense and fatty, the following values were determined: Assume that the sample sizes are equal α=0.05 and β=0.20:

Table C Right Breast SUV Left Breast SUV Mean SD Mean SD Dense 0.39 0.05 0.36 0.07 Fatty 0.32 0.1 0.31 0.08

a.Estimate the total number of subjects (N) required to determine whether the SUV in fatty right breasts is significantly different from the SUV of in the dense right breasts.

pooledSD<-sqrt((0.05^2 + 0.1^2)/2)
pooledSD

## [1] 0.07906

library(pwr)
pwr.t.test(d=(0.39-0.32)/pooledSD,power=0.8,sig.level=0.05,type="two.sample",alternative="two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 21.03
##               d = 0.8854
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

#N would approximately need to be 22 per group, or 44 total.

b.What would N have to be to adequately determine whether the SUV in dense right breasts is significantly different from the SUV of in the dense left breasts?

pooledSD<-sqrt((0.05^2 + 0.07^2)/2)
pooledSD

## [1] 0.06083

library(pwr)
pwr.t.test(d=(0.39-0.36)/pooledSD,power=0.8,sig.level=0.05,type="two.sample",alternative="two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 65.51
##               d = 0.4932
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

#N would approximately need to be 66 per group, or 132 total.

4.Chi-Square Goodness of Fit:

An unusual distribution of blood groups in patients undergoing one type of surgical procedure is suspected. The expected distribution for the population served by the hospital which performs this surgery is known to be 44% group O, 45% group A, 8% group B and 3% group AB. If a random sample of routine pre-operative blood grouping results is taken and compared with the expected distribution, how would you analyze these data? Perform the analysis and explain your results.

Blood O 67 Group A 83 B 29
AB 8

Obs<-c(67,83,29,8)
Exp<-c(.44,.45,.08,.03)
chisq.test(Obs,p=Exp,rescale.p=T)

## 
##  Chi-squared test for given probabilities
## 
## data:  Obs
## X-squared = 17.05, df = 3, p-value = 0.0006908

#You would reject the null hypothesis with a p value of 0.0006908. There is a statistically significant difference between the distribution of blood groups from patients undergoing this surgical procedure and those expected from a random sample of the general population.

5.Chi Square - Independence

A manufacturer was considering marketing crackers high in a certain kind of edible fiber as a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that they would feel less hungry and eat less. A laboratory studied whether people would in fact eat less in this way.

Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber, both, and a control cracker) and were then allowed to eat as much as they wished from a prepared menu. The amount of food they consumed and their weight were monitored, along with any side effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric upset from some of the fiber crackers. A contingency table of “Cracker” versus “Bloat” shows the relationship between the four different types of cracker and the four levels of severity of bloating as reported by the subjects.

Perform a Chi-Square test on the data in TABLE E to determine whether Bloating is independent of Cracker (the type of fiber eaten). Explain your results.

#Table E

cracker<-matrix(c(0,2,0,5,4,5,4,2,1,3,2,3,7,2,6,2),nrow=4)
rownames(cracker)<-c("bran","combo","control","gum")
colnames(cracker)<-c("high","low","medium","none")
cracker

##         high low medium none
## bran       0   4      1    7
## combo      2   5      3    2
## control    0   4      2    6
## gum        5   2      3    2

chisq.test(cracker)

## Warning: Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  cracker
## X-squared = 16.94, df = 9, p-value = 0.04962

#With p<0.05, bloating and cracker (type of fiber) appear to be dependent.

6.Linear Regression.

With the rmr dataset, plot metabolic rate versus body weight. Fit a linear regression model to the relation. According to the fitted model, what is the predicted metabolic rate for a body weight of 70 kg? Give a 95% confidence interval for the slope of the line.

library(ISwR)
attach(rmr)
plot(body.weight,metabolic.rate, main="Metabolic Rate vs Body Weight")
abline(lm(metabolic.rate~body.weight))

plot of chunk unnamed-chunk-7

fit<-lm(metabolic.rate~body.weight,data=rmr)
summary(fit)

## 
## Call:
## lm(formula = metabolic.rate ~ body.weight, data = rmr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -245.7 -114.0  -32.1  105.0  484.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  811.227     76.976   10.54  2.3e-13 ***
## body.weight    7.060      0.978    7.22  7.0e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 158 on 42 degrees of freedom
## Multiple R-squared:  0.554,  Adjusted R-squared:  0.543 
## F-statistic: 52.1 on 1 and 42 DF,  p-value: 7.03e-09

predict(fit,newdata=data.frame(body.weight=70))

##    1 
## 1305

qt(.975,42)

## [1] 2.018

confint(fit,"body.weight",level=0.95)

##             2.5 % 97.5 %
## body.weight 5.087  9.032

7.Linear Regression. In the juul data set, fit a linear regression model for the square root of the IGF-I concentration versus age to the group of subjects over 25 years old.

library(ISwR)
attach(juul)
summary(juul)

##       age           menarche        sex            igf1         tanner    
##  Min.   : 0.17   Min.   :1.0   Min.   :1.00   Min.   : 25   Min.   :1.00  
##  1st Qu.: 9.05   1st Qu.:1.0   1st Qu.:1.00   1st Qu.:202   1st Qu.:1.00  
##  Median :12.56   Median :1.0   Median :2.00   Median :314   Median :2.00  
##  Mean   :15.10   Mean   :1.5   Mean   :1.53   Mean   :340   Mean   :2.64  
##  3rd Qu.:16.86   3rd Qu.:2.0   3rd Qu.:2.00   3rd Qu.:463   3rd Qu.:5.00  
##  Max.   :83.00   Max.   :2.0   Max.   :2.00   Max.   :915   Max.   :5.00  
##  NA's   :5       NA's   :635   NA's   :5      NA's   :321   NA's   :240   
##     testvol    
##  Min.   : 1.0  
##  1st Qu.: 1.0  
##  Median : 3.0  
##  Mean   : 7.9  
##  3rd Qu.:15.0  
##  Max.   :30.0  
##  NA's   :859

summary(lm(sqrt(igf1)~age,data=juul,subset=age>25))

## 
## Call:
## lm(formula = sqrt(igf1) ~ age, data = juul, subset = age > 25)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.864 -1.166  0.102  0.945  4.114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.7103     0.4946   37.83   <2e-16 ***
## age          -0.1053     0.0107   -9.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.74 on 120 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.446,  Adjusted R-squared:  0.441 
## F-statistic: 96.6 on 1 and 120 DF,  p-value: <2e-16

8.In the malaria data set, analyze the log-transformed antibody level versus age. Make a plot of the relation. Do you notice anything peculiar?

library(ISwR)
attach(malaria)

## The following object is masked from juul:
## 
##     age

summary(malaria)

##     subject           age              ab            mal      
##  Min.   :  1.0   Min.   : 3.00   Min.   :   2   Min.   :0.00  
##  1st Qu.: 25.8   1st Qu.: 5.75   1st Qu.:  29   1st Qu.:0.00  
##  Median : 50.5   Median : 9.00   Median : 111   Median :0.00  
##  Mean   : 50.5   Mean   : 8.86   Mean   : 312   Mean   :0.27  
##  3rd Qu.: 75.2   3rd Qu.:12.00   3rd Qu.: 374   3rd Qu.:1.00  
##  Max.   :100.0   Max.   :15.00   Max.   :2066   Max.   :1.00

summary(lm(log(ab)~age,data=malaria))

## 
## Call:
## lm(formula = log(ab) ~ age, data = malaria)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.075 -1.062  0.118  1.101  2.733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.8370     0.3802   10.09   <2e-16 ***
## age           0.1035     0.0395    2.62     0.01 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.48 on 98 degrees of freedom
## Multiple R-squared:  0.0654, Adjusted R-squared:  0.0558 
## F-statistic: 6.85 on 1 and 98 DF,  p-value: 0.0103

plot(log(ab)~age,data=malaria, main="Log(Antibody Level) vs Age - Malaria", ylab="Log(Antibody Level)", xlab="Age")

plot of chunk unnamed-chunk-9

#The graph of log(Antibody level) vs Age appears to show a cyclic relationship between the two variables in the Malaria dataset. There seem to be spikes approximately around ages 4, 6, and 11. This could be due to the inherent nature of the disease.

9.Perform the following functions on the data in TABLE F.

a.Make a histogram of each variable and comment on its apparent normality.

esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
esteem

##    Subject Height Self.Esteem
## 1        1     60         3.6
## 2        2     62         4.1
## 3        3     60         4.6
## 4        4     69         3.8
## 5        5     58         4.4
## 6        6     75         3.2
## 7        7     71         3.1
## 8        8     65         3.8
## 9        9     63         4.1
## 10      10     60         4.3
## 11      11     61         3.7
## 12      12     62         3.5
## 13      13     63         3.2
## 14      14     68         3.7
## 15      15     68         3.3
## 16      16     71         3.4
## 17      17     67         4.1
## 18      18     63         3.8
## 19      19     67         3.4
## 20      20     67         4.0

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'juul':
## 
##     menarche

attach(esteem)
truehist(Height, main="Height", xlab="Height (inches)")
curve(dnorm(x,mean=mean(Height),sd=sd(Height)),add=T)

plot of chunk unnamed-chunk-10

truehist(Self.Esteem, main="Self Esteem", xlab= "Self Esteem (1-5 scale)", ymax=1.0, xlim=c(1,5))
curve(dnorm(x,mean=mean(Self.Esteem),sd=sd(Self.Esteem)),add=T)

plot of chunk unnamed-chunk-10

#They both appear to be more or less normal. Self esteem does not appear to be as normally distributed as height.

b.Perform a Pearson Product Moment Correlation ® and determine its significance.

esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)

## The following objects are masked from esteem (position 3):
## 
##     Height, Self.Esteem, Subject

cor.test(Height,Self.Esteem)

## 
##  Pearson's product-moment correlation
## 
## data:  Height and Self.Esteem
## t = -3.405, df = 18, p-value = 0.003154
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8367 -0.2537
## sample estimates:
##    cor 
## -0.626

#There does seem to be a significant correlation between the two variables.

c.Print out a scatter plot with regression line

esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)

## The following objects are masked from esteem (position 3):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
## 
##     Height, Self.Esteem, Subject

plot(Height,Self.Esteem, main="Self Esteem vs Height", ylab="Self Esteem (1-5 scale)", xlab="Height (inches)", ylim=c(1,5))
abline(lm(Self.Esteem~Height), col=2)

plot of chunk unnamed-chunk-12

plot(Height,Self.Esteem, main="Self Esteem vs Height", ylab="Self Esteem (1-5 scale)", xlab="Height (inches)")
abline(lm(Self.Esteem~Height), col=2)

plot of chunk unnamed-chunk-12

10.Perform the following functions on the data in TABLE F.

a.Add the confidence bands and prediction bands to the scatter plot in Problem 9.

esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)

## The following objects are masked from esteem (position 3):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 5):
## 
##     Height, Self.Esteem, Subject

lm.esteem<-lm(Self.Esteem~Height)

pred.frame<-data.frame(Height=58:75)
pp<-predict(lm.esteem,int="p",newdata=pred.frame)
pc<-predict(lm.esteem,int="c",newdata=pred.frame)
plot(Height,Self.Esteem,ylim=c(1,5), main="Self Esteem vs Height", xlab="Height (inches)", ylab= "Self Esteem (1-5 scale)")
pred.height<-pred.frame$Height
matlines(pred.height,pc,lty=c(1,2,2),col=1)
matlines(pred.height,pp,lty=c(1,3,3),col=2)

plot of chunk unnamed-chunk-13

plot(Height,Self.Esteem, ylim=c(2.5,5), main="Self Esteem vs Height", xlab="Height (inches)", ylab= "Self Esteem (1-5 scale)")
matlines(pred.height,pc,lty=c(1,2,2),col=1)
matlines(pred.height,pp,lty=c(1,3,3),col=2)

plot of chunk unnamed-chunk-13

b.Perform a Spearman rank Order Correlation (ρ) and determine its significance.

esteem<-read.table("/Users/mishaque/Desktop/Statistics Course/HW/HW3/TableF.txt",header=T)
attach(esteem)

## The following objects are masked from esteem (position 3):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 4):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 5):
## 
##     Height, Self.Esteem, Subject
## The following objects are masked from esteem (position 6):
## 
##     Height, Self.Esteem, Subject

cor.test(Height,Self.Esteem, method="spearman")

## Warning: Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  Height and Self.Esteem
## S = 2137, p-value = 0.004566
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## -0.6067

#There appears to be a statistically significant dependency between the two variables. With a rho = -0.0607, there appears to be a decreasing monotonic trend between height and self esteem.

c.Choose which correlation is more appropriate, r or ρ and defend your position.

#I believe using the Spearman test, a nonparametric method, to assess correlation is more appropriate in this case. The self esteem data is ordinal, which is more appropriate for the Spearman correlation rather than Pearson's. Also, the data appears to be monotonic, which is an assumption of the Spearman test.