Question A

Exercise 3.1 (a)

1-pnorm(3)
## [1] 0.001349898

Exercise 3.1 (b)

1-pnorm(42, mean=35, sd=6)
## [1] 0.1216725

Exercise 3.1 (c)

dbinom(10,10,0.8)
## [1] 0.1073742

Exercise 3.1 (d)

punif(0.9)
## [1] 0.9

Exercise 3.1 (e)

1-pchisq(6.5, 2)
## [1] 0.03877421

Exercise 3.2

qnorm(0.025)
## [1] -1.959964
qnorm(0.005)   # position of quartile corresponding to 1% of normal distirbution
## [1] -2.575829
qnorm(0.0025)  # position of quartile corresponding to 0.5% of normal distirbution
## [1] -2.807034
qnorm(0.0005)  # position of quartile corresponding to 0.1% of normal distirbution
## [1] -3.290527
  1. 1.95996 is slightly less than 2. 2-1.95996=0.04004. So, we can say that 5% of the normal distribution lies outside an interval with approximately ±0.04004S in standard deviation units.

  2. Position of the quartiles corresponding to 1%: ±2.56s
  3. Position of the quartiles corresponding to 0.5%: ±2.81s
  4. Position of the quartiles corresponding to 0.1%: ±3.29s

Exercise 3.3

dbinom(10,10,0.8)
## [1] 0.1073742

Question B

#install.packages("ISwR")  ###  Data sets for Dalgaard’s book.
library("ISwR")

x <- rnorm(50)
mean(x)
## [1] -0.06785204
sd(x)
## [1] 0.9418259
var(x)
## [1] 0.887036
quantile(x)
##         0%        25%        50%        75%       100% 
## -2.2177161 -0.7776089 -0.1745177  0.5934072  2.1638495
pvec <- seq(0,1,0.1)
pvec
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
attach(juul)
mean(igf1)
## [1] NA
mean(igf1,na.rm=T)
## [1] 340.168
sum(!is.na(igf1))
## [1] 1018
summary(igf1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0   202.2   313.5   340.2   462.8   915.0     321
summary(juul)
##       age            menarche          sex             igf1      
##  Min.   : 0.170   Min.   :1.000   Min.   :1.000   Min.   : 25.0  
##  1st Qu.: 9.053   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:202.2  
##  Median :12.560   Median :1.000   Median :2.000   Median :313.5  
##  Mean   :15.095   Mean   :1.476   Mean   :1.534   Mean   :340.2  
##  3rd Qu.:16.855   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:462.8  
##  Max.   :83.000   Max.   :2.000   Max.   :2.000   Max.   :915.0  
##  NA's   :5        NA's   :635     NA's   :5       NA's   :321    
##      tanner        testvol      
##  Min.   :1.00   Min.   : 1.000  
##  1st Qu.:1.00   1st Qu.: 1.000  
##  Median :2.00   Median : 3.000  
##  Mean   :2.64   Mean   : 7.896  
##  3rd Qu.:5.00   3rd Qu.:15.000  
##  Max.   :5.00   Max.   :30.000  
##  NA's   :240    NA's   :859
detach(juul)
juul$sex <- factor(juul$sex,labels=c("M","F"))
juul$menarche <- factor(juul$menarche,labels=c("No","Yes"))
juul$tanner <- factor(juul$tanner, labels=c("I","II","III","IV","V"))
attach(juul)
summary(juul)
##       age         menarche     sex           igf1        tanner   
##  Min.   : 0.170   No  :369   M   :621   Min.   : 25.0   I   :515  
##  1st Qu.: 9.053   Yes :335   F   :713   1st Qu.:202.2   II  :103  
##  Median :12.560   NA's:635   NA's:  5   Median :313.5   III : 72  
##  Mean   :15.095                         Mean   :340.2   IV  : 81  
##  3rd Qu.:16.855                         3rd Qu.:462.8   V   :328  
##  Max.   :83.000                         Max.   :915.0   NA's:240  
##  NA's   :5                              NA's   :321               
##     testvol      
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 3.000  
##  Mean   : 7.896  
##  3rd Qu.:15.000  
##  Max.   :30.000  
##  NA's   :859
juul <- transform(juul,sex=factor(sex,labels=c("M","F")), menarche=factor(menarche,labels=c("No","Yes")), tanner=factor(tanner,labels=c("I","II","III","IV","V")))

hist(x)

mid.age <- c(2.5,7.5,13,16.5,17.5,19,22.5,44.5,70.5)
acc.count <- c(28,46,58,20,31,64,149,316,103)
age.acc <- rep(mid.age,acc.count)    # rep(A, B)   repeat A with B times.
brk <- c(0,5,10,16,17,18,20,25,60,80)
hist(age.acc,breaks=brk)

n <- length(x)
plot(sort(x),(1:n)/n,type="o",ylim=c(0,1))

qqnorm(x)

par(mfrow=c(1,2))
boxplot(IgM)
boxplot(log(IgM))

par(mfrow=c(1,1))

attach(red.cell.folate)
tapply(folate,ventilation,mean)   
## N2O+O2,24h  N2O+O2,op     O2,24h 
##   316.6250   256.4444   278.0000
# The tapply call takes the folate variable, splits it according to ventilation, and computes the mean for each group.
tapply(folate,ventilation,sd)
## N2O+O2,24h  N2O+O2,op     O2,24h 
##   58.71709   37.12180   33.75648
tapply(folate,ventilation,length)
## N2O+O2,24h  N2O+O2,op     O2,24h 
##          8          9          5
xbar <- tapply(folate, ventilation, mean)
s <- tapply(folate, ventilation, sd)
n <- tapply(folate, ventilation, length)
cbind(mean=xbar, std.dev=s, n=n)
##                mean  std.dev n
## N2O+O2,24h 316.6250 58.71709 8
## N2O+O2,op  256.4444 37.12180 9
## O2,24h     278.0000 33.75648 5
tapply(igf1, tanner, mean)
##   I  II III  IV   V 
##  NA  NA  NA  NA  NA
tapply(igf1, tanner, mean, na.rm=T)
##        I       II      III       IV        V 
## 207.4727 352.6714 483.2222 513.0172 465.3344
tapply(igf1, tanner, mean, na.rm=T)
##        I       II      III       IV        V 
## 207.4727 352.6714 483.2222 513.0172 465.3344
aggregate(juul[c("age","igf1")], list(sex=juul$sex), mean, na.rm=T)
##   sex      age     igf1
## 1   M 15.38436 310.8866
## 2   F 14.84363 368.1006
aggregate(juul[c("age","igf1")], juul["sex"], mean, na.rm=T)
##   sex      age     igf1
## 1   M 15.38436 310.8866
## 2   F 14.84363 368.1006
by(juul, juul["sex"], summary)
## sex: M
##       age        menarche   sex          igf1        tanner   
##  Min.   : 0.17   No  :  0   M:621   Min.   : 29.0   I   :291  
##  1st Qu.: 8.85   Yes :  0   F:  0   1st Qu.:176.0   II  : 55  
##  Median :12.38   NA's:621           Median :280.0   III : 34  
##  Mean   :15.38                      Mean   :310.9   IV  : 41  
##  3rd Qu.:16.77                      3rd Qu.:430.2   V   :124  
##  Max.   :83.00                      Max.   :915.0   NA's: 76  
##                                     NA's   :145               
##     testvol      
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 3.000  
##  Mean   : 7.896  
##  3rd Qu.:15.000  
##  Max.   :30.000  
##  NA's   :141     
## -------------------------------------------------------- 
## sex: F
##       age        menarche   sex          igf1        tanner   
##  Min.   : 0.25   No  :369   M:  0   Min.   : 25.0   I   :224  
##  1st Qu.: 9.30   Yes :335   F:713   1st Qu.:233.0   II  : 48  
##  Median :12.80   NA's:  9           Median :352.0   III : 38  
##  Mean   :14.84                      Mean   :368.1   IV  : 40  
##  3rd Qu.:16.93                      3rd Qu.:483.0   V   :204  
##  Max.   :75.12                      Max.   :914.0   NA's:159  
##                                     NA's   :176               
##     testvol   
##  Min.   : NA  
##  1st Qu.: NA  
##  Median : NA  
##  Mean   :NaN  
##  3rd Qu.: NA  
##  Max.   : NA  
##  NA's   :713
attach(energy)
expend.lean <- expend[stature=="lean"]
expend.obese <- expend[stature=="obese"]


par(mfrow=c(2,1))
hist(expend.lean,breaks=10,xlim=c(5,13),ylim=c(0,4),col="white")
hist(expend.obese,breaks=10,xlim=c(5,13),ylim=c(0,4),col="grey")

par(mfrow=c(1,1))

boxplot(expend ~ stature)

boxplot(expend.lean,expend.obese)

opar <- par(mfrow=c(2,2), mex=0.8, mar=c(3,3,2,1)+.1)
stripchart(expend ~ stature)
stripchart(expend ~ stature, method="stack")    # stack points that are compeltely identical
stripchart(expend ~ stature, method="jitter")   # jitter offsets all points a random amount vertically.
stripchart(expend ~ stature, method="jitter", jitter=.03)

par(opar)

caff.marital <- matrix(c(652,1537,598,242,36,46,38,21,218,327,106,67), nrow=3,byrow=T)
caff.marital
##      [,1] [,2] [,3] [,4]
## [1,]  652 1537  598  242
## [2,]   36   46   38   21
## [3,]  218  327  106   67
colnames(caff.marital) <- c("0","1-150","151-300",">300")
rownames(caff.marital) <- c("Married","Prev.married","Single")
caff.marital
##                0 1-150 151-300 >300
## Married      652  1537     598  242
## Prev.married  36    46      38   21
## Single       218   327     106   67
names(dimnames(caff.marital)) <- c("marital","consumption")
caff.marital
##               consumption
## marital          0 1-150 151-300 >300
##   Married      652  1537     598  242
##   Prev.married  36    46      38   21
##   Single       218   327     106   67
asDataFrame=as.data.frame(caff.marital)
asDataFrame
##                0 1-150 151-300 >300
## Married      652  1537     598  242
## Prev.married  36    46      38   21
## Single       218   327     106   67
dimnames(asDataFrame)
## [[1]]
## [1] "Married"      "Prev.married" "Single"      
## 
## [[2]]
## [1] "0"       "1-150"   "151-300" ">300"
asTable=as.data.frame(as.table(caff.marital))
asTable
##         marital consumption Freq
## 1       Married           0  652
## 2  Prev.married           0   36
## 3        Single           0  218
## 4       Married       1-150 1537
## 5  Prev.married       1-150   46
## 6        Single       1-150  327
## 7       Married     151-300  598
## 8  Prev.married     151-300   38
## 9        Single     151-300  106
## 10      Married        >300  242
## 11 Prev.married        >300   21
## 12       Single        >300   67
dimnames(asTable)
## [[1]]
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
## 
## [[2]]
## [1] "marital"     "consumption" "Freq"
table(sex)
## sex
##   M   F 
## 621 713
table(sex,menarche)
##    menarche
## sex  No Yes
##   M   0   0
##   F 369 335
table(menarche,tanner)
##         tanner
## menarche   I  II III  IV   V
##      No  221  43  32  14   2
##      Yes   1   1   5  26 202
xtabs(~ tanner + sex, data=juul)
##       sex
## tanner   M   F
##    I   291 224
##    II   55  48
##    III  34  38
##    IV   41  40
##    V   124 204
xtabs(~ dgn + diab + coma, data=stroke)
## , , coma = No
## 
##      diab
## dgn    No Yes
##   ICH  53   6
##   ID  143  21
##   INF 411  64
##   SAH  38   0
## 
## , , coma = Yes
## 
##      diab
## dgn    No Yes
##   ICH  19   1
##   ID   23   3
##   INF  23   2
##   SAH   9   0
ftable(coma + diab ~ dgn, data=stroke)
##     coma  No     Yes    
##     diab  No Yes  No Yes
## dgn                     
## ICH       53   6  19   1
## ID       143  21  23   3
## INF      411  64  23   2
## SAH       38   0   9   0
t(caff.marital)  # transpose
##            marital
## consumption Married Prev.married Single
##     0           652           36    218
##     1-150      1537           46    327
##     151-300     598           38    106
##     >300        242           21     67
# 4.5.2 Marginal tables and relative frequency

tanner.sex <- table(tanner,sex)
tanner.sex
##       sex
## tanner   M   F
##    I   291 224
##    II   55  48
##    III  34  38
##    IV   41  40
##    V   124 204
margin.table(tanner.sex,1)    # marginal count for each row
## tanner
##   I  II III  IV   V 
## 515 103  72  81 328
margin.table(tanner.sex,2)
## sex
##   M   F 
## 545 554
prop.table(tanner.sex,1)     # probablity in each row
##       sex
## tanner         M         F
##    I   0.5650485 0.4349515
##    II  0.5339806 0.4660194
##    III 0.4722222 0.5277778
##    IV  0.5061728 0.4938272
##    V   0.3780488 0.6219512
tanner.sex/sum(tanner.sex)
##       sex
## tanner          M          F
##    I   0.26478617 0.20382166
##    II  0.05004550 0.04367607
##    III 0.03093722 0.03457689
##    IV  0.03730664 0.03639672
##    V   0.11282985 0.18562329
# 4.6 Graphical display of tables
# 4.6.1 Barplots
total.caff <- margin.table(caff.marital,2)
total.caff
## consumption
##       0   1-150 151-300    >300 
##     906    1910     742     330
barplot(total.caff, col="white")

par(mfrow=c(2,2))
barplot(caff.marital, col="white")
barplot(t(caff.marital), col="white")
barplot(t(caff.marital), col="white", beside=T)
barplot(prop.table(t(caff.marital),2), col="white", beside=T)

par(mfrow=c(1,1))

barplot(prop.table(t(caff.marital),2),beside=T,legend.text=colnames(caff.marital),
        col=c("white","grey80","grey50","black"))

# 4.6.2 Dotcharts
dotchart(t(caff.marital), lcolor="black")

# 4.6.3 Piecharts
opar <- par(mfrow=c(2,2),mex=0.8, mar=c(1,1,2,1))
slices <- c("white","grey80","grey50","black")
pie(caff.marital["Married",], main="Married", col=slices)
pie(caff.marital["Prev.married",], main="Previously married", col=slices)
pie(caff.marital["Single",], main="Single", col=slices)
par(opar)

Why dimnames are different?

dimnames() function retrieve or set the dimnames of an object. “dimnames” in the table format and data.frame format are different, because table and data.frame have different data representation. Table has marital status as row names, and consumption levels as column names. However, data.frame has row numbers in each row, and variable names in each column.

o “as.table” presents contingency table with the joint density of one or more categorical variables. Each entry in a contingency table is a count of the number of times a particular set of factors levels occurs in the dataset.

o “as.data.frame” forces and coerces the table into a data frame puts each factor of the contingency table into its own column along with the frequency, rather than keeping the same structure as original table object. Therefore, the input variable for ‘as.data.frame()’ function should be table data.

Question C

[Aha’s] 1) Ah, 95% confidence interval or confidence coefficient means that the population mean is in the range of [X -Z0.25 * sd, X +Z0.25*sd]. If the sample size is large, the 95% confidence interval will be narrow. This also means that the values of [X fall within the 2 standard deviation of the mean. Also, 95% confidence interval means that 95 out of 100 confidence intervals obtained from 100 different samples will include the true population mean. 2) Ah, it is not good that we use higher percentage of confidence interval. As the percentage of confidence interval is higher, it can give us higher confidence, but the confidence interval will be wide, which is not much useful. Expanded confidence interval will give us more confidence, but the wide interval is not much helpful. 3) Ah, t-distribution approaches the standard normal distribution as the sample size increase (i.e., the degree of freedom increase). In this sentence, does the large sample size indicates about >30 as we discussed in class? The number 30 still sounds arbitrary.

[Q’s] 1) In Chapter 6 of Shahbaba, in page 152, it says, “We use X1, X2, … Xn to denote n possible values of X obtained from a sample randomly selected from the population.” However, in the next paragraph, it says, “We use x1, x2, … xn as the specific set of values we have observed in our sample.That is, x1 is the observed value for X1, x2 is the observed value X2, and so forth.”"

As far as I understand, X1, X2, … Xn and x1, x2, … xn are the same. “Random values obtained from sample” and “observed in our sample” sound same. I don’t understand what is different. Aren’t “x1, x2, … xn” also random values, because they are values observed from random values, X1, X2, … Xn. Likewise, Xbar and xbar are the same, aren’t they?

Exercise 4.

inFilePath="~/51_Rstudio/5_Assignment5/birthwt.txt"
birthwtData=read.table(inFilePath, sep="", header=TRUE, stringsAsFactors = T)
#birthwtData
samplesize=nrow(birthwtData)
degree_freedom=samplesize-1
mean_PhyVisit=mean(birthwtData$ftv)
print(paste("Point estimate for the population mean is", mean_PhyVisit))
## [1] "Point estimate for the population mean is 0.793650793650794"
stdev_error=sd(birthwtData$ftv)/sqrt(samplesize)
t_critical_0.05=qt(0.05,degree_freedom)

conf_interval_low=mean_PhyVisit + t_critical_0.05*stdev_error
conf_interval_high=mean_PhyVisit - t_critical_0.05*stdev_error
print(paste("90% confidence interval for the population mean, lower tail: ", conf_interval_low))
## [1] "90% confidence interval for the population mean, lower tail:  0.666284353837966"
print(paste("90% confidence interval for the population mean, upper tail: ", conf_interval_high))
## [1] "90% confidence interval for the population mean, upper tail:  0.921017233463622"

Point estimate for the population mean μ̂ is 0.7937

90% confidence interval for the population mean is [0.9210, 0.6663]

Exercise 7.

sample_size = (1.95*0.5/0.02)^2     # The possible highest standard deviation is 0.5 
print(paste("The number of people to be interviewed: ", sample_size))
## [1] "The number of people to be interviewed:  2376.5625"

Exercise 8.

sample_mean = 320/2000
estimate_populationVariance=sample_mean*(1-sample_mean)
stdev_error_smp_mean=sqrt(estimate_populationVariance/2000)

z_critical_0.05=qnorm(0.05)
conf_interval_90_low = sample_mean + z_critical_0.05*stdev_error_smp_mean
conf_interval_90_high = sample_mean - z_critical_0.05*stdev_error_smp_mean
print(paste("90% confidence interval for the population poroportion of smokers: [",conf_interval_90_low, ", ", conf_interval_90_high,"]", sep=""))
## [1] "90% confidence interval for the population poroportion of smokers: [0.146516212693935, 0.173483787306065]"

Question D. Dalgaard chapter5

  1. Aha In the two-sample t-test, > t.test(expend~stature)

tilde(~) operator is to specify that ‘expend’ is described by ‘stature’. I should be carefule to not be confused and write ‘t.test(stature~expend)’. By the way, if you see the section 5.6, page 106, Welch’s two sample t-test on pre-post data doesn’t need tilde(~) operator in the calling.

  1. Aha and question Ah, in the Welch’s variant of the t-test, default is that I don’t assume that the variance is the same in the two groups. To ass that the variances are the same, I have to specify ‘var.equal=T’. Here is my question. However, how can we know whether the variances in the two groups would be the same or not? Does this variance mean ‘sample variance’, not population variance..? If it means sample variance, do we have to calculate sample variance in each group first before running t.test? They wouldn’t be exactly same each other..? ‘var.equal=T’ is just assumption..?

  2. Aha In the section 5.4, there is explanation about comparison of variance. By the ‘var.test’ function, we can compare the variance of two groups, and test the assumption that the variances are the same. If p-value from ‘var.test’ is not significant, it means we accept alternative hypothesis which is “ratio of variance is not equal to 1”. So, this means, the assumption that two variances are the same is wrong.

  3. Question The section 5.5 explained that two-sample Wilcoxon-rank test is based on calculating the sum of ranks in one group, thus reducing the problem to one of sampling n1 values without replacement from the numbers 1 to n1+n2. I don’t understand this explanation.