Exercise 3.1 (a)
1-pnorm(3)
## [1] 0.001349898
Exercise 3.1 (b)
1-pnorm(42, mean=35, sd=6)
## [1] 0.1216725
Exercise 3.1 (c)
dbinom(10,10,0.8)
## [1] 0.1073742
Exercise 3.1 (d)
punif(0.9)
## [1] 0.9
Exercise 3.1 (e)
1-pchisq(6.5, 2)
## [1] 0.03877421
Exercise 3.2
qnorm(0.025)
## [1] -1.959964
qnorm(0.005) # position of quartile corresponding to 1% of normal distirbution
## [1] -2.575829
qnorm(0.0025) # position of quartile corresponding to 0.5% of normal distirbution
## [1] -2.807034
qnorm(0.0005) # position of quartile corresponding to 0.1% of normal distirbution
## [1] -3.290527
1.95996 is slightly less than 2. 2-1.95996=0.04004. So, we can say that 5% of the normal distribution lies outside an interval with approximately ±0.04004S in standard deviation units.
Position of the quartiles corresponding to 0.1%: ±3.29s
Exercise 3.3
dbinom(10,10,0.8)
## [1] 0.1073742
#install.packages("ISwR") ### Data sets for Dalgaard’s book.
library("ISwR")
x <- rnorm(50)
mean(x)
## [1] -0.06785204
sd(x)
## [1] 0.9418259
var(x)
## [1] 0.887036
quantile(x)
## 0% 25% 50% 75% 100%
## -2.2177161 -0.7776089 -0.1745177 0.5934072 2.1638495
pvec <- seq(0,1,0.1)
pvec
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
attach(juul)
mean(igf1)
## [1] NA
mean(igf1,na.rm=T)
## [1] 340.168
sum(!is.na(igf1))
## [1] 1018
summary(igf1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 202.2 313.5 340.2 462.8 915.0 321
summary(juul)
## age menarche sex igf1
## Min. : 0.170 Min. :1.000 Min. :1.000 Min. : 25.0
## 1st Qu.: 9.053 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:202.2
## Median :12.560 Median :1.000 Median :2.000 Median :313.5
## Mean :15.095 Mean :1.476 Mean :1.534 Mean :340.2
## 3rd Qu.:16.855 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:462.8
## Max. :83.000 Max. :2.000 Max. :2.000 Max. :915.0
## NA's :5 NA's :635 NA's :5 NA's :321
## tanner testvol
## Min. :1.00 Min. : 1.000
## 1st Qu.:1.00 1st Qu.: 1.000
## Median :2.00 Median : 3.000
## Mean :2.64 Mean : 7.896
## 3rd Qu.:5.00 3rd Qu.:15.000
## Max. :5.00 Max. :30.000
## NA's :240 NA's :859
detach(juul)
juul$sex <- factor(juul$sex,labels=c("M","F"))
juul$menarche <- factor(juul$menarche,labels=c("No","Yes"))
juul$tanner <- factor(juul$tanner, labels=c("I","II","III","IV","V"))
attach(juul)
summary(juul)
## age menarche sex igf1 tanner
## Min. : 0.170 No :369 M :621 Min. : 25.0 I :515
## 1st Qu.: 9.053 Yes :335 F :713 1st Qu.:202.2 II :103
## Median :12.560 NA's:635 NA's: 5 Median :313.5 III : 72
## Mean :15.095 Mean :340.2 IV : 81
## 3rd Qu.:16.855 3rd Qu.:462.8 V :328
## Max. :83.000 Max. :915.0 NA's:240
## NA's :5 NA's :321
## testvol
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 3.000
## Mean : 7.896
## 3rd Qu.:15.000
## Max. :30.000
## NA's :859
juul <- transform(juul,sex=factor(sex,labels=c("M","F")), menarche=factor(menarche,labels=c("No","Yes")), tanner=factor(tanner,labels=c("I","II","III","IV","V")))
hist(x)
mid.age <- c(2.5,7.5,13,16.5,17.5,19,22.5,44.5,70.5)
acc.count <- c(28,46,58,20,31,64,149,316,103)
age.acc <- rep(mid.age,acc.count) # rep(A, B) repeat A with B times.
brk <- c(0,5,10,16,17,18,20,25,60,80)
hist(age.acc,breaks=brk)
n <- length(x)
plot(sort(x),(1:n)/n,type="o",ylim=c(0,1))
qqnorm(x)
par(mfrow=c(1,2))
boxplot(IgM)
boxplot(log(IgM))
par(mfrow=c(1,1))
attach(red.cell.folate)
tapply(folate,ventilation,mean)
## N2O+O2,24h N2O+O2,op O2,24h
## 316.6250 256.4444 278.0000
# The tapply call takes the folate variable, splits it according to ventilation, and computes the mean for each group.
tapply(folate,ventilation,sd)
## N2O+O2,24h N2O+O2,op O2,24h
## 58.71709 37.12180 33.75648
tapply(folate,ventilation,length)
## N2O+O2,24h N2O+O2,op O2,24h
## 8 9 5
xbar <- tapply(folate, ventilation, mean)
s <- tapply(folate, ventilation, sd)
n <- tapply(folate, ventilation, length)
cbind(mean=xbar, std.dev=s, n=n)
## mean std.dev n
## N2O+O2,24h 316.6250 58.71709 8
## N2O+O2,op 256.4444 37.12180 9
## O2,24h 278.0000 33.75648 5
tapply(igf1, tanner, mean)
## I II III IV V
## NA NA NA NA NA
tapply(igf1, tanner, mean, na.rm=T)
## I II III IV V
## 207.4727 352.6714 483.2222 513.0172 465.3344
tapply(igf1, tanner, mean, na.rm=T)
## I II III IV V
## 207.4727 352.6714 483.2222 513.0172 465.3344
aggregate(juul[c("age","igf1")], list(sex=juul$sex), mean, na.rm=T)
## sex age igf1
## 1 M 15.38436 310.8866
## 2 F 14.84363 368.1006
aggregate(juul[c("age","igf1")], juul["sex"], mean, na.rm=T)
## sex age igf1
## 1 M 15.38436 310.8866
## 2 F 14.84363 368.1006
by(juul, juul["sex"], summary)
## sex: M
## age menarche sex igf1 tanner
## Min. : 0.17 No : 0 M:621 Min. : 29.0 I :291
## 1st Qu.: 8.85 Yes : 0 F: 0 1st Qu.:176.0 II : 55
## Median :12.38 NA's:621 Median :280.0 III : 34
## Mean :15.38 Mean :310.9 IV : 41
## 3rd Qu.:16.77 3rd Qu.:430.2 V :124
## Max. :83.00 Max. :915.0 NA's: 76
## NA's :145
## testvol
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 3.000
## Mean : 7.896
## 3rd Qu.:15.000
## Max. :30.000
## NA's :141
## --------------------------------------------------------
## sex: F
## age menarche sex igf1 tanner
## Min. : 0.25 No :369 M: 0 Min. : 25.0 I :224
## 1st Qu.: 9.30 Yes :335 F:713 1st Qu.:233.0 II : 48
## Median :12.80 NA's: 9 Median :352.0 III : 38
## Mean :14.84 Mean :368.1 IV : 40
## 3rd Qu.:16.93 3rd Qu.:483.0 V :204
## Max. :75.12 Max. :914.0 NA's:159
## NA's :176
## testvol
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## NA's :713
attach(energy)
expend.lean <- expend[stature=="lean"]
expend.obese <- expend[stature=="obese"]
par(mfrow=c(2,1))
hist(expend.lean,breaks=10,xlim=c(5,13),ylim=c(0,4),col="white")
hist(expend.obese,breaks=10,xlim=c(5,13),ylim=c(0,4),col="grey")
par(mfrow=c(1,1))
boxplot(expend ~ stature)
boxplot(expend.lean,expend.obese)
opar <- par(mfrow=c(2,2), mex=0.8, mar=c(3,3,2,1)+.1)
stripchart(expend ~ stature)
stripchart(expend ~ stature, method="stack") # stack points that are compeltely identical
stripchart(expend ~ stature, method="jitter") # jitter offsets all points a random amount vertically.
stripchart(expend ~ stature, method="jitter", jitter=.03)
par(opar)
caff.marital <- matrix(c(652,1537,598,242,36,46,38,21,218,327,106,67), nrow=3,byrow=T)
caff.marital
## [,1] [,2] [,3] [,4]
## [1,] 652 1537 598 242
## [2,] 36 46 38 21
## [3,] 218 327 106 67
colnames(caff.marital) <- c("0","1-150","151-300",">300")
rownames(caff.marital) <- c("Married","Prev.married","Single")
caff.marital
## 0 1-150 151-300 >300
## Married 652 1537 598 242
## Prev.married 36 46 38 21
## Single 218 327 106 67
names(dimnames(caff.marital)) <- c("marital","consumption")
caff.marital
## consumption
## marital 0 1-150 151-300 >300
## Married 652 1537 598 242
## Prev.married 36 46 38 21
## Single 218 327 106 67
asDataFrame=as.data.frame(caff.marital)
asDataFrame
## 0 1-150 151-300 >300
## Married 652 1537 598 242
## Prev.married 36 46 38 21
## Single 218 327 106 67
dimnames(asDataFrame)
## [[1]]
## [1] "Married" "Prev.married" "Single"
##
## [[2]]
## [1] "0" "1-150" "151-300" ">300"
asTable=as.data.frame(as.table(caff.marital))
asTable
## marital consumption Freq
## 1 Married 0 652
## 2 Prev.married 0 36
## 3 Single 0 218
## 4 Married 1-150 1537
## 5 Prev.married 1-150 46
## 6 Single 1-150 327
## 7 Married 151-300 598
## 8 Prev.married 151-300 38
## 9 Single 151-300 106
## 10 Married >300 242
## 11 Prev.married >300 21
## 12 Single >300 67
dimnames(asTable)
## [[1]]
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
##
## [[2]]
## [1] "marital" "consumption" "Freq"
table(sex)
## sex
## M F
## 621 713
table(sex,menarche)
## menarche
## sex No Yes
## M 0 0
## F 369 335
table(menarche,tanner)
## tanner
## menarche I II III IV V
## No 221 43 32 14 2
## Yes 1 1 5 26 202
xtabs(~ tanner + sex, data=juul)
## sex
## tanner M F
## I 291 224
## II 55 48
## III 34 38
## IV 41 40
## V 124 204
xtabs(~ dgn + diab + coma, data=stroke)
## , , coma = No
##
## diab
## dgn No Yes
## ICH 53 6
## ID 143 21
## INF 411 64
## SAH 38 0
##
## , , coma = Yes
##
## diab
## dgn No Yes
## ICH 19 1
## ID 23 3
## INF 23 2
## SAH 9 0
ftable(coma + diab ~ dgn, data=stroke)
## coma No Yes
## diab No Yes No Yes
## dgn
## ICH 53 6 19 1
## ID 143 21 23 3
## INF 411 64 23 2
## SAH 38 0 9 0
t(caff.marital) # transpose
## marital
## consumption Married Prev.married Single
## 0 652 36 218
## 1-150 1537 46 327
## 151-300 598 38 106
## >300 242 21 67
# 4.5.2 Marginal tables and relative frequency
tanner.sex <- table(tanner,sex)
tanner.sex
## sex
## tanner M F
## I 291 224
## II 55 48
## III 34 38
## IV 41 40
## V 124 204
margin.table(tanner.sex,1) # marginal count for each row
## tanner
## I II III IV V
## 515 103 72 81 328
margin.table(tanner.sex,2)
## sex
## M F
## 545 554
prop.table(tanner.sex,1) # probablity in each row
## sex
## tanner M F
## I 0.5650485 0.4349515
## II 0.5339806 0.4660194
## III 0.4722222 0.5277778
## IV 0.5061728 0.4938272
## V 0.3780488 0.6219512
tanner.sex/sum(tanner.sex)
## sex
## tanner M F
## I 0.26478617 0.20382166
## II 0.05004550 0.04367607
## III 0.03093722 0.03457689
## IV 0.03730664 0.03639672
## V 0.11282985 0.18562329
# 4.6 Graphical display of tables
# 4.6.1 Barplots
total.caff <- margin.table(caff.marital,2)
total.caff
## consumption
## 0 1-150 151-300 >300
## 906 1910 742 330
barplot(total.caff, col="white")
par(mfrow=c(2,2))
barplot(caff.marital, col="white")
barplot(t(caff.marital), col="white")
barplot(t(caff.marital), col="white", beside=T)
barplot(prop.table(t(caff.marital),2), col="white", beside=T)
par(mfrow=c(1,1))
barplot(prop.table(t(caff.marital),2),beside=T,legend.text=colnames(caff.marital),
col=c("white","grey80","grey50","black"))
# 4.6.2 Dotcharts
dotchart(t(caff.marital), lcolor="black")
# 4.6.3 Piecharts
opar <- par(mfrow=c(2,2),mex=0.8, mar=c(1,1,2,1))
slices <- c("white","grey80","grey50","black")
pie(caff.marital["Married",], main="Married", col=slices)
pie(caff.marital["Prev.married",], main="Previously married", col=slices)
pie(caff.marital["Single",], main="Single", col=slices)
par(opar)
dimnames() function retrieve or set the dimnames of an object. “dimnames” in the table format and data.frame format are different, because table and data.frame have different data representation. Table has marital status as row names, and consumption levels as column names. However, data.frame has row numbers in each row, and variable names in each column.
o “as.table” presents contingency table with the joint density of one or more categorical variables. Each entry in a contingency table is a count of the number of times a particular set of factors levels occurs in the dataset.
o “as.data.frame” forces and coerces the table into a data frame puts each factor of the contingency table into its own column along with the frequency, rather than keeping the same structure as original table object. Therefore, the input variable for ‘as.data.frame()’ function should be table data.
[Aha’s] 1) Ah, 95% confidence interval or confidence coefficient means that the population mean is in the range of [X -Z0.25 * sd, X +Z0.25*sd]. If the sample size is large, the 95% confidence interval will be narrow. This also means that the values of [X fall within the 2 standard deviation of the mean. Also, 95% confidence interval means that 95 out of 100 confidence intervals obtained from 100 different samples will include the true population mean. 2) Ah, it is not good that we use higher percentage of confidence interval. As the percentage of confidence interval is higher, it can give us higher confidence, but the confidence interval will be wide, which is not much useful. Expanded confidence interval will give us more confidence, but the wide interval is not much helpful. 3) Ah, t-distribution approaches the standard normal distribution as the sample size increase (i.e., the degree of freedom increase). In this sentence, does the large sample size indicates about >30 as we discussed in class? The number 30 still sounds arbitrary.
[Q’s] 1) In Chapter 6 of Shahbaba, in page 152, it says, “We use X1, X2, … Xn to denote n possible values of X obtained from a sample randomly selected from the population.” However, in the next paragraph, it says, “We use x1, x2, … xn as the specific set of values we have observed in our sample.That is, x1 is the observed value for X1, x2 is the observed value X2, and so forth.”"
As far as I understand, X1, X2, … Xn and x1, x2, … xn are the same. “Random values obtained from sample” and “observed in our sample” sound same. I don’t understand what is different. Aren’t “x1, x2, … xn” also random values, because they are values observed from random values, X1, X2, … Xn. Likewise, Xbar and xbar are the same, aren’t they?
Exercise 4.
inFilePath="~/51_Rstudio/5_Assignment5/birthwt.txt"
birthwtData=read.table(inFilePath, sep="", header=TRUE, stringsAsFactors = T)
#birthwtData
samplesize=nrow(birthwtData)
degree_freedom=samplesize-1
mean_PhyVisit=mean(birthwtData$ftv)
print(paste("Point estimate for the population mean is", mean_PhyVisit))
## [1] "Point estimate for the population mean is 0.793650793650794"
stdev_error=sd(birthwtData$ftv)/sqrt(samplesize)
t_critical_0.05=qt(0.05,degree_freedom)
conf_interval_low=mean_PhyVisit + t_critical_0.05*stdev_error
conf_interval_high=mean_PhyVisit - t_critical_0.05*stdev_error
print(paste("90% confidence interval for the population mean, lower tail: ", conf_interval_low))
## [1] "90% confidence interval for the population mean, lower tail: 0.666284353837966"
print(paste("90% confidence interval for the population mean, upper tail: ", conf_interval_high))
## [1] "90% confidence interval for the population mean, upper tail: 0.921017233463622"
Point estimate for the population mean μ̂ is 0.7937
90% confidence interval for the population mean is [0.9210, 0.6663]
Exercise 7.
sample_size = (1.95*0.5/0.02)^2 # The possible highest standard deviation is 0.5
print(paste("The number of people to be interviewed: ", sample_size))
## [1] "The number of people to be interviewed: 2376.5625"
Exercise 8.
sample_mean = 320/2000
estimate_populationVariance=sample_mean*(1-sample_mean)
stdev_error_smp_mean=sqrt(estimate_populationVariance/2000)
z_critical_0.05=qnorm(0.05)
conf_interval_90_low = sample_mean + z_critical_0.05*stdev_error_smp_mean
conf_interval_90_high = sample_mean - z_critical_0.05*stdev_error_smp_mean
print(paste("90% confidence interval for the population poroportion of smokers: [",conf_interval_90_low, ", ", conf_interval_90_high,"]", sep=""))
## [1] "90% confidence interval for the population poroportion of smokers: [0.146516212693935, 0.173483787306065]"
tilde(~) operator is to specify that ‘expend’ is described by ‘stature’. I should be carefule to not be confused and write ‘t.test(stature~expend)’. By the way, if you see the section 5.6, page 106, Welch’s two sample t-test on pre-post data doesn’t need tilde(~) operator in the calling.
Aha and question Ah, in the Welch’s variant of the t-test, default is that I don’t assume that the variance is the same in the two groups. To ass that the variances are the same, I have to specify ‘var.equal=T’. Here is my question. However, how can we know whether the variances in the two groups would be the same or not? Does this variance mean ‘sample variance’, not population variance..? If it means sample variance, do we have to calculate sample variance in each group first before running t.test? They wouldn’t be exactly same each other..? ‘var.equal=T’ is just assumption..?
Aha In the section 5.4, there is explanation about comparison of variance. By the ‘var.test’ function, we can compare the variance of two groups, and test the assumption that the variances are the same. If p-value from ‘var.test’ is not significant, it means we accept alternative hypothesis which is “ratio of variance is not equal to 1”. So, this means, the assumption that two variances are the same is wrong.
Question The section 5.5 explained that two-sample Wilcoxon-rank test is based on calculating the sum of ranks in one group, thus reducing the problem to one of sampling n1 values without replacement from the numbers 1 to n1+n2. I don’t understand this explanation.