1

The number of complaints received for each of 44 doctors working in the emergency room of a large hospital in a year was recorded. Does the distribution of number of complaints follow the Poisson distribution with mean 3? Use this R script to answer the question. Source: Le, C.T. (2010). Applied categorical data analysis and translational research. P. 203.
Column 1: Number of complaints made
Column 2: Number of doctors

ANS:依照圖形顯示結果應不符合Poisson分配。

dta<-read.table('http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/ERComplaints.txt',h=T)
plot(dta$numberOfComplaint,dta$numberOfDr/sum(dta$numberOfDr),type="l",lty=6)
points(dta$numberOfComplaint,dta$numberOfDr/sum(dta$numberOfDr),pch=16)
segments(dta$numberOfComplaint, rep(0, 7), dta$numberOfComplaint, dpois(0:7, lambda=3),col = 2)

2

A group of students have taken exams in Math, Science, and English. These scores are to be combined into a single grade for each student. An ‘A’ is assigned to the top 20 percent of students, a ‘B’ to the next 20 percent, and so on. The final results are sorted alphabetically by the last names of the students. Use the following R script to complete the task described above.

pupil <- c("John Denver", "Amy Tan", "Bill Clinton", "Dave Kuhn", 
           "Jane Mayer", "Kevin Costner", "Ruben Yin", "Greg Knox", "Joel English",
           "Meg Ryan")

Math <- c(502, 600, 412, 358, 495, 512, 410, 625, 573, 522)
Science <- c(95, 99, 80, 82, 75, 85, 80, 95, 89, 86)
English <- c(25, 22, 18, 15, 20, 28, 15, 30, 27, 18)
roster <- data.frame(pupil, Math, Science, English, stringsAsFactors=FALSE)
fullname <- strsplit((roster$pupil), " ")
#str(fullname)
firstname <- sapply(fullname, "[", 1)
lastname <- sapply(fullname, "[", 2)
roster <- data.frame(firstname, lastname, roster[ , -1])
q.math<-quantile(roster$Math,probs=c(.2,.4,.6,.8,1))
q.sci<-quantile(roster$Science,probs=c(.2,.4,.6,.8,1))
q.eng<-quantile(roster$English,probs=c(.2,.4,.6,.8,1))
roster$Math<-cut(roster$Math,breaks=c(-Inf,q.math),labels=toupper(letters[5:1]))
roster$Science<-cut(roster$Science,breaks=c(-Inf,q.sci),labels=toupper(letters[5:1]))
roster$English<-cut(roster$English,breaks=c(-Inf,q.eng),labels=toupper(letters[5:1]))
roster[order(lastname),]
   firstname lastname Math Science English
3       Bill  Clinton    D       E       D
6      Kevin  Costner    C       C       A
1       John   Denver    C       B       B
9       Joel  English    B       B       B
8       Greg     Knox    A       B       A
4       Dave     Kuhn    E       D       E
5       Jane    Mayer    D       E       C
10       Meg     Ryan    B       C       D
2        Amy      Tan    A       A       C
7      Ruben      Yin    E       E       E

3.

School administrators collected attendance data on 316 high school juniors from two urban high schools. The variables of interest are the number of days absent from school, gender of the student, and a standardized test score in mathematics. Source: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Column 1: School ID Column 2: Gender ID, Male = 1, Female = 0 Column 3: Student’s math score Column 4: Number of days absent from school

Recode the school IDs to ‘S01’ and ‘S02’. Recode the gender ID to ‘Female’ and ‘Male’. Create index for students within each school.

dta<-read.table('http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/attendance.txt',h=T)
dta$school[dta$school==1]<-"S01"
dta$school[dta$school==2]<-"S02"
dta$male[dta$male==1]<-"Male"
dta$male[dta$male==0]<-"Female"
dta$index[dta$school=="S01"]<-paste("S01",1:sum(dta$school=="S01"),sep="_")
dta$index[dta$school=="S02"]<-paste("S02",1:sum(dta$school=="S02"),sep="_")
head(dta)
  school   male    math daysabs index
1    S01   Male 56.9888       4 S01_1
2    S01   Male 37.0942       4 S01_2
3    S01 Female 32.2755       2 S01_3
4    S01 Female 29.0567       3 S01_4
5    S01 Female  6.7480       3 S01_5
6    S01 Female 61.6543      13 S01_6

Find the mean math score of each school.

mean(dta$math[dta$school=="S01"])
[1] 42.20373
mean(dta$math[dta$school=="S02"])
[1] 55.38172

3.5

Find the mean math score of each school by gender.

mean(dta$math[dta$male=="Male"])
[1] 47.76658
mean(dta$math[dta$male=="Female"])
[1] 49.68685

3.6

Find the mean number of days of absence of each school.

mean(dta$daysabs[dta$school=="S01"])
[1] 8.132075
mean(dta$daysabs[dta$school=="S02"])
[1] 3.458599

3.7

Find the mean number of days of absence of each school by gender.

mean(dta$daysabs[dta$male=="Male"])
[1] 4.876623
mean(dta$daysabs[dta$male=="Female"])
[1] 6.697531

3.8

Is the number of days of absence related to either gender or math score or both? Does the same relationship hold for both schools?

summary(dta.lm <- lm(daysabs ~ male-1, data=dta))

Call:
lm(formula = daysabs ~ male - 1, data = dta)

Residuals:
   Min     1Q Median     3Q    Max 
-6.698 -4.877 -2.698  2.123 38.302 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
maleFemale   6.6975     0.5818  11.512  < 2e-16 ***
maleMale     4.8766     0.5967   8.173 7.49e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.405 on 314 degrees of freedom
Multiple R-squared:  0.3883,    Adjusted R-squared:  0.3844 
F-statistic: 99.66 on 2 and 314 DF,  p-value: < 2.2e-16
summary(dta.lm <- lm(daysabs ~ math-1, data=dta))

Call:
lm(formula = daysabs ~ math - 1, data = dta)

Residuals:
   Min     1Q Median     3Q    Max 
-9.609 -4.175 -1.276  3.813 36.352 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
math  0.09707    0.00866   11.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.992 on 315 degrees of freedom
Multiple R-squared:  0.2852,    Adjusted R-squared:  0.2829 
F-statistic: 125.7 on 1 and 315 DF,  p-value: < 2.2e-16
summary(dta.lm <- lm(daysabs ~ male+math-1, data=dta))

Call:
lm(formula = daysabs ~ male + math - 1, data = dta)

Residuals:
   Min     1Q Median     3Q    Max 
-8.184 -4.474 -2.372  2.200 41.790 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
maleFemale 10.21191    1.28158   7.968 3.02e-14 ***
maleMale    8.25518    1.24903   6.609 1.66e-10 ***
math       -0.07073    0.02306  -3.067  0.00235 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.308 on 313 degrees of freedom
Multiple R-squared:  0.4062,    Adjusted R-squared:  0.4005 
F-statistic: 71.36 on 3 and 313 DF,  p-value: < 2.2e-16

4

The following R script converts Taiwan dollars into US dollars. Modify it so that it converts US dollars to Euros.

ps. € sign can be obtained by typing “Alt” plus “128”.

dollar2euros <- function(x, xr)
  return(paste("€", format(round(x*100/xr, 0)/100, nsmall=2),
               sep=""))
xr <- 1.13146
x <- round(rnorm(10, mean=30, sd=3))
data.frame(USD=paste0("$",x), EURO=dollar2euros(x, xr))
   USD   EURO
1  $35 €30.93
2  $29 €25.63
3  $24 €21.21
4  $33 €29.17
5  $34 €30.05
6  $32 €28.28
7  $32 €28.28
8  $29 €25.63
9  $25 €22.10
10 $33 €29.17

5

Thr formula P = L (r/(1-(1+r)^(-M)) describes the payment you have to make per month for M number of months if you take out a loan of L amount today at a monthly interest rate of r. Compute how much you will have to pay per month for 10, 15, 20, 25, or 30 years if you borrow NT5,000,000, 10,000,000, or 15,000,000 from a bank that charges you 2%, 5%, or 7% for the monthly interest rate.

P<-function(L,M,r) L*(r/(1-(1+r)^(-M))) 
L<-c(5000000,10000000,15000000)
M<-c(10,15,20,25,30)*12
r<-c(0.02,0.05,0.07)
dta<-expand.grid(L,M,r)
colnames(dta)<-c("L","M","r")
dta<-dta[with(dta,order(dta$L,dta$M)),]
payment<-cbind(dta,P(dta$L,dta$M,dta$r))
colnames(payment)[4]<-"Payment"
head(payment) 
       L   M    r  Payment
1  5e+06 120 0.02 110240.5
16 5e+06 120 0.05 250718.6
31 5e+06 120 0.07 350104.3
4  5e+06 180 0.02 102913.7
19 5e+06 180 0.05 250038.4
34 5e+06 180 0.07 350001.8

6

Import the following text file: cities , into R. The first variable is the name of the city and the second variable is population density (population per square mile). Note that you will have to convert the second variable to numeric type by getting rid of the comma.

dta<-read.fwf("http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/cities10.txt",widths=c(18,26))
dta<-dta[-11,]
dta<-sapply(dta,gsub,pattern='\\s+',replacement='')
dta<-as.data.frame(cbind(dta[,1],as.numeric(gsub(",", "", dta[,2]))))
colnames(dta)<-c("City","PopulationDensity")
dta
              City PopulationDensity
1       NewYork,NY           66834.6
2         Kings,NY           34722.9
3         Bronx,NY           31729.8
4        Queens,NY             20453
5  SanFrancisco,CA           16526.2
6        Hudson,NJ           12956.9
7       Suffolk,MA           11691.6
8  Philadelphia,PA           11241.1
9    Washington,DC              9378
10 AlexandriaIC,VA            8552.2

7

Explain what this R scipt does.

ANS: pval2Pcnt這個function可以將數字轉為百分比符號表示,並可以指定其位數。

# create the function
pval2Pcnt <- function(p, FUN = round, ...){
  percent <- FUN(p * 100, ...)
  paste(percent, "%", sep="")
}
##
# test it
set.seed(2013)
pval2Pcnt(runif(6))
[1] "46%" "95%" "78%" "76%" "25%" "73%"
set.seed(2013)
pval2Pcnt(runif(6), FUN=signif, digits=3)
[1] "46.3%" "95.2%" "78.5%" "76.4%" "25.2%" "72.6%"
###

8

Use the data in the high schools example and write your own functions to solve the following problems:
(a) test if any pairs of the five variables: read , write , math science , and socst , are different in means.
ANS:兩兩變項平均均無顯著差異。

dta<-read.table("http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/hs0.txt",h=T)
lis<-c("read","write","math","science","socst")
lis<-combn(lis,2)
py<-function(y,x) which(x==y)
lisa<-as.integer(sapply(lis,py,colnames(dta)))
lisa<-matrix(lisa,dim(lis)[1],dim(lis)[2])
px<-function(x,dta)
{
j<-t.test(dta[,x[1]],dta[,x[2]])
j$p.value
}
rbind(lis,round(apply(lisa,2,px,dta),3))
     [,1]    [,2]    [,3]      [,4]    [,5]    [,6]      [,7]    [,8]     
[1,] "read"  "read"  "read"    "read"  "write" "write"   "write" "math"   
[2,] "write" "math"  "science" "socst" "math"  "science" "socst" "science"
[3,] "0.581" "0.673" "0.757"   "0.868" "0.89"  "0.378"   "0.715" "0.452"  
     [,9]    [,10]    
[1,] "math"  "science"
[2,] "socst" "socst"  
[3,] "0.812" "0.638"  


(b) test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read , write , math science , and socst .
ANS: African-amer 的Science與Socst的平均具有顯著差異。

rbind(lis,round(apply(lisa,2,px,dta[dta$race=="white",]),3))
     [,1]    [,2]    [,3]      [,4]    [,5]    [,6]      [,7]    [,8]     
[1,] "read"  "read"  "read"    "read"  "write" "write"   "write" "math"   
[2,] "write" "math"  "science" "socst" "math"  "science" "socst" "science"
[3,] "0.909" "0.967" "0.845"   "0.846" "0.94"  "0.932"   "0.752" "0.873"  
     [,9]    [,10]    
[1,] "math"  "science"
[2,] "socst" "socst"  
[3,] "0.808" "0.694"  
rbind(lis,round(apply(lisa,2,px,dta[dta$race=="african-amer",]),3))
     [,1]    [,2]    [,3]      [,4]    [,5]    [,6]      [,7]    [,8]     
[1,] "read"  "read"  "read"    "read"  "write" "write"   "write" "math"   
[2,] "write" "math"  "science" "socst" "math"  "science" "socst" "science"
[3,] "0.597" "0.982" "0.115"   "0.368" "0.572" "0.064"   "0.698" "0.109"  
     [,9]    [,10]    
[1,] "math"  "science"
[2,] "socst" "socst"  
[3,] "0.347" "0.038"  
rbind(lis,round(apply(lisa,2,px,dta[dta$race=="hispanic",]),3))
     [,1]    [,2]    [,3]      [,4]    [,5]    [,6]      [,7]    [,8]     
[1,] "read"  "read"  "read"    "read"  "write" "write"   "write" "math"   
[2,] "write" "math"  "science" "socst" "math"  "science" "socst" "science"
[3,] "0.939" "0.768" "0.863"   "0.691" "0.667" "0.916"   "0.601" "0.567"  
     [,9]    [,10]    
[1,] "math"  "science"
[2,] "socst" "socst"  
[3,] "0.875" "0.519"  
rbind(lis,round(apply(lisa,2,px,dta[dta$race=="asian",]),3))
     [,1]    [,2]    [,3]      [,4]    [,5]    [,6]      [,7]    [,8]     
[1,] "read"  "read"  "read"    "read"  "write" "write"   "write" "math"   
[2,] "write" "math"  "science" "socst" "math"  "science" "socst" "science"
[3,] "0.081" "0.178" "0.903"   "0.81"  "0.853" "0.095"   "0.08"  "0.18"   
     [,9]    [,10]    
[1,] "math"  "science"
[2,] "socst" "socst"  
[3,] "0.154" "0.913"  

9

The dataset aptitude{cocor} contains two samples of test takers who completed an aptitude test consisting of general knowledge questions, logic tasks, and two measures of intelligence, a and b. Find an efficient way to perform two independent-sample t-test on each of the 4 measurements with R.

t.test(aptitude$sample1$knowledge,aptitude$sample2$knowledge)

    Welch Two Sample t-test

data:  aptitude$sample1$knowledge and aptitude$sample2$knowledge
t = -0.98442, df = 619.08, p-value = 0.3253
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1458623  0.3806504
sample estimates:
mean of x mean of y 
 16.44674  16.82934 
t.test(aptitude$sample1$logic,aptitude$sample2$logic)

    Welch Two Sample t-test

data:  aptitude$sample1$logic and aptitude$sample2$logic
t = -11.018, df = 609.98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.431057 -1.695515
sample estimates:
mean of x mean of y 
 4.903780  6.967066 
t.test(aptitude$sample1$intelligence.b,aptitude$sample2$intelligence.b)

    Welch Two Sample t-test

data:  aptitude$sample1$intelligence.b and aptitude$sample2$intelligence.b
t = 3.5041, df = 566.66, p-value = 0.0004944
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.778104 6.314110
sample estimates:
mean of x mean of y 
 69.87973  65.83362 
t.test(aptitude$sample1$intelligence.a,aptitude$sample2$intelligence.a)

    Welch Two Sample t-test

data:  aptitude$sample1$intelligence.a and aptitude$sample2$intelligence.a
t = -0.48546, df = 604.2, p-value = 0.6275
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1185331  0.6751519
sample estimates:
mean of x mean of y 
 22.06873  22.29042