1

The R script illustrates how several files can be merged together. Investigate how it works.

此數行將資料讀入並了解dta的結構、名稱與前幾項。並將現在工作路徑存為s1。

s1 <- getwd()
dta <- read.table("hs0.txt", header=T)
str(dta)

'data.frame':   200 obs. of  11 variables:
 $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
 $ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
 $ race   : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
 $ ses    : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
 $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
 $ prog   : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
 $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
 $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
 $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
 $ science: int  47 63 58 53 53 63 53 39 58 NA ...
 $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

names(dta)

 [1] "id"      "female"  "race"    "ses"     "schtyp"  "prog"    "read"   
 [8] "write"   "math"    "science" "socst"

head(dta)

   id female  race    ses schtyp     prog read write math science socst
1  70   male white    low public  general   57    52   41      47    57
2 121 female white middle public vocation   68    59   53      63    61
3  86   male white   high public  general   44    33   54      58    31
4 141   male white   high public vocation   63    44   47      53    56
5 172   male white middle public academic   47    52   57      53    61
6 113   male white middle public academic   44    52   51      63    61

此數行將dta的前六項另存為物件z，並將’read’,‘write’,‘math’,’science’還有’socst’分別與id存為新的檔案z1-z5。

z <- head(dta)
write.table(z[, c(1, 7)], file="z1.txt", quote=F, sep=" ", 
            col.names=T, row.names=F)
write.table(z[, c(1, 8)], file="z2.txt", quote=F, sep=" ", 
            col.names=T, row.names=F)
write.table(z[, c(1, 9)], file="z3.txt", quote=F, sep=" ", 
            col.names=T, row.names=F)
write.table(z[, c(1, 10)], file="z4.txt", quote=F, sep=" ", 
            col.names=T, row.names=F)
write.table(z[, c(1, 11)], file="z5.txt", quote=F, sep=" ", 
            col.names=T, row.names=F)

接下來使用list.files這個指令將工作路徑中的檔案清單儲存為nfiles（包括我們剛剛存的新檔案z1-z5）。

nfiles <- list.files(path=s1)
nfiles

 [1] "aaup2.dat.txt"              "dataM_hw1_U76031019.pdf"   
 [3] "dataM_hw1_U76031019.R"      "dataM_hw1_U76031019.Rmd"   
 [5] "dataM_hw1_U76031019_2.html" "dataM_hw1_U76031019_2.R"   
 [7] "dataM_hw1_U76031019_2.Rmd"  "dataM_hw2_U76031019.html"  
 [9] "dataM_hw2_U76031019.R"      "dataM_hw2_U76031019.Rmd"   
[11] "dataM_hw3_U76031019.html"   "dataM_hw3_U76031019.R"     
[13] "dataM_hw3_U76031019.Rmd"    "dmIntro.pdf"               
[15] "dplyr_note.pdf"             "eduSex.R"                  
[17] "ex4.R"                      "file.csv"                  
[19] "file.dat"                   "file.sas"                  
[21] "hs0.txt"                    "Intro_R.pdf"               
[23] "jsp.csv"                    "juniorSchools.txt"         
[25] "mathAttainment.txt"         "P005.txt"                  
[27] "RDataManip.pdf"             "readingtimes.txt"          
[29] "RInOut.pdf"                 "RIntro.pdf"                
[31] "rsconnect"                  "savedfile"                 
[33] "titanic.raw.rdata"          "v53i07.pdf"                
[35] "verbalIQ.txt"               "z1.txt"                    
[37] "z2.txt"                     "z3.txt"                    
[39] "z4.txt"                     "z5.txt"

這邊結合lapply與read.csv，批次讀入z1-z5並存成一個list物件’filesIn’。

filesIn <- lapply(nfiles[36:40], FUN=read.csv, sep=" ")
filesIn

[[1]]
   id read
1  70   57
2 121   68
3  86   44
4 141   63
5 172   47
6 113   44

[[2]]
   id write
1  70    52
2 121    59
3  86    33
4 141    44
5 172    52
6 113    52

[[3]]
   id math
1  70   41
2 121   53
3  86   54
4 141   47
5 172   57
6 113   51

[[4]]
   id science
1  70      47
2 121      63
3  86      58
4 141      53
5 172      53
6 113      63

[[5]]
   id socst
1  70    57
2 121    61
3  86    31
4 141    56
5 172    61
6 113    61

Reduce(function(x, y) merge(x, y, all=T, by="id"), filesIn, accumulate=F)

   id read write math science socst
1  70   57    52   41      47    57
2  86   44    33   54      58    31
3 113   44    52   51      63    61
4 121   68    59   53      63    61
5 141   63    44   47      53    56
6 172   47    52   57      53    61

2.

A subset of data from the National Longitudinal Survey of Youth is presented in a long format.

2.1

Covert the dataset to a wide format.

這個資料是一個longitudinal的資料，共有4次測量，其中用來辨別個體的變項為id，不會變動的變項為sex跟race，會變動的變項為grade,year,month,math與read，用來識別第幾次測量的變項為time。

dta<-read.csv('http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/nlsy86long.csv')
head(dta)

    id    sex     race time grade year month      math      read
1 2390 Female Majority    1     0    6    67 14.285714 19.047619
2 2560 Female Majority    1     0    6    66 20.238095 21.428571
3 3740 Female Majority    1     0    6    67 17.857143 21.428571
4 4020   Male Majority    1     0    5    60  7.142857  7.142857
5 6350   Male Majority    1     1    7    78 29.761905 30.952381
6 7030   Male Majority    1     0    5    62 14.285714 17.857143

dta2<-reshape(dta, idvar="id", timevar="time",v.names=c('grade','year','month','math','read'), direction="wide")
head(dta2)

    id    sex     race grade.1 year.1 month.1    math.1    read.1 grade.2
1 2390 Female Majority       0      6      67 14.285714 19.047619       2
2 2560 Female Majority       0      6      66 20.238095 21.428571       1
3 3740 Female Majority       0      6      67 17.857143 21.428571       2
4 4020   Male Majority       0      5      60  7.142857  7.142857       1
5 6350   Male Majority       1      7      78 29.761905 30.952381       3
6 7030   Male Majority       0      5      62 14.285714 17.857143       2
  year.2 month.2   math.2   read.2 grade.3 year.3 month.3   math.3
1      8      96 15.47619 29.76190       3     10     119 38.09524
2      8      95 36.90476 32.14286       3     10     119 52.38095
3      8      95 22.61905 45.23810       5     10     122 53.57143
4      8      91 21.42857 21.42857       2      9     112 53.57143
5      9     108 50.00000 50.00000       4     11     132 47.61905
6      8      93 36.90476 46.42857       3     10     117 55.95238
    read.3 grade.4 year.4 month.4   math.4   read.4
1 28.57143       5     12     142 41.66667 45.23810
2 45.23810       6     12     143 58.33333 57.14286
3 69.04762       6     12     144 58.33333 78.57143
4 50.00000       5     12     139 51.19048 59.52381
5 63.09524       6     13     155 71.42857 82.14286
6 64.28571       5     12     139 63.09524 96.42857

2.2

Recode grade 0, 1 and 2 to “kindergarten” “first” and “second”.

dta$grade[dta$grade==0]<-"kindergarten"
dta$grade[dta$grade==1]<-"first"
dta$grade[dta$grade==2]<-"second"
head(dta)

    id    sex     race time        grade year month      math      read
1 2390 Female Majority    1 kindergarten    6    67 14.285714 19.047619
2 2560 Female Majority    1 kindergarten    6    66 20.238095 21.428571
3 3740 Female Majority    1 kindergarten    6    67 17.857143 21.428571
4 4020   Male Majority    1 kindergarten    5    60  7.142857  7.142857
5 6350   Male Majority    1        first    7    78 29.761905 30.952381
6 7030   Male Majority    1 kindergarten    5    62 14.285714 17.857143

3

Given a vector of numerical values, find the most common element (mode) for it. Report only the element and the number of its occurrences. For example, use num <- sample(100, rep=T) to generate (with replacement) a list of random integers from 1 to 100.

num <- sample(100, rep=T)
sort(table(num),d=T)

num
24 51 53 75 81  3  4 10 14 17 22 29 36 41 43 47 59 70 71 79 87 88 89 90 92 
 3  3  3  3  3  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
99  1  5  7  8 11 12 13 21 23 26 27 28 30 32 33 35 38 40 42 44 45 46 54 55 
 2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
56 57 58 61 63 64 66 67 69 72 73 76 77 78 82 83 91 96 97 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

4

The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.

設定與讀資料，瞭解資料格式

options(digits=3)
options(width=68)
ds <- read.csv("http://www.math.smith.edu/r/data/help.csv")
str(ds)

'data.frame':   453 obs. of  88 variables:
 $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ e2b1        : int  NA NA NA NA NA NA NA 1 1 1 ...
 $ g1b1        : int  0 0 0 0 0 0 NA 0 0 0 ...
 $ i11         : int  NA 8 NA NA 64 1 NA NA 12 13 ...
 $ pcs1        : num  54.2 59.6 58.5 46.6 31.4 ...
 $ mcs1        : num  52.2 41.7 56.8 14.7 40.7 ...
 $ cesd1       : int  7 11 14 44 26 23 NA 18 33 37 ...
 $ indtot1     : int  5 12 99 97 55 12 NA 82 92 104 ...
 $ drugrisk1   : int  0 0 13 0 0 0 NA 0 8 14 ...
 $ sexrisk1    : int  1 0 4 4 4 4 NA 4 4 5 ...
 $ pcrec1      : int  1 0 0 0 0 0 NA 0 0 2 ...
 $ e2b2        : int  NA NA NA 1 NA NA 1 NA NA 2 ...
 $ g1b2        : int  NA NA NA NA 0 NA 0 NA NA 0 ...
 $ i12         : int  NA NA NA NA NA NA 13 NA NA 22 ...
 $ pcs2        : num  NA NA NA 57.6 44.8 ...
 $ mcs2        : num  NA NA NA 23.9 42.4 ...
 $ cesd2       : int  NA NA NA NA 27 NA 47 NA NA 47 ...
 $ indtot2     : int  NA NA NA NA 34 NA 92 NA NA 100 ...
 $ drugrisk2   : int  NA NA NA NA 0 NA 0 NA NA 19 ...
 $ sexrisk2    : int  NA NA NA NA 2 NA 9 NA NA 0 ...
 $ pcrec2      : int  NA NA NA 0 0 NA 2 NA NA 2 ...
 $ e2b3        : int  NA NA NA NA 1 NA NA 2 4 NA ...
 $ g1b3        : int  0 NA NA NA 0 NA 0 0 0 NA ...
 $ i13         : int  NA NA NA NA 13 NA 13 12 27 NA ...
 $ pcs3        : num  52.1 NA NA NA 25 ...
 $ mcs3        : num  56.1 NA NA NA 50.9 ...
 $ cesd3       : int  8 NA NA NA 15 NA 54 25 35 NA ...
 $ indtot3     : int  0 NA NA NA 37 NA 104 92 88 NA ...
 $ drugrisk3   : int  0 NA NA NA 0 NA 0 6 7 NA ...
 $ sexrisk3    : int  1 NA NA NA 4 NA 7 6 8 NA ...
 $ pcrec3      : int  2 NA NA NA 0 NA 2 0 2 NA ...
 $ e2b4        : int  NA NA 2 NA NA NA 1 NA NA NA ...
 $ g1b4        : int  0 NA 0 0 0 NA 0 NA 0 NA ...
 $ i14         : int  8 NA 3 NA 6 NA NA NA 9 NA ...
 $ pcs4        : num  52.3 NA 66.2 57.1 44.7 ...
 $ mcs4        : num  58 NA 13.6 53.5 32.7 ...
 $ cesd4       : int  5 NA 49 20 28 NA 52 NA 27 NA ...
 $ indtot4     : int  34 NA 94 5 58 NA 113 NA 93 NA ...
 $ drugrisk4   : int  0 NA 19 0 0 NA 0 NA 0 NA ...
 $ sexrisk4    : int  3 NA 4 4 2 NA 7 NA 3 NA ...
 $ pcrec4      : int  2 NA 0 0 0 NA 2 NA 2 NA ...
 $ a15a        : int  0 2 0 0 15 0 0 4 1 4 ...
 $ a15b        : int  0 3 0 0 0 0 0 1 134 20 ...
 $ d1          : int  3 22 0 2 12 1 14 1 14 4 ...
 $ e2b         : int  NA NA NA 1 1 NA 1 8 7 3 ...
 $ f1a         : int  3 3 3 0 3 1 3 1 3 2 ...
 $ f1b         : int  2 2 2 0 0 0 1 1 2 3 ...
 $ f1c         : int  3 0 3 1 3 1 3 2 3 3 ...
 $ f1d         : int  0 3 0 3 3 3 1 3 1 0 ...
 $ f1e         : int  2 3 2 2 3 0 3 3 3 1 ...
 $ f1f         : int  3 2 2 2 3 0 3 3 3 2 ...
 $ f1g         : int  3 0 1 1 1 0 3 3 3 3 ...
 $ f1h         : int  0 0 3 3 3 3 1 1 0 0 ...
 $ f1i         : int  2 3 2 0 3 0 3 1 3 3 ...
 $ f1j         : int  3 0 3 0 2 1 3 0 3 1 ...
 $ f1k         : int  3 3 1 1 3 1 3 3 3 3 ...
 $ f1l         : int  0 0 0 2 2 3 0 1 1 0 ...
 $ f1m         : int  1 0 1 2 2 1 0 3 2 3 ...
 $ f1n         : int  2 3 3 2 3 0 3 0 3 3 ...
 $ f1o         : int  2 0 2 0 0 1 3 1 2 1 ...
 $ f1p         : int  2 0 0 NA 3 3 1 0 0 0 ...
 $ f1q         : int  2 0 0 2 3 0 3 2 1 0 ...
 $ f1r         : int  3 2 3 0 3 0 3 2 3 3 ...
 $ f1s         : int  3 0 2 0 3 0 3 0 1 0 ...
 $ f1t         : int  2 0 0 1 3 0 3 0 2 3 ...
 $ g1b         : int  1 1 0 0 0 0 1 1 0 0 ...
 $ i1          : int  13 56 0 5 10 4 13 12 71 20 ...
 $ i2          : int  26 62 0 5 13 4 20 24 129 27 ...
 $ age         : int  37 37 26 39 32 47 49 28 50 39 ...
 $ treat       : int  1 1 0 0 0 1 0 1 0 1 ...
 $ homeless    : int  0 1 0 0 1 0 0 1 1 1 ...
 $ pcs         : num  58.4 36 74.8 61.9 37.3 ...
 $ mcs         : num  25.11 26.67 6.76 43.97 21.68 ...
 $ cesd        : int  49 30 39 15 39 6 52 32 50 46 ...
 $ indtot      : int  39 43 41 28 38 29 38 44 44 44 ...
 $ pss_fr      : int  0 1 13 11 10 5 1 4 5 0 ...
 $ drugrisk    : int  0 0 20 0 0 0 0 7 18 20 ...
 $ sexrisk     : int  4 7 2 4 6 5 8 6 8 0 ...
 $ satreat     : int  0 0 0 1 0 0 1 1 0 1 ...
 $ drinkstatus : int  1 1 1 0 1 1 NA 1 1 1 ...
 $ daysdrink   : int  177 2 3 196 2 31 NA 47 62 115 ...
 $ anysubstatus: int  1 1 1 1 1 1 NA 1 1 1 ...
 $ daysanysub  : int  177 2 3 189 2 31 NA 47 31 115 ...
 $ linkstatus  : int  1 NA 0 0 1 0 0 0 0 0 ...
 $ dayslink    : int  225 NA 365 343 57 365 334 365 365 382 ...
 $ female      : int  0 0 0 1 0 1 1 0 1 0 ...
 $ substance   : Factor w/ 3 levels "alcohol","cocaine",..: 2 1 3 3 2 2 2 1 1 3 ...
 $ racegrp     : Factor w/ 4 levels "black","hispanic",..: 1 4 1 4 1 1 1 4 4 4 ...

names(ds)

 [1] "id"           "e2b1"         "g1b1"         "i11"         
 [5] "pcs1"         "mcs1"         "cesd1"        "indtot1"     
 [9] "drugrisk1"    "sexrisk1"     "pcrec1"       "e2b2"        
[13] "g1b2"         "i12"          "pcs2"         "mcs2"        
[17] "cesd2"        "indtot2"      "drugrisk2"    "sexrisk2"    
[21] "pcrec2"       "e2b3"         "g1b3"         "i13"         
[25] "pcs3"         "mcs3"         "cesd3"        "indtot3"     
[29] "drugrisk3"    "sexrisk3"     "pcrec3"       "e2b4"        
[33] "g1b4"         "i14"          "pcs4"         "mcs4"        
[37] "cesd4"        "indtot4"      "drugrisk4"    "sexrisk4"    
[41] "pcrec4"       "a15a"         "a15b"         "d1"          
[45] "e2b"          "f1a"          "f1b"          "f1c"         
[49] "f1d"          "f1e"          "f1f"          "f1g"         
[53] "f1h"          "f1i"          "f1j"          "f1k"         
[57] "f1l"          "f1m"          "f1n"          "f1o"         
[61] "f1p"          "f1q"          "f1r"          "f1s"         
[65] "f1t"          "g1b"          "i1"           "i2"          
[69] "age"          "treat"        "homeless"     "pcs"         
[73] "mcs"          "cesd"         "indtot"       "pss_fr"      
[77] "drugrisk"     "sexrisk"      "satreat"      "drinkstatus" 
[81] "daysdrink"    "anysubstatus" "daysanysub"   "linkstatus"  
[85] "dayslink"     "female"       "substance"    "racegrp"

選擇特定的columns，存為newds。其中有一行unlames(newds)應該是打錯字，可能是colnames之類的。應該不太重要。

newds <- ds[, c("cesd","female","i1","i2","id","treat","f1a",
   "f1b","f1c","f1d","f1e","f1f","f1g","f1h","f1i","f1j","f1k",
   "f1l","f1m","f1n","f1o","f1p","f1q","f1r","f1s","f1t")]
attach(newds)
str(newds[ ,1:10])

'data.frame':   453 obs. of  10 variables:
 $ cesd  : int  49 30 39 15 39 6 52 32 50 46 ...
 $ female: int  0 0 0 1 0 1 1 0 1 0 ...
 $ i1    : int  13 56 0 5 10 4 13 12 71 20 ...
 $ i2    : int  26 62 0 5 13 4 20 24 129 27 ...
 $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ treat : int  1 1 0 0 0 1 0 1 0 1 ...
 $ f1a   : int  3 3 3 0 3 1 3 1 3 2 ...
 $ f1b   : int  2 2 2 0 0 0 1 1 2 3 ...
 $ f1c   : int  3 0 3 1 3 1 3 2 3 3 ...
 $ f1d   : int  0 3 0 3 3 3 1 3 1 0 ...

使用package“plyr”，colwise這個function可以讓一些用在vector的function可以columnwise的使用,我覺得跟apply(,2,)有點像。

#install.packages("plyr")
library(plyr)

Warning: package 'plyr' was built under R version 3.2.4

colwise(mean)(newds[ ,1:10])

  cesd female   i1   i2  id treat  f1a  f1b  f1c  f1d
1 32.8  0.236 17.9 22.6 233 0.497 1.63 1.39 1.92 1.57

看newds前五個row的資料

head(newds, n=5)

  cesd female i1 i2 id treat f1a f1b f1c f1d f1e f1f f1g f1h f1i
1   49      0 13 26  1     1   3   2   3   0   2   3   3   0   2
2   30      0 56 62  2     1   3   2   0   3   3   2   0   0   3
3   39      0  0  0  3     0   3   2   3   0   2   2   1   3   2
4   15      1  5  5  4     0   0   0   1   3   2   2   1   3   0
5   39      0 10 13  5     0   3   0   3   3   3   3   1   3   3
  f1j f1k f1l f1m f1n f1o f1p f1q f1r f1s f1t
1   3   3   0   1   2   2   2   2   3   3   2
2   0   3   0   0   3   0   0   0   2   0   0
3   3   1   0   1   3   2   0   0   3   2   0
4   0   1   2   2   2   0  NA   2   0   0   1
5   2   3   2   2   3   0   3   3   3   3   3

stopifnot可以檢查數個判斷式是否均為真，在這邊沒有return任何error，代表判斷式均為真。

stopifnot(cesd >= 0 & cesd <= 60, i1 >= 0, female == 0 | female == 1)

不太確定try的功用，感覺好像跟直接跑一樣。geterrmessage可以記錄error的內容。

try(stopifnot(i1 >= 1))
geterrmessage()

[1] "Error : i1 >= 1 are not all TRUE\n"

comment加入註解

comment(newds) <- "HELP baseline dataset"
comment(newds)

[1] "HELP baseline dataset"

將ds存為一RData“savedfile”

save(ds, file="savedfile")

將newds存為一csv檔“file.csv”

write.csv(newds, "file.csv")

使用package “foreign”將資料存為“file.dat”與“file.sas”

library(foreign)
write.foreign(newds, "file.dat", "file.sas", package="SAS")

看cesd的前10個東西

cesd[1:10]

 [1] 49 30 39 15 39  6 52 32 50 46

選擇cesd>55的內容

cesd[cesd > 55]

 [1] 57 58 57 60 58 56 58 56 57 56

回傳cesd>55的位置

which(cesd > 55)

 [1]  64 116 171 194 231 266 295 305 387 415

排序cesd的前四個元素，預設由小到大排列

sort(cesd)[1:4]

[1] 1 3 3 4

看f1g中有多少missing，FALSE的是沒有遺漏，

table(is.na(f1g))


FALSE  TRUE 
  452     1

將某些變項反序排列，作成新的檔案“cesditems”後，將資料依沒有missing的比例調整，最後展示具有missing的項目總合調整後的結果。

cesditems <- cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g, 
                   (3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, 
                   (3 - f1p), f1q, f1r, f1s, f1t)
nmisscesd <- apply(is.na(cesditems), 1, sum)
ncesditems <- cesditems
ncesditems[is.na(cesditems)] <- 0
newcesd <- apply(ncesditems, 1, sum)
imputemeancesd <- 20/(20 - nmisscesd)*newcesd
cbind(newcesd, cesd, nmisscesd, imputemeancesd)[nmisscesd > 0, ]

     newcesd cesd nmisscesd imputemeancesd
[1,]      15   15         1           15.8
[2,]      19   19         1           20.0
[3,]      44   44         1           46.3
[4,]      17   17         1           17.9
[5,]      29   29         1           30.5
[6,]      44   44         1           46.3
[7,]      39   39         1           41.1

創造與i1一樣多個的空白的character物件

drinkstat <- character(length(i1))

依據條件填入字串

drinkstat[i1 == 0] <- "abstinent"
# create moderate group
drinkstat[(i1 > 0 & i1 <= 1 & i2 <= 3 & female == 1) | 
            (i1 > 0 & i1 <= 2 & i2 <= 4 & female == 0)] <- "moderate"
# create highrisk group
drinkstat[(( i1>1 | i2>3 ) & female == 1) | 
            ((i1 > 2 | i2 > 4) & female == 0)] <- "highrisk"

檢查drinstat中是否有遺漏值，結果顯示沒有遺漏

is.na(drinkstat) = is.na(i1) | is.na(i2) | is.na(female)
table(is.na(drinkstat))


FALSE 
  453

將一些變項組合成新的data frame

tmpds <- data.frame(i1, i2, female, drinkstat)
tmpds[361:370, ]

    i1 i2 female drinkstat
361 37 37      0  highrisk
362 25 25      0  highrisk
363 38 38      0  highrisk
364 12 29      0  highrisk
365  6 24      0  highrisk
366  6  6      0  highrisk
367  0  0      0 abstinent
368  0  0      1 abstinent
369  8  8      0  highrisk
370 32 32      0  highrisk

依照drinkstat和female的值條件篩選

tmpds[tmpds$drinkstat == "moderate" & tmpds$female == 1, ]

    i1 i2 female drinkstat
116  1  1      1  moderate
137  1  3      1  moderate
225  1  2      1  moderate
230  1  1      1  moderate
264  1  1      1  moderate
266  1  1      1  moderate
394  1  1      1  moderate

將cesd切分，並計算組別。right=F代表上界為開區間。

table(cut(cesd, c(0, 16, 22, 60), right=FALSE))


 [0,16) [16,22) [22,60) 
     46      42     364

計算drinkstate中missing的總數，結果沒有missing。

sum(is.na(drinkstat))

[1] 0

計算drinkstate各個變項的數量

table(drinkstat, exclude="NULL")

drinkstat
abstinent  highrisk  moderate 
       68       357        28

依照drinkstate和female作列聯表

table(drinkstat, female, exclude="NULL")

Warning in as.vector(exclude, typeof(x)): 強制變更過程中產生了 NA

           female
drinkstat     0   1
  abstinent  42  26
  highrisk  283  74
  moderate   21   7

將female=1和female=0分別重新編碼成gender=female、gender=male

gender = factor(female, c(0,1), c("male","Female"))
table(female)

female
  0   1 
346 107

table(gender)

gender
  male Female 
   346    107

只選擇女生的資料並計算平均（共4種方法）

detach(newds)
newds <- ds[order(ds$cesd, ds$i1), ]
newds[1:5, c("cesd", "i1", "id")]

    cesd i1  id
199    1  3 233
394    3  1 139
349    3 13 418
417    4  4 251
85     4  9  95

females <- ds[ds$female == 1, ]
attach(females)
mean(cesd)

[1] 36.9

with(ds, mean(cesd[female == 1]))

[1] 36.9

tapply(ds$cesd, ds$female, mean)

   0    1 
31.6 36.9

aggregate(ds$cesd, list(ds$female), mean)

  Group.1    x
1       0 31.6
2       1 36.9

5

An instructor has 12 students is his class: Amy, Brad, Chad, Daisy, Ed, Fey, George, Hillary, Ike, Jane, Kim, Luke. For a class assignment, he needs to permute these names at random. Help him accomplish it with R.

sample(c('Amy','Brad','Chad','Daisy','Ed','Fey','George','Hillary','Ike','Jane','Kim','Luke'),12)

 [1] "Amy"     "Daisy"   "Ike"     "Jane"    "Brad"    "Luke"   
 [7] "Kim"     "George"  "Chad"    "Ed"      "Hillary" "Fey"

7

The ‘cabbages’ data set in the ‘MASS’ package has 4 variables: Cult, Date, HeadWt, VitC. Ignore the last variable and make ‘Date’ a numeric variable. Plot 95% CIs for the variable ‘HeadWt’ by ‘Cult’ and ‘Date’, jointly.

dat<-MASS::cabbages
dat<-dat[,-dim(dta)[2]]
dat$Date<-as.numeric(dat$Date)
dta <- ddply(dat, c("Cult", "Date"),summarize,ll=mean(HeadWt)+c(-1.96,1.96)*sd(HeadWt)/sqrt(length(HeadWt)))
dta$bnd <- rep(c("l", "u"), 6)
dtaw <- reshape2::dcast(dta, Cult + Date ~ bnd,value.var="ll")
s <- 1:6; d <- 2*pi/1000
plot(c(0.5,4), c(0.5, 4), type="n", xlab="Date", ylab="HeadWt")
segments(dtaw$Date[s]+d*(s-1), dtaw$l[s], dtaw$Date[s]+d*(s-1),
         dtaw$u[s], col=dtaw$Cult, lwd=2)
points(dtaw$Date[s]+d*(s-1), (dtaw$l[s]+dtaw$u[s])/2,
       col=dtaw$Cult, pch=16)
legend("topright", legend=c("c39", "c52"), text.col=c(1,2))
grid()

8

The ‘MASS’ library has these two data sets: ‘Animals’ and ‘mammals’. Merge the two files and remove duplicated observations using ‘duplicated’.

ani<-rbind(MASS::Animals,MASS::mammals)
ani[!duplicated(ani),]

                              body   brain
Mountain beaver           1.35e+00    8.10
Cow                       4.65e+02  423.00
Grey wolf                 3.63e+01  119.50
Goat                      2.77e+01  115.00
Guinea pig                1.04e+00    5.50
Dipliodocus               1.17e+04   50.00
Asian elephant            2.55e+03 4603.00
Donkey                    1.87e+02  419.00
Horse                     5.21e+02  655.00
Potar monkey              1.00e+01  115.00
Cat                       3.30e+00   25.60
Giraffe                   5.29e+02  680.00
Gorilla                   2.07e+02  406.00
Human                     6.20e+01 1320.00
African elephant          6.65e+03 5712.00
Triceratops               9.40e+03   70.00
Rhesus monkey             6.80e+00  179.00
Kangaroo                  3.50e+01   56.00
Golden hamster            1.20e-01    1.00
Mouse                     2.30e-02    0.40
Rabbit                    2.50e+00   12.10
Sheep                     5.55e+01  175.00
Jaguar                    1.00e+02  157.00
Chimpanzee                5.22e+01  440.00
Rat                       2.80e-01    1.90
Brachiosaurus             8.70e+04  154.50
Mole                      1.22e-01    3.00
Pig                       1.92e+02  180.00
Arctic fox                3.38e+00   44.50
Owl monkey                4.80e-01   15.50
Roe deer                  1.48e+01   98.20
Verbet                    4.19e+00   58.00
Chinchilla                4.25e-01    6.40
Ground squirrel           1.01e-01    4.00
Arctic ground squirrel    9.20e-01    5.70
African giant pouched rat 1.00e+00    6.60
Lesser short-tailed shrew 5.00e-03    0.14
Star-nosed mole           6.00e-02    1.00
Nine-banded armadillo     3.50e+00   10.80
Tree hyrax                2.00e+00   12.30
N.A. opossum              1.70e+00    6.30
Big brown bat             2.30e-02    0.30
European hedgehog         7.85e-01    3.50
Galago                    2.00e-01    5.00
Genet                     1.41e+00   17.50
Grey seal                 8.50e+01  325.00
Rock hyrax-a              7.50e-01   12.30
Water opossum             3.50e+00    3.90
Yellow-bellied marmot     4.05e+00   17.00
Little brown bat          1.00e-02    0.25
Slow loris                1.40e+00   12.50
Okapi                     2.50e+02  490.00
Baboon                    1.06e+01  179.50
Desert hedgehog           5.50e-01    2.40
Giant armadillo           6.00e+01   81.00
Rock hyrax-b              3.60e+00   21.00
Raccoon                   4.29e+00   39.20
E. American mole          7.50e-02    1.20
Musk shrew                4.80e-02    0.33
Echidna                   3.00e+00   25.00
Brazilian tapir           1.60e+02  169.00
Tenrec                    9.00e-01    2.60
Phalanger                 1.62e+00   11.40
Tree shrew                1.04e-01    2.50
Red fox                   4.24e+00   50.40

9

The data set lq2002{multilevel} cotains responses from a sample of 2,042 soldiers to a survey of 19 items. Items 1-11 measure leadership climate, items 12-14 measure task significance, and items 15-19 measure the hostility felt by the soldier. The soldiers came from 49 companies. For this problem, use only the company with the largest number of soldiers and only the hostility items.

Convert the scores greater than or equal to 4 to 1 and 0 otherwise for all 5 items.

library(multilevel)

Warning: package 'multilevel' was built under R version 3.2.4

Loading required package: nlme

Warning: package 'nlme' was built under R version 3.2.4

Loading required package: MASS

data(lq2002)
#head(lq2002)
which.max(table(lq2002$COMPID))

13 
 9

lq<-lq2002[which(lq2002$COMPID=="13"),15:19]
lqq<-apply(lq,2,function(x) {ifelse(x >= 4,1,0)})
head(lqq)

    TSIG02 TSIG03 HOSTIL01 HOSTIL02 HOSTIL03
234      0      0        0        0        1
235      1      1        0        0        0
236      1      1        1        1        1
237      0      0        1        1        0
238      0      0        0        0        1
239      0      0        0        0        0

Compute the proportions of 1’s for the five items and present them in descending order.

sort(apply(lqq,2,mean),decreasing =T)

  TSIG02   TSIG03 HOSTIL03 HOSTIL01 HOSTIL02 
  0.6061   0.5960   0.1313   0.1111   0.0606

HW3

U_76031019

1

2.

2.1

2.2

3

4

5

7

8

9