The R script illustrates how several files can be merged together. Investigate how it works.
此數行將資料讀入並了解dta的結構、名稱與前幾項。並將現在工作路徑存為s1。
s1 <- getwd()
dta <- read.table("hs0.txt", header=T)
str(dta)
'data.frame': 200 obs. of 11 variables:
$ id : int 70 121 86 141 172 113 50 11 84 48 ...
$ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
$ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
$ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
$ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
$ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
$ read : int 57 68 44 63 47 44 50 34 63 57 ...
$ write : int 52 59 33 44 52 52 59 46 57 55 ...
$ math : int 41 53 54 47 57 51 42 45 54 52 ...
$ science: int 47 63 58 53 53 63 53 39 58 NA ...
$ socst : int 57 61 31 56 61 61 61 36 51 51 ...
names(dta)
[1] "id" "female" "race" "ses" "schtyp" "prog" "read"
[8] "write" "math" "science" "socst"
head(dta)
id female race ses schtyp prog read write math science socst
1 70 male white low public general 57 52 41 47 57
2 121 female white middle public vocation 68 59 53 63 61
3 86 male white high public general 44 33 54 58 31
4 141 male white high public vocation 63 44 47 53 56
5 172 male white middle public academic 47 52 57 53 61
6 113 male white middle public academic 44 52 51 63 61
此數行將dta的前六項另存為物件z,並將’read’,‘write’,‘math’,’science’還有’socst’分別與id存為新的檔案z1-z5。
z <- head(dta)
write.table(z[, c(1, 7)], file="z1.txt", quote=F, sep=" ",
col.names=T, row.names=F)
write.table(z[, c(1, 8)], file="z2.txt", quote=F, sep=" ",
col.names=T, row.names=F)
write.table(z[, c(1, 9)], file="z3.txt", quote=F, sep=" ",
col.names=T, row.names=F)
write.table(z[, c(1, 10)], file="z4.txt", quote=F, sep=" ",
col.names=T, row.names=F)
write.table(z[, c(1, 11)], file="z5.txt", quote=F, sep=" ",
col.names=T, row.names=F)
接下來使用list.files這個指令將工作路徑中的檔案清單儲存為nfiles(包括我們剛剛存的新檔案z1-z5)。
nfiles <- list.files(path=s1)
nfiles
[1] "aaup2.dat.txt" "dataM_hw1_U76031019.pdf"
[3] "dataM_hw1_U76031019.R" "dataM_hw1_U76031019.Rmd"
[5] "dataM_hw1_U76031019_2.html" "dataM_hw1_U76031019_2.R"
[7] "dataM_hw1_U76031019_2.Rmd" "dataM_hw2_U76031019.html"
[9] "dataM_hw2_U76031019.R" "dataM_hw2_U76031019.Rmd"
[11] "dataM_hw3_U76031019.html" "dataM_hw3_U76031019.R"
[13] "dataM_hw3_U76031019.Rmd" "dmIntro.pdf"
[15] "dplyr_note.pdf" "eduSex.R"
[17] "ex4.R" "file.csv"
[19] "file.dat" "file.sas"
[21] "hs0.txt" "Intro_R.pdf"
[23] "jsp.csv" "juniorSchools.txt"
[25] "mathAttainment.txt" "P005.txt"
[27] "RDataManip.pdf" "readingtimes.txt"
[29] "RInOut.pdf" "RIntro.pdf"
[31] "rsconnect" "savedfile"
[33] "titanic.raw.rdata" "v53i07.pdf"
[35] "verbalIQ.txt" "z1.txt"
[37] "z2.txt" "z3.txt"
[39] "z4.txt" "z5.txt"
這邊結合lapply與read.csv,批次讀入z1-z5並存成一個list物件’filesIn’。
filesIn <- lapply(nfiles[36:40], FUN=read.csv, sep=" ")
filesIn
[[1]]
id read
1 70 57
2 121 68
3 86 44
4 141 63
5 172 47
6 113 44
[[2]]
id write
1 70 52
2 121 59
3 86 33
4 141 44
5 172 52
6 113 52
[[3]]
id math
1 70 41
2 121 53
3 86 54
4 141 47
5 172 57
6 113 51
[[4]]
id science
1 70 47
2 121 63
3 86 58
4 141 53
5 172 53
6 113 63
[[5]]
id socst
1 70 57
2 121 61
3 86 31
4 141 56
5 172 61
6 113 61
Reduce(function(x, y) merge(x, y, all=T, by="id"), filesIn, accumulate=F)
id read write math science socst
1 70 57 52 41 47 57
2 86 44 33 54 58 31
3 113 44 52 51 63 61
4 121 68 59 53 63 61
5 141 63 44 47 53 56
6 172 47 52 57 53 61
A subset of data from the National Longitudinal Survey of Youth is presented in a long format.
Covert the dataset to a wide format.
這個資料是一個longitudinal的資料,共有4次測量,其中用來辨別個體的變項為id,不會變動的變項為sex跟race,會變動的變項為grade,year,month,math與read,用來識別第幾次測量的變項為time。
dta<-read.csv('http://titan.ccunix.ccu.edu.tw/~psycfs/dataM/Data/nlsy86long.csv')
head(dta)
id sex race time grade year month math read
1 2390 Female Majority 1 0 6 67 14.285714 19.047619
2 2560 Female Majority 1 0 6 66 20.238095 21.428571
3 3740 Female Majority 1 0 6 67 17.857143 21.428571
4 4020 Male Majority 1 0 5 60 7.142857 7.142857
5 6350 Male Majority 1 1 7 78 29.761905 30.952381
6 7030 Male Majority 1 0 5 62 14.285714 17.857143
dta2<-reshape(dta, idvar="id", timevar="time",v.names=c('grade','year','month','math','read'), direction="wide")
head(dta2)
id sex race grade.1 year.1 month.1 math.1 read.1 grade.2
1 2390 Female Majority 0 6 67 14.285714 19.047619 2
2 2560 Female Majority 0 6 66 20.238095 21.428571 1
3 3740 Female Majority 0 6 67 17.857143 21.428571 2
4 4020 Male Majority 0 5 60 7.142857 7.142857 1
5 6350 Male Majority 1 7 78 29.761905 30.952381 3
6 7030 Male Majority 0 5 62 14.285714 17.857143 2
year.2 month.2 math.2 read.2 grade.3 year.3 month.3 math.3
1 8 96 15.47619 29.76190 3 10 119 38.09524
2 8 95 36.90476 32.14286 3 10 119 52.38095
3 8 95 22.61905 45.23810 5 10 122 53.57143
4 8 91 21.42857 21.42857 2 9 112 53.57143
5 9 108 50.00000 50.00000 4 11 132 47.61905
6 8 93 36.90476 46.42857 3 10 117 55.95238
read.3 grade.4 year.4 month.4 math.4 read.4
1 28.57143 5 12 142 41.66667 45.23810
2 45.23810 6 12 143 58.33333 57.14286
3 69.04762 6 12 144 58.33333 78.57143
4 50.00000 5 12 139 51.19048 59.52381
5 63.09524 6 13 155 71.42857 82.14286
6 64.28571 5 12 139 63.09524 96.42857
Recode grade 0, 1 and 2 to “kindergarten” “first” and “second”.
dta$grade[dta$grade==0]<-"kindergarten"
dta$grade[dta$grade==1]<-"first"
dta$grade[dta$grade==2]<-"second"
head(dta)
id sex race time grade year month math read
1 2390 Female Majority 1 kindergarten 6 67 14.285714 19.047619
2 2560 Female Majority 1 kindergarten 6 66 20.238095 21.428571
3 3740 Female Majority 1 kindergarten 6 67 17.857143 21.428571
4 4020 Male Majority 1 kindergarten 5 60 7.142857 7.142857
5 6350 Male Majority 1 first 7 78 29.761905 30.952381
6 7030 Male Majority 1 kindergarten 5 62 14.285714 17.857143
Given a vector of numerical values, find the most common element (mode) for it. Report only the element and the number of its occurrences. For example, use num <- sample(100, rep=T) to generate (with replacement) a list of random integers from 1 to 100.
num <- sample(100, rep=T)
sort(table(num),d=T)
num
24 51 53 75 81 3 4 10 14 17 22 29 36 41 43 47 59 70 71 79 87 88 89 90 92
3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
99 1 5 7 8 11 12 13 21 23 26 27 28 30 32 33 35 38 40 42 44 45 46 54 55
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
56 57 58 61 63 64 66 67 69 72 73 76 77 78 82 83 91 96 97
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.
設定與讀資料,瞭解資料格式
options(digits=3)
options(width=68)
ds <- read.csv("http://www.math.smith.edu/r/data/help.csv")
str(ds)
'data.frame': 453 obs. of 88 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ e2b1 : int NA NA NA NA NA NA NA 1 1 1 ...
$ g1b1 : int 0 0 0 0 0 0 NA 0 0 0 ...
$ i11 : int NA 8 NA NA 64 1 NA NA 12 13 ...
$ pcs1 : num 54.2 59.6 58.5 46.6 31.4 ...
$ mcs1 : num 52.2 41.7 56.8 14.7 40.7 ...
$ cesd1 : int 7 11 14 44 26 23 NA 18 33 37 ...
$ indtot1 : int 5 12 99 97 55 12 NA 82 92 104 ...
$ drugrisk1 : int 0 0 13 0 0 0 NA 0 8 14 ...
$ sexrisk1 : int 1 0 4 4 4 4 NA 4 4 5 ...
$ pcrec1 : int 1 0 0 0 0 0 NA 0 0 2 ...
$ e2b2 : int NA NA NA 1 NA NA 1 NA NA 2 ...
$ g1b2 : int NA NA NA NA 0 NA 0 NA NA 0 ...
$ i12 : int NA NA NA NA NA NA 13 NA NA 22 ...
$ pcs2 : num NA NA NA 57.6 44.8 ...
$ mcs2 : num NA NA NA 23.9 42.4 ...
$ cesd2 : int NA NA NA NA 27 NA 47 NA NA 47 ...
$ indtot2 : int NA NA NA NA 34 NA 92 NA NA 100 ...
$ drugrisk2 : int NA NA NA NA 0 NA 0 NA NA 19 ...
$ sexrisk2 : int NA NA NA NA 2 NA 9 NA NA 0 ...
$ pcrec2 : int NA NA NA 0 0 NA 2 NA NA 2 ...
$ e2b3 : int NA NA NA NA 1 NA NA 2 4 NA ...
$ g1b3 : int 0 NA NA NA 0 NA 0 0 0 NA ...
$ i13 : int NA NA NA NA 13 NA 13 12 27 NA ...
$ pcs3 : num 52.1 NA NA NA 25 ...
$ mcs3 : num 56.1 NA NA NA 50.9 ...
$ cesd3 : int 8 NA NA NA 15 NA 54 25 35 NA ...
$ indtot3 : int 0 NA NA NA 37 NA 104 92 88 NA ...
$ drugrisk3 : int 0 NA NA NA 0 NA 0 6 7 NA ...
$ sexrisk3 : int 1 NA NA NA 4 NA 7 6 8 NA ...
$ pcrec3 : int 2 NA NA NA 0 NA 2 0 2 NA ...
$ e2b4 : int NA NA 2 NA NA NA 1 NA NA NA ...
$ g1b4 : int 0 NA 0 0 0 NA 0 NA 0 NA ...
$ i14 : int 8 NA 3 NA 6 NA NA NA 9 NA ...
$ pcs4 : num 52.3 NA 66.2 57.1 44.7 ...
$ mcs4 : num 58 NA 13.6 53.5 32.7 ...
$ cesd4 : int 5 NA 49 20 28 NA 52 NA 27 NA ...
$ indtot4 : int 34 NA 94 5 58 NA 113 NA 93 NA ...
$ drugrisk4 : int 0 NA 19 0 0 NA 0 NA 0 NA ...
$ sexrisk4 : int 3 NA 4 4 2 NA 7 NA 3 NA ...
$ pcrec4 : int 2 NA 0 0 0 NA 2 NA 2 NA ...
$ a15a : int 0 2 0 0 15 0 0 4 1 4 ...
$ a15b : int 0 3 0 0 0 0 0 1 134 20 ...
$ d1 : int 3 22 0 2 12 1 14 1 14 4 ...
$ e2b : int NA NA NA 1 1 NA 1 8 7 3 ...
$ f1a : int 3 3 3 0 3 1 3 1 3 2 ...
$ f1b : int 2 2 2 0 0 0 1 1 2 3 ...
$ f1c : int 3 0 3 1 3 1 3 2 3 3 ...
$ f1d : int 0 3 0 3 3 3 1 3 1 0 ...
$ f1e : int 2 3 2 2 3 0 3 3 3 1 ...
$ f1f : int 3 2 2 2 3 0 3 3 3 2 ...
$ f1g : int 3 0 1 1 1 0 3 3 3 3 ...
$ f1h : int 0 0 3 3 3 3 1 1 0 0 ...
$ f1i : int 2 3 2 0 3 0 3 1 3 3 ...
$ f1j : int 3 0 3 0 2 1 3 0 3 1 ...
$ f1k : int 3 3 1 1 3 1 3 3 3 3 ...
$ f1l : int 0 0 0 2 2 3 0 1 1 0 ...
$ f1m : int 1 0 1 2 2 1 0 3 2 3 ...
$ f1n : int 2 3 3 2 3 0 3 0 3 3 ...
$ f1o : int 2 0 2 0 0 1 3 1 2 1 ...
$ f1p : int 2 0 0 NA 3 3 1 0 0 0 ...
$ f1q : int 2 0 0 2 3 0 3 2 1 0 ...
$ f1r : int 3 2 3 0 3 0 3 2 3 3 ...
$ f1s : int 3 0 2 0 3 0 3 0 1 0 ...
$ f1t : int 2 0 0 1 3 0 3 0 2 3 ...
$ g1b : int 1 1 0 0 0 0 1 1 0 0 ...
$ i1 : int 13 56 0 5 10 4 13 12 71 20 ...
$ i2 : int 26 62 0 5 13 4 20 24 129 27 ...
$ age : int 37 37 26 39 32 47 49 28 50 39 ...
$ treat : int 1 1 0 0 0 1 0 1 0 1 ...
$ homeless : int 0 1 0 0 1 0 0 1 1 1 ...
$ pcs : num 58.4 36 74.8 61.9 37.3 ...
$ mcs : num 25.11 26.67 6.76 43.97 21.68 ...
$ cesd : int 49 30 39 15 39 6 52 32 50 46 ...
$ indtot : int 39 43 41 28 38 29 38 44 44 44 ...
$ pss_fr : int 0 1 13 11 10 5 1 4 5 0 ...
$ drugrisk : int 0 0 20 0 0 0 0 7 18 20 ...
$ sexrisk : int 4 7 2 4 6 5 8 6 8 0 ...
$ satreat : int 0 0 0 1 0 0 1 1 0 1 ...
$ drinkstatus : int 1 1 1 0 1 1 NA 1 1 1 ...
$ daysdrink : int 177 2 3 196 2 31 NA 47 62 115 ...
$ anysubstatus: int 1 1 1 1 1 1 NA 1 1 1 ...
$ daysanysub : int 177 2 3 189 2 31 NA 47 31 115 ...
$ linkstatus : int 1 NA 0 0 1 0 0 0 0 0 ...
$ dayslink : int 225 NA 365 343 57 365 334 365 365 382 ...
$ female : int 0 0 0 1 0 1 1 0 1 0 ...
$ substance : Factor w/ 3 levels "alcohol","cocaine",..: 2 1 3 3 2 2 2 1 1 3 ...
$ racegrp : Factor w/ 4 levels "black","hispanic",..: 1 4 1 4 1 1 1 4 4 4 ...
names(ds)
[1] "id" "e2b1" "g1b1" "i11"
[5] "pcs1" "mcs1" "cesd1" "indtot1"
[9] "drugrisk1" "sexrisk1" "pcrec1" "e2b2"
[13] "g1b2" "i12" "pcs2" "mcs2"
[17] "cesd2" "indtot2" "drugrisk2" "sexrisk2"
[21] "pcrec2" "e2b3" "g1b3" "i13"
[25] "pcs3" "mcs3" "cesd3" "indtot3"
[29] "drugrisk3" "sexrisk3" "pcrec3" "e2b4"
[33] "g1b4" "i14" "pcs4" "mcs4"
[37] "cesd4" "indtot4" "drugrisk4" "sexrisk4"
[41] "pcrec4" "a15a" "a15b" "d1"
[45] "e2b" "f1a" "f1b" "f1c"
[49] "f1d" "f1e" "f1f" "f1g"
[53] "f1h" "f1i" "f1j" "f1k"
[57] "f1l" "f1m" "f1n" "f1o"
[61] "f1p" "f1q" "f1r" "f1s"
[65] "f1t" "g1b" "i1" "i2"
[69] "age" "treat" "homeless" "pcs"
[73] "mcs" "cesd" "indtot" "pss_fr"
[77] "drugrisk" "sexrisk" "satreat" "drinkstatus"
[81] "daysdrink" "anysubstatus" "daysanysub" "linkstatus"
[85] "dayslink" "female" "substance" "racegrp"
選擇特定的columns,存為newds。其中有一行unlames(newds)應該是打錯字,可能是colnames之類的。應該不太重要。
newds <- ds[, c("cesd","female","i1","i2","id","treat","f1a",
"f1b","f1c","f1d","f1e","f1f","f1g","f1h","f1i","f1j","f1k",
"f1l","f1m","f1n","f1o","f1p","f1q","f1r","f1s","f1t")]
attach(newds)
str(newds[ ,1:10])
'data.frame': 453 obs. of 10 variables:
$ cesd : int 49 30 39 15 39 6 52 32 50 46 ...
$ female: int 0 0 0 1 0 1 1 0 1 0 ...
$ i1 : int 13 56 0 5 10 4 13 12 71 20 ...
$ i2 : int 26 62 0 5 13 4 20 24 129 27 ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ treat : int 1 1 0 0 0 1 0 1 0 1 ...
$ f1a : int 3 3 3 0 3 1 3 1 3 2 ...
$ f1b : int 2 2 2 0 0 0 1 1 2 3 ...
$ f1c : int 3 0 3 1 3 1 3 2 3 3 ...
$ f1d : int 0 3 0 3 3 3 1 3 1 0 ...
使用package“plyr”,colwise這個function可以讓一些用在vector的function可以columnwise的使用,我覺得跟apply(,2,)有點像。
#install.packages("plyr")
library(plyr)
Warning: package 'plyr' was built under R version 3.2.4
colwise(mean)(newds[ ,1:10])
cesd female i1 i2 id treat f1a f1b f1c f1d
1 32.8 0.236 17.9 22.6 233 0.497 1.63 1.39 1.92 1.57
看newds前五個row的資料
head(newds, n=5)
cesd female i1 i2 id treat f1a f1b f1c f1d f1e f1f f1g f1h f1i
1 49 0 13 26 1 1 3 2 3 0 2 3 3 0 2
2 30 0 56 62 2 1 3 2 0 3 3 2 0 0 3
3 39 0 0 0 3 0 3 2 3 0 2 2 1 3 2
4 15 1 5 5 4 0 0 0 1 3 2 2 1 3 0
5 39 0 10 13 5 0 3 0 3 3 3 3 1 3 3
f1j f1k f1l f1m f1n f1o f1p f1q f1r f1s f1t
1 3 3 0 1 2 2 2 2 3 3 2
2 0 3 0 0 3 0 0 0 2 0 0
3 3 1 0 1 3 2 0 0 3 2 0
4 0 1 2 2 2 0 NA 2 0 0 1
5 2 3 2 2 3 0 3 3 3 3 3
stopifnot可以檢查數個判斷式是否均為真,在這邊沒有return任何error,代表判斷式均為真。
stopifnot(cesd >= 0 & cesd <= 60, i1 >= 0, female == 0 | female == 1)
不太確定try的功用,感覺好像跟直接跑一樣。geterrmessage可以記錄error的內容。
try(stopifnot(i1 >= 1))
geterrmessage()
[1] "Error : i1 >= 1 are not all TRUE\n"
comment加入註解
comment(newds) <- "HELP baseline dataset"
comment(newds)
[1] "HELP baseline dataset"
將ds存為一RData“savedfile”
save(ds, file="savedfile")
將newds存為一csv檔“file.csv”
write.csv(newds, "file.csv")
使用package “foreign”將資料存為“file.dat”與“file.sas”
library(foreign)
write.foreign(newds, "file.dat", "file.sas", package="SAS")
看cesd的前10個東西
cesd[1:10]
[1] 49 30 39 15 39 6 52 32 50 46
選擇cesd>55的內容
cesd[cesd > 55]
[1] 57 58 57 60 58 56 58 56 57 56
回傳cesd>55的位置
which(cesd > 55)
[1] 64 116 171 194 231 266 295 305 387 415
排序cesd的前四個元素,預設由小到大排列
sort(cesd)[1:4]
[1] 1 3 3 4
看f1g中有多少missing,FALSE的是沒有遺漏,
table(is.na(f1g))
FALSE TRUE
452 1
將某些變項反序排列,作成新的檔案“cesditems”後,將資料依沒有missing的比例調整,最後展示具有missing的項目總合調整後的結果。
cesditems <- cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g,
(3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o,
(3 - f1p), f1q, f1r, f1s, f1t)
nmisscesd <- apply(is.na(cesditems), 1, sum)
ncesditems <- cesditems
ncesditems[is.na(cesditems)] <- 0
newcesd <- apply(ncesditems, 1, sum)
imputemeancesd <- 20/(20 - nmisscesd)*newcesd
cbind(newcesd, cesd, nmisscesd, imputemeancesd)[nmisscesd > 0, ]
newcesd cesd nmisscesd imputemeancesd
[1,] 15 15 1 15.8
[2,] 19 19 1 20.0
[3,] 44 44 1 46.3
[4,] 17 17 1 17.9
[5,] 29 29 1 30.5
[6,] 44 44 1 46.3
[7,] 39 39 1 41.1
創造與i1一樣多個的空白的character物件
drinkstat <- character(length(i1))
依據條件填入字串
drinkstat[i1 == 0] <- "abstinent"
# create moderate group
drinkstat[(i1 > 0 & i1 <= 1 & i2 <= 3 & female == 1) |
(i1 > 0 & i1 <= 2 & i2 <= 4 & female == 0)] <- "moderate"
# create highrisk group
drinkstat[(( i1>1 | i2>3 ) & female == 1) |
((i1 > 2 | i2 > 4) & female == 0)] <- "highrisk"
檢查drinstat中是否有遺漏值,結果顯示沒有遺漏
is.na(drinkstat) = is.na(i1) | is.na(i2) | is.na(female)
table(is.na(drinkstat))
FALSE
453
將一些變項組合成新的data frame
tmpds <- data.frame(i1, i2, female, drinkstat)
tmpds[361:370, ]
i1 i2 female drinkstat
361 37 37 0 highrisk
362 25 25 0 highrisk
363 38 38 0 highrisk
364 12 29 0 highrisk
365 6 24 0 highrisk
366 6 6 0 highrisk
367 0 0 0 abstinent
368 0 0 1 abstinent
369 8 8 0 highrisk
370 32 32 0 highrisk
依照drinkstat和female的值條件篩選
tmpds[tmpds$drinkstat == "moderate" & tmpds$female == 1, ]
i1 i2 female drinkstat
116 1 1 1 moderate
137 1 3 1 moderate
225 1 2 1 moderate
230 1 1 1 moderate
264 1 1 1 moderate
266 1 1 1 moderate
394 1 1 1 moderate
將cesd切分,並計算組別。right=F代表上界為開區間。
table(cut(cesd, c(0, 16, 22, 60), right=FALSE))
[0,16) [16,22) [22,60)
46 42 364
計算drinkstate中missing的總數,結果沒有missing。
sum(is.na(drinkstat))
[1] 0
計算drinkstate各個變項的數量
table(drinkstat, exclude="NULL")
drinkstat
abstinent highrisk moderate
68 357 28
依照drinkstate和female作列聯表
table(drinkstat, female, exclude="NULL")
Warning in as.vector(exclude, typeof(x)): 強制變更過程中產生了 NA
female
drinkstat 0 1
abstinent 42 26
highrisk 283 74
moderate 21 7
將female=1和female=0分別重新編碼成gender=female、gender=male
gender = factor(female, c(0,1), c("male","Female"))
table(female)
female
0 1
346 107
table(gender)
gender
male Female
346 107
只選擇女生的資料並計算平均(共4種方法)
detach(newds)
newds <- ds[order(ds$cesd, ds$i1), ]
newds[1:5, c("cesd", "i1", "id")]
cesd i1 id
199 1 3 233
394 3 1 139
349 3 13 418
417 4 4 251
85 4 9 95
females <- ds[ds$female == 1, ]
attach(females)
mean(cesd)
[1] 36.9
with(ds, mean(cesd[female == 1]))
[1] 36.9
tapply(ds$cesd, ds$female, mean)
0 1
31.6 36.9
aggregate(ds$cesd, list(ds$female), mean)
Group.1 x
1 0 31.6
2 1 36.9
An instructor has 12 students is his class: Amy, Brad, Chad, Daisy, Ed, Fey, George, Hillary, Ike, Jane, Kim, Luke. For a class assignment, he needs to permute these names at random. Help him accomplish it with R.
sample(c('Amy','Brad','Chad','Daisy','Ed','Fey','George','Hillary','Ike','Jane','Kim','Luke'),12)
[1] "Amy" "Daisy" "Ike" "Jane" "Brad" "Luke"
[7] "Kim" "George" "Chad" "Ed" "Hillary" "Fey"
The ‘cabbages’ data set in the ‘MASS’ package has 4 variables: Cult, Date, HeadWt, VitC. Ignore the last variable and make ‘Date’ a numeric variable. Plot 95% CIs for the variable ‘HeadWt’ by ‘Cult’ and ‘Date’, jointly.
dat<-MASS::cabbages
dat<-dat[,-dim(dta)[2]]
dat$Date<-as.numeric(dat$Date)
dta <- ddply(dat, c("Cult", "Date"),summarize,ll=mean(HeadWt)+c(-1.96,1.96)*sd(HeadWt)/sqrt(length(HeadWt)))
dta$bnd <- rep(c("l", "u"), 6)
dtaw <- reshape2::dcast(dta, Cult + Date ~ bnd,value.var="ll")
s <- 1:6; d <- 2*pi/1000
plot(c(0.5,4), c(0.5, 4), type="n", xlab="Date", ylab="HeadWt")
segments(dtaw$Date[s]+d*(s-1), dtaw$l[s], dtaw$Date[s]+d*(s-1),
dtaw$u[s], col=dtaw$Cult, lwd=2)
points(dtaw$Date[s]+d*(s-1), (dtaw$l[s]+dtaw$u[s])/2,
col=dtaw$Cult, pch=16)
legend("topright", legend=c("c39", "c52"), text.col=c(1,2))
grid()
The ‘MASS’ library has these two data sets: ‘Animals’ and ‘mammals’. Merge the two files and remove duplicated observations using ‘duplicated’.
ani<-rbind(MASS::Animals,MASS::mammals)
ani[!duplicated(ani),]
body brain
Mountain beaver 1.35e+00 8.10
Cow 4.65e+02 423.00
Grey wolf 3.63e+01 119.50
Goat 2.77e+01 115.00
Guinea pig 1.04e+00 5.50
Dipliodocus 1.17e+04 50.00
Asian elephant 2.55e+03 4603.00
Donkey 1.87e+02 419.00
Horse 5.21e+02 655.00
Potar monkey 1.00e+01 115.00
Cat 3.30e+00 25.60
Giraffe 5.29e+02 680.00
Gorilla 2.07e+02 406.00
Human 6.20e+01 1320.00
African elephant 6.65e+03 5712.00
Triceratops 9.40e+03 70.00
Rhesus monkey 6.80e+00 179.00
Kangaroo 3.50e+01 56.00
Golden hamster 1.20e-01 1.00
Mouse 2.30e-02 0.40
Rabbit 2.50e+00 12.10
Sheep 5.55e+01 175.00
Jaguar 1.00e+02 157.00
Chimpanzee 5.22e+01 440.00
Rat 2.80e-01 1.90
Brachiosaurus 8.70e+04 154.50
Mole 1.22e-01 3.00
Pig 1.92e+02 180.00
Arctic fox 3.38e+00 44.50
Owl monkey 4.80e-01 15.50
Roe deer 1.48e+01 98.20
Verbet 4.19e+00 58.00
Chinchilla 4.25e-01 6.40
Ground squirrel 1.01e-01 4.00
Arctic ground squirrel 9.20e-01 5.70
African giant pouched rat 1.00e+00 6.60
Lesser short-tailed shrew 5.00e-03 0.14
Star-nosed mole 6.00e-02 1.00
Nine-banded armadillo 3.50e+00 10.80
Tree hyrax 2.00e+00 12.30
N.A. opossum 1.70e+00 6.30
Big brown bat 2.30e-02 0.30
European hedgehog 7.85e-01 3.50
Galago 2.00e-01 5.00
Genet 1.41e+00 17.50
Grey seal 8.50e+01 325.00
Rock hyrax-a 7.50e-01 12.30
Water opossum 3.50e+00 3.90
Yellow-bellied marmot 4.05e+00 17.00
Little brown bat 1.00e-02 0.25
Slow loris 1.40e+00 12.50
Okapi 2.50e+02 490.00
Baboon 1.06e+01 179.50
Desert hedgehog 5.50e-01 2.40
Giant armadillo 6.00e+01 81.00
Rock hyrax-b 3.60e+00 21.00
Raccoon 4.29e+00 39.20
E. American mole 7.50e-02 1.20
Musk shrew 4.80e-02 0.33
Echidna 3.00e+00 25.00
Brazilian tapir 1.60e+02 169.00
Tenrec 9.00e-01 2.60
Phalanger 1.62e+00 11.40
Tree shrew 1.04e-01 2.50
Red fox 4.24e+00 50.40
The data set lq2002{multilevel} cotains responses from a sample of 2,042 soldiers to a survey of 19 items. Items 1-11 measure leadership climate, items 12-14 measure task significance, and items 15-19 measure the hostility felt by the soldier. The soldiers came from 49 companies. For this problem, use only the company with the largest number of soldiers and only the hostility items.
library(multilevel)
Warning: package 'multilevel' was built under R version 3.2.4
Loading required package: nlme
Warning: package 'nlme' was built under R version 3.2.4
Loading required package: MASS
data(lq2002)
#head(lq2002)
which.max(table(lq2002$COMPID))
13
9
lq<-lq2002[which(lq2002$COMPID=="13"),15:19]
lqq<-apply(lq,2,function(x) {ifelse(x >= 4,1,0)})
head(lqq)
TSIG02 TSIG03 HOSTIL01 HOSTIL02 HOSTIL03
234 0 0 0 0 1
235 1 1 0 0 0
236 1 1 1 1 1
237 0 0 1 1 0
238 0 0 0 0 1
239 0 0 0 0 0
sort(apply(lqq,2,mean),decreasing =T)
TSIG02 TSIG03 HOSTIL03 HOSTIL01 HOSTIL02
0.6061 0.5960 0.1313 0.1111 0.0606