“学习R reference Card 2.0”
“日期:2013-04-01”
by_daigazi
sweep(x, MARGIN, STATS, FUN = “-”, check.margin = TRUE, …)
描述:Return an array obtained from an input array by sweeping out a summary statistic.
x:数组,margin,1或2,代表行和列,stats:一个指标或者一个数字等,FUN默认是减号,可以写成“/”等
A <- array(1:24, dim = 4:2)
A
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
## no warnings in normal use
sweep(A, 1, 5)
## , , 1
##
## [,1] [,2] [,3]
## [1,] -4 0 4
## [2,] -3 1 5
## [3,] -2 2 6
## [4,] -1 3 7
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 8 12 16
## [2,] 9 13 17
## [3,] 10 14 18
## [4,] 11 15 19
prop.table(x, margin = NULL)
描述:This is really sweep(x, margin, margin.table(x, margin), “/”) for newbies, except that if margin has length zero, then one gets x/sum(x).
prop比例,table,求和,即对每一行或每一列求和,再求每个元素在行内或列内的比值
prop.table(x, margin = 1)等价于 sweep(x,1,sum,FUN=“/”) ##错误,stats是一个数值而不是函数
x <- matrix(1:4, 2)
x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
prop.table(x, margin = 1)
## [,1] [,2]
## [1,] 0.2500 0.7500
## [2,] 0.3333 0.6667
# 上式等价于
sweep(x, 1, margin.table(x, 1), FUN = "/") ##正确
## [,1] [,2]
## [1,] 0.2500 0.7500
## [2,] 0.3333 0.6667
split(x, f, drop = FALSE, …)
描述:split divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.
split是按照f对x进行分组的。其中x是向量或者数据框,f必须是因子
–》x按照f的类别分类,返回一个列表,列表名字是因子
split divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.
set.seed(1234)
x = sample(1:30, 30, replace = T)
x
## [1] 4 19 19 19 26 20 1 7 20 16 21 17 9 28 9 26 9 9 6 7 10 10 5
## [24] 2 7 25 16 28 25 2
y = as.factor(sample(1:3, 30, replace = T))
levels(y)
## [1] "1" "2" "3"
split(x, f = y)
## $`1`
## [1] 19 19 26 1 7 9 9 6 10 10 7 25
##
## $`2`
## [1] 4 19 21 17 28 26 9 2 25 16
##
## $`3`
## [1] 20 20 16 9 7 5 28 2
names(split(x, f = y)) #查看返回列表的名字
## [1] "1" "2" "3"
choose就是排列组合
choose(5, 2) #结果就是10
## [1] 10
描述:Create a contingency table (optionally a sparse matrix) from cross-classifying factors, usually contained in a data frame, using a formula interface.
也就是说通过一个cross-classifying factors(交叉分类因素)新生成一张表格
其功能相当于excel里的透视表具体例子见http://cos.name/cn/topic/11566#post-157688
cbind(1,1:7) #增加列
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
## [6,] 1 6
## [7,] 1 7
rbind(1,1:7) #增加行
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 1 1 1 1 1 1
## [2,] 1 2 3 4 5 6 7
text <- "V1 V2 V3
2006 1 1871
2006 2 1828
2006 3 2126
2006 4 2172
2006 5 2340
2006 6 2397
2006 7 2389
2006 8 2444
2006 9 2430
2006 10 2490
2006 11 2554
2006 12 2736
2007 1 2404
2007 2 2289
2007 3 2604
2007 4 2646
2007 5 2741
2007 6 2889
2007 7 2811
2007 8 2796
2007 9 2890
2007 10 2854
2007 11 2878
2007 12 2958"
text;class(text)
## [1] "character"
text <- gsub(" +", " ", text) #不知道该函数的用法
tab <- read.table(textConnection(text), sep=" ", head=TRUE)
dt.tab <- xtabs(V3~V2+V1, tab) # tab 是数据集
#根据V2和V1的组合情况,元素内填写V3
addmargins(dt.tab) #新增边缘列
## V1
## V2 2006 2007 Sum
## 1 1871 2404 4275
## 2 1828 2289 4117
## 3 2126 2604 4730
## 4 2172 2646 4818
## 5 2340 2741 5081
## 6 2397 2889 5286
## 7 2389 2811 5200
## 8 2444 2796 5240
## 9 2430 2890 5320
## 10 2490 2854 5344
## 11 2554 2878 5432
## 12 2736 2958 5694
## Sum 27777 32760 60537
#下面是另外一个例子
head(esoph);str(esoph)
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
xtabs(cbind(ncases, ncontrols) ~ ., data = esoph)
## , , tobgp = 0-9g/day, = ncases
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 0 0 0 0
## 35-44 0 0 0 2
## 45-54 1 6 3 4
## 55-64 2 9 9 5
## 65-74 5 17 6 3
## 75+ 1 2 1 2
##
## , , tobgp = 10-19, = ncases
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 0 0 0 1
## 35-44 1 3 0 0
## 45-54 0 4 6 3
## 55-64 3 6 8 6
## 65-74 4 3 4 1
## 75+ 2 1 1 1
##
## , , tobgp = 20-29, = ncases
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 0 0 0 0
## 35-44 0 1 0 2
## 45-54 0 5 1 2
## 55-64 3 4 3 2
## 65-74 2 5 2 1
## 75+ 0 0 0 0
##
## , , tobgp = 30+, = ncases
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 0 0 0 0
## 35-44 0 0 0 0
## 45-54 0 5 2 4
## 55-64 4 3 4 5
## 65-74 0 0 1 1
## 75+ 1 1 0 0
##
## , , tobgp = 0-9g/day, = ncontrols
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 40 27 2 1
## 35-44 60 35 11 3
## 45-54 46 38 16 4
## 55-64 49 40 18 10
## 65-74 48 34 13 4
## 75+ 18 5 1 2
##
## , , tobgp = 10-19, = ncontrols
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 10 7 1 1
## 35-44 14 23 6 3
## 45-54 18 21 14 4
## 55-64 22 21 15 7
## 65-74 14 10 12 2
## 75+ 6 3 1 1
##
## , , tobgp = 20-29, = ncontrols
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 6 4 0 1
## 35-44 7 14 2 4
## 45-54 10 15 5 3
## 55-64 12 17 6 3
## 65-74 7 9 3 1
## 75+ 0 3 0 0
##
## , , tobgp = 30+, = ncontrols
##
## alcgp
## agegp 0-39g/day 40-79 80-119 120+
## 25-34 5 7 2 2
## 35-44 8 8 1 0
## 45-54 4 7 4 4
## 55-64 6 6 4 6
## 65-74 2 0 1 1
## 75+ 3 1 0 0
## Output is not really helpful ... flat tables are better:
ftable(xtabs(cbind(ncases, ncontrols) ~ ., data = esoph))
## ncases ncontrols
## agegp alcgp tobgp
## 25-34 0-39g/day 0-9g/day 0 40
## 10-19 0 10
## 20-29 0 6
## 30+ 0 5
## 40-79 0-9g/day 0 27
## 10-19 0 7
## 20-29 0 4
## 30+ 0 7
## 80-119 0-9g/day 0 2
## 10-19 0 1
## 20-29 0 0
## 30+ 0 2
## 120+ 0-9g/day 0 1
## 10-19 1 1
## 20-29 0 1
## 30+ 0 2
## 35-44 0-39g/day 0-9g/day 0 60
## 10-19 1 14
## 20-29 0 7
## 30+ 0 8
## 40-79 0-9g/day 0 35
## 10-19 3 23
## 20-29 1 14
## 30+ 0 8
## 80-119 0-9g/day 0 11
## 10-19 0 6
## 20-29 0 2
## 30+ 0 1
## 120+ 0-9g/day 2 3
## 10-19 0 3
## 20-29 2 4
## 30+ 0 0
## 45-54 0-39g/day 0-9g/day 1 46
## 10-19 0 18
## 20-29 0 10
## 30+ 0 4
## 40-79 0-9g/day 6 38
## 10-19 4 21
## 20-29 5 15
## 30+ 5 7
## 80-119 0-9g/day 3 16
## 10-19 6 14
## 20-29 1 5
## 30+ 2 4
## 120+ 0-9g/day 4 4
## 10-19 3 4
## 20-29 2 3
## 30+ 4 4
## 55-64 0-39g/day 0-9g/day 2 49
## 10-19 3 22
## 20-29 3 12
## 30+ 4 6
## 40-79 0-9g/day 9 40
## 10-19 6 21
## 20-29 4 17
## 30+ 3 6
## 80-119 0-9g/day 9 18
## 10-19 8 15
## 20-29 3 6
## 30+ 4 4
## 120+ 0-9g/day 5 10
## 10-19 6 7
## 20-29 2 3
## 30+ 5 6
## 65-74 0-39g/day 0-9g/day 5 48
## 10-19 4 14
## 20-29 2 7
## 30+ 0 2
## 40-79 0-9g/day 17 34
## 10-19 3 10
## 20-29 5 9
## 30+ 0 0
## 80-119 0-9g/day 6 13
## 10-19 4 12
## 20-29 2 3
## 30+ 1 1
## 120+ 0-9g/day 3 4
## 10-19 1 2
## 20-29 1 1
## 30+ 1 1
## 75+ 0-39g/day 0-9g/day 1 18
## 10-19 2 6
## 20-29 0 0
## 30+ 1 3
## 40-79 0-9g/day 2 5
## 10-19 1 3
## 20-29 0 3
## 30+ 1 1
## 80-119 0-9g/day 1 1
## 10-19 1 1
## 20-29 0 0
## 30+ 0 0
## 120+ 0-9g/day 2 2
## 10-19 1 1
## 20-29 0 0
## 30+ 0 0
描述:Create ‘flat’ contingency tables.
使用方法:ftable(x, …)
x R objects which can be interpreted as factors (including character strings), or a list (or data frame) whose components can be so interpreted, or a contingency table object of class “table” or “ftable”.
x要求是可以分解出因子的列表/数据框/因子/table/ftable等类型
head(Titanic)
str(Titanic)
## Start with a contingency table.
ftable(Titanic, row.vars = 1:3) #row.vars是输出透视表中的行名,填写的元素是出1:3三列之外的其他列
ftable(Titanic, row.vars = 1:2, col.vars = "Survived") #col.vars是列名
ftable(Titanic, row.vars = 2:1, col.vars = "Survived")
## Start with a data frame.
x <- ftable(mtcars[c("cyl", "vs", "am", "gear")])
x
levels(as.factor(mtcars$gear))
ftable(x, row.vars = c(2, 4))
## Start with expressions, use table()'s 'dnn' to change labels
ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4), dnn = c("Cylinders",
"V/S", "Transmission", "Gears")) ##dnn对列名和行名分别给名字
replace(x, list, values)
x vector
list an index vector
values replacement values
replace replaces the values in x with indices given in list by those given in values.
If necessary, the values in values are recycled.
按照list所提供的索引以及value所提供的更换数值,一次性对x的指定位置的值做更换。
set.seed(123)
x = sample(1:10, 5, replace = T)
x
[1] 3 8 5 9 10
l = c(1, 3, 4)
value = c(100, 200, 300)
# list(a=matrix(sample(1:20,30),ncol=4,nrow=5,byrow=T))
x = replace(x, l, value)
x
[1] 100 8 200 300 10
Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(A, margin = seq_along(dim(A)), FUN = sum, quiet = FALSE)
Aye <- sample(c("Yes", "Si", "Oui"), 177, replace = TRUE)
Bee <- sample(c("Hum", "Buzz"), 177, replace = TRUE)
Sea <- sample(c("White", "Black", "Red", "Dead"), 177, replace = TRUE)
(A <- table(Aye, Bee, Sea)) #与A <- table(Aye, Bee, Sea)不同,前者直接赋值给A并且打印出来,后者只有赋值
## , , Sea = Black
##
## Bee
## Aye Buzz Hum
## Oui 8 5
## Si 10 8
## Yes 9 11
##
## , , Sea = Dead
##
## Bee
## Aye Buzz Hum
## Oui 6 5
## Si 9 8
## Yes 5 10
##
## , , Sea = Red
##
## Bee
## Aye Buzz Hum
## Oui 8 10
## Si 7 3
## Yes 4 3
##
## , , Sea = White
##
## Bee
## Aye Buzz Hum
## Oui 3 9
## Si 8 12
## Yes 8 8
addmargins(A)
## , , Sea = Black
##
## Bee
## Aye Buzz Hum Sum
## Oui 8 5 13
## Si 10 8 18
## Yes 9 11 20
## Sum 27 24 51
##
## , , Sea = Dead
##
## Bee
## Aye Buzz Hum Sum
## Oui 6 5 11
## Si 9 8 17
## Yes 5 10 15
## Sum 20 23 43
##
## , , Sea = Red
##
## Bee
## Aye Buzz Hum Sum
## Oui 8 10 18
## Si 7 3 10
## Yes 4 3 7
## Sum 19 16 35
##
## , , Sea = White
##
## Bee
## Aye Buzz Hum Sum
## Oui 3 9 12
## Si 8 12 20
## Yes 8 8 16
## Sum 19 29 48
##
## , , Sea = Sum
##
## Bee
## Aye Buzz Hum Sum
## Oui 25 29 54
## Si 34 31 65
## Yes 26 32 58
## Sum 85 92 177
ftable(A) #等价于ftable(A,row.vars=c(1,2),col.vars=3)
## Sea Black Dead Red White
## Aye Bee
## Oui Buzz 8 6 8 3
## Hum 5 5 10 9
## Si Buzz 10 9 7 8
## Hum 8 8 3 12
## Yes Buzz 9 5 4 8
## Hum 11 10 3 8
ftable(addmargins(A)) #等价于ftable(addmargins(A),row.vars=c(1,2),col.vars=3)
## Sea Black Dead Red White Sum
## Aye Bee
## Oui Buzz 8 6 8 3 25
## Hum 5 5 10 9 29
## Sum 13 11 18 12 54
## Si Buzz 10 9 7 8 34
## Hum 8 8 3 12 31
## Sum 18 17 10 20 65
## Yes Buzz 9 5 4 8 26
## Hum 11 10 3 8 32
## Sum 20 15 7 16 58
## Sum Buzz 27 20 19 19 85
## Hum 24 23 16 29 92
## Sum 51 43 35 48 177
# Non-commutative functions - note differences between resulting tables:
ftable(addmargins(A, c(1, 3), FUN = list(Sum = sum, list(Min = min, Max = max))))
## Margins computed over dimensions
## in the following order:
## 1: Aye
## 2: Sea
## Sea Black Dead Red White Min Max
## Aye Bee
## Oui Buzz 8 6 8 3 3 8
## Hum 5 5 10 9 5 10
## Si Buzz 10 9 7 8 7 10
## Hum 8 8 3 12 3 12
## Yes Buzz 9 5 4 8 4 9
## Hum 11 10 3 8 3 11
## Sum Buzz 27 20 19 19 19 27
## Hum 24 23 16 29 16 29
ftable(addmargins(A, c(3, 1), FUN = list(list(Min = min, Max = max), Sum = sum)))
## Margins computed over dimensions
## in the following order:
## 1: Sea
## 2: Aye
## Sea Black Dead Red White Min Max
## Aye Bee
## Oui Buzz 8 6 8 3 3 8
## Hum 5 5 10 9 5 10
## Si Buzz 10 9 7 8 7 10
## Hum 8 8 3 12 3 12
## Yes Buzz 9 5 4 8 4 9
## Hum 11 10 3 8 3 11
## Sum Buzz 27 20 19 19 14 27
## Hum 24 23 16 29 11 33
# 与上一行代码结果一致,排序顺序不同
ftable(addmargins(A, c(1, 3), FUN = list(list(Min = min, Max = max), Sum = sum)))
## Margins computed over dimensions
## in the following order:
## 1: Aye
## 2: Sea
## Sea Black Dead Red White Sum
## Aye Bee
## Oui Buzz 8 6 8 3 25
## Hum 5 5 10 9 29
## Si Buzz 10 9 7 8 34
## Hum 8 8 3 12 31
## Yes Buzz 9 5 4 8 26
## Hum 11 10 3 8 32
## Min Buzz 8 5 4 3 20
## Hum 5 5 3 8 21
## Max Buzz 10 9 8 8 35
## Hum 11 10 10 12 43
# 与上上行代码结果类似,只是求和在列,最大最小值在行
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, …)
seq.int(from, to, by, length.out, along.with, …)
seq_along(along.with)
seq_len(length.out)
seq.Date:seq(from, to, by, length.out = NULL, along.with = NULL, …)
by can be specified in several ways.
A number, taken to be in days.by可以是以天为单位,比如"1 days"、"1 days"。
A object of class difftime.by可是是difftime的格式,见案例
A character string, containing one of “day”, “week”, “month” or “year”. # 单位也可以是天、周、月、年
This can optionally be preceded by a (positive or negative) integer and a space, or followed by “s”.
seq_along(dim(A))
## [1] 1 2 3
dim(A)
## [1] 3 2 4
seq_len(length.out = 10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq_along(along.with = 10)
## [1] 1
seq(along.with = 10)
## [1] 1
seq(along.with = seq_len(length.out = 10))
## [1] 1 2 3 4 5 6 7 8 9 10
seq(along.with = rep(1, 5))
## [1] 1 2 3 4 5
seq.int(from = 1.1, to = 10, by = 1) #精确到小数点
## [1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1
seq.int(from = 1.1, by = 1, length.out = 10) #和上一行代码的结果不一样,是整数
## [1] 1 2 3 4 5 6 7 8 9 10
seq(as.Date("2000/1/1"), as.Date("2000/1/2"), by = "0.5 days")
## Error: invalid '(to - from)/by' in 'seq'
seq(as.Date("2000/1/1"), as.Date("2000/1/2"), by = "1 days")
## [1] "2000-01-01" "2000-01-02"
t = seq(as.POSIXlt("2012-12-01 00:00:00"), as.POSIXlt("2012-12-02 00:00:00"),
by = difftime(as.POSIXlt("2012-12-01 00:15:00"), as.POSIXlt("2012-12-01 00:00:00"),
units = "mins"))
length(t)
## [1] 97
是strsplit不是strisplit
Split the elements of a character vector x into substrings according to the matches to substring split within them.
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
fixed=T时,完全精确匹配
noquote(strsplit("A text I want to display with spaces", NULL)[[1]]) #split=NULL或者“ ”时,对每一个字节做切割
## [1] A t e x t I w a n t t o d i s p l a y w i t h s p a c e
## [36] s
x <- c(as = "asfef", qu = "qwerty", "yuiop[", "b", "stuff.blah.yech")
# split x on the letter e
strsplit(x, "e") #保留split的左右两侧,并分开成两小部分
## $as
## [1] "asf" "f"
##
## $qu
## [1] "qw" "rty"
##
## [[3]]
## [1] "yuiop["
##
## [[4]]
## [1] "b"
##
## [[5]]
## [1] "stuff.blah.y" "ch"
strsplit(x, "e", fixed = T)
## $as
## [1] "asf" "f"
##
## $qu
## [1] "qw" "rty"
##
## [[3]]
## [1] "yuiop["
##
## [[4]]
## [1] "b"
##
## [[5]]
## [1] "stuff.blah.y" "ch"
# 例子,见http://cos.name/cn/topic/104102#post-217293
library(igraph)
b = "1-2,1-3,1-4,1-6,1-8,1-10,2-3,2-4,2-5,2-7,2-9,3-4,3-5,3-6,3-10,4-6,4-7,4-9,4-10,5-7,5-8,5-10,6-7,6-8,6-9,7-9,7-10,8-10,9-10"
b = strsplit(b, ",")[[1]]
b = as.numeric(unlist(strsplit(b, "-")))
# unlist的例子
l.ex <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
unlist(l.ex, recursive = FALSE)
## $a1
## [1] 1 2 3 4 5
##
## $a2
## [1] "A" "B" "C" "D" "E"
##
## $b
## [1] "Z"
##
## $c
## [1] NA
unlist(l.ex, recursive = TRUE)
## a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 b c
## "1" "2" "3" "4" "5" "A" "B" "C" "D" "E" "Z" NA
b1 = b[seq(1, 57, by = 2)]
b2 = b[seq(2, 58, by = 2)]
r = data.frame(from = b1, to = b2)
g = graph.data.frame(d = r, directed = F) #生成class是graph的对象
plot(g, layout = layout.fruchterman.reingold, vertex.label = 1:10)
which(x, arr.ind = FALSE, useNames = TRUE)
arrayInd(ind, .dim, .dimnames = NULL, useNames = FALSE)
x a logical vector or array. NAs are allowed and omitted (treated as if FALSE).
x 是逻辑向量或者数组。NA是运行的,因为会被过滤掉,默认此时的if结果是False
arr.ind logical; should array indices be returned when x is an array
ind integer-valued index vector, as resulting from which(x).
.dim dim(.) integer vector
.dimnames optional list of character dimnames(.), of which only .dimnames[[1]] is used.
useNames logical indicating if the value of arrayInd() should have (non-null) dimnames at all.
df = data.frame(u = c(5, 10, 15, 20, 30, 40, 60, 80, 100), lot1 = c(118, 58,
42, 35, 27, 25, 21, 19, 18), lot2 = c(69, 35, 26, 21, 18, 16, 13, 12, 12))
which(df$u == 15)
## [1] 3
mat = as.matrix(df)
which(mat%%3 == 0, arr.ind = T) #返回符合条件元素所在行和列
## row col
## [1,] 3 1
## [2,] 5 1
## [3,] 7 1
## [4,] 3 2
## [5,] 5 2
## [6,] 7 2
## [7,] 9 2
## [8,] 1 3
## [9,] 4 3
## [10,] 5 3
## [11,] 8 3
## [12,] 9 3
arrayInd(which(mat%%3 == 0))
## Error: argument ".dim" is missing, with no default
# Error in arrayInd(which(mat%%3 == 0)) : argument '.dim' is missing, with
# no default
arrayInd(which(mat%%3 == 0), .dim = c(1:3)) # ind来源与which(x)即使逻辑向量。.dim控制mat中参与计算ma的维???有问题
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 1 1 3
## [3,] 1 1 1
## [4,] 1 2 3
## [5,] 1 2 1
## [6,] 1 2 2
## [7,] 1 2 3
## [8,] 1 1 1
## [9,] 1 2 2
## [10,] 1 1 3
## [11,] 1 2 1
## [12,] 1 1 2
library(lubridate)
## Attaching package: 'lubridate'
## The following object(s) are masked from 'package:igraph':
##
## %--%
floor_date(as.POSIXlt("2012-01-31 23:46:09"), "day")
## [1] "2012-01-31 CST"
ceiling_date(as.POSIXct("2012-01-31 23:46:09"), "day")
## [1] "2012-02-01 CST"
floor_date(as.POSIXct("2012-01-31 23:46:09"), "month")
## [1] "2012-01-01 CST"
ceiling_date(as.POSIXct("2012-01-31 23:46:09"), "month")
## [1] "2012-02-01 CST"
t = class(floor_date(as.POSIXct("2012-01-31 23:46:09"), "month"))
t
## [1] "POSIXct" "POSIXt"
mode(t) #t的类型是function(函数)
## [1] "character"
## 把数据拼接成精确到秒
library(stringr)
str_c(ceiling_date(as.POSIXct("2012-01-31 23:46:09"), "month"), "00:00:00",
sep = " ")
## [1] "2012-02-01 00:00:00"
str_c(t, "00:00:00", sep = " ") #报错,因为t应该是一个单纯对象,函数是个复合对象
## [1] "POSIXct 00:00:00" "POSIXt 00:00:00"
atrribtutes(object)函数返回对象objiect的各种特殊属性组成的列表,不包括固有属性mode和length。
R对象分为单纯(atomic)对象和复合(recursive)对象两种。单纯对象的所有元素都是同一种基本类型,如数值、字符串等,元素不再是对象;复合对象的元素可以是不同类型的对象,每一个元素是一个对象。函数就是一种复合对象。案例:见13小点。
Evaluate an R expression in an environment constructed from data, possibly modifying the original data.
with(data, expr, …)
within(data, expr, …)
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
aq = within(airquality, {
a = log(Ozone)
})
str(aq) #对原表做了修改,新增一列a
## 'data.frame': 153 obs. of 7 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ a : num 3.71 3.58 2.48 2.89 NA ...
df = data.frame(u = c(5, 10, 15, 20, 30, 40, 60, 80, 100), lot1 = c(118, 58,
42, 35, 27, 25, 21, 19, 18), lot2 = c(69, 35, 26, 21, 18, 16, 13, 12, 12))
with(df, list(summary(glm(lot1 ~ log(u), family = Gamma)), summary(glm(lot2 ~
log(u), family = Gamma)))) #返回的是两个summary。用到list而不是{}
## [[1]]
##
## Call:
## glm(formula = lot1 ~ log(u), family = Gamma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.0401 -0.0376 -0.0264 0.0290 0.0864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.016554 0.000928 -17.9 4.3e-07 ***
## log(u) 0.015343 0.000415 37.0 2.8e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.002446)
##
## Null deviance: 3.51283 on 8 degrees of freedom
## Residual deviance: 0.01673 on 7 degrees of freedom
## AIC: 37.99
##
## Number of Fisher Scoring iterations: 3
##
##
## [[2]]
##
## Call:
## glm(formula = lot2 ~ log(u), family = Gamma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.0557 -0.0293 0.0103 0.0171 0.0637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.023908 0.001326 -18.0 4.0e-07 ***
## log(u) 0.023599 0.000577 40.9 1.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.001813)
##
## Null deviance: 3.118557 on 8 degrees of freedom
## Residual deviance: 0.012672 on 7 degrees of freedom
## AIC: 27.03
##
## Number of Fisher Scoring iterations: 3
分为系统聚类和动态聚类两大类方法。
系统聚类:
步骤:数据中心化与标准化变换–》求距离–》系统聚类–》画图plot
这中间有个类的个数确定问题。尚无十分令人满意的方法。R中提供了rect.hclust()函数,其本质是由给定类的个数或者阀值来确定聚类的情况。
动态聚类:
temp = lf.111[, c(4, 7)]
temp = temp[temp$traveltime < 240, ]
cor(temp$traveltime, temp$weight) #弱相关性
hist(temp$traveltime)
hist(temp$weight, breaks = 200)
plot(density(temp$weight))
plot(temp$weight, temp$traveltime)
temp = temp[temp$traveltime < 240, ]
km = kmeans(scale(temp), 4, nstart = 20)
plot(temp[c("weight", "traveltime")], col = km$cluster)
points(km$centers[, c("weight", "traveltime")], col = 1:3, pch = 8, cex = 2)
Partial String Matching局部字符串匹配,需同charmatch(23点)区分开来。
pmatch("", "") # returns NA
## [1] NA
pmatch("m", c("mean", "median", "mode")) # returns NA
## [1] NA
pmatch("med", c("mean", "median", "mode")) # returns 2
## [1] 2
pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = FALSE)
## [1] NA 2 1
pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = TRUE)
## [1] NA 2 2
## compare
charmatch(c("", "ab", "ab"), c("abc", "ab"))
## [1] 0 2 2
newiris <- iris newiris$Species <- NULL kc <- kmeans(newiris, 3) plot(newiris[c(“Sepal.Length”, “Sepal.Width”)], col = kc$cluster) points(kc$centers[,c(“Sepal.Length”, “Sepal.Width”)], col = 1:3, pch = 8, cex=2)
例子http://www.cookbook-r.com/Strings/Creating_strings_from_variables/
Usage
paste (…, sep = “ ”, collapse = NULL)
paste0(…, collapse = NULL)
Arguments
… one or more R objects, to be converted to character vectors.
sep a character string to separate the terms. Not NAcharacter.
collapse an optional character string to separate the results. Not NAcharacter.
sep单个单个之间的链接,collapse是对sep链接之后结果再做链接
a = LETTERS[1:5]
b = letters[1:5]
paste(a, b, sep = ",")
## [1] "A,a" "B,b" "C,c" "D,d" "E,e"
paste(a, b, sep = "\t")
## [1] "A\ta" "B\tb" "C\tc" "D\td" "E\te"
paste(a, b, sep = ",", collapse = "-")
## [1] "A,a-B,b-C,c-D,d-E,e"
paste("hello", 2, collapse = "")
## [1] "hello 2"
paste("hello", 2, sep = "", collapse = "")
## [1] "hello2"
x = seq(0.1, 2.5, length = 10)
m = 10000
z = rnorm(m)
dim(x) = length(x)
p = apply(x, 1, FUN = function(x, y) {
mean(z <= x)
}, y = z)
# z是z=rnorm(m)所赋值的,关键是在引入z这个变量时的方法。注意:大括号之外
phi = pnorm(x)
print(round(rbind(x, p, phi), 3)) #round(x,3)四舍五入,精确到小数点后三位
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x 0.10 0.367 0.633 0.900 1.167 1.433 1.700 1.967 2.233 2.500
## p 0.54 0.644 0.736 0.816 0.882 0.926 0.956 0.976 0.987 0.994
## phi 0.54 0.643 0.737 0.816 0.878 0.924 0.955 0.975 0.987 0.994
详见例子http://cos.name/cn/topic/12987#post-159604
# 字符串连接:
paste()
# 字符串分割:
strsplit() #strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)
# 计算字符串的字符数:
nchar()
# 字符串截取:
substr(x, start, stop)
substring(text, first, last = 1e+06)
substr(x, start, stop) <- value
substring(text, first, last = 1e+06) <- value
# 字符串替换及大小写转换:
chartr(old, new, x)
tolower(x)
toupper(x)
casefold(x, upper = FALSE)
字符完全匹配
grep()
字符不完全匹配
agrep()
字符替换
gsub()
# 以上这些函数均可以通过perl=TRUE来使用正则表达式。
grep(pattern, x, ignore.case = FALSE, extended = TRUE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, extended = TRUE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, extended = TRUE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE,
useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
例子http://cos.name/cn/topic/6021
# 最简单的,向量赋值
assign(a, c(1:5)) #报错
## Warning: only the first element is used as variable name
assign("a", c(1:5)) #ok
a = 1:4
assign("a[1]", 5)
a #a还是1:4,这是因为什么呢?
## [1] 1 2 3 4
get("a[1]") == 2 #TRUE
## [1] FALSE
# 复杂点
eval(parse(text = paste("assign('a',", 1, ")", sep = "")))
a
## [1] 1
<
charmatch("", "") # returns 1
## [1] 1
charmatch("m", c("mean", "median", "mode")) # returns 0
## [1] 0
charmatch("med", c("mean", "median", "mode")) # returns 2
## [1] 2
charmatch("med", c("mean", "median", "mode"), nomatch = 5)
## [1] 2
charmatch("m", c("mean", "median", "mode"), nomatch = 5)
## [1] 0