R的資料形態

Wush Wu
Taiwan R User Group

我們所分析的數據代表什麼意義？

分析中的資料形態

數值系統的分類

名目資料(nomial)
順序資料(ordinal)
區間資料(interval)
比例資料(ratio)

	資料衡量尺度	變數形態	特性
1	名目資料	質化	類別
2	順序資料	質化	優先順序
3	區間資料	量化	大小距離
4	比例資料	量化	比值

名目資料

數值只用於記號，值毫無意義的數據

性別
Domain
Cookie

男生	女生
0	1

男生	女生
1	2

順序資料

數值有順序關係，但是差距沒有意義

硬度表
名次
排序表

硬度	礦物	化學式	絕對硬度	圖片
1	滑石	Mg₃Si₄O₁₀(OH)₂	1
2	石膏	CaSO₄·2H₂O	2
3	方解石	CaCO₃	9

區間資料

有差距的概念，沒有倍數的概念。數值有1的概念，沒有0的概念。可加減。

溫度
時間

溫度

華氏 0 度 = 攝氏 -17.78 度
攝氏 0 度 = 華氏 32.00 度

時間

POSIX time 的 0: 1970-01-01 08:00:00 +0800
西元
民國

比值資料

同時有差距和倍數的概念。數值有1和0的概念。可加減乘除。

營收
股價

plot of chunk TWII

小挑戰

請再舉出四種資料形態的範例
眾數、中位數、四分位差和算數平均數能用於哪些資料形態？
對於區間資料，成長率有沒有意義？
在應用Machine Learning的演算法時，不同的資料形態的值能直接使用嗎？
- Regression
- Decision Tree
以下資料，各又是怎樣的資料形態呢？
- 140.118.1.1
- #R,Text Mining Series,Taiwan R User Group

	資料衡量尺度	變數形態	特性
1	名目資料	質化	類別
2	順序資料	質化	優先順序
3	區間資料	量化	大小距離
4	比例資料	量化	比值

R 的資料形態

R 的資料形態分類

資料相關

同質形	異質形
Atomic Type	List
Matrix	Data Frame
Array

其他

今天說好不提的...

functions
environments
expressions

Atomic Type

所有的資料都是透過Atomic Type組合而成

Character

最廣泛的資料結構，可用於處理文字相關的工作，如：設定圖片的標題
輸入的時候利用"或'來包覆要輸入的文字
- 可以直接用\"來輸入"或\'

plot(cars, main = "\"hello world\"")

plot of chunk character-figure-title

常用的Character處理函數：

字串的剪接：`paste`

x <- "abc"
y <- "bbb"
paste(x, y, sep = ",")

## [1] "abc,bbb"

字串的切割：`strsplit`

strsplit(x, "b")

## [[1]]
## [1] "a" "c"

字串的長度：`nchar`

nchar(x)

## [1] 3

截取子字串：`substring`

substring(x, 1, 2)

## [1] "ab"

Logical

產生自比較，或是使用T、TRUE、F或FALSE輸入

x <- 1
x < 2

## [1] TRUE

x <= 1

## [1] TRUE

常用來做流程控制

if (T) {
    print("This is TRUE")
} else {
    print("This is FALSE")
}

## [1] "This is TRUE"

Logical常用的Operator

NOT

!TRUE

## [1] FALSE

AND

T & T

## [1] TRUE

OR

T | F

## [1] TRUE

Integer and Numeric

沒什麼特別的就跳過吧！

`+`

1 + 2

## [1] 3

`-`

1 - 2

## [1] -1

`*`

1 * 2

## [1] 2

`/`

1L/2L

## [1] 0.5

向量式資料結構

所有Atomic Type都有長度

length(0L)

## [1] 1

加上`dim`就變成矩陣、陣列

x <- 1L
dim(x) <- c(1, 1)
x

##      [,1]
## [1,]    1

可以利用`:`或`c()`快速建立Vector

1:3

## [1] 1 2 3

c(1L, 2L, 3L)

## [1] 1 2 3

幾乎內建的Operations、Functions都是Vectorize

1:3 + 2:4

## [1] 3 5 7

1:2 + 1:3

## Warning: 較長的物件長度並非較短物件長度的倍數

## [1] 2 4 4

向量式資料結構

長度不同會自動延伸，無法對齊就噴警告

1:2 + 1:3

## Warning: 較長的物件長度並非較短物件長度的倍數

## [1] 2 4 4

1 2 1 2
1 2 3

2 4 4

Vectorized的效能都好很多

f1 <- function() 1:1000 + 1
f2 <- function() {
    r <- integer(1000)
    for (i in 1:1000) r[i] <- i + 1
    r
}

	expr	median(nano seconds)
1	f2()	1215099.00
2	f1()	3589.50

向量的同質性

Character > Numeric > Integer > Logical

x <- c(1L, 2, "3")
class(x)

## [1] "character"

## [1] "1" "2" "3"

改一個，全部就都變了

x <- matrix(1:4, 2, 2)
class(x)

## [1] "matrix"

x[2]

## [1] 2

x[2] <- as.character(x[2])
x

##      [,1] [,2]
## [1,] "1"  "3" 
## [2,] "2"  "4"

R 物件的向量：List

List 的異質性

x <- list(1L, 2, "3")
x

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] "3"

連函數都吃

x <- list(1L, 2, "3", mean)
x[[4]]

## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x1067098e0>
## <environment: namespace:base>

處理表格資料的data.frame

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.10	3.50	1.40	0.20	setosa
4.90	3.00	1.40	0.20	setosa
4.70	3.20	1.30	0.20	setosa
4.60	3.10	1.50	0.20	setosa
5.00	3.60	1.40	0.20	setosa
5.40	3.90	1.70	0.40	setosa

每行的長度要一致

length(iris$Sepal.Length)

## [1] 150

length(iris$Species)

## [1] 150

是一種list

class(iris)

## [1] "data.frame"

is.list(iris)

## [1] TRUE

data.frame(a = 1L, b = "2")

##   a b
## 1 1 2

轉換資料結構

`as`

as.character
as.logical
as.integer
as.numeric

as.numeric("2")

## [1] 2

as.integer("a")

## Warning: 強制變更過程中產生了 NA

## [1] NA

NA代表MISSING VALUE

注意用Operator處理NA，通常會噴NA

z <- c(1:3, NA)
is.na(z)

## [1] FALSE FALSE FALSE  TRUE

z == NA

## [1] NA NA NA NA

小挑戰

Atomic Type之中的character, logical, integer, numeric，哪些可以處理質性資料，哪些可以處理量化資料？
根據character, numeric, integer, logical之間的轉換，誰是最廣泛的形態？
還有兩種atomic type，他們是什麼？

factor

常用於處理質性變數

levels

x <- c(2, 1, 2, 2)
x

## [1] 2 1 2 2

x <- factor(c(2, 1, 2, 2), levels = 2:1)
x

## [1] 2 1 2 2
## Levels: 2 1

levels(x)

## [1] "2" "1"

factor其實是integer

這些integer其實代表index

x <- factor(c(2, 1, 2, 2), levels = 2:1)
as.integer(x)

## [1] 1 2 1 1

levels(x)

## [1] "2" "1"

levels(x)[as.integer(x)]

## [1] "2" "1" "2" "2"

常用於處理時間的POSIXct

利用`format`、`as.POSIXct`和字串轉換

as.POSIXct("2014-01-01 23:50:34")

## [1] "2014-01-01 23:50:34 CST"

format(Sys.time())

## [1] "2014-02-24 10:37:15"

format(Sys.time(), "%Y%m%d %H%M%S")

## [1] "20140224 103715"

本質是8bytes的整數

單位是秒數
0是格林威治時間的1970-01-01 00:00:00

as.integer(Sys.time())

## [1] 1393209435

x <- as.POSIXct("1-01-01 00:00:00")
as.integer(x)

## Warning: 強制變更過程中產生了 NA

## [1] NA

資料結構對函數的影響

眾數、中位數、四分位差和算數平均數能用於哪些資料形態

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

資料整理

常用的整理資料流程

把資料整理成相同長度的Atomic Type
整合成data.frame(表格式資料)
套用進階的工具做資料分析

age <- c("18", "17", "10")
name <- c("Wush", "c3h3", "+1")
tool <- c("R", "R,Python", "Python")
df <- data.frame(age = age, name = name, tool = tool)
df

##   age name     tool
## 1  18 Wush        R
## 2  17 c3h3 R,Python
## 3  10   +1   Python

自向量中選取資料 `[`

`[` + logical

x <- 1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

head(1:10 < 3)

## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE

x[1:10 < 3]

## [1] 1 2

`[` + integer/numeric

x[1:3]

## [1] 1 2 3

`[` + character

names(x) <- letters[1:10]
x

##  a  b  c  d  e  f  g  h  i  j 
##  1  2  3  4  5  6  7  8  9 10

x[c("a", "c")]

## a c 
## 1 3

自矩陣、陣列中選取資料

x <- matrix(1:4, 2, 2)
x

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

x[1, 2]

## [1] 3

x[1:2, 1]

## [1] 1 2

x <- array(1:8, c(2, 2, 2))
x

## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

x[1, 2, 1]

## [1] 3

維持矩陣和陣列的資料結構

x <- matrix(1:4, 2, 2)
2

## [1] 2

x[1:2, 1]

## [1] 1 2

x[1:2, 1, drop = FALSE]

##      [,1]
## [1,]    1
## [2,]    2

data.frame 也類似

iris[1:2, 1:2]

##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0

head(iris[, 3])

## [1] 1.4 1.4 1.3 1.5 1.4 1.7

自List中選取資料

(x <- list(a = 1L, b = "2"))

## $a
## [1] 1
## 
## $b
## [1] "2"

x[2]

## $b
## [1] "2"

x[[2]]

## [1] "2"

x[["a"]]

## [1] 1

x$a

## [1] 1

選取的資料都可以替換 `[<-`

(x <- matrix(1:4, 2, 2))

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

(x[1, 2] <- 100)

## [1] 100

##      [,1] [,2]
## [1,]    1  100
## [2,]    2    4

(x <- list(a = 1, b = "2"))

## $a
## [1] 1
## 
## $b
## [1] "2"

x$b <- pi
x

## $a
## [1] 1
## 
## $b
## [1] 3.142

增加向量長度

c(1:4, 5)

## [1] 1 2 3 4 5

x <- 1:4
x[5] <- 5
x

## [1] 1 2 3 4 5

length(x) <- 6
x

## [1]  1  2  3  4  5 NA

縮減向量長度

x <- 1:4
length(x) <- 3
x

## [1] 1 2 3

小挑戰

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

請選出Species是setosa而且Petal.Length不超過1.5的資料
請挑出Species和Sepal.Length
請把Sepal.Length標準化(平均值調整為0, 變異數調整為1)
請把資料依照Sepal.Length排序(?order)
請計算Sepal.Length的平均
請計算各Species在iris資料中出現的個數
請把Species的名稱中帶有字母"g"的資料

大挑戰

age <- c("18", "17", "10")
name <- c("Wush", "c3h3", "+1")
tool <- c("R", "R,Python", "Python")
df <- data.frame(age = age, name = name, tool = tool)
df

##   age name     tool
## 1  18 Wush        R
## 2  17 c3h3 R,Python
## 3  10   +1   Python

請問該怎麼整理上述表格?

加碼：dplyr的介紹

dplyr by Hadley Wickham

http://blog.rstudio.org/2014/01/17/introducing-dplyr/

今年1 月發佈的整理資料的套件，改善它的前作：plyr
效能提升！(膜拜Rcpp的大大Romain Francois)
功能要能套用「所有的」表格資料：
- data.frame
- data.table (library(data.table);?data.table)
- SQL Database, ex: SQLite, PostgreSQL, MySQL, Google bigquery
提供下列基礎功能，使用者可以善用這些功能組合出進階功能： filter, select, mutate, arrange, summarise, group_by

filter

請選出Species是setosa而且Petal.Length不超過1.5的資料

head(filter(iris, Species == "setosa", Petal.Length <= 1.5))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          4.6         3.4          1.4         0.3  setosa

select

請挑出Species和Sepal.Length

head(select(iris, Species, Sepal.Length))

##   Species Sepal.Length
## 1  setosa          5.1
## 2  setosa          4.9
## 3  setosa          4.7
## 4  setosa          4.6
## 5  setosa          5.0
## 6  setosa          5.4

mutate

請把Sepal.Length標準化(平均值調整為0, 變異數調整為1)

head(mutate(iris, zSepal.Length = scale(Sepal.Length)))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species zSepal.Length
## 1          5.1         3.5          1.4         0.2  setosa       -0.8977
## 2          4.9         3.0          1.4         0.2  setosa       -1.1392
## 3          4.7         3.2          1.3         0.2  setosa       -1.3807
## 4          4.6         3.1          1.5         0.2  setosa       -1.5015
## 5          5.0         3.6          1.4         0.2  setosa       -1.0184
## 6          5.4         3.9          1.7         0.4  setosa       -0.5354

arrange

請把資料依照Sepal.Length排序(?order)

head(arrange(iris, Sepal.Length), 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.3         3.0          1.1         0.1  setosa
## 2          4.4         2.9          1.4         0.2  setosa
## 3          4.4         3.0          1.3         0.2  setosa

請把資料依照Sepal.Length, Petal.Length排序(?order)

head(arrange(iris, Sepal.Length, Petal.Length), 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.3         3.0          1.1         0.1  setosa
## 2          4.4         3.0          1.3         0.2  setosa
## 3          4.4         3.2          1.3         0.2  setosa

summarize

請計算Sepal.Length的平均

head(summarise(iris, mean(Sepal.Length)))

##   mean(Sepal.Length)
## 1              5.843

group_by

請計算各Species在iris資料中出現的個數

iris.g <- group_by(iris, Species)
head(iris.g, 2)

## Source: local data frame [2 x 5]
## Groups: Species
## 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa

summarise(iris.g, length(Species))

## Source: local data frame [3 x 2]
## 
##      Species length(Species)
## 1  virginica              50
## 2 versicolor              50
## 3     setosa              50

R的資料形態

我們所分析的數據代表什麼意義？

分析中的資料形態

數值系統的分類

名目資料

順序資料

區間資料

溫度

時間

比值資料

小挑戰

R 的資料形態

R 的資料形態分類

資料相關

其他

Atomic Type

Character

常用的Character處理函數：

字串的剪接：paste

字串的切割：strsplit

字串的長度：nchar

截取子字串：substring

Logical

Logical常用的Operator

NOT

AND

OR

Integer and Numeric

+

-

*

/

向量式資料結構

所有Atomic Type都有長度

加上dim就變成矩陣、陣列

可以利用:或c()快速建立Vector

幾乎內建的Operations、Functions都是Vectorize

向量式資料結構

長度不同會自動延伸，無法對齊就噴警告

Vectorized的效能都好很多

向量的同質性

Character > Numeric > Integer > Logical

改一個，全部就都變了

R 物件的向量：List

List 的異質性

連函數都吃

處理表格資料的data.frame

每行的長度要一致

是一種list

轉換資料結構

as

NA代表MISSING VALUE

小挑戰

factor

levels

factor其實是integer

常用於處理時間的POSIXct

利用format、as.POSIXct和字串轉換

本質是8bytes的整數

資料結構對函數的影響

資料整理

常用的整理資料流程

自向量中選取資料 [

[ + logical

[ + integer/numeric

[ + character

自矩陣、陣列中選取資料

維持矩陣和陣列的資料結構

data.frame 也類似

自List中選取資料

選取的資料都可以替換 [<-

增加向量長度

縮減向量長度

小挑戰

大挑戰

加碼：dplyr的介紹

dplyr by Hadley Wickham

filter

select

mutate

字串的剪接：`paste`

字串的切割：`strsplit`

字串的長度：`nchar`

截取子字串：`substring`

`+`

`-`

`*`

`/`

加上`dim`就變成矩陣、陣列

可以利用`:`或`c()`快速建立Vector

`as`

利用`format`、`as.POSIXct`和字串轉換

自向量中選取資料 `[`

`[` + logical

`[` + integer/numeric

`[` + character

選取的資料都可以替換 `[<-`