dplyr 函數包是處理矩形資料(dataframe、tibble 格式)的利器, 函數大多以 function( 物標物件, 變數 ) 的格式做表達,形式簡潔可讀性高, 且函數間可透過 %>% (pipe) 運算符號(Ctrl + Shift + M) 組合,進而輕鬆實現彈性、高效率的資料處理。
stockdata.csv : 2018-01-01 至 2018-12-31 部分台股上市公司代碼、產業別、每日調整後開高收低價格。 (資料來源TEJ+台灣經濟新報)
eps.csv : 2018年台股上市公司全年累計每股盈餘(eps)。 (資料來源 TEJ+台灣經濟新報)
#設定工作目錄
setwd("C:/r_example")
#讀取資料
stockData <- read.csv("stockdata.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE) %>% as_tibble()
epsData <- read.csv("eps.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE) %>% as_tibble()## [1] "tbl_df" "tbl" "data.frame"
## [1] "code" "ind" "date" "open" "high" "low" "close"
## [1] 28478 7
| code | ind | date | open | high | low | close |
|---|---|---|---|---|---|---|
| 1101 | 1 | 20180102 | 27.68 | 27.79 | 27.37 | 27.75 |
| 1102 | 1 | 20180102 | 25.90 | 26.00 | 25.67 | 25.81 |
| 1103 | 1 | 20180102 | 12.12 | 12.21 | 11.94 | 12.21 |
| 1104 | 1 | 20180102 | 20.66 | 20.70 | 20.57 | 20.66 |
| 1108 | 1 | 20180102 | 8.35 | 8.40 | 8.35 | 8.40 |
| 1109 | 1 | 20180102 | 12.68 | 13.17 | 12.68 | 13.02 |
| 1110 | 1 | 20180102 | 14.56 | 14.56 | 14.31 | 14.46 |
| 1201 | 2 | 20180102 | 20.65 | 20.65 | 20.35 | 20.50 |
| 1203 | 2 | 20180102 | 23.97 | 24.21 | 23.97 | 24.07 |
| 1210 | 2 | 20180102 | 27.46 | 27.54 | 27.33 | 27.42 |
將資料照所選欄位排序
## # A tibble: 28,478 x 7
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20181228 30.6 30.9 30.6 30.8
## 2 1101 1 20181227 30.9 30.9 30.6 30.6
## 3 1101 1 20181226 30.9 31.0 30.6 30.6
## 4 1101 1 20181225 30.3 31.0 30.2 31.0
## 5 1101 1 20181224 30.8 31.1 30.7 30.9
## 6 1101 1 20181222 30.6 30.8 30.5 30.6
## 7 1101 1 20181221 30.6 30.8 30.5 30.6
## 8 1101 1 20181220 30.9 31.2 30.5 30.6
## 9 1101 1 20181219 30.9 31.1 30.8 31.1
## 10 1101 1 20181218 31.0 31.3 30.7 30.8
## # ... with 28,468 more rows
選擇資料中所需欄位
## # A tibble: 28,478 x 3
## close code ind
## <dbl> <int> <int>
## 1 27.8 1101 1
## 2 25.8 1102 1
## 3 12.2 1103 1
## 4 20.7 1104 1
## 5 8.4 1108 1
## 6 13.0 1109 1
## 7 14.5 1110 1
## 8 20.5 1201 2
## 9 24.1 1203 2
## 10 27.4 1210 2
## # ... with 28,468 more rows
## [1] "code" "ind" "date" "open"
## # A tibble: 28,478 x 4
## code ind date open
## <int> <int> <int> <dbl>
## 1 1101 1 20180102 27.7
## 2 1102 1 20180102 25.9
## 3 1103 1 20180102 12.1
## 4 1104 1 20180102 20.7
## 5 1108 1 20180102 8.35
## 6 1109 1 20180102 12.7
## 7 1110 1 20180102 14.6
## 8 1201 2 20180102 20.6
## 9 1203 2 20180102 24.0
## 10 1210 2 20180102 27.5
## # ... with 28,468 more rows
## # A tibble: 28,478 x 6
## code date open high low close
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1101 20180102 27.7 27.8 27.4 27.8
## 2 1102 20180102 25.9 26 25.7 25.8
## 3 1103 20180102 12.1 12.2 11.9 12.2
## 4 1104 20180102 20.7 20.7 20.6 20.7
## 5 1108 20180102 8.35 8.4 8.35 8.4
## 6 1109 20180102 12.7 13.2 12.7 13.0
## 7 1110 20180102 14.6 14.6 14.3 14.5
## 8 1201 20180102 20.6 20.6 20.4 20.5
## 9 1203 20180102 24.0 24.2 24.0 24.1
## 10 1210 20180102 27.5 27.5 27.3 27.4
## # ... with 28,468 more rows
依照條件篩選資料
## # A tibble: 48 x 7
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1590 5 20180102 523. 526. 501. 511.
## 2 1590 5 20180103 516. 520. 511. 515.
## 3 1590 5 20180104 518. 522. 513. 520.
## 4 1590 5 20180105 525. 525. 509. 512.
## 5 1590 5 20180108 515. 515. 498. 501.
## 6 1590 5 20180226 506. 512. 503. 510.
## 7 1590 5 20180227 518. 529. 518. 522.
## 8 1590 5 20180301 522. 533. 513. 516.
## 9 1590 5 20180302 508. 517. 508. 513.
## 10 1590 5 20180329 488. 507. 484. 507.
## # ... with 38 more rows
## # A tibble: 163 x 7
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180424 30.0 31.1 30.0 31.0
## 2 1101 1 20180425 30.9 31.1 30.3 30.5
## 3 1101 1 20180426 30.6 31.1 30.6 30.8
## 4 1101 1 20180427 31.1 31.1 30.4 30.6
## 5 1101 1 20180430 30.9 31.0 30.5 31.0
## 6 1101 1 20180502 31.0 31.4 30.9 31.0
## 7 1101 1 20180504 30.1 30.4 30.0 30.0
## 8 1101 1 20180507 30.4 31.4 30.4 30.7
## 9 1101 1 20180508 30.7 31.3 30.4 31.0
## 10 1101 1 20180509 30.9 30.9 30.7 30.8
## # ... with 153 more rows
#從資料中篩選出股票代碼(code)在特定名單(codeList)內者
codeList <- c(1101,1102,1103)
stockData %>% filter( code %in% codeList)## # A tibble: 741 x 7
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8
## 2 1102 1 20180102 25.9 26 25.7 25.8
## 3 1103 1 20180102 12.1 12.2 11.9 12.2
## 4 1101 1 20180103 27.7 28.4 27.5 28.4
## 5 1102 1 20180103 25.9 26 25.7 26
## 6 1103 1 20180103 12.2 12.2 12.0 12.1
## 7 1101 1 20180104 28.4 28.4 27.9 28.2
## 8 1102 1 20180104 26.1 26.4 26 26.2
## 9 1103 1 20180104 12.1 12.5 12.1 12.1
## 10 1101 1 20180105 28.5 28.5 28.1 28.3
## # ... with 731 more rows
將現有資料欄位經過計算後,形成新欄位
#將資料中的最高價(high)減去最低價(low),形成最大價差(maxDif)欄位,並將其取log
stockData %>% mutate(maxDif = high-low,
log_maxDif = log(maxDif))## # A tibble: 28,478 x 9
## code ind date open high low close maxDif log_maxDif
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8 0.420 -0.868
## 2 1102 1 20180102 25.9 26 25.7 25.8 0.330 -1.11
## 3 1103 1 20180102 12.1 12.2 11.9 12.2 0.27 -1.31
## 4 1104 1 20180102 20.7 20.7 20.6 20.7 0.130 -2.04
## 5 1108 1 20180102 8.35 8.4 8.35 8.4 0.05 -3.00
## 6 1109 1 20180102 12.7 13.2 12.7 13.0 0.49 -0.713
## 7 1110 1 20180102 14.6 14.6 14.3 14.5 0.25 -1.39
## 8 1201 2 20180102 20.6 20.6 20.4 20.5 0.300 -1.20
## 9 1203 2 20180102 24.0 24.2 24.0 24.1 0.24 -1.43
## 10 1210 2 20180102 27.5 27.5 27.3 27.4 0.21 -1.56
## # ... with 28,468 more rows
## # A tibble: 28,478 x 8
## code ind date open high low close n
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8 1
## 2 1102 1 20180102 25.9 26 25.7 25.8 2
## 3 1103 1 20180102 12.1 12.2 11.9 12.2 3
## 4 1104 1 20180102 20.7 20.7 20.6 20.7 4
## 5 1108 1 20180102 8.35 8.4 8.35 8.4 5
## 6 1109 1 20180102 12.7 13.2 12.7 13.0 6
## 7 1110 1 20180102 14.6 14.6 14.3 14.5 7
## 8 1201 2 20180102 20.6 20.6 20.4 20.5 8
## 9 1203 2 20180102 24.0 24.2 24.0 24.1 9
## 10 1210 2 20180102 27.5 27.5 27.3 27.4 10
## # ... with 28,468 more rows
## # A tibble: 28,478 x 8
## code ind date open high low close lClose
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8 3.32
## 2 1102 1 20180102 25.9 26 25.7 25.8 3.25
## 3 1103 1 20180102 12.1 12.2 11.9 12.2 2.50
## 4 1104 1 20180102 20.7 20.7 20.6 20.7 3.03
## 5 1108 1 20180102 8.35 8.4 8.35 8.4 2.13
## 6 1109 1 20180102 12.7 13.2 12.7 13.0 2.57
## 7 1110 1 20180102 14.6 14.6 14.3 14.5 2.67
## 8 1201 2 20180102 20.6 20.6 20.4 20.5 3.02
## 9 1203 2 20180102 24.0 24.2 24.0 24.1 3.18
## 10 1210 2 20180102 27.5 27.5 27.3 27.4 3.31
## # ... with 28,468 more rows
將資料依照特定欄位內容分組
## # A tibble: 1 x 7
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1417 4 20181030 4.23 4.29 4.19 4.26
## # A tibble: 6 x 7
## # Groups: ind [4]
## code ind date open high low close
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1108 1 20181026 7.15 7.16 6.99 6.99
## 2 1217 2 20181030 6.61 6.68 6.55 6.58
## 3 1417 4 20181030 4.23 4.29 4.19 4.26
## 4 1538 5 20181102 4.7 4.7 4.7 4.7
## 5 1538 5 20181105 4.7 4.7 4.7 4.7
## 6 1538 5 20181106 4.7 4.7 4.7 4.7
將資料做總結,常搭配group_by()使用
## # A tibble: 4 x 2
## ind sum_close
## <int> <dbl>
## 1 1 18.6
## 2 2 39.4
## 3 4 27.7
## 4 5 55.0
#將數列往"下"移動一格,計算報酬率時常用
stockData %>% arrange(code,date) %>% group_by(code) %>% mutate(ret = (close/lag(close,1)) -1)## # A tibble: 28,478 x 8
## # Groups: code [116]
## code ind date open high low close ret
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8 NA
## 2 1101 1 20180103 27.7 28.4 27.5 28.4 0.0220
## 3 1101 1 20180104 28.4 28.4 27.9 28.2 -0.00388
## 4 1101 1 20180105 28.5 28.5 28.1 28.3 0.00248
## 5 1101 1 20180108 28.4 29.1 28.2 29.0 0.0244
## 6 1101 1 20180109 28.9 29.3 28.8 29.1 0.00241
## 7 1101 1 20180110 29.1 29.2 28.9 29.2 0.00413
## 8 1101 1 20180111 29.1 29.1 28.4 28.5 -0.0247
## 9 1101 1 20180112 28.7 28.8 28.4 28.6 0.00246
## 10 1101 1 20180115 28.6 28.7 28.5 28.7 0.00665
## # ... with 28,468 more rows
## # A tibble: 4 x 1
## ind
## <int>
## 1 1
## 2 2
## 3 4
## 4 5
## # A tibble: 28,478 x 8
## code ind date open high low close eps
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1101 1 20180102 27.7 27.8 27.4 27.8 2.03
## 2 1102 1 20180102 25.9 26 25.7 25.8 1.74
## 3 1103 1 20180102 12.1 12.2 11.9 12.2 1.39
## 4 1104 1 20180102 20.7 20.7 20.6 20.7 2.16
## 5 1108 1 20180102 8.35 8.4 8.35 8.4 -0.25
## 6 1109 1 20180102 12.7 13.2 12.7 13.0 1.32
## 7 1110 1 20180102 14.6 14.6 14.3 14.5 0.11
## 8 1201 2 20180102 20.6 20.6 20.4 20.5 1.31
## 9 1203 2 20180102 24.0 24.2 24.0 24.1 1.86
## 10 1210 2 20180102 27.5 27.5 27.3 27.4 3.28
## # ... with 28,468 more rows
更多join系列動圖看 這裡
情境:
希望以股價、EPS資料做分析,研究市場是否會給予產業內獲利的公司股價正面反應?
又不同產業間,狀況是否有差異?
請使用本範例提供的stockdata.csv及eps.csv進行操作,
獲利的定義為eps大於10,正面反應的定義為全年累計報酬為正。
步驟如下:
請將eps.csv合併至stockdata.csv並篩選出eps大於10者。(left_join, filter)
請將個股分群後依照代碼、日期排序並計算單日簡單報酬、全年累積報酬。(gorup_by, arrange, mutate, lag, cumprod)
計算各產業平均全年累計報酬。(group_by, summarize)