R語言量化投資實作工作坊

dplyr 介紹:

dplyr 函數包是處理矩形資料(dataframe、tibble 格式)的利器，函數大多以 function( 物標物件, 變數 ) 的格式做表達，形式簡潔可讀性高，且函數間可透過 %>% (pipe) 運算符號(Ctrl + Shift + M) 組合，進而輕鬆實現彈性、高效率的資料處理。

安裝並載入函數包:

tidyverse中包含了dplyr、ggplot2，與stringr等數據處理常用包，直接呼叫即可，

細節可參考官方網站

#第一次使用tidyverse，請先安裝
#install.packages("tidyverse")
library(tidyverse)

範例資料介紹:

stockdata.csv : 2018-01-01 至 2018-12-31 部分台股上市公司代碼、產業別、每日調整後開高收低價格。 (資料來源TEJ+台灣經濟新報)
eps.csv : 2018年台股上市公司全年累計每股盈餘(eps)。 (資料來源 TEJ+台灣經濟新報)

#設定工作目錄
setwd("C:/r_example")

#讀取資料
stockData <- read.csv("stockdata.csv",
                      header = TRUE,
                      sep = ",",
                      stringsAsFactors = FALSE) %>% as_tibble()

epsData <- read.csv("eps.csv",
                    header = TRUE,
                    sep = ",",
                    stringsAsFactors = FALSE) %>% as_tibble()

#查看資料的類型
class(stockData)

## [1] "tbl_df"     "tbl"        "data.frame"

#觀看資料所有的欄位(變數)名稱
colnames(stockData)

## [1] "code"  "ind"   "date"  "open"  "high"  "low"   "close"

#觀看資料筆數、變數數量
dim(stockData)

## [1] 28478     7

#觀看資料前10筆資料
head(stockData,10)

code	ind	date	open	high	low	close
1101	1	20180102	27.68	27.79	27.37	27.75
1102	1	20180102	25.90	26.00	25.67	25.81
1103	1	20180102	12.12	12.21	11.94	12.21
1104	1	20180102	20.66	20.70	20.57	20.66
1108	1	20180102	8.35	8.40	8.35	8.40
1109	1	20180102	12.68	13.17	12.68	13.02
1110	1	20180102	14.56	14.56	14.31	14.46
1201	2	20180102	20.65	20.65	20.35	20.50
1203	2	20180102	23.97	24.21	23.97	24.07
1210	2	20180102	27.46	27.54	27.33	27.42

dplyr常用函數介紹:

Arrange

將資料照所選欄位排序

  #將資料依照股票代碼(code)從小排到大，且依日期(date)從大排到小
  stockData %>% arrange(code,-date)

## # A tibble: 28,478 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20181228  30.6  30.9  30.6  30.8
##  2  1101     1 20181227  30.9  30.9  30.6  30.6
##  3  1101     1 20181226  30.9  31.0  30.6  30.6
##  4  1101     1 20181225  30.3  31.0  30.2  31.0
##  5  1101     1 20181224  30.8  31.1  30.7  30.9
##  6  1101     1 20181222  30.6  30.8  30.5  30.6
##  7  1101     1 20181221  30.6  30.8  30.5  30.6
##  8  1101     1 20181220  30.9  31.2  30.5  30.6
##  9  1101     1 20181219  30.9  31.1  30.8  31.1
## 10  1101     1 20181218  31.0  31.3  30.7  30.8
## # ... with 28,468 more rows

Select

選擇資料中所需欄位

  #從資料中依序選出代碼(code)、產業別(ind)、以及收盤價(close)欄位
  stockData %>% select(close,code,ind)

## # A tibble: 28,478 x 3
##    close  code   ind
##    <dbl> <int> <int>
##  1  27.8  1101     1
##  2  25.8  1102     1
##  3  12.2  1103     1
##  4  20.7  1104     1
##  5   8.4  1108     1
##  6  13.0  1109     1
##  7  14.5  1110     1
##  8  20.5  1201     2
##  9  24.1  1203     2
## 10  27.4  1210     2
## # ... with 28,468 more rows

  #依照特定名單選取所需欄位
  colList <- colnames(stockData)[c(1:4)]
  colList

## [1] "code" "ind"  "date" "open"

  stockData %>% select(colList)

## # A tibble: 28,478 x 4
##     code   ind     date  open
##    <int> <int>    <int> <dbl>
##  1  1101     1 20180102 27.7 
##  2  1102     1 20180102 25.9 
##  3  1103     1 20180102 12.1 
##  4  1104     1 20180102 20.7 
##  5  1108     1 20180102  8.35
##  6  1109     1 20180102 12.7 
##  7  1110     1 20180102 14.6 
##  8  1201     2 20180102 20.6 
##  9  1203     2 20180102 24.0 
## 10  1210     2 20180102 27.5 
## # ... with 28,468 more rows

  #將資料中的產業(ind)欄位刪除
  stockData %>% select(-ind)

## # A tibble: 28,478 x 6
##     code     date  open  high   low close
##    <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101 20180102 27.7   27.8 27.4   27.8
##  2  1102 20180102 25.9   26   25.7   25.8
##  3  1103 20180102 12.1   12.2 11.9   12.2
##  4  1104 20180102 20.7   20.7 20.6   20.7
##  5  1108 20180102  8.35   8.4  8.35   8.4
##  6  1109 20180102 12.7   13.2 12.7   13.0
##  7  1110 20180102 14.6   14.6 14.3   14.5
##  8  1201 20180102 20.6   20.6 20.4   20.5
##  9  1203 20180102 24.0   24.2 24.0   24.1
## 10  1210 20180102 27.5   27.5 27.3   27.4
## # ... with 28,468 more rows

Filter

依照條件篩選資料

  #從資料中篩選出收盤價(close)高於500元者
  stockData %>% filter(close>= 500)

## # A tibble: 48 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1590     5 20180102  523.  526.  501.  511.
##  2  1590     5 20180103  516.  520.  511.  515.
##  3  1590     5 20180104  518.  522.  513.  520.
##  4  1590     5 20180105  525.  525.  509.  512.
##  5  1590     5 20180108  515.  515.  498.  501.
##  6  1590     5 20180226  506.  512.  503.  510.
##  7  1590     5 20180227  518.  529.  518.  522.
##  8  1590     5 20180301  522.  533.  513.  516.
##  9  1590     5 20180302  508.  517.  508.  513.
## 10  1590     5 20180329  488.  507.  484.  507.
## # ... with 38 more rows

  #從資料中篩選出股票代碼(code)為1101,且收盤價(close)大於30元者
  stockData %>% filter( (code == 1101) & (close> 30) )

## # A tibble: 163 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180424  30.0  31.1  30.0  31.0
##  2  1101     1 20180425  30.9  31.1  30.3  30.5
##  3  1101     1 20180426  30.6  31.1  30.6  30.8
##  4  1101     1 20180427  31.1  31.1  30.4  30.6
##  5  1101     1 20180430  30.9  31.0  30.5  31.0
##  6  1101     1 20180502  31.0  31.4  30.9  31.0
##  7  1101     1 20180504  30.1  30.4  30.0  30.0
##  8  1101     1 20180507  30.4  31.4  30.4  30.7
##  9  1101     1 20180508  30.7  31.3  30.4  31.0
## 10  1101     1 20180509  30.9  30.9  30.7  30.8
## # ... with 153 more rows

  #從資料中篩選出股票代碼(code)在特定名單(codeList)內者
  codeList <- c(1101,1102,1103)
  stockData %>% filter( code %in% codeList)

## # A tibble: 741 x 7
##     code   ind     date  open  high   low close
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180102  27.7  27.8  27.4  27.8
##  2  1102     1 20180102  25.9  26    25.7  25.8
##  3  1103     1 20180102  12.1  12.2  11.9  12.2
##  4  1101     1 20180103  27.7  28.4  27.5  28.4
##  5  1102     1 20180103  25.9  26    25.7  26  
##  6  1103     1 20180103  12.2  12.2  12.0  12.1
##  7  1101     1 20180104  28.4  28.4  27.9  28.2
##  8  1102     1 20180104  26.1  26.4  26    26.2
##  9  1103     1 20180104  12.1  12.5  12.1  12.1
## 10  1101     1 20180105  28.5  28.5  28.1  28.3
## # ... with 731 more rows

Mutate

將現有資料欄位經過計算後，形成新欄位

  #將資料中的最高價(high)減去最低價(low)，形成最大價差(maxDif)欄位，並將其取log
  stockData %>% mutate(maxDif = high-low,
                       log_maxDif = log(maxDif))

## # A tibble: 28,478 x 9
##     code   ind     date  open  high   low close maxDif log_maxDif
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>  <dbl>      <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8  0.420     -0.868
##  2  1102     1 20180102 25.9   26   25.7   25.8  0.330     -1.11 
##  3  1103     1 20180102 12.1   12.2 11.9   12.2  0.27      -1.31 
##  4  1104     1 20180102 20.7   20.7 20.6   20.7  0.130     -2.04 
##  5  1108     1 20180102  8.35   8.4  8.35   8.4  0.05      -3.00 
##  6  1109     1 20180102 12.7   13.2 12.7   13.0  0.49      -0.713
##  7  1110     1 20180102 14.6   14.6 14.3   14.5  0.25      -1.39 
##  8  1201     2 20180102 20.6   20.6 20.4   20.5  0.300     -1.20 
##  9  1203     2 20180102 24.0   24.2 24.0   24.1  0.24      -1.43 
## 10  1210     2 20180102 27.5   27.5 27.3   27.4  0.21      -1.56 
## # ... with 28,468 more rows

  #將資料排序序號形成新欄位(n)
  stockData %>% mutate( n = row_number() )

## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close     n
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl> <int>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8     1
##  2  1102     1 20180102 25.9   26   25.7   25.8     2
##  3  1103     1 20180102 12.1   12.2 11.9   12.2     3
##  4  1104     1 20180102 20.7   20.7 20.6   20.7     4
##  5  1108     1 20180102  8.35   8.4  8.35   8.4     5
##  6  1109     1 20180102 12.7   13.2 12.7   13.0     6
##  7  1110     1 20180102 14.6   14.6 14.3   14.5     7
##  8  1201     2 20180102 20.6   20.6 20.4   20.5     8
##  9  1203     2 20180102 24.0   24.2 24.0   24.1     9
## 10  1210     2 20180102 27.5   27.5 27.3   27.4    10
## # ... with 28,468 more rows

  #將收盤價(close)取log後形成新欄位(lClose)
  stockData %>% mutate(lClose = log(close))

## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close lClose
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8   3.32
##  2  1102     1 20180102 25.9   26   25.7   25.8   3.25
##  3  1103     1 20180102 12.1   12.2 11.9   12.2   2.50
##  4  1104     1 20180102 20.7   20.7 20.6   20.7   3.03
##  5  1108     1 20180102  8.35   8.4  8.35   8.4   2.13
##  6  1109     1 20180102 12.7   13.2 12.7   13.0   2.57
##  7  1110     1 20180102 14.6   14.6 14.3   14.5   2.67
##  8  1201     2 20180102 20.6   20.6 20.4   20.5   3.02
##  9  1203     2 20180102 24.0   24.2 24.0   24.1   3.18
## 10  1210     2 20180102 27.5   27.5 27.3   27.4   3.31
## # ... with 28,468 more rows

Group_By

將資料依照特定欄位內容分組

  #將資料依照產業(ind)分群，並查詢每群中最低價的收盤價(close)
  
  #分組前
  stockData %>% filter(close == min(close))

## # A tibble: 1 x 7
##    code   ind     date  open  high   low close
##   <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
## 1  1417     4 20181030  4.23  4.29  4.19  4.26

  #分組後
  stockData %>% group_by(ind) %>% filter(close == min(close))

## # A tibble: 6 x 7
## # Groups:   ind [4]
##    code   ind     date  open  high   low close
##   <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>
## 1  1108     1 20181026  7.15  7.16  6.99  6.99
## 2  1217     2 20181030  6.61  6.68  6.55  6.58
## 3  1417     4 20181030  4.23  4.29  4.19  4.26
## 4  1538     5 20181102  4.7   4.7   4.7   4.7 
## 5  1538     5 20181105  4.7   4.7   4.7   4.7 
## 6  1538     5 20181106  4.7   4.7   4.7   4.7

Summarise

將資料做總結，常搭配group_by()使用

  #查看每個產業(ind)的收盤價(close)平均數(mean)
  stockData %>% group_by(ind) %>%  summarise(sum_close = mean(close))

## # A tibble: 4 x 2
##     ind sum_close
##   <int>     <dbl>
## 1     1      18.6
## 2     2      39.4
## 3     4      27.7
## 4     5      55.0

其他常用函數

lag

  #將數列往"下"移動一格,計算報酬率時常用
  stockData %>% arrange(code,date) %>% group_by(code) %>% mutate(ret = (close/lag(close,1)) -1)

## # A tibble: 28,478 x 8
## # Groups:   code [116]
##     code   ind     date  open  high   low close      ret
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1  1101     1 20180102  27.7  27.8  27.4  27.8 NA      
##  2  1101     1 20180103  27.7  28.4  27.5  28.4  0.0220 
##  3  1101     1 20180104  28.4  28.4  27.9  28.2 -0.00388
##  4  1101     1 20180105  28.5  28.5  28.1  28.3  0.00248
##  5  1101     1 20180108  28.4  29.1  28.2  29.0  0.0244 
##  6  1101     1 20180109  28.9  29.3  28.8  29.1  0.00241
##  7  1101     1 20180110  29.1  29.2  28.9  29.2  0.00413
##  8  1101     1 20180111  29.1  29.1  28.4  28.5 -0.0247 
##  9  1101     1 20180112  28.7  28.8  28.4  28.6  0.00246
## 10  1101     1 20180115  28.6  28.7  28.5  28.7  0.00665
## # ... with 28,468 more rows

distinct

  #將指定變數中的"非重複"值抽出
  stockData %>% distinct(ind)

## # A tibble: 4 x 1
##     ind
##   <int>
## 1     1
## 2     2
## 3     4
## 4     5

left_join

  #以左表為主表，將右表為副表對應合併，左表沒有者忽略
  left_join(stockData,epsData,by = c("code" = "code"))

## # A tibble: 28,478 x 8
##     code   ind     date  open  high   low close   eps
##    <int> <int>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1101     1 20180102 27.7   27.8 27.4   27.8  2.03
##  2  1102     1 20180102 25.9   26   25.7   25.8  1.74
##  3  1103     1 20180102 12.1   12.2 11.9   12.2  1.39
##  4  1104     1 20180102 20.7   20.7 20.6   20.7  2.16
##  5  1108     1 20180102  8.35   8.4  8.35   8.4 -0.25
##  6  1109     1 20180102 12.7   13.2 12.7   13.0  1.32
##  7  1110     1 20180102 14.6   14.6 14.3   14.5  0.11
##  8  1201     2 20180102 20.6   20.6 20.4   20.5  1.31
##  9  1203     2 20180102 24.0   24.2 24.0   24.1  1.86
## 10  1210     2 20180102 27.5   27.5 27.3   27.4  3.28
## # ... with 28,468 more rows

更多join系列動圖看這裡

實作範例:

情境:

希望以股價、EPS資料做分析，研究市場是否會給予產業內獲利的公司股價正面反應?

又不同產業間，狀況是否有差異?

請使用本範例提供的stockdata.csv及eps.csv進行操作，

獲利的定義為eps大於10，正面反應的定義為全年累計報酬為正。

步驟如下:

請將eps.csv合併至stockdata.csv並篩選出eps大於10者。(left_join, filter)
請將個股分群後依照代碼、日期排序並計算單日簡單報酬、全年累積報酬。(gorup_by, arrange, mutate, lag, cumprod)
計算各產業平均全年累計報酬。(group_by, summarize)

R語言量化投資實作工作坊_dplyr實作

中山大學財務管理研究所蔣維倫

2019年9月16日

dplyr 介紹:

安裝並載入函數包:

範例資料介紹:

dplyr常用函數介紹:

Arrange

Select

Filter

Mutate

Group_By

Summarise

其他常用函數

lag

distinct

left_join

實作範例:

R語言量化投資實作工作坊_dplyr實作

中山大學 財務管理研究所 蔣維倫

2019年9月16日

dplyr 介紹:

安裝並載入函數包:

範例資料介紹:

dplyr常用函數介紹:

Arrange

Select

Filter

Mutate

Group_By

Summarise

其他常用函數

lag

distinct

left_join

實作範例:

中山大學財務管理研究所蔣維倫